Special cases in tokenization


when I use Prodigy to tag my text I want some specific text to not be split.
For example, the following are some examples of part of the text I would like not to be split by whitespace in order to tag them as entity legal form.
I tried introducing some exception in the tokenizer but is seems not working. How can I do?

'Federal administration'
'Limited Liability Limited Partnership'
'Trust Company'
'Partido político'
'Incorporated Limited Partnerships'

Thank you.

Is it splitting them into different sentences?

The name of the Limited
Partnership is ACME

If that's the case you could try passing the --unsegmented flag https://prodi.gy/docs/named-entity-recognition#manual-model which might fix that issue for you. Otherwise, if you mean "white space" as in just a space, you should be able to tag all the words at once by dragging your mouse across them into a single entity.

I tend to preprocess my text to avoid issues like this. It depends on your corpus, but for me, simply splitting the text into paragraphs then putting the document into jsonl format with --unsegmented seems to do the trick.

Not exactly.

Take for example the following company name:

SANNE Group Japan Trust Company

In this case I would like the following tokenization:
Trust Company

So basically I want that Trust Company is not split in two parts.

This may not be the answer you're looking for, but I think this is probably the place to start. https://spacy.io/api/tokenizer#add_special_case then create a custom recipe in prodigy that manages the special cases.

However, this approach seems very overkill to me as I don't understand why you don't want those tokens split. You mention:

If your label is something like "LEGAL_FORM" the tokens do not need to be combined. You can tag both tokens of "Trust Company" as a single entity. I don't understand the benefit of having them as a single token, so it seems highly unnecessary to me (I admit, maybe I don't fully understand what you're trying to do).

If you're looking to tag things like "Trust Company", "Limited Partnership", "Limited liability company" as a single entity "LEGAL_FORM". The approach I would take is not to worry about them being separate tokens and just tag multiple tokens as a single entities. I imagine you may have a list of legal forms already, you could set up some match patterns to speed up the processes and automatically tag things in your preset list. https://prodi.gy/docs/named-entity-recognition#manual-patterns

1 Like