Special cases in tokenization

mauro_svl · March 9, 2022, 4:39pm

Hi,

when I use Prodigy to tag my text I want some specific text to not be split.
For example, the following are some examples of part of the text I would like not to be split by whitespace in order to tag them as entity legal form.
I tried introducing some exception in the tokenizer but is seems not working. How can I do?

'Federal administration'
'Limited Liability Limited Partnership'
'Trust Company'
'Partido político'
'Incorporated Limited Partnerships'

Thank you.

Onyoursix · March 10, 2022, 7:11am

Is it splitting them into different sentences?

Like:
The name of the Limited
Liability
Limited
Partnership is ACME

If that's the case you could try passing the --unsegmented flag https://prodi.gy/docs/named-entity-recognition#manual-model which might fix that issue for you. Otherwise, if you mean "white space" as in just a space, you should be able to tag all the words at once by dragging your mouse across them into a single entity.

I tend to preprocess my text to avoid issues like this. It depends on your corpus, but for me, simply splitting the text into paragraphs then putting the document into jsonl format with --unsegmented seems to do the trick.

mauro_svl · March 10, 2022, 11:08am

Not exactly.

Take for example the following company name:

SANNE Group Japan Trust Company

In this case I would like the following tokenization:
SANNE
Group
Japan
Trust Company

So basically I want that Trust Company is not split in two parts.

Onyoursix · March 10, 2022, 6:41pm

This may not be the answer you're looking for, but I think this is probably the place to start. Tokenizer · spaCy API Documentation then create a custom recipe in prodigy that manages the special cases.

However, this approach seems very overkill to me as I don't understand why you don't want those tokens split. You mention:

If your label is something like "LEGAL_FORM" the tokens do not need to be combined. You can tag both tokens of "Trust Company" as a single entity. I don't understand the benefit of having them as a single token, so it seems highly unnecessary to me (I admit, maybe I don't fully understand what you're trying to do).

If you're looking to tag things like "Trust Company", "Limited Partnership", "Limited liability company" as a single entity "LEGAL_FORM". The approach I would take is not to worry about them being separate tokens and just tag multiple tokens as a single entities. I imagine you may have a list of legal forms already, you could set up some match patterns to speed up the processes and automatically tag things in your preset list. Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP

Topic		Replies	Views
Text normalization / conversion with Prodigy / spaCy usage , ner , spacy	3	1526	August 20, 2018
merging a data annotated by regex with the annotated data by prodigy usage , ner , spacy	1	482	August 7, 2019
Newlines included in entity spans bug , ner	6	382	August 24, 2023
Annotating strings without correct separation ner , best-practices	8	187	November 21, 2024
spaCy, prodigy, annotation usage , ner , solved	2	720	February 8, 2019

Special cases in tokenization

Related topics