Strange text segmentation with ner.teach recipe

damiano · September 9, 2019, 12:43am

Hello,
i am using ner.teach recipe for long documents, after saving few annotations to db and then to a .jsonl file (via db-out recipe) i have noticed that the "text" of the annotations is very short. Why does't it save the real text? ( i mean the real text on the source)

How this will impact the training after the teaching ?

I mean...training the model with these new annotations will ""corrupt""" the current NER model that i have trained with long documents?

Thanks

P.S. My current model does not have a parser (or sentencizer), so, for the moment, i do not split sentences.

ines · September 9, 2019, 9:38am

Are you setting the --unsegmented argument? If not, Prodigy will automatically segment the text to prevent long outliers from causing problems for the active learning.

Depending on the text, what gets segmented as a sentence differs, so if you don't want the text to be segmented, set --unsegmented to turn off the behaviour. Just make sure that the text you stream in doesn't include any examples that are huge.

damiano · September 9, 2019, 10:00am

Thank you @ines
But, what does "huge" really mean? ...how many tokens?
At the moment i am using one label at time.

However, please, correct me if wrong, i must use the original documents during the training, right? i mean, "fake" chunks of the documents that contain the label are not good and they will corrupt the NER right?

ines · September 9, 2019, 10:06am

Like, a few hundred or a few thousand tokens? It really depends, but if your incoming texts are uneven and there's suddenly an outlier like this, the whole process may die.

No, training on sentences can be a totally fine strategy. But if there isn't a good way to meaningfully split your text, then yes, training on non-sentence chunks is probably not very useful.

damiano · September 9, 2019, 11:17am

Thanks @ines
Our documents are 10k tokens long, less or more.
However, we can create a split strategy but the problem is that the documents are semi-structured and we only can deal with \n and then Upper/Lower case to detect a new sentence.

The problem Ines is that if we split our documents, we could get sentence with few words, for example an ORG entity can be in a sentence, alone. So i many times lose the context of the entities.

I did not find a solution yet.

damiano · September 9, 2019, 11:17am

or better...i did not find another solution instead of using the whole document.

ines · September 9, 2019, 11:23am

Oh okay – but that's more of a general question about how you structure a task and not a problem with Prodigy or the default sentence segmentation then.

If you can't find a good way to split the input text, using a workflow like ner.teach probably makes less sense, because you don't want to be updating on random fragments you can't control. So you're probably better off just labelling the data manually and creating a gold-standard set to train on. Then you can annotate the whole text at once, or in random segments and put them back together afterwards.

damiano · September 9, 2019, 12:44pm

Exactly, at the moment i am using ner.make-gold to correct the labels. Thanks!

Topic		Replies	Views
How to split the paragraph into sentences after annotation ner	3	623	November 20, 2022
prodigy splitting sentences for annotation enhancement , usage , done	14	3458	December 12, 2019
NER on long texts usage , ner	1	723	March 24, 2022
Best way to prepare a long text for annotations usage , spacy , solved	4	2143	August 29, 2018
Sentence segmentation in NER.teach ner , spacy , solved , legal	2	824	March 10, 2020

Strange text segmentation with ner.teach recipe

Related topics