Hello,
i am using ner.teach recipe for long documents, after saving few annotations to db and then to a .jsonl file (via db-out recipe) i have noticed that the "text" of the annotations is very short. Why does't it save the real text? ( i mean the real text on the source)
How this will impact the training after the teaching ?
I mean...training the model with these new annotations will ""corrupt""" the current NER model that i have trained with long documents?
Thanks
P.S. My current model does not have a parser (or sentencizer), so, for the moment, i do not split sentences.
Are you setting the --unsegmented argument? If not, Prodigy will automatically segment the text to prevent long outliers from causing problems for the active learning.
Depending on the text, what gets segmented as a sentence differs, so if you don't want the text to be segmented, set --unsegmented to turn off the behaviour. Just make sure that the text you stream in doesn't include any examples that are huge.
Thank you @ines
But, what does "huge" really mean? ...how many tokens?
At the moment i am using one label at time.
However, please, correct me if wrong, i must use the original documents during the training, right? i mean, "fake" chunks of the documents that contain the label are not good and they will corrupt the NER right?
Like, a few hundred or a few thousand tokens? It really depends, but if your incoming texts are uneven and there's suddenly an outlier like this, the whole process may die.
No, training on sentences can be a totally fine strategy. But if there isn't a good way to meaningfully split your text, then yes, training on non-sentence chunks is probably not very useful.
Thanks @ines
Our documents are 10k tokens long, less or more.
However, we can create a split strategy but the problem is that the documents are semi-structured and we only can deal with \n and then Upper/Lower case to detect a new sentence.
The problem Ines is that if we split our documents, we could get sentence with few words, for example an ORG entity can be in a sentence, alone. So i many times lose the context of the entities.
Oh okay – but that's more of a general question about how you structure a task and not a problem with Prodigy or the default sentence segmentation then.
If you can't find a good way to split the input text, using a workflow like ner.teach probably makes less sense, because you don't want to be updating on random fragments you can't control. So you're probably better off just labelling the data manually and creating a gold-standard set to train on. Then you can annotate the whole text at once, or in random segments and put them back together afterwards.