Working with longer texts

chiquadrat · September 8, 2020, 10:26am

Hi prodigy team/community,

I have a question about working with longer text in the annotation tool. In the documentation it says that: For "NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. " However, the text snippets that I am trying to annotate are quite long and arbitrary making them shorter would not make much sense in my case. So the question is: How are these annotations (annotated long texts) are being fed into the CNN while training (prodify train ner)? I guess the maximum token number is somehow limited.

Thanks,
Paul

chiquadrat · September 9, 2020, 6:18am

Maybe to make the question more clear. Has the difference between TRAIN_DATA_1 and TRAIN_DATA_2 an impact on the training results, when training a NER model?

TRAIN_DATA_1 = [
    ("horses pretend to care about your feelings", {"entities": [(0, 6, "ANIMAL")]}),
    ("they pretend to care about your feelings, those horses", {"entities": [(48, 54, "ANIMAL")]}),
]

TRAIN_DATA_2 = [
    ("horses pretend to care about your feelings. they pretend to care about your feelings, those horses", {"entities": [(0, 6, "ANIMAL"), (91, 97, "ANIMAL")]}),
]

honnibal · September 10, 2020, 12:45am

Hi Paul,

The CNN encodes 4 words of context on either side of each token. So for a token more than 4 words from an edge, the rest of the context doesn't really matter. This does allow one convenience though: it makes it relatively easy to support longer documents, because they can be processed largely in parallel during the token-vector encoding.

So on the one hand, you'll be able to pass forward documents of a few thousand words into spaCy and it will be able to process it. But it's not taking particular advantage of the long context, and the long documents are likely to be harder to work with in the annotation tool.

Our rule of thumb is that if you need more than a paragraph of context to make the decision, the machine learning models will probably struggle anyway. Also, in longer texts you can likely make a heuristic that works quite well to divide the text into paragraphs or sections. Long documents tend to come in more regular formats at least, so you can usually come up with a way to losslessly segment them.

chiquadrat · September 10, 2020, 10:12am

Hi Matthew,
thanks for the answer that helps us a lot.
Best,
Paul

Topic		Replies	Views
Strange text segmentation with ner.teach recipe usage	7	598	September 9, 2019
Is there a limitation for string length for NER spacy models? usage , ner , spacy	1	1504	October 31, 2018
NER on long texts usage , ner	1	723	March 24, 2022
Ideal input length for spaCy model ner , spacy	3	779	November 28, 2018
How to split the paragraph into sentences after annotation ner	3	625	November 20, 2022

Working with longer texts

Related topics