Working with longer texts

Hi prodigy team/community,

I have a question about working with longer text in the annotation tool. In the documentation it says that: For "NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. " However, the text snippets that I am trying to annotate are quite long and arbitrary making them shorter would not make much sense in my case. So the question is: How are these annotations (annotated long texts) are being fed into the CNN while training (prodify train ner)? I guess the maximum token number is somehow limited.


Maybe to make the question more clear. Has the difference between TRAIN_DATA_1 and TRAIN_DATA_2 an impact on the training results, when training a NER model?

    ("horses pretend to care about your feelings", {"entities": [(0, 6, "ANIMAL")]}),
    ("they pretend to care about your feelings, those horses", {"entities": [(48, 54, "ANIMAL")]}),
    ("horses pretend to care about your feelings. they pretend to care about your feelings, those horses", {"entities": [(0, 6, "ANIMAL"), (91, 97, "ANIMAL")]}),

Hi Paul,

The CNN encodes 4 words of context on either side of each token. So for a token more than 4 words from an edge, the rest of the context doesn't really matter. This does allow one convenience though: it makes it relatively easy to support longer documents, because they can be processed largely in parallel during the token-vector encoding.

So on the one hand, you'll be able to pass forward documents of a few thousand words into spaCy and it will be able to process it. But it's not taking particular advantage of the long context, and the long documents are likely to be harder to work with in the annotation tool.

Our rule of thumb is that if you need more than a paragraph of context to make the decision, the machine learning models will probably struggle anyway. Also, in longer texts you can likely make a heuristic that works quite well to divide the text into paragraphs or sections. Long documents tend to come in more regular formats at least, so you can usually come up with a way to losslessly segment them.

Hi Matthew,
thanks for the answer that helps us a lot.