Working with longer texts

honnibal · September 10, 2020, 12:45am

Hi Paul,

The CNN encodes 4 words of context on either side of each token. So for a token more than 4 words from an edge, the rest of the context doesn't really matter. This does allow one convenience though: it makes it relatively easy to support longer documents, because they can be processed largely in parallel during the token-vector encoding.

So on the one hand, you'll be able to pass forward documents of a few thousand words into spaCy and it will be able to process it. But it's not taking particular advantage of the long context, and the long documents are likely to be harder to work with in the annotation tool.

Our rule of thumb is that if you need more than a paragraph of context to make the decision, the machine learning models will probably struggle anyway. Also, in longer texts you can likely make a heuristic that works quite well to divide the text into paragraphs or sections. Long documents tend to come in more regular formats at least, so you can usually come up with a way to losslessly segment them.

Topic		Replies	Views
Strange text segmentation with ner.teach recipe usage	7	615	September 9, 2019
Is there a limitation for string length for NER spacy models? usage , ner , spacy	1	1518	October 31, 2018
Ideal input length for spaCy model ner , spacy	3	787	November 28, 2018
The effect of segmentation on NER training ner	1	782	March 21, 2018
Prodigy crashes on large documents ner , spacy	1	1112	January 16, 2018

Working with longer texts

Related topics