Working with longer texts

Hi Paul,

The CNN encodes 4 words of context on either side of each token. So for a token more than 4 words from an edge, the rest of the context doesn't really matter. This does allow one convenience though: it makes it relatively easy to support longer documents, because they can be processed largely in parallel during the token-vector encoding.

So on the one hand, you'll be able to pass forward documents of a few thousand words into spaCy and it will be able to process it. But it's not taking particular advantage of the long context, and the long documents are likely to be harder to work with in the annotation tool.

Our rule of thumb is that if you need more than a paragraph of context to make the decision, the machine learning models will probably struggle anyway. Also, in longer texts you can likely make a heuristic that works quite well to divide the text into paragraphs or sections. Long documents tend to come in more regular formats at least, so you can usually come up with a way to losslessly segment them.