I think the problem isn’t that all documents are processed at once – internally, Prodigy works with generators to process the stream, so examples are processed as they come in.
It seems like the difficulty here is that ner.teach
uses beam search, so it’s trying to find the best parses for each 20k token document – instead of only the one best parse like ner.pipe
. At each step, it has a beam of states, and has to create a new state on the next token. That state representation involves a copy that’s sensitive to the line length. On normal-length document, this is such a small overhead that in practice, the time is still linear. But with 20k token documents, the time complexity becomes non-linear, because the copy starts to dominate.
In terms of accuracy, beam search also won’t perform as well on long documents, because the number of candidates per word that you’re considering (relative to the whole document) is very small. This is also one of the reasons Prodigy tries to keep the task text small (in addition to the advantages for the human annotator, like keeping focused and moving fast).
When you mentioned that the full context is important, I didn’t expect it to be that much context. I know that legal texts are pretty tricky in that respect – but the problem is, if it really is true that the annotator can’t make the decision from one or two sentences, the model is also much less likely to learn anything meaningful from the annotations.
Some ideas for solutions:
- For your use case, it might be better to start off collecting annotations with
ner.match
, which only uses the pattern matcher, and will be much faster. If there’s a match, you could also truncate the text around the match, to at least exclude some parts of the full document. - Once you’ve collected a bunch of annotations from the patterns, you can pre-train a model, parse the text with spaCy, extract the predictions and annotate them statically using
mark
. - Maybe you can think of a creative way to pre-process your documents to shorten them, or remove text that you can definitely exclude?
- This could even mean training a model to help you with pre-processing or shortening the documents. For example, a per-sentence text classifier This is actually not a weird workflow at all, and chaining models like this is something we often recommend for more complex use cases.