Prodigy crashes on large documents

ines · January 16, 2018, 12:09am

I think the problem isn’t that all documents are processed at once – internally, Prodigy works with generators to process the stream, so examples are processed as they come in.

It seems like the difficulty here is that ner.teach uses beam search, so it’s trying to find the best parses for each 20k token document – instead of only the one best parse like ner.pipe. At each step, it has a beam of states, and has to create a new state on the next token. That state representation involves a copy that’s sensitive to the line length. On normal-length document, this is such a small overhead that in practice, the time is still linear. But with 20k token documents, the time complexity becomes non-linear, because the copy starts to dominate.

In terms of accuracy, beam search also won’t perform as well on long documents, because the number of candidates per word that you’re considering (relative to the whole document) is very small. This is also one of the reasons Prodigy tries to keep the task text small (in addition to the advantages for the human annotator, like keeping focused and moving fast).

When you mentioned that the full context is important, I didn’t expect it to be that much context. I know that legal texts are pretty tricky in that respect – but the problem is, if it really is true that the annotator can’t make the decision from one or two sentences, the model is also much less likely to learn anything meaningful from the annotations.

Some ideas for solutions:

For your use case, it might be better to start off collecting annotations with ner.match, which only uses the pattern matcher, and will be much faster. If there’s a match, you could also truncate the text around the match, to at least exclude some parts of the full document.
Once you’ve collected a bunch of annotations from the patterns, you can pre-train a model, parse the text with spaCy, extract the predictions and annotate them statically using mark.
Maybe you can think of a creative way to pre-process your documents to shorten them, or remove text that you can definitely exclude?
This could even mean training a model to help you with pre-processing or shortening the documents. For example, a per-sentence text classifier This is actually not a weird workflow at all, and chaining models like this is something we often recommend for more complex use cases.

Topic		Replies	Views
Prodigy is slow at loading annotations usage	4	1087	July 22, 2021
documents length and annotation time usage , ner , solved , streams	13	947	December 4, 2020
ner.teach very slow ner	7	1350	June 27, 2018
Best way to prepare a long text for annotations usage , spacy , solved	4	2143	August 29, 2018
prodigy splitting sentences for annotation enhancement , usage , done	14	3463	December 12, 2019

Prodigy crashes on large documents

Related topics