Annotate passages in long documents

I'm building a classification model for documents. Each document has in average 4 pages (20 paragraphs). Within each document we have 3 to 5 key passages / paragraphs.

We'd like to annotate all the passages of the document spliting in "Introduction", "Body", "Key Passage" and "Conclusion". On the demos it seems to me that Span categorization would be the closest to what we want to achieve. But Span cateorization seems more suitable to categorize phrases within a paragraph while we need to categorize paragraphs within a document. The key passages are stricktly related to each document - we cannot classify a paragraph individually and independently of the whole document.

Is prodigy suitable for this task?

hi @mchavesmartins!

Thanks for your question and welcome to the Prodigy community!

Can you clarify what you mean by "passages"? Unlike sentences or paragraphs that are well defined in language, the term "passage" is a a bit vague. Would you be able to replace what you're saying with "span", that is a sub-set of words within a sentence? Or by "passage" could it also be multiple sentences and/or multiple paragraphs?

If it is a span, then yes, perhaps span categorization would make sense. However, it seems you're concerned that the model wouldn't do well because you believe it would need more context from other paragraphs, correct?

Have you seen this section of our text classification documentation?

This documentation outlines our general philosophy of advocating breaking up tasks into smaller tasks. It also addresses the related question: "But what if annotation requires context from a few paragraphs earlier?". One key point to highlight is:

However, if you have an annotation task where the annotator really needs to see the whole document to make a decision, that’s often a sign that your text classification model might struggle. Current technologies struggle to put together information across sentences in complex ways.

Here's a post from Matt that explains this a little more.

Another important point is that the longer the document you provide to annotators, the more complex you may find annotation to be. We have additional documents on how you can customize the UI to enable longer documents (e.g., expand the card size).

Your question comes up a lot and you may also find other past posts on "long documents" to be helpful as well. Also, since out-of-the-box Prodigy's training (e.g., spancat and textcat) are really spaCy components, you may find helpful discussions on the spaCy discussion forum like this as well.

typically with longer documents, I take the approach of breaking things down to the sentence level and classifying there, then aggregating those predictions to the whole document level. The assumption there is that typically always at least one sentence that indicates how a document should be classified. You could also use the predicted probabilities of each sentence to do something more complicated when labeling the entire document.

Last, one of the core design philosophies of Prodigy is that some of the best NLP solutions aren't cookie-cutter templates. They are context dependent. Prodigy's goal is to get you started on testing and iterating on your ideas as quickly as possible. If I were in your position with Prodigy, I would experiment with both. I would try to annotate a few hundred examples using both TextCat and SpanCat and then train a model to see. If set up right, you could likely do it in an afternoon with a train and/or train-curve loop. You may find out within hours (not days or weeks) which is the most promising on your unique data before commissioning thousands of annotations for either approach.

Hope this helps and happy to answer further questions you may have!