Hi @grahama!
Thanks for your question and welcome to the Prodigy community
Can you provide some context on the problem you're trying to solve? For example:
- what format (e.g., raw text, json/xml, word docs, pdf) do you receive the data in?
- is it consistently in the same format like titles/sections or can it vary?
- are there any known organization conventions you know about the docs (like legal docs that follow certain sectioning guidelines)
- who is annotating the data? only you or do you have other annotators? are annotators experts (i.e., they don't need a lot of annotation guidance) or non-experts that will need lots of guidance?
Overall, in general, Prodigy is best suited for breaking down he documents into the smallest unit of analysis that's possible. We have some advice in our docs (here and here) on this:
For NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. Annotating with a model in the loop is also much faster if the texts aren’t too long, which is why recipes like
ner.teach
andner.correct
split sentences by default. NER model implementations also typically use a narrow contextual window of a few tokens on either side. If a human annotator can’t make a decision based on the local context, the model will struggle to learn from the data.
If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.
As you may have seen too, there are lots of other posts relating to longer text:
Here's a related post from spaCy GitHub on textcat vs spancat for documents:
If you're the only one annotating, then you can be a little more generous and tend to label longer documents.
However, if you have multiple non-expert annotators, I'd err on the simplest decisions possible. I say this because we sometimes see users who decide it's not worth the effort to break down their docs and provide everything to the annotators leading to poor UX annotation experience. This leads to poor annotations and possibly wasted time as they scale up annotators to do thousands of annotations only to realize that those annotations don't aid model training much.