Hello. I'm very new to Prodigy and Spacy in general.
I need some advice on how to structure a chunk of text for annotation. Most of my chunks are sections like the below.
Question: Using SpanCat, is it better to annotate the entire section chunk within Prodigy? I could label sections, subjects, predicates, steps, table rows and etc--all in one go....in a big section
Or, should I break the section text below into lines and annotate them separately?
Or, does it matter as I'm assuming Spacy will chunk them later into whatever it likes.
Personally, I'd prefer to annotate the entire section...but do not want to go down on any bad paths given my lack of experience with Prodigy
Thanks in advance and I look forward to getting some good use out of Prodigy!
Example Section
3.1 Section Thing [Section]
This is some Text. Followed by More Text [text]
Table 1 Title [title of table]
red, green, yellow [header]
32, 46, 78 [data]
Component [subsection title]
Some text about the subsection [text]
If X is False, do the following steps in order: [precondition]
Thanks for your question and welcome to the Prodigy community
Can you provide some context on the problem you're trying to solve? For example:
what format (e.g., raw text, json/xml, word docs, pdf) do you receive the data in?
is it consistently in the same format like titles/sections or can it vary?
are there any known organization conventions you know about the docs (like legal docs that follow certain sectioning guidelines)
who is annotating the data? only you or do you have other annotators? are annotators experts (i.e., they don't need a lot of annotation guidance) or non-experts that will need lots of guidance?
Overall, in general, Prodigy is best suited for breaking down he documents into the smallest unit of analysis that's possible. We have some advice in our docs (here and here) on this:
For NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. Annotating with a model in the loop is also much faster if the texts aren’t too long, which is why recipes like ner.teach and ner.correct split sentences by default. NER model implementations also typically use a narrow contextual window of a few tokens on either side. If a human annotator can’t make a decision based on the local context, the model will struggle to learn from the data.
If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.
As you may have seen too, there are lots of other posts relating to longer text:
Here's a related post from spaCy GitHub on textcat vs spancat for documents:
If you're the only one annotating, then you can be a little more generous and tend to label longer documents.
However, if you have multiple non-expert annotators, I'd err on the simplest decisions possible. I say this because we sometimes see users who decide it's not worth the effort to break down their docs and provide everything to the annotators leading to poor UX annotation experience. This leads to poor annotations and possibly wasted time as they scale up annotators to do thousands of annotations only to realize that those annotations don't aid model training much.