New to Prodigy: Annotation Structure Advice (Big Section of Text vs Separating Sentences)

grahama · November 17, 2023, 5:39pm

Hello. I'm very new to Prodigy and Spacy in general.
I need some advice on how to structure a chunk of text for annotation. Most of my chunks are sections like the below.

Question: Using SpanCat, is it better to annotate the entire section chunk within Prodigy? I could label sections, subjects, predicates, steps, table rows and etc--all in one go....in a big section

Or, should I break the section text below into lines and annotate them separately?

Or, does it matter as I'm assuming Spacy will chunk them later into whatever it likes.

Personally, I'd prefer to annotate the entire section...but do not want to go down on any bad paths given my lack of experience with Prodigy

Thanks in advance and I look forward to getting some good use out of Prodigy!

Example Section

3.1 Section Thing [Section]
This is some Text. Followed by More Text [text]
Table 1 Title [title of table]
red, green, yellow [header]
32, 46, 78 [data]

Component [subsection title]
Some text about the subsection [text]
If X is False, do the following steps in order: [precondition]

jump for joy [step]
explode [step]
renaimate [step]

Footnote 1 [external reference]
Footnote 2 [external reference]
Footnote 3[external reference]

3.2 Next Section [section title]
...

ryanwesslen · November 20, 2023, 1:50pm

Hi @grahama!

Thanks for your question and welcome to the Prodigy community

Can you provide some context on the problem you're trying to solve? For example:

what format (e.g., raw text, json/xml, word docs, pdf) do you receive the data in?
is it consistently in the same format like titles/sections or can it vary?
are there any known organization conventions you know about the docs (like legal docs that follow certain sectioning guidelines)
who is annotating the data? only you or do you have other annotators? are annotators experts (i.e., they don't need a lot of annotation guidance) or non-experts that will need lots of guidance?

Overall, in general, Prodigy is best suited for breaking down he documents into the smallest unit of analysis that's possible. We have some advice in our docs (here and here) on this:

For NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. Annotating with a model in the loop is also much faster if the texts aren’t too long, which is why recipes like ner.teach and ner.correct split sentences by default. NER model implementations also typically use a narrow contextual window of a few tokens on either side. If a human annotator can’t make a decision based on the local context, the model will struggle to learn from the data.

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

As you may have seen too, there are lots of other posts relating to longer text:

Here's a related post from spaCy GitHub on textcat vs spancat for documents:

If you're the only one annotating, then you can be a little more generous and tend to label longer documents.

However, if you have multiple non-expert annotators, I'd err on the simplest decisions possible. I say this because we sometimes see users who decide it's not worth the effort to break down their docs and provide everything to the annotators leading to poor UX annotation experience. This leads to poor annotations and possibly wasted time as they scale up annotators to do thousands of annotations only to realize that those annotations don't aid model training much.

grahama · November 20, 2023, 1:55pm

thank you that helps quite a bit.

Topic		Replies	Views
Best way to prepare a long text for annotations usage , spacy , solved	4	2145	August 29, 2018
Sentence Segmentation and Annotations usage , spacy , legal	2	1557	January 23, 2020
Prodigy NER Long Text? usage , ner , textcat	3	627	August 6, 2021
Annotate passages in long documents	1	588	July 28, 2022
Document-level annotations with Prodigy usage , ner , spacy , solved	3	806	March 28, 2021

New to Prodigy: Annotation Structure Advice (Big Section of Text vs Separating Sentences)

Related topics