Document-level annotations with Prodigy

leguilln · March 25, 2021, 1:40pm

Hi.

I'd like to use Prodigy to annotate a new NER corpus, and I would like the annotation to be in standoff format, like all the existing corpora I am working with.

This means that I would like my Prodigy annotations to be at the document-level, i.e. entity offsets are relative to the beginning of the document, not of the sentence/paragraph.

Is this possible ? Is this desirable ? I mean, I already trained a spaCy transformer model with document-level annotations, i.e. by calling nlp on the whole document, and results are pretty good, but this does not seem to be spaCy's philosophy, and I am wondering if there could be undesirable side-effects...

ines · March 26, 2021, 10:44am

Hi! There's definitely no conflict with spaCy's philosophy here – in fact, Doc stands for "document" so the idea is that a Doc object should contain a logical unit of text, i.e. a document or a paragraph. If you're also training a component to predict sentence boundaries, you typically want to train on longer documents with multiple sentences.

For Prodigy, we typically recommend working with shorter fragments like sentences and paragraphs because it makes it easier for the annotator to focus and can increase annotation speed and reduce mistakes. For many types of annotations, it's also helpful to annotate with a similar "context window" as the model you're planning on training, because you'll be able to spot and resolve potential problems early on. (For example, if the context of a sentence or paragraph isn't enough to make an annotation decision, an entity recognizer that looks at 4 token on either side will likely struggle to generalise.)

That said, you don't have to do it like that and you can also not enable sentence segmentation and annotate one document at a time. See here for an example: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP

Alternatively, if you do prefer annotating at the paragraph level but want the resulting offsets to refer to the document, you could just store the document-level offsets of each slice with the JSON objects you're streaming in. So in post-processing, it's just one simple calculation: span["start"] + paragraph_offset etc.

leguilln · March 26, 2021, 3:14pm

Hi @ines, thank you for this clarification.

How can I get the slice offsets ? Can I save them as part of the JSONL file containing the paragraph-level annotations, or is there a way to get them at post-processing time ?

ines · March 28, 2021, 1:57am

Yes, any additional properties you add to the imcoming JSON will be passed through and saved with the data. So you could add keys like "document_offset_start" (or however else you want to call it), and even include other meta information like the document ID. When you export the data after annotation, those custom keys will be present in the data.

Topic		Replies	Views
Sentence Segmentation and Annotations usage , spacy , legal	2	1561	January 23, 2020
Can I recover offsets into the document from ner.teach? usage , ner , done	6	1056	March 11, 2018
How do I use prodigy as a purely annotation tool with no underlying SpaCy model? usage	1	1603	April 27, 2018
Prodigy annotations to SpaCy train spacy	13	5635	January 31, 2018
Best way to prepare a long text for annotations usage , spacy , solved	4	2160	August 29, 2018

Document-level annotations with Prodigy

Related topics