Document-level annotations with Prodigy

Hi.

I'd like to use Prodigy to annotate a new NER corpus, and I would like the annotation to be in standoff format, like all the existing corpora I am working with.

This means that I would like my Prodigy annotations to be at the document-level, i.e. entity offsets are relative to the beginning of the document, not of the sentence/paragraph.

Is this possible ? Is this desirable ? I mean, I already trained a spaCy transformer model with document-level annotations, i.e. by calling nlp on the whole document, and results are pretty good, but this does not seem to be spaCy's philosophy, and I am wondering if there could be undesirable side-effects...

Hi! There's definitely no conflict with spaCy's philosophy here – in fact, Doc stands for "document" so the idea is that a Doc object should contain a logical unit of text, i.e. a document or a paragraph. If you're also training a component to predict sentence boundaries, you typically want to train on longer documents with multiple sentences.

For Prodigy, we typically recommend working with shorter fragments like sentences and paragraphs because it makes it easier for the annotator to focus and can increase annotation speed and reduce mistakes. For many types of annotations, it's also helpful to annotate with a similar "context window" as the model you're planning on training, because you'll be able to spot and resolve potential problems early on. (For example, if the context of a sentence or paragraph isn't enough to make an annotation decision, an entity recognizer that looks at 4 token on either side will likely struggle to generalise.)

That said, you don't have to do it like that and you can also not enable sentence segmentation and annotate one document at a time. See here for an example: https://prodi.gy/docs/named-entity-recognition#long-text

Alternatively, if you do prefer annotating at the paragraph level but want the resulting offsets to refer to the document, you could just store the document-level offsets of each slice with the JSON objects you're streaming in. So in post-processing, it's just one simple calculation: span["start"] + paragraph_offset etc.

1 Like

Hi @ines, thank you for this clarification.

How can I get the slice offsets ? Can I save them as part of the JSONL file containing the paragraph-level annotations, or is there a way to get them at post-processing time ?

Yes, any additional properties you add to the imcoming JSON will be passed through and saved with the data. So you could add keys like "document_offset_start" (or however else you want to call it), and even include other meta information like the document ID. When you export the data after annotation, those custom keys will be present in the data.