Reviewing NER annotations for long documents

Hello !

:world_map: What are you trying to do Prodigy?
I'm back with my long documents problems. We work with documents that are quite long (french legal decisions) and we have multiple expert annotators needing context in the decisions to lead to accurate annotations. The context is not necessarily long, but it is not well defined. It could be in the same sentence or few lines below. We then decided to annotate the documents as a whole because it would make sense for the domain experts and ease the process for them as a decision is a single unit in their world.

We have annotated with some overlap (to provide inter annotator agreement and allow for review of differences) and we now would like to proceed to review.

:face_with_raised_eyebrow: Did you find something confusing, disorienting or hard to find?
We have found that the review recipe simply fails (error 500 with no log) while trying to review the long documents. There are rarely more than 10 entities in each document, but they are quite long.

We are thinking about making a custom recipe which would allow only the parts of the document with entities to be shown to the reviewer, but there is no examples of custom review recipes, is it possible to have one ? Or at least guidance into how to best approach this issue ?

We can cut most of the text due to the placement of the entities, but we couldn't do that reliably before annotating...

Hi @Martin,

It's true that the review interface was designed to show the diff between short snippets as it shows the entire document per annotator. Not exactly sure why it is returning 500 (probably it's just a performance issue when trying to render such a huge diff). Even if it did though, it would be pretty unusable as you can imagine.
Your strategy to split articles into snippets and use that for review make a lot of sense, in fact we did something very similar for one of our consulting clients.
Before I get into details of how you could approach it, I just want to point out that, in general, if annotators need to rely on context that is far away from the entity in question, the NER model will very likely struggle to learn it. NER learner relies on a small context window to make its decision, so if it is impossible to assign a label based on a local context, then it's probably not a good fit for NER task.

That said, to address your immediate question I would suggest moving the splitting and merging of the snippets outside the main recipe and implement it as pre and post processing steps. Concretely:
step 1: split annotated long documents into snippets preserving the annotations
step 2: perform the review
step 3: merge the reviewed snippets back into articles (if neccesary)
To make sure the annotations are preservered and in sync with the changing tokenization, you can leverage spaCy doc data structure and its methods. All you need is a utility to translate between Prodigy representation and spaCy representation and a logic to split into snippets that fits your purpose.
You would have to adapt it to your purposes, but here's how such workflow could look like:

Once you have saved your annotated dataset, you could use it as intput to split_doc.py . This script would translate the Prodigy annotated example to a spaCy doc and then use a placeholder logic to split it into snippets that are saved in the new Prodigy dataset that can be used as input to review.
Please note that you would have to implement you own make_snippets function but hopefully this can get you started.
Please note that it's OK to keep the snippets without the annotations as they will be automatically accepted (we expect they won't differ between the annotators) if you use the -A flag with the reviewrecipe. I think it simplifies everything a lot if we keep all the snippets in the dataset.
Once you're done reviewing and you want to put the articles back together, you can call merge_snippets.py that does exactly the opposite.
As a side note, we are planning to add these utilties (especially the translation from Prodigy to spaCy) to the library as soon as we have bandwidth. Hopefully, the gist provided can be a good starting point for you and other users in the community.

1 Like