Reviewing NER annotations for long documents

Martin · March 4, 2024, 1:44pm

Hello !

What are you trying to do Prodigy?
I'm back with my long documents problems. We work with documents that are quite long (french legal decisions) and we have multiple expert annotators needing context in the decisions to lead to accurate annotations. The context is not necessarily long, but it is not well defined. It could be in the same sentence or few lines below. We then decided to annotate the documents as a whole because it would make sense for the domain experts and ease the process for them as a decision is a single unit in their world.

We have annotated with some overlap (to provide inter annotator agreement and allow for review of differences) and we now would like to proceed to review.

Did you find something confusing, disorienting or hard to find?
We have found that the review recipe simply fails (error 500 with no log) while trying to review the long documents. There are rarely more than 10 entities in each document, but they are quite long.

We are thinking about making a custom recipe which would allow only the parts of the document with entities to be shown to the reviewer, but there is no examples of custom review recipes, is it possible to have one ? Or at least guidance into how to best approach this issue ?

We can cut most of the text due to the placement of the entities, but we couldn't do that reliably before annotating...

magdaaniol · March 5, 2024, 6:39pm

Hi @Martin,

It's true that the review interface was designed to show the diff between short snippets as it shows the entire document per annotator. Not exactly sure why it is returning 500 (probably it's just a performance issue when trying to render such a huge diff). Even if it did though, it would be pretty unusable as you can imagine.
Your strategy to split articles into snippets and use that for review make a lot of sense, in fact we did something very similar for one of our consulting clients.
Before I get into details of how you could approach it, I just want to point out that, in general, if annotators need to rely on context that is far away from the entity in question, the NER model will very likely struggle to learn it. NER learner relies on a small context window to make its decision, so if it is impossible to assign a label based on a local context, then it's probably not a good fit for NER task.

That said, to address your immediate question I would suggest moving the splitting and merging of the snippets outside the main recipe and implement it as pre and post processing steps. Concretely:
step 1: split annotated long documents into snippets preserving the annotations
step 2: perform the review
step 3: merge the reviewed snippets back into articles (if neccesary)
To make sure the annotations are preservered and in sync with the changing tokenization, you can leverage spaCy doc data structure and its methods. All you need is a utility to translate between Prodigy representation and spaCy representation and a logic to split into snippets that fits your purpose.
You would have to adapt it to your purposes, but here's how such workflow could look like:

gist.github.com

https://gist.github.com/magdaaniol/b3e24bc071f6c7eac0b39b80b0ebef32

merge_snippets.py

from pathlib import Path
import srsly
import spacy
import typer
from util import Snippet, Document
from collections import defaultdict


def main(snippets_path: Path, articles_path: Path) -> None:
    """

This file has been truncated. show original

split_doc.py

from pathlib import Path
import srsly
import spacy
import typer
from util import Document


def main(articles_path: Path, snippets_path: Path) -> None:
    """
    Split articles into snippets respecting span annotations if available.

This file has been truncated. show original

util.py

import spacy
from spacy.tokens import Doc, Span
from spacy.language import Language
from spacy.vocab import Vocab
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
from prodigy.components.preprocess import get_token
from prodigy.util import BINARY_ATTR, set_hashes

@dataclass

This file has been truncated. show original

Once you have saved your annotated dataset, you could use it as intput to split_doc.py . This script would translate the Prodigy annotated example to a spaCy doc and then use a placeholder logic to split it into snippets that are saved in the new Prodigy dataset that can be used as input to review.
Please note that you would have to implement you own make_snippets function but hopefully this can get you started.
Please note that it's OK to keep the snippets without the annotations as they will be automatically accepted (we expect they won't differ between the annotators) if you use the -A flag with the reviewrecipe. I think it simplifies everything a lot if we keep all the snippets in the dataset.
Once you're done reviewing and you want to put the articles back together, you can call merge_snippets.py that does exactly the opposite.
As a side note, we are planning to add these utilties (especially the translation from Prodigy to spaCy) to the library as soon as we have bandwidth. Hopefully, the gist provided can be a good starting point for you and other users in the community.

Topic		Replies	Views
Working with longer texts usage , ner	3	680	September 10, 2020
documents length and annotation time usage , ner , solved , streams	13	942	December 4, 2020
Customizations for the ner.teach UI ner	3	1260	January 11, 2018
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
UI slow when serving snippets over 500 words ner , front-end	6	692	January 27, 2022

Reviewing NER annotations for long documents

Related topics