Annotating coreference on NER annotated text

Hi,

I am building the following workflow on Prodigy: Annotation of named entities, then coreference annotation for entities, and finally relationship annotation between entities.

For the first step I plan to use ner.manual to label tokens. However, I am having a hard time understanding how to load the NER annotated text into the coref.manual recipe for coreference annotation. Specifically, I would like to have a coreference interface that loads my entity annotations and allows me to indicate coreference between entities and possibly other tokens in the text.

Thanks.

Hello @ale,

Actually, the coref.manual recipe has been designed for cases where NER and POS annotations are not available and so it would only display model-based annotations. Nonetheless it is a very reasonable workflow to start the coref annotation with NER pre-annotated dataset and, with hindsight, it should have been included as possible input. The good news is that coref.manual is just a specific case of rel.manual with some coref relevant configurations such as disabling all tokens that are not NER entities or (relevant) POS tags.
Consequently, you should be able to recreate a similar workflow directly with rel.manual by specifying yourself the tokens to disable via --disable-patterns parameter.
If you'd like recreate the coref.manual exactly i.e. only enable NER and certain POS tags, you'll need POS tags in your data and that's the tricky bit because rel.manual currently allows either spans from the input dataset or from the model, but not both.
As a workaround you could add these tags in a standalone preprocessing script like this one: A script to preprocess Prodigy JSONL stream by adding POS spans from a spaCy pipeline · GitHub
This script iterates over a dataset with NER annotations and adds spans for POS tags relevant for coref.
the resulting dataset (that the script writes to disk) contains NER and POS spans so it can be used as input to rel.manual with the following pattern for disabling tokens other than ORG (example entity label), NOUN, PRON,DET and NP (which is a label for a noun phrase that the script adds):

{"pattern": [{"_": {"label": {"NOT_IN": ["ORG","NOUN","PROPN","PRON","DET","NP"]}}}]}

(Please note that for this pattern to work you'll need the workaround mentioned here ).
If the tokenization in your NER dataset is different from the default spaCy tokenization, the nlp object instantiated in the script would, ideally, use the same tokenizer as in NER annotation (to prevent the potential rejection of POS spans due to misaligned tokens)

I realize it's quite a bit of working around and your post definitely made us revisit the inputs of both coref.manual and rel.manual that we'll intend to improve in the forth coming release of Prodigy.

1 Like

Thanks for your answer @magdaaniol.

In the case I wanted to annotate relationships and coreference at the same time on NER annotated data, and didn't care too much about disabling tokens, what would be a reasonable approach? I believe another forum answer mentioned one can use rel.manual with a COREF label to indicate coreference.

Hi @ale,

That's right. There's nothing special about the COREF label per se. The coref.manual has some presets such as irrelevant token disabling, but otherwise it's just rel.manual under the hood. So, yes, you can add COREF as another rel.manual label.
That said, I would recommend separating these steps as both relations and coreference are fairly complex annotation tasks and focusing on one a time as well as disabling tokens other than noun phrases will likely result in more efficient and less error prone annotation, even if it means going through data twice.

1 Like