Annotating coreference on NER annotated text

Hello @ale,

Actually, the coref.manual recipe has been designed for cases where NER and POS annotations are not available and so it would only display model-based annotations. Nonetheless it is a very reasonable workflow to start the coref annotation with NER pre-annotated dataset and, with hindsight, it should have been included as possible input. The good news is that coref.manual is just a specific case of rel.manual with some coref relevant configurations such as disabling all tokens that are not NER entities or (relevant) POS tags.
Consequently, you should be able to recreate a similar workflow directly with rel.manual by specifying yourself the tokens to disable via --disable-patterns parameter.
If you'd like recreate the coref.manual exactly i.e. only enable NER and certain POS tags, you'll need POS tags in your data and that's the tricky bit because rel.manual currently allows either spans from the input dataset or from the model, but not both.
As a workaround you could add these tags in a standalone preprocessing script like this one: A script to preprocess Prodigy JSONL stream by adding POS spans from a spaCy pipeline ยท GitHub
This script iterates over a dataset with NER annotations and adds spans for POS tags relevant for coref.
the resulting dataset (that the script writes to disk) contains NER and POS spans so it can be used as input to rel.manual with the following pattern for disabling tokens other than ORG (example entity label), NOUN, PRON,DET and NP (which is a label for a noun phrase that the script adds):

{"pattern": [{"_": {"label": {"NOT_IN": ["ORG","NOUN","PROPN","PRON","DET","NP"]}}}]}

(Please note that for this pattern to work you'll need the workaround mentioned here ).
If the tokenization in your NER dataset is different from the default spaCy tokenization, the nlp object instantiated in the script would, ideally, use the same tokenizer as in NER annotation (to prevent the potential rejection of POS spans due to misaligned tokens)

I realize it's quite a bit of working around and your post definitely made us revisit the inputs of both coref.manual and rel.manual that we'll intend to improve in the forth coming release of Prodigy.

1 Like