Highlighting spans that are not the entities to be labeled when using ner.correct


I wonder if it is possible to highlight spans that aren't entities that are to be labeled when using ner.correct. My use case is that the entities that I would like annotators to label often appear near certain words (though not always). Highlighting those words would help draw annotators' attention to those words, in case the trained model fails to identify certain occurrences of the entities of interest in a document.

I tried including "spans" for each document:


{"text": "blah blah blah blah",
"spans": [{"start": 1, "end": 12, "label": "WORD_OF_INTEREST"}]

But the highlighted terms don't show up in the ner.correct mode. However, they do show up in the ner.manual model.

I suppose a workaround method could be to use the model to make predictions on all the documents that I want to be annotated and get the spans of all the identified entities, and then create a new jsonl file that contains the spans of the model-identified entities AND the spans of the words that are of interest.

But ideally if there is an easier to do that, that would be great.


Yes, that's expected, because ner.correct will override all spans present in the data and replace them with what the model predicts, so you see the exact output of the model for the given text. If it also included pre-defined spans, you could easily end up with very unintuitive results, and there wouldn't be a clear answer for how to handle overlaps (if your words of interest overlap with predictions).

The easiest solution would probably be to use a custom recipe that's a slight modification of the ner.correct recipe – only that you're not overwriting the spans but adding to them. Just make sure that you decide on a strategy to deal with overlaps! For example, filter out the words of interest that overlap with a prediction (by comparing the start/end offsets).

Here's the recipe you can use as a template: