How to use the spacy data to prodigy ner.manual and continue the annotation?

I have a spacy v2 ner model that I trained long back and it performs good. It has four labels, A, B, C, D

Now, I have collected some more data from other sources which is in this format given below, it has three labels A, C and D. This collected data doesn't have label = B

[["Who is Shaka Khan?", {"entities": [[7, 17, "A"]]}],
 ["I like London and Berlin.", {"entities": [[7, 13, "C"], [18, 24, "D"]]}]]

How can I utilize the model that I have and use it in the loop and continue to annotate this new data that i have collected with new label B along with other labels.

Please help out. I have been struggling for days now on how to do it using prodigy?

Hi! Prodigy either allows you to use the ner.manual workflow with pre-annotated data in Prodigy's format (see here for details), so you can load your new annotations in and add the label B manually wherever it's missing.

Alternatively, the ner.correct workflow lets you pre-annotate everything that's predicted by the model and correct it's predictions. However, this workflow won't include pre-defined annotations by default, as you can easily end up with conflicts (e.g. pre-annotated entities that overlap with something predicted by the model), and there's no easy answer for how to resolve them.

So you'd have to decide whether it's worth it to use your model to fill in the B entities, or whether it makes sense to do it manually. If it's just one label, it might make more sense to use ner.manual with your pre-annotated data, and add label B by hand.

If you want to use your model for it, you'd have to decide how you want to deal with conflicts. You can find conflicts by checking the start/end indices of the predicted spans against the existing ones in your data and see if your data already has one or more tokens of it covered. Another, potentially easier way is to use spaCy's filter_spans helper: it'll take a list of potentially overlapping spans and filter out conflicts and overlaps. So if you end up with fewer filtered spans, you know that there's at least one conflict:

from spacy.util import filter_spans

def make_stream(stream):
    data_tuples = ((eg["text"], eg) for eg in stream)
    # This gives you a Doc processed by your model, and the original input JSON with pre-annotated "spans"
    for doc, eg in nlp.pipe(data_tuples, as_tuples=True):
        # Entities your model annotated as B
        b_entities = [ent for ent in doc.ents if ent.label_ == "B"]
        # Existing spans annotated in your data
        existing_spans = [doc.char_span(span["start"], span["end"], span["label"]) for span in eg.get("spans")]
        all_spans = [*b_entities, *existing_spans]
        filtered_spans = filter_spans(all_spans)
        # Some spans got filtered out, there must be a conflict
        if len(filtered_spans) < len(all_spans):
            print("Overlapping entities:", filtered_spans)
            yield eg  # send out original example, annotate B manually?
            # Add all of your spans to the example, including label B
            eg["spans"] = [{"start": span.start_char, "end": span.end_char, "label": span.label_} for span in filtered_spans]
            yield eg

If you have conflicts (e.g. your model predicted something as B that your data has annotated as A), you can decide how you want to deal with that. It's possible that this case is very rare, so it probably makes sense to just handle those examples manually and add B yourself (or fix existing annotations that were inconsistent).