Hi! Prodigy either allows you to use the ner.manual
workflow with pre-annotated data in Prodigy's format (see here for details), so you can load your new annotations in and add the label B
manually wherever it's missing.
Alternatively, the ner.correct
workflow lets you pre-annotate everything that's predicted by the model and correct it's predictions. However, this workflow won't include pre-defined annotations by default, as you can easily end up with conflicts (e.g. pre-annotated entities that overlap with something predicted by the model), and there's no easy answer for how to resolve them.
So you'd have to decide whether it's worth it to use your model to fill in the B
entities, or whether it makes sense to do it manually. If it's just one label, it might make more sense to use ner.manual
with your pre-annotated data, and add label B
by hand.
If you want to use your model for it, you'd have to decide how you want to deal with conflicts. You can find conflicts by checking the start/end indices of the predicted spans against the existing ones in your data and see if your data already has one or more tokens of it covered. Another, potentially easier way is to use spaCy's filter_spans
helper: it'll take a list of potentially overlapping spans and filter out conflicts and overlaps. So if you end up with fewer filtered spans, you know that there's at least one conflict:
from spacy.util import filter_spans
def make_stream(stream):
data_tuples = ((eg["text"], eg) for eg in stream)
# This gives you a Doc processed by your model, and the original input JSON with pre-annotated "spans"
for doc, eg in nlp.pipe(data_tuples, as_tuples=True):
# Entities your model annotated as B
b_entities = [ent for ent in doc.ents if ent.label_ == "B"]
# Existing spans annotated in your data
existing_spans = [doc.char_span(span["start"], span["end"], span["label"]) for span in eg.get("spans")]
all_spans = [*b_entities, *existing_spans]
filtered_spans = filter_spans(all_spans)
# Some spans got filtered out, there must be a conflict
if len(filtered_spans) < len(all_spans):
print("Overlapping entities:", filtered_spans)
yield eg # send out original example, annotate B manually?
else:
# Add all of your spans to the example, including label B
eg["spans"] = [{"start": span.start_char, "end": span.end_char, "label": span.label_} for span in filtered_spans]
yield eg
If you have conflicts (e.g. your model predicted something as B
that your data has annotated as A
), you can decide how you want to deal with that. It's possible that this case is very rare, so it probably makes sense to just handle those examples manually and add B
yourself (or fix existing annotations that were inconsistent).