Hi! I'm looking for an opinion on the best way to implement the following NER annotation workflow (i.e. related to "model in the loop").
Problem:
We have N annotated notes that we'd like to pass through Prodigy to remove/add further annotations. The preset annotations are just regexes and we passed them to Prodigy via the spans field in a JSONL file. Now, when we are reannotating these notes, we'd like the annotations we do to be applied to future notes in a batch. For example, say as we're reannotating these notes, we notice that "car" isn't being annotated. Since we would have annotated "car" in the first note, we'd like all future instances of "car" to be highlighted as an annotation so we don't have to redo it.
Our current workflow is based on the "model in the loop" idea: we'd like to update our model after a batch of notes and rerun all unprocessed notes through this model. This idea normally applies to BERT models, but we'd like to apply it for the PatternMatcher instance.
Below is pseudocode for an implementation idea.
@prodigy.recipe(
"ner.regex",
dataset=("The dataset to use", "positional", None, str),
spacy_model=("The base model", "positional", None, str),
source=("The source data as a JSONL file", "positional", None, str),
label=("One or more comma-separated labels", "option", "l", split_string),
patterns=("Optional match patterns", "option", "p", str)
)
def ner_regex(
dataset: str,
spacy_model: str,
source: str,
label: Optional[List[str]] = None,
patterns: Optional[List[str]] = None):
nlp = spacy.blank("en")
matcher = PatternMatcher(nlp).from_disk(patterns)
update = model.update
stream = JSONL(source)
stream = add_tokens(nlp, stream)
predict = model
stream = (eg for score, eg in predict(stream))
return {
"view_id": "ner_manual", # annotation interface to use
"dataset": dataset, # dataset to save annotations to
"stream": stream, # the incoming stream of examples
"update": update, # the update callback
"config": {
"lang": "en",
"labels": label
}
}
The idea is that the unannotated examples would be updated once the user saves the annotations (calls the update func). However, this code doesn't work, and we believe we may have to implement a custom model. Additionally, we don't fully understand how the batch_size parameter works (in this case, we'd like to be 1: the unannotated examples get updated after each annotation).
Any feedback and help will be greatly appreciated, thank you!