I've built spans just like you proposed for tasks using spacy's PatternMatcher which returns multiple matches if available but ner_manual view seems a little off - a token may be. Also firsts task goes unannotated, I've to ignore it to get new task which is annotated.
Also while building spans, I had to normalize overlapping spans and drop smaller spans in favor of longer ones so manual view gets as cleaned up data as possible.
I could not use ner.teach recipe because prodigy's PatternMatcher returns only one match as per my understanding. I like to use the knowledge of patterns and model predictions to be used in manual view.
Here is how my recipe looks like. Would really appreciate some quick help.
@prodigy.recipe('ner.semi-manual',
dataset=prodigy.recipe_args['dataset'],
spacy_model=prodigy.recipe_args['spacy_model'],
source=prodigy.recipe_args['source'],
api=prodigy.recipe_args['api'],
loader=prodigy.recipe_args['loader'],
label=prodigy.recipe_args['label'],
patterns=prodigy.recipe_args['patterns'],
exclude=prodigy.recipe_args['exclude'])
def manual(dataset, spacy_model, source=None, api=None, loader=None,
label=None, patterns=None, exclude=None):
"""
Mark spans by token. Requires only a tokenizer and no entity recognizer,
and doesn't do any active learning.
"""
log("RECIPE: Starting recipe ner.manual", locals())
nlp = spacy.load(spacy_model)
log("RECIPE: Loaded model {}".format(spacy_model))
labels = get_labels(label, nlp)
log("RECIPE: Annotating with {} labels".format(len(labels)), labels)
my_matcher = MyPatternMatcher(nlp).from_disk(patterns)
stream = get_stream(source, api=api, loader=loader, rehash=True,
dedup=True, input_key='text')
stream = split_tokens(nlp, stream)
stream = my_matcher(stream) # adds spans to task based on patterns matched
return {
'view_id': 'ner_manual',
'dataset': dataset,
'stream': stream,
'exclude': exclude,
'config': {'labels': labels}
}
Thanks