Update:
After experimenting with the hints given in "Forcing NER to ignore stopwords" and "patterns using regex or shape" I'm still confused...
Short summary of these topics: It was desired to automatically reject given predictions in a custom manner. While Ines' suggestion in the first one alters the ner.teach
recipe, Matt proposes to write a recipe wrapper for the recipe.
Both answers are followed by a discussion, because this automatic rejections are not added to the database (or it has to be explicitly done in the altered recipes).
Question:
What do I have to do e.g. in the wrapper approach to update the processed examples in the prodigy web server interface? I mean, I can see the history and the progress of "clicked" examples, but these do not include the auto-rejects. I assume if these auto-rejects would pop-up there, they would also be saved along with the others when saving with STRG+S?
My current recipe wrapper looks like this:
`@prodigy.recipe('custom.ner.teach', **teach.__annotations__)
def custom_ner_teach(dataset, spacy_model, source=None, api=None, loader=None,
label=None, patterns=None, exclude=None, unsegmented=False):
"""Custom wrapper for ner.teach recipe that replaces the stream."""
components = teach(**locals())
original_stream = components['stream']
original_update = components['update']
bad_spans = []
def get_modified_stream():
nonlocal bad_spans
for eg in original_stream:
for span in eg['spans']:
if (span['text'].lower() in ["impressum", "imprint"]) or (span['text'][0] in [",", ":", "|", "-"]) or (span['text'][-1] in [",", ":", "|", "-"]):
print("ANSWER '{}' rejected".format(span['text']))
eg['answer'] = 'reject'
bad_spans.append(eg)
break
else:
yield eg
def modified_update(batch):
nonlocal bad_spans
batch = batch + bad_spans
print("LEN:", len(bad_spans))
bad_spans = []
return original_update(batch)
components['stream'] = get_modified_stream()
components['update'] = modified_update
return components
The logging and my prints tell me, that the auto-reject works as expected and these rejects are considered by the model update... but they aren't written to the dataset.
I can globally connect to the database and add
db.add_examples([eg], datasets=[dataset])
right after setting the answer to 'reject' and this will add to the database. But I would like to incorporate this in the web interface. Otherwise my session counter will only give me my manual annotations not the automatic ones.
Edit: Even when using the above mentioned quick fix, the added examples are missing a view_id and session_id. The latter leads to the problem, that these entries are automatically re-added when starting a new session.