NER not containing <word_list>

k.schroeer · August 28, 2019, 10:00am

Update:
After experimenting with the hints given in "Forcing NER to ignore stopwords" and "patterns using regex or shape" I'm still confused...

Short summary of these topics: It was desired to automatically reject given predictions in a custom manner. While Ines' suggestion in the first one alters the ner.teach recipe, Matt proposes to write a recipe wrapper for the recipe.

Both answers are followed by a discussion, because this automatic rejections are not added to the database (or it has to be explicitly done in the altered recipes).

Question:
What do I have to do e.g. in the wrapper approach to update the processed examples in the prodigy web server interface? I mean, I can see the history and the progress of "clicked" examples, but these do not include the auto-rejects. I assume if these auto-rejects would pop-up there, they would also be saved along with the others when saving with STRG+S?

My current recipe wrapper looks like this:

`@prodigy.recipe('custom.ner.teach', **teach.__annotations__)
def custom_ner_teach(dataset, spacy_model, source=None, api=None, loader=None,
                     label=None, patterns=None, exclude=None, unsegmented=False):
    """Custom wrapper for ner.teach recipe that replaces the stream."""
    components = teach(**locals())

    original_stream = components['stream']
    original_update = components['update']
    bad_spans = []

    def get_modified_stream():
        nonlocal bad_spans
        for eg in original_stream:
            for span in eg['spans']:
                if (span['text'].lower() in ["impressum", "imprint"]) or (span['text'][0] in [",", ":", "|", "-"]) or (span['text'][-1] in [",", ":", "|", "-"]):
                    print("ANSWER '{}' rejected".format(span['text']))
                    eg['answer'] = 'reject'
                    bad_spans.append(eg)
                    break
            else:
                yield eg

    def modified_update(batch):
        nonlocal bad_spans
        batch = batch + bad_spans
        print("LEN:", len(bad_spans))
        bad_spans = []
        return original_update(batch)

    components['stream'] = get_modified_stream()
    components['update'] = modified_update
    return components

The logging and my prints tell me, that the auto-reject works as expected and these rejects are considered by the model update... but they aren't written to the dataset.

I can globally connect to the database and add

db.add_examples([eg], datasets=[dataset])

right after setting the answer to 'reject' and this will add to the database. But I would like to incorporate this in the web interface. Otherwise my session counter will only give me my manual annotations not the automatic ones.

Edit: Even when using the above mentioned quick fix, the added examples are missing a view_id and session_id. The latter leads to the problem, that these entries are automatically re-added when starting a new session.

Topic		Replies	Views
Excluding patterns for NER enhancement , usage , ner	2	726	May 9, 2019
NER training data creation ner , spacy , training	2	309	August 4, 2022
How to treat entity-free text in manual/match modes. usage , ner	1	475	April 16, 2019
Forcing NER to ignore stopwords ner , terms , solved	8	1899	June 10, 2018
Catastrophic forgetting when training NER using Prodigy ner , spacy	1	519	February 11, 2020

NER not containing <word_list>

Related topics