Forcing NER to ignore stopwords

Hi,

I am wondering if it’s possible to provide a patterns file or something similar with ‘negative’ terms for the model to reject/ignore.

I am currently using a language model with no word embeddings so I cannot use terms.teach therefore I am creating my own patterns file to help start off ner.teach. I see this file just has accepted terms not any rejected terms, i.e. there is no ‘answer’ like from the annotation files. The issue I am having is that my model is picking up a lot of stopwords - even after I have done 1000+ annotations when I created the model when try make-gold it is still identifying stopwords. I don’t want to remove them from my corpus however is there a way to force the model not to identify them as possible named entities?

Thanks.

The “antipatterns” idea is something that has been suggested before, and we’re thinking about a good way to solve this. One complication is that in general, Prodigy shouldn’t just silently mark anything as accept or reject without at least giving the user a chance to review it. This would also go against the philosophy of datasets always being an exact record of the individual annotation decisions.

One solution would be to wrap the already sorted and predicted stream in another generator that checks if the span text is a stop word and will only present examples for annotation that do not have stop words highlighted. The removed examples are stored and can then be added later.

STOP_WORDS = ['is', 'a', 'the' ]  # etc. 
removed_examples = []

def extract_stopwords(stream):
    for eg in stream:
        spans = eg.get('spans', [])  # get the spans
        for span in spans:
            if span['text'].lower() in STOP_WORDS:
                removed_examples.append(eg)  # store example
            else:  # only present other examples
                yield eg

stream = prefer_uncertain(model(stream))
stream = extract_stopwords(stream)

You can then return an 'on_exit' callback from your recipe that takes the removed examples and does something with them – for example, you could save them out to a file, review them and then add them to your dataset via db-in. Or you could just add "answer": "reject" them to the dataset straight away.

def on_exit(ctrl):
    nonlocal removed_examples
    # this is called when you exit the Prodigy server
    dataset_name = ctrl.dataset
    database = ctrl.db
    # do something here

This is great - thank you Ines! The annotating is already so much faster without including these obvious rejections. As suggested on exit I assign those examples as reject however I am a little uncertain on how to save these back to my data set. The relevant part of my recipe is

    def reject_stopwords(removed_examples):
        for eg in removed_examples:
            eg['answer'] = 'reject'
            print (eg)


def on_exit(ctrl):
    nonlocal removed_examples
    # this is called when you exit the Prodigy server
    dataset_name = ctrl.dataset
    database = ctrl.db
    # do something here
    reject_stopwords(removed_examples)

I am not sure if I should make the updated rejected examples into a list and pass that back or if I can update the dataset on the fly?

Thanks.

Yay, glad to hear it worked!

The controller gives you access to the database, which has an add_examples method. This lets you add a list of tasks to one or more datasets. So you could do the following, which will add your removed examples to the current dataset:

ctrl.db.add_examples(removed_examples, datasets=[ctrl.dataset])
print("Added {} rejected examples to dataset".format(len(removed_examples)))

Doing it on_exit is nice because you only have to write to the database once – but it also means that if your process dies for some reason, you’ll lose the removed examples.

I think it’d also be fine to add individual examples while you process the stream – at least, I don’t see why it wouldn’t. Here’s an example:

from prodigy.components.db import connect
db = connect()  # connect using settings from your prodigy.json

def extract_stopwords(stream):
    for eg in stream:
        spans = eg.get('spans', [])  # get the spans
        for span in spans:
            if span['text'].lower() in STOP_WORDS:
                db.add_examples([eg], datasets=[dataset])
            else:  # only present other examples
                yield eg
1 Like

Interestingly when I implemented this I started to get a lot of duplicates one after the other. I tried to write a method to exclude them however they don’t seem visible through my python function but they are offered up for annotation and are present in the resulting dataset.

Inside the remove_stopwords function I print the span info (text, start, end, input_hash, etc.) and none of the yielded examples are the same however I am being offered the exact same example sometimes up to 4 times in a row before it moves on to another example. These repetitive examples I can see in the output dataset and they have different scores and task_hash but the same input_hash, text, start and end.

I am quite confused why I cannot print the spans inside the remove stopwords function and use this to avoid offering the same example repetitively. Do you have any ideas on how I can avoid seeing the same example repetitively?

At which point are you applying the remove_stopwords wrapper to your stream? Could you double-check whether it’s at the very last step, i.e. after the prefer_uncertain sorter?

Prodigy asking about the same span with different labels is not uncommon – basically, ner.teach will use the beam search algorithm to find all possible analyses for a given text and attach scores to them. The sorter will then pick out the most uncertain ones and often, this does include one span with different label options.

One possible explanation for what’s going on here: The model in the loop makes suggestions for the stop word entities – but it never actually receives any negative feedback on them, so it doesn’t know that it should stop asking about them. reject is actually super important in the recipes that use a model in the loop. (This is also something I hadn’t considered in my initial example.)

So one thing you could try is to also call the update function that updates the model when you filter out the rejected spans. This is the same function that’s returned by the recipe as 'update': update.

def extract_stopwords(stream, update):
    for eg in stream:
        spans = eg.get('spans', [])  # get the spans
        for span in spans:
            if span['text'].lower() in STOP_WORDS:
                db.add_examples([eg], datasets=[dataset])
                update([eg])
            else:  # only present other examples
                yield eg
stream = prefer_uncertain(model(stream))
stream = extract_stopwords(stream, update)

So far my stopwords function looks like this:

    def extract_stopwords(stream):
    span_list = []
    for eg in stream:
        spans = eg.get('spans', [])  # get the spans
        for span in spans:
            span_string = ' '.join([str(span['input_hash']),str(span['start']),str(span['end'])])

            # REMOVE STOPWORDS
            if span['text'].lower() in stopwords:
                removed_examples.append(eg)  # store example
                #print ('CONTAINS STOPWORD')

            # REMOVE ALPHA NUMERIC
            elif span['text'].isalpha() == False:
                removed_examples.append(eg)  # store example
                #print ('CONTAINS NON ALPHA NUMERIC')

            else:
                print (span)
                # CHECK IF DUPLICATE
                if span_string in span_list:
                    duplicates.append(eg)
                    print ('DUPLICATE')

                # PRESENT EXAMPLE
                else:
                    span_list.append(span_string)
                    print ('ALLOWED EXAMPLE')
                    yield eg # only present other examples

It has evolved to also remove words with non-alpha characters and I tried to add the deduplication here. I call this as you say

stream = prefer_uncertain(model(stream))
stream = extract_stopwords(stream)

I tried adding in update([eg]) to both remove stopwords and remove alpha sections but this threw an error.

File "cython_src/prodigy/models/matcher.pyx", line 108, in prodigy.models.matcher.PatternMatcher.update
KeyError: 'answer'

This looks good so far!

I think you might have to call your reject_stopwords functions first, or manually add the answer you want to update the model with:

eg['answer'] = 'reject'
update([eg])

Amazing! That fixed it :smiley:

Huge thank you Ines. Hope you have a good rest of your weekend.