NER not containing <word_list>

Hi!

Situation
I use a blank model to train a new NER entity, which is the only entity in the model. Input are spans of varying length (approx 1-15 tokens), which may or may not contain this entity (or in some cases the given span should be recognized as this entity entirely).

There is a set of words/tokens that will never be part of this entity. I used ner.manual and ner.teach to generate a small(?) dataset of 2000 entries.

Problem
This dataset contains many entries in which the "no-go" words/tokens mentioned above are explicitly rejected as single entities (e.g. commas, "|" or the word "Impressum"). Still, after batch-training my dataset to a new model (accuracy 85+%) and looking at it's predictions, it still often marks these as single token entities.

Question
How many examples explicitly rejecting these tokens do I need to avoid this behaviour? I know, I could add a filter in the pipeline, eliminating these abvious false positives, but I think this is not the way to go.

I found topics like Forcing NER to ignore stopwords that I could adapt to my question. Still, I was wondering if there is another possibility to tell the spacy/prodigy that some words can't be part of an entity in the newer versions.

A last addition: Is it possible to restrain the model in way that it can only predict ONE or NONE span of my entity type per input document? Like if there are plenty of entity guesses, only take the highest scoring one? Maybe this would already help with the problem mentioned before. How does one reach this internal NER entity "scoring" in general?

Thank you for your help!

Update:
After experimenting with the hints given in "Forcing NER to ignore stopwords" and "patterns using regex or shape" I'm still confused...

Short summary of these topics: It was desired to automatically reject given predictions in a custom manner. While Ines' suggestion in the first one alters the ner.teach recipe, Matt proposes to write a recipe wrapper for the recipe.

Both answers are followed by a discussion, because this automatic rejections are not added to the database (or it has to be explicitly done in the altered recipes).

Question:
What do I have to do e.g. in the wrapper approach to update the processed examples in the prodigy web server interface? I mean, I can see the history and the progress of "clicked" examples, but these do not include the auto-rejects. I assume if these auto-rejects would pop-up there, they would also be saved along with the others when saving with STRG+S?

My current recipe wrapper looks like this:

`@prodigy.recipe('custom.ner.teach', **teach.__annotations__)
def custom_ner_teach(dataset, spacy_model, source=None, api=None, loader=None,
                     label=None, patterns=None, exclude=None, unsegmented=False):
    """Custom wrapper for ner.teach recipe that replaces the stream."""
    components = teach(**locals())

    original_stream = components['stream']
    original_update = components['update']
    bad_spans = []

    def get_modified_stream():
        nonlocal bad_spans
        for eg in original_stream:
            for span in eg['spans']:
                if (span['text'].lower() in ["impressum", "imprint"]) or (span['text'][0] in [",", ":", "|", "-"]) or (span['text'][-1] in [",", ":", "|", "-"]):
                    print("ANSWER '{}' rejected".format(span['text']))
                    eg['answer'] = 'reject'
                    bad_spans.append(eg)
                    break
            else:
                yield eg

    def modified_update(batch):
        nonlocal bad_spans
        batch = batch + bad_spans
        print("LEN:", len(bad_spans))
        bad_spans = []
        return original_update(batch)

    components['stream'] = get_modified_stream()
    components['update'] = modified_update
    return components

The logging and my prints tell me, that the auto-reject works as expected and these rejects are considered by the model update... but they aren't written to the dataset.

I can globally connect to the database and add

db.add_examples([eg], datasets=[dataset])

right after setting the answer to 'reject' and this will add to the database. But I would like to incorporate this in the web interface. Otherwise my session counter will only give me my manual annotations not the automatic ones.

Edit: Even when using the above mentioned quick fix, the added examples are missing a view_id and session_id. The latter leads to the problem, that these entries are automatically re-added when starting a new session.

Hi Kevin,

I have a few ideas for what the problem could be, but I'm not 100% sure. A few questions:

Is this accuracy figure on a held-out dataset, or a random split of the data? If you're taking a random split and you've used the ner.teach recipe (which makes a non-uniform sample) you might be getting results that aren't really representative of performance on other data.

It's hard to tell, but it's possible this task isn't best modeled as entity recognition. If the phrases are really long, the model might require many examples to learn the problem, because the boundaries will be a bit vague. This might be one reason why the model isn't performing that well.

2000 entries is also a fairly small dataset for named entity recognition. It could be that the model isn't really sure what theory to come up with based on your annotations, and it settles on just predicting capitalised words.

Are you training with --no-missing? I think this might be the problem: if the model doesn't have many examples, you really need to have complete annotations with --no-missing in order to get good performance. Otherwise the model doesn't know whether some word that's not annotated is actually an entity.

There's not really any good way to say unfortunately. Different problems are different --- if the annotations are consistent and the policy is easy to learn, sometimes you need very few examples. But if there's a deeper problem, or you're starting from a blank model with incomplete annotations, it will require a lot of examples.

Perhaps you could consider text classification? It really sounds like it might be a better fit for your problem, and it will be much easier to get good performance.

Hi Matt,

first of all, thank you very much for your very detailed answer to my numerous questions!

It was a random split, but I'm currently generating a gold/eval set for better comparability. I got somewhat confused by the difference some recipes would make in this particular use case, so I opened a separate topic (Gold/Silver Dataset Confusion).

I think I was unclear in my explanation: 1-15 Token spans refered to the length of the input sentences, the entities are mostly 1-4. I wanted to avoid the word "sentence" because these lines are not necessary sentences in a grammatical way. I generate the data by simply taking after the first newline character and surpressing further segmentation during teaching with the -U flag.

The 2000 were just for a first glimpse, I'm confident that something can be achieved here, because the entities are often "framed" by a set of typical stopwords like "-" or "in".

Then, no :sweat_smile:. Now, yes! I explain my workflow in the above mentioned Gold/Silver Dataset Confusion, but in short: It greatly improved the predictions, I have only a few false positives now and I am trying to improve the accuracy with more examples (and a dedicated eval set :wink: )

How many would "a lot of" be in this context? Do you have some kind of documentation on how to experiment with the hyperparameters of the batch-train? It feels like you need much experience in the field to get some level of intuition when things like the beam width, the number of iterations or the batch size has to be changed. For quick results I am happy with the wise defaults of the prodigy recipes, but of course I want to get a deeper understanding for the final improvements.

I think this solved itself due to the misunderstanding above. I am already using text classification for other tasks, mainly the verification i.e. plausibility of rule based entities.

Remaining Question
Regarding my update with the recipe wrapper... is it possible to write a wrapper that can auto accept/reject examples and still keeping them in the normal controller workflow? I mean, I can add explicit commands to add them in the database and incorporate them in the model update, but I will always filter them from the example stream in order to avoid them being displayed.
But inside the wrapper I have no access to the controller (except on_exit), so I can't update things like the total processed examples in the current session.

What you could do is create the controller object within the recipe. Instead of returning the dict, just use it to construct the controller:


recipe_args = {
    "dataset": dataset,
    "stream": stream,
    # etc
}
controller = prodigy.core.Controller(**recipe_args)

You can return a controller object from the recipe instead of the dict, in order to support exactly this type of use-case.

Just for my understanding, would this collide with the ner.teach recipe controller I'm wrapping?

def custom_ner_teach(dataset, spacy_model, source=None, api=None, loader=None,
                     label=None, patterns=None, exclude=None, unsegmented=False):
    """Custom wrapper for ner.teach recipe that replaces the stream."""

    components = teach(**locals())

    input_hashes = db.get_input_hashes(dataset)
    original_stream = components['stream']
    original_update = components['update']
    bad_spans = []


    def get_modified_stream():
        nonlocal bad_spans
        for eg in filter_inputs(original_stream, input_hashes):
            for span in eg['spans']:
                if is_bad(span['text']):
                    print("ANSWER '{}' rejected".format(span['text']))
                    eg['answer'] = 'reject'
                    bad_spans.append(eg)
                    # manually add to the database
                    db.add_examples([eg], datasets=[dataset])
                    break
            else:
                yield eg

    def modified_update(batch):
        nonlocal bad_spans
        batch = batch + bad_spans
        print("LEN:", len(bad_spans))
        bad_spans = []
        return original_update(batch)

    components['stream'] = get_modified_stream()
    components['update'] = modified_update
    return components

I'd guess that the teach call will already communicate with the formerly created controller? If I create a new one inside my recipe, I'll lose the information the teach added? From my understanding, I only have to add a controller.receive_answers() for my auto-rejected examples and leave the rest untouched.

I tried to create a controller inside my recipe, but it failed because it lacked some arguments. Where can I get the missing defaults?

Sorry to ask, but the fiddling with the controller is completely new for me.

Ah, you might find the prodigy.util.get_config function helpful --- if there are any arguments you need to fill in from your prodigy.json or environment variables, that's the easiest way to get them. You can find docs on the arguments the controller expects in the readme. In case you don't have it handy:

## Controller

The controller takes care of putting the individual recipe components together
and exposes methods that allow the application to interact with the REST API.
This is usually done when you use the `@recipe` decorator on a function that
returns a dictionary of components. However, you can also choose to initialise
a `Controller` yourself and make your recipe return it.

### <kbd>METHOD</kbd> Controller.\_\_init\_\_

Initialise the controller.


    from prodigy.controller import Controller
    controller = Controller(dataset, view_id, stream, update, store, progress, on_load, on_exit,     get_session_id, exclude, config)


| Argument         | Type         | Description                                                                 |
| ---------------- | ------------ | --------------------------------------------------------------------------- |
| `dataset`        | unicode      | The ID of the current project.                                              |
| `view_id`        | unicode      | The annotation interface to use.                                            |
| `stream`         | iterable     | The stream of annotation tasks.                                             |
| `update`         | callable     | The update function called when annotated tasks are received.               |
| `db`             | callable     | The database ID, component or custom storage function.                      |
| `progress`       | callable     | The progress function that computes the annotation progress.                |
| `on_load`        | callable     | The on load function that gets called when Prodigy is started.              |
| `on_exit`        | callable     | The on exit function that gets called when the user exits Prodigy.          |
| `get_session_id` | callable     | Function that returns a custom session ID. If not set, a timestamp is used. |
| `exclude`        | list         | List of dataset IDs whose annotations to exclude.                           |
| `config`         | dict         | Recipe-specific configuration.                                              |
| **RETURNS**      | `Controller` | The recipe controller.                                                      |

All arguments of the controller are also accessible as attributes, for example
`controller.store`. In addition, the controller exposes the following
attributes:

| Argument            | Type      | Description                                                                      |
| ------------------- | --------- | -------------------------------------------------------------------------------- |
| `home`              | `Path`    | Path to Prodigy home directory.                                                  |
| `session_id`        | unicode   | ID of the current session, generated from a timestamp.                           |
| `batch_size`        | int       | The number of tasks to return at once. Taken from `config` and defaults to `10`. |
| `queue`             | generator | The batched-up stream of annotation tasks.                                       |
| `total_annotated`   | int       | Number of tasks annotated in the current project.                                |
| `session_annotated` | int       | Number of tasks annotated in the current session.                                |

Finally, don't worry about any 'formerly created controller' --- the teach() call above isn't creating a controller, it just returns the components dict. So you just need to do something like:


controller = Controller(**components)

To create your controller, so that you can use it within your recipe.

Well...

This is what I naively tried before, but the teach recipe only returns the components view_id, dataset, stream, update, exclude, config. When testing my recipe I'll get the error

controller = prodigy.core.Controller(**components)
File "cython_src\prodigy\core.pyx", line 22, in prodigy.core.Controller.__init__
TypeError: __init__() takes exactly 12 positional arguments (5 given)

which makes sense, because I'll have to add the other arguments as mentioned in the readme. But these arguments incluce things like the progress or on_load callables and THESE defaults I don't know.

I'm not sure, if the prodigy.util.get_config will help me there...

Ah right. Set None for the callbacks like on_load etc. For progress, you can use:


progress = prodigy.components.progress.ProgressEstimator()

Thank you very much! I'm sorry to still bother you, but though I have no more errors, there is also no alternation of my stream anymore!

@prodigy.recipe('custom.ner.teach2', **teach.__annotations__)
def custom_ner_teach2(dataset, spacy_model, source=None, api=None, loader=None,
                     label=None, patterns=None, exclude=None, unsegmented=False):
    """Custom wrapper for ner.teach recipe that replaces the stream."""
    components = teach(**locals())
    input_hashes = db.get_input_hashes(dataset)
    original_stream = components['stream']

    controller = prodigy.core.Controller(on_load=None,
                                         on_exit=None,
                                         progress=prodigy.components.progress.ProgressEstimator(),
                                         db=None,
                                         get_session_id=None,
                                         **components)

    def get_modified_stream():
        nonlocal controller
        for eg in filter_inputs(original_stream, input_hashes):
            for span in eg['spans']:
                if is_bad(span):
                    print("ANSWER '{}' rejected".format(span['text']))
                    eg['answer'] = 'reject'
                    controller.receive_answers([eg])
                    break
            else:
                yield eg

    controller.stream = get_modified_stream()
    return controller

I thought I could modify the stream, look for my auto-rejections and in this case automatically trigger an answer reception of the controller via controller.receive_answers to integrate this into the default handling.

Side question: Is it normal that starting a recipe with custom controller takes significantly longer than usual?

I think the problem is probably due to the way you're creating the stream twice there. Can you do this instead?

@prodigy.recipe('custom.ner.teach2', **teach.__annotations__)
def custom_ner_teach2(dataset, spacy_model, source=None, api=None, loader=None,
                     label=None, patterns=None, exclude=None, unsegmented=False):
    """Custom wrapper for ner.teach recipe that replaces the stream."""
    components = teach(**locals())
    input_hashes = db.get_input_hashes(dataset)
    original_stream = components['stream']

    controller = None
    def get_modified_stream():
        nonlocal controller
        for eg in filter_inputs(original_stream, input_hashes):
            for span in eg['spans']:
                if is_bad(span):
                    print("ANSWER '{}' rejected".format(span['text']))
                    eg['answer'] = 'reject'
                    controller.receive_answers([eg])
                    break
            else:
                yield eg

    components["stream"] = get_modified_stream()
    controller = prodigy.core.Controller(on_load=None,
                                         on_exit=None,
                                         progress=prodigy.components.progress.ProgressEstimator(),
                                         db=None,
                                         get_session_id=None,
                                         **components)
    return controller

Yes that fixed it! I'll give a short summary, if anyone is interested what is achieved with this recipe:

  • If one is able to define a function that decides whether an example can be accepted/rejected with certainty the above code integrates this in form of a teach recipe wrapper.
  • The filter_inputs part was necessary, because I found that otherwise the automatically added examples are NOT skipped when loading the dataset again to resume annotation.
  • The auto-answered exampled will not appear in the history (why should they?), but they will update the TOTAL count, as well as the progress bar. From my understanding of controller.receive_answers these auto-answered examples are immediately saved and will also update the model.

Minor remaining problems:
I inspected the generated dataset and saw that the automatically generated entries have an empty _view_id (adding eg['_view_id'] = 'ner' in the get_modified_stream() function to fix) and _session_id. The normally saved ones have 'no_dataset-default' as session id.

I assume this has something to do with giving None for the get_session_id callable, but I'm confused why this has a different impact on my two different answer types (auto/manual).

But again, I already want to thank you very much for your support and your response time that is almost reaching chat quality :wink: