ner.match error with exact string patterns

Hey,

I’m having problems using the ner.match recipe for my project. I want to bootstrap a new entity type, so I made a JSONL file with a few exact strings to matches:

{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx807"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx808"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx809"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx810"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx811"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx812"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx813"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx814"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx815"}]}
{"label": "MASTER_EQUIPMENT", "pattern": [{"lower": "18mx816"}]}

However, when i try to run ner.match with the patterns file:

→ prodigy ner.match maintenance_reports_annotations en_core_web_lg cleaned_events.jsonl --patterns ./patterns/master_equipment_patterns.jsonl./patterns/master_equipment_patterns.jsonl
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/prodigy/__main__.py", line 254, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 152, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 50, in match
    model.update(existing)
  File "cython_src/prodigy/models/matcher.pyx", line 109, in prodigy.models.matcher.PatternMatcher.update
KeyError: 'spans'
(env)

Am I doing something wrong? I guess there has to be something with the patterns file?

The documentation says:

…patterns file can include exact strings, regular expressions, or token patterns for use with spaCy’s Matcher class…

Thanks in advance! :upside_down_face:

Edit: I just tested adding some random words with terms.teach and then use the terms.to-patterns. However, I’m getting the same error with the new patterns file.

I just had a look at the recipe and it seems like the error occurs when the existing annotations from the dataset are added to the pattern matcher model. Is it possible that your set maintenance_reports_annotations includes other types of annotations that are not NER annotations? Or something without a span? And does the error go away if you use an empty dataset? (You can also set the environment variable PRODIGY_LOGGING=basic to see what’s going on behind the scenes.)

Btw, It’s probably unideal that ner.match updates the matcher and resumes by default – it’s a nice feature, but we should probably add it behind a flag.

1 Like

Yes, that was it! Thank you! I’m not sure how they got there :thinking:

Thanks for updating – glad it worked! (I’ve moved this functionality behind a --resume flag btw, which is also more consistent with the other recipes.)

I’m running the latest version of prodigy, and keep getting this error when i try to use the patterns file. My patterns file has exactly 6083 entries, which seems to fit with error.

→ prodigy ner.match reports_annotations en_core_web_sm cleaned_events.jsonl --patterns ./patterns/el_equip_patterns.jsonl
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 54, in match
    model.update(existing)
  File "cython_src/prodigy/models/matcher.pyx", line 112, in prodigy.models.matcher.PatternMatcher.update
IndexError: index 7389 is out of bounds for axis 0 with size 6083

Is this connected with the --resume flag? Seems like it assumes there are many more entries - maybe because i just ran the recipe with a much bigger patterns file? What can i do?

Edit: Same thing with another patterns file, however different index.

→ prodigy ner.match reports_annotations en_core_web_sm cleaned_events.jsonl --patterns ./patterns/fire_and_gas_patterns.jsonl
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/ocselvig/Code/master_thesis_ner/env/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 54, in match
    model.update(existing)
  File "cython_src/prodigy/models/matcher.pyx", line 112, in prodigy.models.matcher.PatternMatcher.update
IndexError: index 2643 is out of bounds for axis 0 with size 1558

Thanks for the report! Unfortunately, the --resume flag on ner.match didn’t actually make it into v1.5.0. I squeezed it in last minute but after we’d published the release, it turned out that the build process had only built from the previous commit :sob: Very sorry about that – it’ll definitely come in v15.1.

About the error: Yes, I think your analysis is correct. The scores are stored in a numpy array where the first column is the pattern ID (the value of the span’s "pattern") and the second is the annotation decision. The array is created with the same number of rows as your patterns file – so if your existing annotations include examples with a higher pattern ID, the matcher fails to update the scores.

This is an edge case we hadn’t considered. We could add a simple check for this, but now that I think about it, we probably want a more sophisticated solution for pattern-heavy use cases. For example, we could hash the patterns to generate the IDs (similar to the input and task hashes).

1 Like

Makes sense! Hehe, no worry :grin:

So if I get this correctly, the current implementation of prodigy assumes one single patterns file, and not multiple files for different entity types to bootstrap?

Yes, because all patterns have a "label" assigned, you can keep them all in one file. I definitely see the problems in ner.match, though – unline ner.teach and most other recipes, it doesn’t actually support a --label argument at the moment, so there’s no convenient way to filter the examples you’re seeing. This is also something we should add.

In the meantime, you could just copy-paste the label argument from ner.teach and then add a stream wrapper like this:

def filter_by_label(stream, labels):
    for eg in stream:
        for span in eg.get('spans', []):
            if span['label'] in labels:
                yield eg

And then add it last:

stream = (eg for _, eg in model(stream))
stream = filter_by_label(stream, label)

Quick update: Just tested the proposed solution and it’s working well so far. So in the next version, patterns will be referenced by a hash based on the "label" and "pattern" values. The matcher will also respect a predefined "id" (to allow users to implement their own pattern ID system, if needed).

If the --resume flag is set, the matcher will be updated from the existing annotations and the hashes ensure that the scores for the correct pattern are incremented, even if different pattern files were used between sessions.

The only downside is that it’s difficult to make the change fully backwards-compatible – but if you have the original list of patterns, you could update the pattern IDs by iterating over your annotations, getting the pattern at index i and generating its hash using the get_pattern_id helper:

from prodigy.models.matcher import get_pattern_id

patterns = [...]  # <-- your list of patterns here

new_dataset = []
for eg in dataset:  # your dataset
    pattern_index = eg['meta'].get('pattern')
    if pattern_index:  # span was created by a pattern
        pattern = patterns[pattern_index]
        pattern_id = get_pattern_id(pattern)
        eg['meta']['pattern'] = pattern_id
        for span in eg.get('spans', []):   # also edit the ID in the span
            span['pattern'] = pattern_id
    new_dataset.append(eg)

Edit: Just released v1.5.1, which includes this update! :tada: