ner correct with prodigy 1.11.8

Hi all,

Again facing issues/incomprehensions.

I used this piece of code with last version of prodigy: 1.11.8
and spacy: 3.4.3

prodigy ner.correct trainset_reviewed model/model_spacy3/model-last trainset.jsonl --label x1,x2 --unsegmented

Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 63, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 160, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 104, in prodigy.components.feeds.Feed.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 150, in prodigy.components.feeds.Feed._init_stream
  File "cython_src/prodigy/components/stream.pyx", line 107, in prodigy.components.stream.Stream.__init__
  File "cython_src/prodigy/components/stream.pyx", line 58, in prodigy.components.stream.validate_stream
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/recipes/ner.py", line 244, in make_tasks
    for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1545, in pipe
    for doc in docs:
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1589, in pipe
    for doc in docs:
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1651, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 233, in pipe
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1600, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1651, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 73, in pipe
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1600, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1586, in <genexpr>
    docs = (self._ensure_doc(text) for text in texts)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1535, in <genexpr>
    docs_with_contexts = (
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/recipes/ner.py", line 243, in <genexpr>
    texts = ((eg["text"], eg) for eg in stream)
  File "cython_src/prodigy/components/preprocess.pyx", line 167, in add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 263, in prodigy.components.preprocess._add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 225, in prodigy.components.preprocess.sync_spans_to_tokens
KeyError: 'id'

I tried using model-best but the same. I think I didn't have such error with last version of prodigy.

Do you have some ideas?
Thank you
Best regards
Julie

hi @JulieSarah!

In spaCy 3.4.3, can you load/run your model?

import spacy

nlp = spacy.load("model/model_spacy3/model-last") # or can use model-best

text = "Provide a sentence you'd expect to find one of your entities."

for doc in nlp.pipe(texts):
    print([(ent.text, ent.label_) for ent in doc.ents])

Do you remember what version of prodigy/spaCy you used to create your model? Or alternatively, do you remember what version of Prodigy you last ran before this was becoming a problem (if you were previously able to run this command)?

Did you use prodigy train or spacy train with a config.cfg to create your model?

It may also be helpful to provide the meta.json in either of your model-last or model-best folders.

Also, just curious, but can you run:

python -m prodigy ner.correct test_dataset en_core_web_sm trainset.jsonl --label ORG,PERSON --unsegmented

Let me know if any of these problems

Dear @ryanwesslen ,

  • Running:

Lead to :

Using 1 label(s): Org
/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py:885: UserWarning: [W094] Model 'en_core_web_sm' (2.2.0) specifies an under-constrained spaCy version requirement: >=2.2.0. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.4.3,<3.5.0
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 364, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/usr/local/anaconda3/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/recipes/ner.py", line 215, in correct
    nlp = spacy.load(spacy_model)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/__init__.py", line 54, in load
    return util.load_model(
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 432, in load_model
    return load_model_from_package(name, **kwargs)  # type: ignore[arg-type]
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 468, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, enable=enable, exclude=exclude, config=config)  # type: ignore[attr-defined]
  File "/usr/local/anaconda3/lib/python3.8/site-packages/en_core_web_sm/__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 649, in load_model_from_init_py
    return load_model_from_path(
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 505, in load_model_from_path
    config = load_config(config_path, overrides=overrides)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 681, in load_config
    raise IOError(Errors.E053.format(path=config_path, name="config file"))
OSError: [E053] Could not read config file from /usr/local/anaconda3/lib/python3.8/site-packages/en_core_web_sm/en_core_web_sm-2.2.0/config.cfg
  • The snippet of code worked without any error(but no entity detected)

*We used 1.10.7 before

Thank you for your indefectible support.
Regards
Julie

hi @JulieSarah!

Not sure if it answers your full questions, but I noticed this:

Using 1 label(s): Org
/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py:885: UserWarning: [W094] Model 'en_core_web_sm' (2.2.0) specifies an under-constrained spaCy version requirement: >=2.2.0. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.4.3,<3.5.0

Can you install a newer version of en_core_web_sm? You can do this by running python -m spacy download en_core_web_sm. It seems like you're running a spaCy 2.0 pipeline but on spaCy 3.0.

Dear @ryanwesslen , I have done it.

But still the same error:

Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 63, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 160, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 104, in prodigy.components.feeds.Feed.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 150, in prodigy.components.feeds.Feed._init_stream
  File "cython_src/prodigy/components/stream.pyx", line 107, in prodigy.components.stream.Stream.__init__
  File "cython_src/prodigy/components/stream.pyx", line 58, in prodigy.components.stream.validate_stream
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/recipes/ner.py", line 244, in make_tasks
    for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1545, in pipe
    for doc in docs:
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1589, in pipe
    for doc in docs:
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1651, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 233, in pipe
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1600, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1651, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 73, in pipe
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/util.py", line 1600, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1586, in <genexpr>
    docs = (self._ensure_doc(text) for text in texts)
  File "/usr/local/anaconda3/lib/python3.8/site-packages/spacy/language.py", line 1535, in <genexpr>
    docs_with_contexts = (
  File "/usr/local/anaconda3/lib/python3.8/site-packages/prodigy/recipes/ner.py", line 243, in <genexpr>
    texts = ((eg["text"], eg) for eg in stream)
  File "cython_src/prodigy/components/preprocess.pyx", line 167, in add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 263, in prodigy.components.preprocess._add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 225, in prodigy.components.preprocess.sync_spans_to_tokens
KeyError: 'id'

hi @JulieSarah!

I'm still thinking this could be with spaCy. Since you moved from Prodigy v1.10.7 to v1.11.8, the biggest change in v1.11 was compatibility with spaCy v3.x.

Models built with spaCy v2.x need to be retrained with spaCy v3.x.

Can you run and provide:

python -m spacy info

Also, compare that to the "spacy_version" in each pipeline's meta.json file.

I'm wondering if when you upgraded to Prodigy, it upgraded your spaCy version to something that was incompatible with your original model.

The change from Prodigy 1.10.7 was before spaCy 3.0.

Just curious - can you run other recipes?

Hello @ryanwesslen

I retrained with the new version of spacy and still get the error mentioned.
I can run ner manual without issue...

Spacy is 3.4.3

Thank you
Regards
Julie

Hello @ryanwesslen , do you have any clue? What will you recommend if I can't solve this bug? Would you have an idea to circumvent it?

Thank you
Regards
Julie

hi @JulieSarah!

Let's take a step back.

This suggests that the problem is when Prodigy uses a spaCy model-in-the-loop. This suggests to me it's like more of an issue with spaCy than a Prodigy issue.

What's weird is that you cannot even get use en_core_web_sm model on ner.correct.

You mentioned this same error: can you confirm that this was trying to use the model with your custom model or en_core_web_sm?

I was asking to see if you can run it for en_core_web_sm, not your custom model:

python -m prodigy ner.correct test_dataset en_core_web_sm trainset.jsonl --label Org --unsegmented

If you can't run a ner.correct recipe with en_core_web_sm, this suggests to me there is something incorrect with the way spaCy is installed. The first goal should be to find a way to get ner.correct to work with en_core_web_sm because that should not be happening.

I know your original question was more on your custom model but until you can get en_core_web_sm working, I believe there is something wrong with your spaCy installation, versioning (e.g., incompatible model version with spaCy version).

Suggested steps: rebuild in a clean venv

Sorry if it's redundant but can you confirm that you have tried to install Prodigy 1.11.8 on a new, clean virtual environment? Ideally start with only Prodigy and then add installing en_core_web_sm as python -m spacy download en_core_web_sm.

After you've activated your new virtual environment, please run (this will again confirm the versions you wanted were installed):

python -m prodigy stats
python -m spacy info

And try to load your en_core_web_sm:

import spacy
# confirm same spaCy version
print(spacy.__version__)

# confirm that en_core_web_sm works for that spaCy version
nlp = spacy.load("en_core_web_sm") 
text = "Provide a sentence you'd expect to find one of your entities."
for doc in nlp.pipe(texts):
    print([(ent.text, ent.label_) for ent in doc.ents])

Then either try:

python -m prodigy ner.correct test_dataset en_core_web_sm trainset.jsonl --label Org --unsegmented

Or just to be safe, you may even want to use like this from the docs on this generic dataset:

python -m prodigy ner.correct ner_finance_news en_core_web_sm ./raw_shares-newsapi.jsonl --label PERSON,ORG,MONEY,TICKER --unsegmented

If you can't run this test generic recipe for en_core_web_sm, then something else is wrong.

However, if you can, this gives me hope that at least the clean environment fixed the critical issue.

Then at this point, I'd try your custom model that you previously used. If you're having issues there, it may be a spaCy versioning for that specific model.

Hope this helps and let us know how it turns out!

Dear @ryanwesslen

I cleanup the environment and still it doesn't work.
I found the error: the file in the database is corrupted and I have to find why now.

I may reach out to you again.
Have a great day
Julie

1 Like

Dear @ryanwesslen

I supposed the error I have whas due to a bad processing of a jsonl file.

Still, I don't understand some things:

I trained a model with a new tokenizer I created.
The annotation where the model is trained have been annotated with a default tokenizer.
When I run ner correct with the model(trained with custom tokenizer), I am not able to select the entities supposedly highlightable with custom tokenizer.

What has happened? My goal is to correct past annotation with new tokenizer. Is it feasible?
Thank you

Best regards and wishing you nice end of the year vacations :blush:

Julie

hi @JulieSarah!

So mismatched tokenization can be a big problem that many users don't realize how important it is until it happens. Typically this happens when users have pre-annotated spans/entities that they load into manual recipes. These would be spans/entities they annotated in another tool, formatted the data for Prodigy, and then they use in the manual recipe (e.g., ner.manual) with a different tokenizer (e.g., blank:en or en_core_web_sm).

This is a good post that highlights it and provides some context on how to identify.

When you say you had a new tokenizer you created -- was this from scratch like this:

Did you train a ner component in a spaCy pipeline which included a custom tokenizer that was different than the ner annotations used to train?

Yeah - that could cause problems. Is there a compelling reason why you didn't make your annotations with the same custom tokenizer that you intend to include in the pipeline?

I suspect it may be you made the annotations first. Then after reviewing those, you found some problems with the spaCy (default tokenizer) so you decided to build a custom tokenizer but didn't want to redo all of the annotations.

I would recommend using the code from the earlier post to determine which annotations have a mismatch.

nlp = spacy.load("my_custom_model")  # model with your custom_tokenizer

for example in examples:  # your existing annotations
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)

Typically, it may not be a lot of examples that have mismatch. You could re-annotate only those annotations that have mismatch but this time with your custom tokenizer.

Alternatively, you could try this package to "align" your annotations' tokenization to your new custom tokenizer:

I haven't used this package so I can't provide a lot more suggestions (but hopefully the package is self-explanatory).

Hope this helps!