Problem with adding a new label

Hi, I'm trying to add a new NER label to the en_core_web_lg model, as part of ner.teach. However, it seems that the model does not contain any tokens that match my hand-made patterns. When running the following command:
python -m prodigy ner.teach AVHerald_headlines_aircraft_types en_core_web_lg d:\incidents_headlines.jsonl --label AIRCRAFT_TYPE --patterns d:\aircraft_types.jsonl
I only get:

ValueError: Invalid pattern: {'LABEL': 'AIRCRAFT_TYPE', 'pattern': [{'orth': 'A124'}]}

Based on the additional error message below, I suppose I need to train a new model from the scratch. Or maybe there is a way to add the previously-absent entities to an existing model? Any confirmation / advice?

ERROR: Can't find label 'AIRCRAFT_TYPE' in model en_core_web_lg
ner.teach will only show entities with one of the specified labels. If a
label is not available in the model, Prodigy won't be able to propose
entities for annotation. To add a new label, you can specify a patterns file
containing examples of the new entity as the --patterns argument or
pre-train your model with examples of the new entity and load it back in.

Try replacing 'LABEL' with 'label'! The pattern validator currently checks if a key 'label' is present in the pattern and will complain if it's not.

That error is raised if you run ner.teach without any patterns and with a completely new label that the model has never seen before. In that case, the recipe won't be able to propose any entities for annotation, since it doesn't output any predictions for the unseen label. This problem is solved by providing patterns, which suggest entity candidates you can accept and reject until the model has learned enough about the new label that it can also propose spans.

1 Like

@ines Thanks very much, that was quick :smiley:
You should probably adjust the PATTERNS.JSONL example at https://prodi.gy/docs/cookbook

Now I get the following error. Please help.

(nlp) C:\WINDOWS\system32>python -m prodigy ner.teach AVHerald_headlines_aircraft_types en_core_web_lg d:\incidents_headlines.jsonl --label AIRCRAFT_TYPE --patterns d:\aircraft_types.jsonl
Using 1 labels: AIRCRAFT_TYPE
C:\Programs\Anaconda3\envs\nlp\lib\site-packages\toolz\itertoolz.py:368: RuntimeWarning: Mean of empty slice.
return next(iter(seq))
C:\Programs\Anaconda3\envs\nlp\lib\site-packages\numpy\core_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
C:\Programs\Anaconda3\envs\nlp\lib\site-packages\toolz\itertoolz.py:368: RuntimeWarning: Degrees of freedom <= 0 for slice
return next(iter(seq))
C:\Programs\Anaconda3\envs\nlp\lib\site-packages\numpy\core_methods.py:110: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
C:\Programs\Anaconda3\envs\nlp\lib\site-packages\numpy\core_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "cython_src\prodigy\core.pyx", line 55, in prodigy.core.Controller.init
File "C:\Programs\Anaconda3\envs\nlp\lib\site-packages\toolz\itertoolz.py", line 368, in first
return next(iter(seq))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Programs\Anaconda3\envs\nlp\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Programs\Anaconda3\envs\nlp\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Programs\Anaconda3\envs\nlp\lib\site-packages\prodigy_main
.py", line 259, in
controller = recipe(*args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "cython_src\prodigy\core.pyx", line 60, in prodigy.core.Controller.init
ValueError: Error while validating stream: no first batch. This likely means that your stream is empty.

Here's a sample of the documents that I want to use ner.teach on:

{"headline":"Canada A320 at San Francisco on Jul 7th 2017, lined up with taxiway for landing"}
{"headline":"Cathay Pacific A359 at Copenhagen on Sep 24th 2018, lightning strike"}
{"headline":"Baltic BCS3 at Riga on Sep 25th 2018, bird strike"}
{"headline":"Volotea B712 near Athens on Sep 25th 2018, engine trouble"}
{"headline":"Delta B763 near Newark on Sep 24th 2018, hydraulic failure"}
{"headline":"Southwest B737 at San Antonio on Sep 24th 2018, hydraulic failure"}
{"headline":"Jet2 B733 near Rennes on Sep 25th 2018, pilot felt unwell"}
{"headline":"British Airways B789 near Iqaluit on Sep 12th 2018, fumes in cockpit"}
{"headline":"IRS F100 near GANLA on May 10th 2014, loss of radio contact and emergency landing"}

Ah, damn, I had no idea this was in our docs. Sorry about that! (Prodigy used to be less strict, but we gradually added more validation to be able to output better error messages, warnings and diagnostics.)

This error usually indicates that the stream has an invalid format and/or that Prodigy can't find any loadable examples in the input data. In your case, it's likely because your examples specify a "headline" property – but by default, Prodigy is looking for a "text" key (this is the default convention – there has to be some convention, because Prodigy allows passing in any arbitrary JSON data, and it can't guess which property contains the text you want to render and annotate).

So changing "headline" to "text" in your data should make the stream loadable as expected.

1 Like