Adding labels in ner.batch-train

jerbob92 · February 20, 2018, 1:59pm

I'm trying to create a new model with ner.manual and then train it further with ner.teach.
I was able to annotate my new labels, for which I used the following command:
prodigy ner.manual new_set en_core_web_sm train.jsonl --label labels.txt

Now I want to improve that dataset with other data by using ner.teach. How to do this?
I tried to create a new model out of the dataset with to use in ner.teach:
prodigy ner.batch-train new_set en_core_web_sm --output /tmp/model --eval-split 0.5 --label labels.txt

However, this resulted in the following error:

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.5/dist-packages/prodigy/main.py", line 253, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.5/dist-packages/prodigy/recipes/ner.py", line 400, in batch_train
drop=dropout, beam_width=beam_width)
File "cython_src/prodigy/models/ner.pyx", line 309, in prodigy.models.ner.EntityRecognizer.batch_train
File "cython_src/prodigy/models/ner.pyx", line 370, in prodigy.models.ner.EntityRecognizer._update
File "cython_src/prodigy/models/ner.pyx", line 364, in prodigy.models.ner.EntityRecognizer._update
File "cython_src/prodigy/models/ner.pyx", line 365, in prodigy.models.ner.EntityRecognizer._update
File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 415, in update
proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
File "nn_parser.pyx", line 558, in spacy.syntax.nn_parser.Parser.update
File "nn_parser.pyx", line 676, in spacy.syntax.nn_parser.Parser._init_gold_batch
File "ner.pyx", line 119, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
File "ner.pyx", line 178, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: 'B-IDENTIFIER'

jerbob92 · February 20, 2018, 2:07pm

Nevermind. I found the ner.gold-to-spacy recipe

ines · February 20, 2018, 2:14pm

Thanks for updating – and yes, this works as well! Your workflow definitely makes sense and after pre-training the model, you can simply load it into ner.teach using the path to the data directory:

prodigy ner.teach your_dataset /path/to/pretrained-model your_data.jsonl --label SOME_LABEL

About your initial report: I think the problem here is that the input format of the --label argument currently isn’t 100% consistent. For ner.manual, we introduced the option to load labels from a file (since you often want to load in a larger label set) – but all other recipes currently expect the labels to be a string. There’s also a slight inconsistency around adding unknown labels to the model, which we’ve already fixed for the upcoming release.

In the meantime, adding the following to ner.batch-train should work:

ner = nlp.get_pipe('ner')   # get the model's entity recognizer 
labels = get_labels(label)  # this helper function supports loading from a file
for l in labels:
    ner.add_label(l)        # add label to the model

Alternatively, you could also iterate over the examples and their spans, and add each span['label'] to the model (add_label will ignore labels that are already present in the model, so you don’t have to worry about filtering out the new ones).

I’ll also experiment with better ways of handling the --label argument. Plac (which Prodigy uses for the recipes CLI) supports converter functions – so we could handle all loading in a function that checks whether the value is a path or a string of comma-separated labels, and returns them as a list (similar to util.get_labels).

jerbob92 · February 20, 2018, 2:57pm

Ah that makes sense. For now the ner.gold-to-spacy worked fine.
It needed some tweaking because you can’t directly load the jsonl into the example for creating a new model but it worked out and I’m able to use ner.teach on it now.

Topic		Replies	Views
Add more 3 new entity type usage , ner	4	647	November 1, 2019
Improve a NER on multiple labels usage , ner	3	1329	March 20, 2019
different dataset for ner.batch-train usage , ner	1	421	August 28, 2019
ner.batch-train not to use default labels but just the ones from a training sample ner , spacy , solved	8	739	July 30, 2018
'Cannot find label in model' when trying to train from pre-annotated data usage , ner , solved	11	946	March 14, 2019

Adding labels in ner.batch-train

Related topics