different dataset for ner.batch-train

aslitoj · August 28, 2019, 2:49pm

Hello,
I trained my model 2 times with annotations that I received after doing ner.make-gold, and got a 20% improvement in the overall label's accuracy.
For a next step, I looked at the precision/recall results of each label by using Score from spaCy and figured out which labels need to be improved.
For that, I already had some ground truth label examples in local as following:

{"label": "SPORTS", "pattern": [{"lower": "abseiling"}]}
{"label": "SPORTS", "pattern": [{"lower": "adventure racing"}]}

I imported these labels into the database and tried to run the ner.batch-train by using the model that I trained earlier, but got an error on merging spans:

Traceback (most recent call last):
  File "/Users/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec)
  File "/Users/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals)
  File "/Users/lib/python3.7/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/lib/python3.7/site-packages/plac_core.py", line 328, in call cmd, result = parser.consume(arglist)
  File "/Users/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 552, in batch_train
    examples = merge_spans(DB.get_dataset(dataset))
  File "cython_src/prodigy/models/ner.pyx", line 40, in prodigy.models.ner.merge_spans
KeyError: 'text'

Could you please help me to understand how can I use this type of dataset in ner.batch-train with an existing pre-trained model?
Thanks,

ines · August 28, 2019, 3:03pm

Your workflow sounds good, but the problem is that you imported match patterns to your dataset instead of labelled examples. To train a model, you need labelled examples of the entity types in context – for example, with a "text" and a list of "spans" describing the entities in the text. You can run db-out to export your previous dataset to see how those examples look. Your PRODIGY_README.html also has examples of the expected format.

If you want to use your patterns to find more examples, you need to actually match them in your text, make sure the matches are correct and then save those example to your training set.

Also, one small detail:

{"label": "SPORTS", "pattern": [{"lower": "adventure racing"}]}

This pattern will likely never match. Each dict in the pattern describes one token. So this pattern will match a token whose lowercase text equals "adventure racing". This will never be true, because that string will probably always be two tokens. See the Matcher docs for more details: Rule-based matching · spaCy Usage Documentation

Topic		Replies	Views
accuracy not improving much with ner.batch-train usage , ner	16	934	December 20, 2019
Adding labels in ner.batch-train enhancement , usage , ner , done	3	986	February 20, 2018
ner.train number of examples usage , ner	8	1948	August 3, 2018
Improve a NER on multiple labels usage , ner	3	1329	March 20, 2019
No entities found when running ner.batch-train on new NER ner , done	7	825	June 7, 2019

different dataset for ner.batch-train

Related topics