KeyError: 'text' when using ner.batch-train

Hi,
I've used spaCy before, but I'm fairly new to Prodigy. So far I managed to find all answers on this page and I've seen this topic:

which seems to discuss a similar problem: when I run ner.batch-train I get the following error:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/anaconda3/lib/python3.6/site-packages/prodigy/main.py", line 259, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/anaconda3/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 385, in batch_train
examples = merge_spans(DB.get_dataset(dataset))
File "cython_src/prodigy/models/ner.pyx", line 37, in prodigy.models.ner.merge_spans
KeyError: 'text'

I checked the training.jsonl file and found two occurrences of empty spans. I guess that's what's causing the trouble? How do I fix that? I have run a couple of rounds of ner.teach and ner.batch-train with no problem until now. I didn't add the spans myself. If it helps, the spans are empty for tokens "\r" and "\n".

Hi! Empty spans should be fine, because this is a totally valid annotation – a text can just not have any entity spans. In your case, it looks like it comes across an example with no "text", which is very strange :thinking: If you export your dataset using db-out, is there any example that doesn’t have a top-level "text" property?

Yes, I found around 200 examples of no “text”. I suspect they came from ner.eval-ab that I run once to compare datasets, because they all contain “mapping”:{“B”:“accept”,“A”:“reject”} etc. What did I do wrong this time? :smile:

Ah, glad you found them! I think the problem here is that you’ve added different “types” of annotations to the same dataset. Ideally, if you want to run a different recipe like eva-ab, you also want to save those annotations to a different dataset so they’re separate from the annotations you’ll be using for training later on.

(If possible, I’d recommend to always use separate datasets for different things you’re doing. Even if you’re collecting annotations of the same type, merging datasets is usually easier than splitting them.)

Thanks for the response, I managed to copy the correct parts of the dataset to a new one and it worked :slight_smile: I tried to train the model some more and even though it had about 90% accuracy it failed on one of the entities quite often. So I tried ner.manual for that particular label and now the overall accuracy dropped to 30% and then got even lower, the model seems to have forgotten the other labels. Why is that so and how do I fix this? I use the --patterns to help prodigy to suggest correct entities and it worked very well until now. Now I get very random suggestions that are mostly wrong (similar to those I got at the very beginning before I used patterns). I understand that this entity is problematic, because it consists of multiple tokens, but why are the other entities not working anymore?

There could be a couple of explanations for that. One thing to check is that the evaluation data hasn’t changed. If you’re not using a dedicated evaluation set, but are instead using the random split, then your evaluation data will be changing as you conduct more annotation. When you focus on the most difficult entity type, you’re making the evaluation harder. If you’re still using a random evaluation split, you’ll want to switch over to getting a dedicated test set soon, so you can more reliably reason about what’s going on.

The drop to 30% is pretty significant, though. So, maybe it’s not just the evaluation. Depending on the sequence of operations you’ve done, you might have ended up training a new entity type on top of a model with a number of previous entity types, which sometimes produces bad results. When you do batch train, you usually want to make sure you’re passing in a new model that doesn’t have pre-trained named entities.

Another way things can go wrong is that the algorithm that learns from incomplete annotations (such as produced by ner.teach) doesn’t work as well when there are lots of entity types. As you get more annotations, it’s useful to build up a sample of complete and correct annotations, perhaps using the silver to gold recipe here: https://github.com/explosion/prodigy-recipes/blob/master/ner/ner_silver_to_gold.py If you have a dataset where you know that all of the examples have all the entities you want to learn fully annotated, you can then use the --no-missing flag during ner.batch-train. This makes training proceed much more accurately.

Apologies that my response is a bit vague — I’m not 100% sure what the problem is, and I don’t want to mislead you!

Thank you for your suggestions, I’ll experiment with the model/data and see what happens :slight_smile: