I just want to be sure I’m understanding this error correctly. I’m working on further training the ‘ORG’ label from a spacy model. The model had been trained once in spacy, renamed “ner_train_v3”, and I used ner_train_v3 in prodigy’s ner.teach recipe.
That all goes okay, but when I try using prodigy’s ner.batch-train recipe, I invariably get an error like the below:
Navigating to that line in the relevant dataset, I can see the sentence is has some issues, because the stream data I’m throwing in the dataset hasn’t been meticulously cleaned:
{“text”:“Romania posted the European Union**\u201a\u00c4\u00f4s** highest economic growth rate in the third quarter at 8.8 percent year-on-year, but it also had the largest rate of household deprivation, Eurostat data showed, with one in two Romanians struggling to keep their home warm or pay their bills on time."”,“spans”:[{“start”:176,“end”:184,“text”:“Eurostat”,“rank”:0,“label”:“ORG”,“score”:0.6449537913,“source”:“core_web_lg”,“input_hash”:762417285}],“meta”:{“score”:0.6449537913},"_input_hash":762417285,"_task_hash":-1449057813,“answer”:“accept”}
Does the inclusion of quotes ( \u201a, etc) throw it off, or is it the strange break at the end of the sentence?
Thanks for the report and finding the example it likely fails on! Errors like this are often difficult to debug, so having a concrete example is very valuable. “Weird” formatting and unicode characters should never cause a segfault – this is definitely a bug, either in Prodigy’s NER model or somewhere in spaCy.
(In the meantime, you could always try removing that example from your set and see if you can run ner.batch-train without any problems?)
Hi @mmars9, @ines , I’m experiencing a similar problem training the NER on anything but a very small set of examples. Training on anything over 1000 examples throws the following error. Is this a memory error? Has either of you come up with a temporary solution?
Example Error messages when running prodigy:
line 1: 41665 Segmentation fault: 11 python -m prodigy “$@”
Info about spaCy
Python version: 3.6.3
spaCy version: 2.0.5
Models: en, en_core_sm
Platform: MacOS
I note that I got the same error when trying to train using each of (a) the Prodigy ner.batch-train recipe and (b) the regular spacy train_ner.py script.
Unfortunately, I now also have to fight with this error. If I only have a few annotations (ner.teach), then I can work with ner.batch-train. But if I have processed about 1000 texts, then the error appears. But I don’t see any problem with the memory and CPU usage.
Segmentation fault: 11
python3.6 -m prodigy ner.batch-train db /Users/frederik/mdl/modell -l label1,label2 -e db -o modell2
I think I found my problem. I know about the character limitation on spaCy. In the database I found a few very long strings of text. I dropped them and training seems to work marvelous! (Except the fact that 1107 sentences took 14 GB of RAM and 5 GB of swap on Ubuntu 18.04)
Same error, prodigy: line 1: 9693 Segmentation fault: 11 python -m prodigy "$@"
This was during an annotation task, which was launched via prodigy ner.teach menu_brand_tagging en_core_web_sm menu_data_1018.jsonl --label BRAND --patterns brand_patterns.jsonl
Unfortunately it was accompanied by this in the annotation front-end as well!
I don’t think I lost too many, but a good reminder to hit save often.
What type of texts are in your menu_data_1018.jsonl? Are they long or short? Any particularly long texts, or texts with lots of whitespace?
The errors in the app are a direct response to the server dying, btw. As soon as the Prodigy app fails to connect to the server, it will show you the error, so you know that something is up. (Otherwise, you’d have to keep checking the terminal, which is pretty inconvenient.) Prodigy auto-saves the annotations in batches and also uses them to update the model in the loop (if you’re using an active learning recipe like ner.teach).
If you’re using the default batch size of 10, the maximum amount of annotations you could theoretically ever lose at a time is 19 (10 items in the history and 9 waiting to be sent out as soon as they become 10). If Prodigy is unable to save, the examples are still all in your browser btw – so you can always restart the server in the terminal and then hit save in the web app, and the “stranded” examples should be saved.