ner.train-curve error on whitespace

david.keeling · December 20, 2019, 5:17pm

Hi, we've been running into an error trying to use train-curve. I found ner.batch-train after ner.maual results error (Value error : [E024]) and some other answers (both regarding Prodigy as well as spaCy) that say this is caused by tokens that begin or end with whitespace, and as a solution we should remove the bad spans as they would be "reject" annotations anyway. However, we used exclusively manual labeling for the dataset in question, so the dataset is all accepts and this solution doesn't seem right. I was wondering if you could help me understand: does Prodigy create tokens for ner.manual that begin or end with whitespace? If so, wouldn't that mean those token are unusable for training without additional processing?

Followup -- this only affects NER spans, if I understand correctly, but the Prodigy jsonl format includes references to the original tokens in addition to the character indices in the text. When I correct the whitespace issue, do I also have to change the start/end of the original token, or is it enough to just adjust the start/end character indices of the NER span?

honnibal · December 25, 2019, 11:45am

Prodigy shouldn't be creating entities with whitespace, I wouldn't think. So maybe the tokenization is mismatched?

The easiest thing would be to find and review the spans that it says are mismatched. Have you been able to print them out and review them? If not I can suggest some code that should help find them. As a first step, you could also look at the dataset quickly with ner.print-dataset, which I normally pipe into the less command.

Topic		Replies	Views
Error while training NER model usage , spacy , training	4	1853	September 16, 2021
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	552	March 27, 2020
ner.batch-train after ner.maual results error (Value error : [E024]) ner , spacy , solved	8	2962	June 26, 2019
Skip mismatched tokenization? usage , ner , spacy , solved	2	395	February 8, 2022
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018

ner.train-curve error on whitespace

Related topics