I'm trying to annotate a dataset with the following string:
"The partnership comes after Baidu received approval earlier this week from regulators to test its self-driving cars in California, where Tesla Motors Inc.,Ford Motor Co. and Google parent Alphabet Inc., among others, are testing their autonomous-driving cars on the road."
Notice the period and comma between Inc and Ford – trying to load this dataset will trigger an UI error:
I cannot change/remove the blue box next to the token Ford (correctly identified as
ORG, but somehow prodigy is tripped up by the leading comma). Trying to remove the entity label leads to a JS error...
I've seen this happen before with similarly malformed data, which leads all identified entity labels to shift by one token.
I've manually changed the data for now. Ideally, prodigy would not split longer strings in those cases where this bug might occur.
Thanks for looking into this!
Hi! Which version of Prodigy are you using and how are you loading in your examples? Are you using a recipe like
ner.correct? And are you using your own tokenization?
I wasn't able to reproduce this using the text, both with token-based annotation and characters The result in
ner.correct looks like this and the missing space only causes the tokens to not be separated, which makes sense:
Hi! I'm trying this on
I'm loading the data from a JSONL file, and this is the recipe I'm using for annotating:
prodigy ner.make-gold dataset spacy_model my_data.jsonl --label ORG
spacy_model is based on en_core_web_lg, to which I added an EntityRuler with a custom entity, and saved it to disk. Perhaps this is where the issue lies?
Ah, interesting I think the most likely explanation here is that the data that gets sent from the recipe to the app is somehow mismatched – mismatched tokenization, spans etc. But it's unlikely that it's just the model and/or entity ruler producing this, because then spaCy would just error much earlier and tell you about it. spaCy isn't going to output a Doc that's invalid or mismatched.
--unsegmented solve the problem? This would indicate that the problem is in Prodigy's sentence segmentation wrapper and the spans and tokens aren't translated correctly.
Alternatively, I'd be interested to see the underlying JSON that ended up in the app and likely confused it. An easy way to output the JSON of the current example in the app is this:
- Open your browser's developer console and type:
window.prodigy.content. This should output the JSON object.