Token boundary bug in web interface

dc17 · July 22, 2020, 9:23am

I'm trying to annotate a dataset with the following string:

"The partnership comes after Baidu received approval earlier this week from regulators to test its self-driving cars in California, where Tesla Motors Inc.,Ford Motor Co. and Google parent Alphabet Inc., among others, are testing their autonomous-driving cars on the road."

Notice the period and comma between Inc and Ford – trying to load this dataset will trigger an UI error:

I cannot change/remove the blue box next to the token Ford (correctly identified as ORG, but somehow prodigy is tripped up by the leading comma). Trying to remove the entity label leads to a JS error...

I've seen this happen before with similarly malformed data, which leads all identified entity labels to shift by one token.

I've manually changed the data for now. Ideally, prodigy would not split longer strings in those cases where this bug might occur.

Thanks for looking into this!

ines · July 22, 2020, 9:32am

Hi! Which version of Prodigy are you using and how are you loading in your examples? Are you using a recipe like ner.correct? And are you using your own tokenization?

I wasn't able to reproduce this using the text, both with token-based annotation and characters The result in ner.correct looks like this and the missing space only causes the tokens to not be separated, which makes sense:

dc17 · July 22, 2020, 10:00am

Hi! I'm trying this on prodigy-1.10.2.

I'm loading the data from a JSONL file, and this is the recipe I'm using for annotating:

prodigy ner.make-gold dataset spacy_model my_data.jsonl --label ORG

spacy_model is based on en_core_web_lg, to which I added an EntityRuler with a custom entity, and saved it to disk. Perhaps this is where the issue lies?

ines · July 22, 2020, 10:09am

Ah, interesting I think the most likely explanation here is that the data that gets sent from the recipe to the app is somehow mismatched – mismatched tokenization, spans etc. But it's unlikely that it's just the model and/or entity ruler producing this, because then spaCy would just error much earlier and tell you about it. spaCy isn't going to output a Doc that's invalid or mismatched.

Does setting --unsegmented solve the problem? This would indicate that the problem is in Prodigy's sentence segmentation wrapper and the spans and tokens aren't translated correctly.

Alternatively, I'd be interested to see the underlying JSON that ended up in the app and likely confused it. An easy way to output the JSON of the current example in the app is this:

Add "javascript": "console.log('JS enabled')" to your prodigy.json. This means Prodigy executes custom JavaScript and will add its state to window.prodigy.
Open your browser's developer console and type: window.prodigy.content. This should output the JSON object.

Topic		Replies	Views
rel.manual not accepting entities because of tokenization ner , solved , relations	7	1055	April 17, 2024
ner.train-curve error on whitespace usage , ner , spacy	1	597	December 25, 2019
NER tagging in non-alphabetic language ner , spacy	1	408	May 2, 2022
Annotating strings without correct separation ner , best-practices	8	190	November 21, 2024
Skip mismatched tokenization? usage , ner , spacy , solved	2	394	February 8, 2022

Token boundary bug in web interface

Related topics