Inconsistency in "token_end" in prodigy/spacy entities

Arul · March 25, 2019, 9:14pm

I have a sentence from which i want to find entries from a dictionary. I am using spacy Matcher. From the matched object i take “start”, “end”, “start_char”, “end_char” to bootstrap prodigy manual annotation. From the matcher, the token_end i get has a +1 from the token end that is labeled from prodigy. For example, the pattern I generated from dictionary looks like this:
{"id": "8efeb7c9b4be571cee362f60cd6705bc", "text": "Waited on weather to pull production riser Meanwhile: Performed TBT for displacing riser to clean SW", "spans": [{"dict_id": "3731", "text": "Waited on weather", "start": 0, "end": 17, "token_start": 0, "token_end": 3, "label": "Well Problem", "normal_form": "waited on weather"}, {"dict_id": "1341", "text": "production riser", "start": 26, "end": 42, "token_start": 5, "token_end": 7, "label": "Equipment", "normal_form": "production riser"}, {"dict_id": "168", "text": "TBT", "start": 65, "end": 68, "token_start": 11, "token_end": 12, "label": "Action", "normal_form": "through bore tree"}, {"dict_id": "851", "text": "riser", "start": 84, "end": 89, "token_start": 14, "token_end": 15, "label": "Equipment", "normal_form": "riser"}, {"dict_id": "3348", "text": "SW", "start": 99, "end": 101, "token_start": 17, "token_end": 18, "label": "Fluid Additive", "normal_form": "SW"}]}

and the prodigy hand labeled text looks like this:
{"id":"8efeb7c9b4be571cee362f60cd6705bc","text":"Waited on weather to pull production riser Meanwhile: Performed TBT for displacing riser to clean SW","spans":[{"start":0,"end":17,"token_start":0,"token_end":2,"label":"Well Problem"},{"start":21,"end":25,"token_start":4,"token_end":4,"label":"Action"},{"start":26,"end":42,"token_start":5,"token_end":6,"label":"Equipment"},{"start":55,"end":64,"token_start":10,"token_end":10,"label":"Action"},{"start":65,"end":68,"token_start":11,"token_end":11,"label":"Important Action"},{"start":73,"end":83,"token_start":13,"token_end":13,"label":"Action"},{"start":84,"end":89,"token_start":14,"token_end":14,"label":"Equipment"},{"start":99,"end":101,"token_start":17,"token_end":17,"label":"Fluid Additive"}],"_input_hash":1724628538,"_task_hash":-516837807,"tokens":[{"text":"Waited","start":0,"end":6,"id":0},{"text":"on","start":7,"end":9,"id":1},{"text":"weather","start":10,"end":17,"id":2},{"text":"to","start":18,"end":20,"id":3},{"text":"pull","start":21,"end":25,"id":4},{"text":"production","start":26,"end":36,"id":5},{"text":"riser","start":37,"end":42,"id":6},{"text":" ","start":43,"end":44,"id":7},{"text":"Meanwhile","start":44,"end":53,"id":8},{"text":":","start":53,"end":54,"id":9},{"text":"Performed","start":55,"end":64,"id":10},{"text":"TBT","start":65,"end":68,"id":11},{"text":"for","start":69,"end":72,"id":12},{"text":"displacing","start":73,"end":83,"id":13},{"text":"riser","start":84,"end":89,"id":14},{"text":"to","start":90,"end":92,"id":15},{"text":"clean","start":93,"end":98,"id":16},{"text":"SW","start":99,"end":101,"id":17}],"answer":"accept"}

My question is:

Is this an intended change? if so why?
I am right now adjusting the spacy matcher’s identification of “token_end” by 1 and pushing it to prodigy db for bootstraping-mannual label corrections and training. It is working fine for now (i think) But I would like to know which one will be consistently followed. Because manual annotation is quite expensive by time and expertise.

It was quite hard to identify this as i assumed everything same-same in prodigy and spacy. Suggestions will be quite helpful, because at some point i will automatically generate the prodigy db bootstrap for manuel correction. which could be difficult to undo/redo.

Thanks,
Arul.

ines · March 26, 2019, 10:25am

The token_start and token_end in Prodigy are mostly used internally and map to the exact "id" values of the given entry in the "tokens". So the JSONL format itself is agnostic to the order of the tokens or what their IDs "mean".

But you're right that this is inconsistent with spaCy and we should have probably used the same logic for the indices. We actually thought about this before, but it'd be a breaking backwards-incompatible change, so we'd have to wait till Prodigy 2.

If you're using the same model / language for tokenization in spaCy and Prodigy, you do not have to include the token_start and token_end yourself btw. Prodigy's manual recipes will assign those if they're not present in the data and as long as the character offsets map to valid tokens (which they always will if tokenization is consistent), that should be no problem.

Arul · March 26, 2019, 2:59pm

Thank you @ines.
I asked about this only for bootstrapping the manual label correction. Once Prodigy accepts it for the label corrections, then it gives very consistent and compatible dataset for training.

I am doing the “-1” adjustment for the dictionary identified labels. I am generating the initial jsonl from scratch with thise dictionary-identified label spans and pushing it to prodigy db using ‘db-in’. then manual correction starts. In that case i have to generate the token numbers on my own. (I think Token numbers follow the same with default Named entities too in spacy). I hope that adjustment of ‘-1’ is ok.

Topic		Replies	Views
Skip mismatched tokenization? usage , ner , spacy , solved	2	403	February 8, 2022
Mismatching spans usage , ner , solved	3	342	July 15, 2021
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	568	March 27, 2020
Tokenization compatibility issues in rel.manual enhancement , usage , done , transformers , relations	7	1460	September 8, 2020
Token boundary bug in web interface ner , front-end	3	408	July 22, 2020

Inconsistency in "token_end" in prodigy/spacy entities

Related topics