Inconsistency in "token_end" in prodigy/spacy entities

I have a sentence from which i want to find entries from a dictionary. I am using spacy Matcher. From the matched object i take “start”, “end”, “start_char”, “end_char” to bootstrap prodigy manual annotation. From the matcher, the token_end i get has a +1 from the token end that is labeled from prodigy. For example, the pattern I generated from dictionary looks like this:
{"id": "8efeb7c9b4be571cee362f60cd6705bc", "text": "Waited on weather to pull production riser Meanwhile: Performed TBT for displacing riser to clean SW", "spans": [{"dict_id": "3731", "text": "Waited on weather", "start": 0, "end": 17, "token_start": 0, "token_end": 3, "label": "Well Problem", "normal_form": "waited on weather"}, {"dict_id": "1341", "text": "production riser", "start": 26, "end": 42, "token_start": 5, "token_end": 7, "label": "Equipment", "normal_form": "production riser"}, {"dict_id": "168", "text": "TBT", "start": 65, "end": 68, "token_start": 11, "token_end": 12, "label": "Action", "normal_form": "through bore tree"}, {"dict_id": "851", "text": "riser", "start": 84, "end": 89, "token_start": 14, "token_end": 15, "label": "Equipment", "normal_form": "riser"}, {"dict_id": "3348", "text": "SW", "start": 99, "end": 101, "token_start": 17, "token_end": 18, "label": "Fluid Additive", "normal_form": "SW"}]}

and the prodigy hand labeled text looks like this:
{"id":"8efeb7c9b4be571cee362f60cd6705bc","text":"Waited on weather to pull production riser Meanwhile: Performed TBT for displacing riser to clean SW","spans":[{"start":0,"end":17,"token_start":0,"token_end":2,"label":"Well Problem"},{"start":21,"end":25,"token_start":4,"token_end":4,"label":"Action"},{"start":26,"end":42,"token_start":5,"token_end":6,"label":"Equipment"},{"start":55,"end":64,"token_start":10,"token_end":10,"label":"Action"},{"start":65,"end":68,"token_start":11,"token_end":11,"label":"Important Action"},{"start":73,"end":83,"token_start":13,"token_end":13,"label":"Action"},{"start":84,"end":89,"token_start":14,"token_end":14,"label":"Equipment"},{"start":99,"end":101,"token_start":17,"token_end":17,"label":"Fluid Additive"}],"_input_hash":1724628538,"_task_hash":-516837807,"tokens":[{"text":"Waited","start":0,"end":6,"id":0},{"text":"on","start":7,"end":9,"id":1},{"text":"weather","start":10,"end":17,"id":2},{"text":"to","start":18,"end":20,"id":3},{"text":"pull","start":21,"end":25,"id":4},{"text":"production","start":26,"end":36,"id":5},{"text":"riser","start":37,"end":42,"id":6},{"text":" ","start":43,"end":44,"id":7},{"text":"Meanwhile","start":44,"end":53,"id":8},{"text":":","start":53,"end":54,"id":9},{"text":"Performed","start":55,"end":64,"id":10},{"text":"TBT","start":65,"end":68,"id":11},{"text":"for","start":69,"end":72,"id":12},{"text":"displacing","start":73,"end":83,"id":13},{"text":"riser","start":84,"end":89,"id":14},{"text":"to","start":90,"end":92,"id":15},{"text":"clean","start":93,"end":98,"id":16},{"text":"SW","start":99,"end":101,"id":17}],"answer":"accept"}

My question is:

  1. Is this an intended change? if so why?
  2. I am right now adjusting the spacy matcher’s identification of “token_end” by 1 and pushing it to prodigy db for bootstraping-mannual label corrections and training. It is working fine for now (i think) But I would like to know which one will be consistently followed. Because manual annotation is quite expensive by time and expertise.

It was quite hard to identify this as i assumed everything same-same in prodigy and spacy. Suggestions will be quite helpful, because at some point i will automatically generate the prodigy db bootstrap for manuel correction. which could be difficult to undo/redo.

Thanks,
Arul.

The token_start and token_end in Prodigy are mostly used internally and map to the exact "id" values of the given entry in the "tokens". So the JSONL format itself is agnostic to the order of the tokens or what their IDs "mean".

But you're right that this is inconsistent with spaCy and we should have probably used the same logic for the indices. We actually thought about this before, but it'd be a breaking backwards-incompatible change, so we'd have to wait till Prodigy 2.

If you're using the same model / language for tokenization in spaCy and Prodigy, you do not have to include the token_start and token_end yourself btw. Prodigy's manual recipes will assign those if they're not present in the data and as long as the character offsets map to valid tokens (which they always will if tokenization is consistent), that should be no problem.

1 Like

Thank you @ines.
I asked about this only for bootstrapping the manual label correction. Once Prodigy accepts it for the label corrections, then it gives very consistent and compatible dataset for training.

I am doing the “-1” adjustment for the dictionary identified labels. I am generating the initial jsonl from scratch with thise dictionary-identified label spans and pushing it to prodigy db using ‘db-in’. then manual correction starts. In that case i have to generate the token numbers on my own. (I think Token numbers follow the same with default Named entities too in spacy). I hope that adjustment of ‘-1’ is ok.