I have a sentence from which i want to find entries from a dictionary. I am using spacy Matcher. From the matched object i take “start”, “end”, “start_char”, “end_char” to bootstrap prodigy manual annotation. From the matcher, the token_end i get has a +1 from the token end that is labeled from prodigy. For example, the pattern I generated from dictionary looks like this:
{"id": "8efeb7c9b4be571cee362f60cd6705bc", "text": "Waited on weather to pull production riser Meanwhile: Performed TBT for displacing riser to clean SW", "spans": [{"dict_id": "3731", "text": "Waited on weather", "start": 0, "end": 17, "token_start": 0, "token_end": 3, "label": "Well Problem", "normal_form": "waited on weather"}, {"dict_id": "1341", "text": "production riser", "start": 26, "end": 42, "token_start": 5, "token_end": 7, "label": "Equipment", "normal_form": "production riser"}, {"dict_id": "168", "text": "TBT", "start": 65, "end": 68, "token_start": 11, "token_end": 12, "label": "Action", "normal_form": "through bore tree"}, {"dict_id": "851", "text": "riser", "start": 84, "end": 89, "token_start": 14, "token_end": 15, "label": "Equipment", "normal_form": "riser"}, {"dict_id": "3348", "text": "SW", "start": 99, "end": 101, "token_start": 17, "token_end": 18, "label": "Fluid Additive", "normal_form": "SW"}]}
and the prodigy hand labeled text looks like this:
{"id":"8efeb7c9b4be571cee362f60cd6705bc","text":"Waited on weather to pull production riser Meanwhile: Performed TBT for displacing riser to clean SW","spans":[{"start":0,"end":17,"token_start":0,"token_end":2,"label":"Well Problem"},{"start":21,"end":25,"token_start":4,"token_end":4,"label":"Action"},{"start":26,"end":42,"token_start":5,"token_end":6,"label":"Equipment"},{"start":55,"end":64,"token_start":10,"token_end":10,"label":"Action"},{"start":65,"end":68,"token_start":11,"token_end":11,"label":"Important Action"},{"start":73,"end":83,"token_start":13,"token_end":13,"label":"Action"},{"start":84,"end":89,"token_start":14,"token_end":14,"label":"Equipment"},{"start":99,"end":101,"token_start":17,"token_end":17,"label":"Fluid Additive"}],"_input_hash":1724628538,"_task_hash":-516837807,"tokens":[{"text":"Waited","start":0,"end":6,"id":0},{"text":"on","start":7,"end":9,"id":1},{"text":"weather","start":10,"end":17,"id":2},{"text":"to","start":18,"end":20,"id":3},{"text":"pull","start":21,"end":25,"id":4},{"text":"production","start":26,"end":36,"id":5},{"text":"riser","start":37,"end":42,"id":6},{"text":" ","start":43,"end":44,"id":7},{"text":"Meanwhile","start":44,"end":53,"id":8},{"text":":","start":53,"end":54,"id":9},{"text":"Performed","start":55,"end":64,"id":10},{"text":"TBT","start":65,"end":68,"id":11},{"text":"for","start":69,"end":72,"id":12},{"text":"displacing","start":73,"end":83,"id":13},{"text":"riser","start":84,"end":89,"id":14},{"text":"to","start":90,"end":92,"id":15},{"text":"clean","start":93,"end":98,"id":16},{"text":"SW","start":99,"end":101,"id":17}],"answer":"accept"}
My question is:
- Is this an intended change? if so why?
- I am right now adjusting the spacy matcher’s identification of “token_end” by 1 and pushing it to prodigy db for bootstraping-mannual label corrections and training. It is working fine for now (i think) But I would like to know which one will be consistently followed. Because manual annotation is quite expensive by time and expertise.
It was quite hard to identify this as i assumed everything same-same in prodigy and spacy. Suggestions will be quite helpful, because at some point i will automatically generate the prodigy db bootstrap for manuel correction. which could be difficult to undo/redo.
Thanks,
Arul.