Mismatching spans

Hello!

What I did:
I took three files: text, bert tokens, IOB labels and created prodigy format like here:

Those files have right and wrong predictions by pytorch model.

Format I have (text, tokens, spans):

{"text":"U Profile BIKARAEROSPACE GmbH fast bayrische Verh\u00e4ltnisse Und am Wittgensteiner Land rollt der Verkehr vorbei Was die Verkehrsanbindung Wittgenstein anbetrifft werden wir nicht nachlassen verspricht der Hauptgesch\u00e4ftsf\u00fchrer Lesen Sie mehr Dabei zeichnet sich die Dann gibt es die M\u00f6glichkeit im Betrieb zu arbeiten und an den Wochenenden die Uni zu besuchen Platten Bleche Zuschnitte Ronden Ringe Stangen Rohre und Profile Dazu kommt die Stangenware also Stangen Rohre und Profile in allen m\u00f6glichen Abmessungen","tokens":[{"text":"U","start":0,"end":1,"id":0},{"text":"Profil","start":2,"end":8,"id":1},{"text":"##e","start":8,"end":11,"id":2},{"text":"B","start":12,"end":13,"id":3},{"text":"##IK","start":13,"end":17,"id":4},{"text":"##AR","start":17,"end":21,"id":5},{"text":"##AE","start":21,"end":25,"id":6},{"text":"##RO","start":25,"end":29,"id":7},{"text":"##SP","start":29,"end":33,"id":8},{"text":"##ACE","start":33,"end":38,"id":9},{"text":"GmbH","start":39,"end":43,"id":10},{"text":"fast","start":44,"end":48,"id":11},{"text":"ba","start":49,"end":51,"id":12},{"text":"##yr","start":51,"end":55,"id":13},{"text":"##ische","start":55,"end":62,"id":14},{"text":"Verh","start":63,"end":67,"id":15},{"text":"##alt","start":67,"end":72,"id":16},{"text":"##nisse","start":72,"end":79,"id":17},{"text":"Und","start":80,"end":83,"id":18},{"text":"am","start":84,"end":86,"id":19},{"text":"Witt","start":87,"end":91,"id":20},{"text":"##gens","start":91,"end":97,"id":21},{"text":"##te","start":97,"end":101,"id":22},{"text":"##iner","start":101,"end":107,"id":23},{"text":"Land","start":108,"end":112,"id":24},{"text":"ro","start":113,"end":115,"id":25},{"text":"##llt","start":115,"end":120,"id":26},{"text":"der","start":121,"end":124,"id":27},{"text":"Verkehr","start":125,"end":132,"id":28},{"text":"vorbei","start":133,"end":139,"id":29},{"text":"Was","start":140,"end":143,"id":30},{"text":"die","start":144,"end":147,"id":31},{"text":"Verkehrs","start":148,"end":156,"id":32},{"text":"##an","start":156,"end":160,"id":33},{"text":"##bindung","start":160,"end":169,"id":34},{"text":"Witt","start":170,"end":174,"id":35},{"text":"##gens","start":174,"end":180,"id":36},{"text":"##te","start":180,"end":184,"id":37},{"text":"##in","start":184,"end":188,"id":38},{"text":"an","start":189,"end":191,"id":39},{"text":"##bet","start":191,"end":196,"id":40},{"text":"##rifft","start":196,"end":203,"id":41},{"text":"werden","start":204,"end":210,"id":42},{"text":"wir","start":211,"end":214,"id":43},{"text":"nicht","start":215,"end":220,"id":44},{"text":"nach","start":221,"end":225,"id":45},{"text":"##lassen","start":225,"end":233,"id":46},{"text":"verspricht","start":234,"end":244,"id":47},{"text":"der","start":245,"end":248,"id":48},{"text":"Haupt","start":249,"end":254,"id":49},{"text":"##gesch","start":254,"end":261,"id":50},{"text":"##aft","start":261,"end":266,"id":51},{"text":"##sf","start":266,"end":270,"id":52},{"text":"##uhr","start":270,"end":275,"id":53},{"text":"##er","start":275,"end":279,"id":54},{"text":"Lesen","start":280,"end":285,"id":55},{"text":"Sie","start":286,"end":289,"id":56},{"text":"mehr","start":290,"end":294,"id":57},{"text":"Dabei","start":295,"end":300,"id":58},{"text":"zeichnet","start":301,"end":309,"id":59},{"text":"sich","start":310,"end":314,"id":60},{"text":"die","start":315,"end":318,"id":61},{"text":"Dann","start":319,"end":323,"id":62},{"text":"gibt","start":324,"end":328,"id":63},{"text":"es","start":329,"end":331,"id":64},{"text":"die","start":332,"end":335,"id":65},{"text":"Mog","start":336,"end":339,"id":66},{"text":"##lichkeit","start":339,"end":349,"id":67},{"text":"im","start":350,"end":352,"id":68},{"text":"Betrieb","start":353,"end":360,"id":69},{"text":"zu","start":361,"end":363,"id":70},{"text":"arbeiten","start":364,"end":372,"id":71},{"text":"und","start":373,"end":376,"id":72},{"text":"an","start":377,"end":379,"id":73},{"text":"den","start":380,"end":383,"id":74},{"text":"Wochenenden","start":384,"end":395,"id":75},{"text":"die","start":396,"end":399,"id":76},{"text":"Uni","start":400,"end":403,"id":77},{"text":"zu","start":404,"end":406,"id":78},{"text":"besuchen","start":407,"end":415,"id":79},{"text":"Platten","start":416,"end":423,"id":80},{"text":"Blech","start":424,"end":429,"id":81},{"text":"##e","start":429,"end":432,"id":82},{"text":"Zusch","start":433,"end":438,"id":83},{"text":"##nitt","start":438,"end":444,"id":84},{"text":"##e","start":444,"end":447,"id":85},{"text":"Ron","start":448,"end":451,"id":86},{"text":"##den","start":451,"end":456,"id":87},{"text":"Ringe","start":457,"end":462,"id":88},{"text":"Stan","start":463,"end":467,"id":89},{"text":"##gen","start":467,"end":472,"id":90},{"text":"Rohr","start":473,"end":477,"id":91},{"text":"##e","start":477,"end":480,"id":92},{"text":"und","start":481,"end":484,"id":93},{"text":"Profil","start":485,"end":491,"id":94},{"text":"##e","start":491,"end":494,"id":95},{"text":"Dazu","start":495,"end":499,"id":96},{"text":"kommt","start":500,"end":505,"id":97},{"text":"die","start":506,"end":509,"id":98},{"text":"Stan","start":510,"end":514,"id":99},{"text":"##gen","start":514,"end":519,"id":100},{"text":"##ware","start":519,"end":525,"id":101},{"text":"also","start":526,"end":530,"id":102},{"text":"Stan","start":531,"end":535,"id":103},{"text":"##gen","start":535,"end":540,"id":104},{"text":"Rohr","start":541,"end":545,"id":105},{"text":"##e","start":545,"end":548,"id":106},{"text":"und","start":549,"end":552,"id":107},{"text":"Profil","start":553,"end":559,"id":108},{"text":"##e","start":559,"end":562,"id":109},{"text":"in","start":563,"end":565,"id":110},{"text":"allen","start":566,"end":571,"id":111},{"text":"mo","start":572,"end":574,"id":112},{"text":"##glichen","start":574,"end":583,"id":113},{"text":"Abmessungen","start":584,"end":595,"id":114}],"spans":[{"start":2,"end":11,"label":"PRODNAME"},{"start":404,"end":411,"label":"PRODNAME"},{"start":412,"end":420,"label":"PRODNAME"},{"start":421,"end":435,"label":"PRODNAME"},{"start":436,"end":444,"label":"PRODNAME"},{"start":445,"end":450,"label":"PRODNAME"},{"start":451,"end":460,"label":"PRODNAME"},{"start":461,"end":468,"label":"PRODNAME"},{"start":473,"end":482,"label":"PRODNAME"},{"start":498,"end":502,"label":"PRODNAME"},{"start":519,"end":528,"label":"PRODNAME"},{"start":529,"end":536,"label":"PRODNAME"},{"start":541,"end":550,"label":"PRODNAME"}]}
{"text":"70 x 70 x 25 60 x 60 x 25 Kupfer Sechs##kant##sta##ng##en Aluminium##Gu##ss##platten FOR##MO##DA##L 02##3 Eben##heit mm##m FOR##MO##DA##L BM##50##83 \u2013 Pra##zi##si##ons##Wa##l##z##platte EN AW 50##83 Univers##ell einsetz##bare Aluminium##platten fur erh##oh##te Anforderungen im Werkzeug Formen und Modell##bau Hoch##feste Aluminium##Wa##l##z##platten FOR##MO##DA##L BM##400","tokens":[{"text":"70","start":0,"end":2,"id":0},{"text":"x","start":3,"end":4,"id":1},{"text":"70","start":5,"end":7,"id":2},{"text":"x","start":8,"end":9,"id":3},{"text":"25","start":10,"end":12,"id":4},{"text":"60","start":13,"end":15,"id":5},{"text":"x","start":16,"end":17,"id":6},{"text":"60","start":18,"end":20,"id":7},{"text":"x","start":21,"end":22,"id":8},{"text":"25","start":23,"end":25,"id":9},{"text":"Kupfer","start":26,"end":32,"id":10},{"text":"Sechs","start":33,"end":38,"id":11},{"text":"##kant","start":38,"end":44,"id":12},{"text":"##sta","start":44,"end":49,"id":13},{"text":"##ng","start":49,"end":53,"id":14},{"text":"##en","start":53,"end":57,"id":15},{"text":"Aluminium","start":58,"end":67,"id":16},{"text":"##Gu","start":67,"end":71,"id":17},{"text":"##ss","start":71,"end":75,"id":18},{"text":"##platten","start":75,"end":84,"id":19},{"text":"FOR","start":85,"end":88,"id":20},{"text":"##MO","start":88,"end":92,"id":21},{"text":"##DA","start":92,"end":96,"id":22},{"text":"##L","start":96,"end":99,"id":23},{"text":"02","start":100,"end":102,"id":24},{"text":"##3","start":102,"end":105,"id":25},{"text":"Eben","start":106,"end":110,"id":26},{"text":"##heit","start":110,"end":116,"id":27},{"text":"mm","start":117,"end":119,"id":28},{"text":"##m","start":119,"end":122,"id":29},{"text":"FOR","start":123,"end":126,"id":30},{"text":"##MO","start":126,"end":130,"id":31},{"text":"##DA","start":130,"end":134,"id":32},{"text":"##L","start":134,"end":137,"id":33},{"text":"BM","start":138,"end":140,"id":34},{"text":"##50","start":140,"end":144,"id":35},{"text":"##83","start":144,"end":148,"id":36},{"text":"\u2013","start":149,"end":150,"id":37},{"text":"Pra","start":151,"end":154,"id":38},{"text":"##zi","start":154,"end":158,"id":39},{"text":"##si","start":158,"end":162,"id":40},{"text":"##ons","start":162,"end":167,"id":41},{"text":"##Wa","start":167,"end":171,"id":42},{"text":"##l","start":171,"end":174,"id":43},{"text":"##z","start":174,"end":177,"id":44},{"text":"##platte","start":177,"end":185,"id":45},{"text":"EN","start":186,"end":188,"id":46},{"text":"AW","start":189,"end":191,"id":47},{"text":"50","start":192,"end":194,"id":48},{"text":"##83","start":194,"end":198,"id":49},{"text":"Univers","start":199,"end":206,"id":50},{"text":"##ell","start":206,"end":211,"id":51},{"text":"einsetz","start":212,"end":219,"id":52},{"text":"##bare","start":219,"end":225,"id":53},{"text":"Aluminium","start":226,"end":235,"id":54},{"text":"##platten","start":235,"end":244,"id":55},{"text":"fur","start":245,"end":248,"id":56},{"text":"erh","start":249,"end":252,"id":57},{"text":"##oh","start":252,"end":256,"id":58},{"text":"##te","start":256,"end":260,"id":59},{"text":"Anforderungen","start":261,"end":274,"id":60},{"text":"im","start":275,"end":277,"id":61},{"text":"Werkzeug","start":278,"end":286,"id":62},{"text":"Formen","start":287,"end":293,"id":63},{"text":"und","start":294,"end":297,"id":64},{"text":"Modell","start":298,"end":304,"id":65},{"text":"##bau","start":304,"end":309,"id":66},{"text":"Hoch","start":310,"end":314,"id":67},{"text":"##feste","start":314,"end":321,"id":68},{"text":"Aluminium","start":322,"end":331,"id":69},{"text":"##Wa","start":331,"end":335,"id":70},{"text":"##l","start":335,"end":338,"id":71},{"text":"##z","start":338,"end":341,"id":72},{"text":"##platten","start":341,"end":350,"id":73},{"text":"FOR","start":351,"end":354,"id":74},{"text":"##MO","start":354,"end":358,"id":75},{"text":"##DA","start":358,"end":362,"id":76},{"text":"##L","start":362,"end":365,"id":77},{"text":"BM","start":366,"end":368,"id":78},{"text":"##400","start":368,"end":373,"id":79}],"spans":[{"start":2,"end":11,"label":"PRODNAME"},{"start":404,"end":411,"label":"PRODNAME"},{"start":412,"end":420,"label":"PRODNAME"},{"start":421,"end":435,"label":"PRODNAME"},{"start":436,"end":444,"label":"PRODNAME"},{"start":445,"end":450,"label":"PRODNAME"},{"start":451,"end":460,"label":"PRODNAME"},{"start":461,"end":468,"label":"PRODNAME"},{"start":473,"end":482,"label":"PRODNAME"},{"start":498,"end":502,"label":"PRODNAME"},{"start":519,"end":528,"label":"PRODNAME"},{"start":529,"end":536,"label":"PRODNAME"},{"start":541,"end":550,"label":"PRODNAME"}]}
{"text":"30 x 10 x 15 30 x 20 x 20 mm a x b x a x s FOR##MO##DA##L 07 sehr geehrt##er Gesch##aft##spartner Andere un##ed##le Metall##e einschlie\u00dflich Stan##gen Nickel##matt##e Nickel##oxid##sin##ter und andere Hoch##feste Pra##zi##si##ons##wal##z##platten","tokens":[{"text":"30","start":0,"end":2,"id":0},{"text":"x","start":3,"end":4,"id":1},{"text":"10","start":5,"end":7,"id":2},{"text":"x","start":8,"end":9,"id":3},{"text":"15","start":10,"end":12,"id":4},{"text":"30","start":13,"end":15,"id":5},{"text":"x","start":16,"end":17,"id":6},{"text":"20","start":18,"end":20,"id":7},{"text":"x","start":21,"end":22,"id":8},{"text":"20","start":23,"end":25,"id":9},{"text":"mm","start":26,"end":28,"id":10},{"text":"a","start":29,"end":30,"id":11},{"text":"x","start":31,"end":32,"id":12},{"text":"b","start":33,"end":34,"id":13},{"text":"x","start":35,"end":36,"id":14},{"text":"a","start":37,"end":38,"id":15},{"text":"x","start":39,"end":40,"id":16},{"text":"s","start":41,"end":42,"id":17},{"text":"FOR","start":43,"end":46,"id":18},{"text":"##MO","start":46,"end":50,"id":19},{"text":"##DA","start":50,"end":54,"id":20},{"text":"##L","start":54,"end":57,"id":21},{"text":"07","start":58,"end":60,"id":22},{"text":"sehr","start":61,"end":65,"id":23},{"text":"geehrt","start":66,"end":72,"id":24},{"text":"##er","start":72,"end":76,"id":25},{"text":"Gesch","start":77,"end":82,"id":26},{"text":"##aft","start":82,"end":87,"id":27},{"text":"##spartner","start":87,"end":97,"id":28},{"text":"Andere","start":98,"end":104,"id":29},{"text":"un","start":105,"end":107,"id":30},{"text":"##ed","start":107,"end":111,"id":31},{"text":"##le","start":111,"end":115,"id":32},{"text":"Metall","start":116,"end":122,"id":33},{"text":"##e","start":122,"end":125,"id":34},{"text":"einschlie\u00dflich","start":126,"end":140,"id":35},{"text":"Stan","start":141,"end":145,"id":36},{"text":"##gen","start":145,"end":150,"id":37},{"text":"Nickel","start":151,"end":157,"id":38},{"text":"##matt","start":157,"end":163,"id":39},{"text":"##e","start":163,"end":166,"id":40},{"text":"Nickel","start":167,"end":173,"id":41},{"text":"##oxid","start":173,"end":179,"id":42},{"text":"##sin","start":179,"end":184,"id":43},{"text":"##ter","start":184,"end":189,"id":44},{"text":"und","start":190,"end":193,"id":45},{"text":"andere","start":194,"end":200,"id":46},{"text":"Hoch","start":201,"end":205,"id":47},{"text":"##feste","start":205,"end":212,"id":48},{"text":"Pra","start":213,"end":216,"id":49},{"text":"##zi","start":216,"end":220,"id":50},{"text":"##si","start":220,"end":224,"id":51},{"text":"##ons","start":224,"end":229,"id":52},{"text":"##wal","start":229,"end":234,"id":53},{"text":"##z","start":234,"end":237,"id":54},{"text":"##platten","start":237,"end":246,"id":55}],"spans":[{"start":2,"end":11,"label":"PRODNAME"},{"start":404,"end":411,"label":"PRODNAME"},{"start":412,"end":420,"label":"PRODNAME"},{"start":421,"end":435,"label":"PRODNAME"},{"start":436,"end":444,"label":"PRODNAME"},{"start":445,"end":450,"label":"PRODNAME"},{"start":451,"end":460,"label":"PRODNAME"},{"start":461,"end":468,"label":"PRODNAME"},{"start":473,"end":482,"label":"PRODNAME"},{"start":498,"end":502,"label":"PRODNAME"},{"start":519,"end":528,"label":"PRODNAME"},{"start":529,"end":536,"label":"PRODNAME"},{"start":541,"end":550,"label":"PRODNAME"}]}
{"text":"42##5 mm 42##9 mm 115 mm 117 mm 60 mm 61 mm 200 x 5 Vier##kant##sta##ng##en Blech##e und Platten aus Kunststoff Poly##vin##yl##iden##flu##ori##d P##VD##F 25 x 20 27 x 18 100 x 80 102 x 78","tokens":[{"text":"42","start":0,"end":2,"id":0},{"text":"##5","start":2,"end":5,"id":1},{"text":"mm","start":6,"end":8,"id":2},{"text":"42","start":9,"end":11,"id":3},{"text":"##9","start":11,"end":14,"id":4},{"text":"mm","start":15,"end":17,"id":5},{"text":"115","start":18,"end":21,"id":6},{"text":"mm","start":22,"end":24,"id":7},{"text":"117","start":25,"end":28,"id":8},{"text":"mm","start":29,"end":31,"id":9},{"text":"60","start":32,"end":34,"id":10},{"text":"mm","start":35,"end":37,"id":11},{"text":"61","start":38,"end":40,"id":12},{"text":"mm","start":41,"end":43,"id":13},{"text":"200","start":44,"end":47,"id":14},{"text":"x","start":48,"end":49,"id":15},{"text":"5","start":50,"end":51,"id":16},{"text":"Vier","start":52,"end":56,"id":17},{"text":"##kant","start":56,"end":62,"id":18},{"text":"##sta","start":62,"end":67,"id":19},{"text":"##ng","start":67,"end":71,"id":20},{"text":"##en","start":71,"end":75,"id":21},{"text":"Blech","start":76,"end":81,"id":22},{"text":"##e","start":81,"end":84,"id":23},{"text":"und","start":85,"end":88,"id":24},{"text":"Platten","start":89,"end":96,"id":25},{"text":"aus","start":97,"end":100,"id":26},{"text":"Kunststoff","start":101,"end":111,"id":27},{"text":"Poly","start":112,"end":116,"id":28},{"text":"##vin","start":116,"end":121,"id":29},{"text":"##yl","start":121,"end":125,"id":30},{"text":"##iden","start":125,"end":131,"id":31},{"text":"##flu","start":131,"end":136,"id":32},{"text":"##ori","start":136,"end":141,"id":33},{"text":"##d","start":141,"end":144,"id":34},{"text":"P","start":145,"end":146,"id":35},{"text":"##VD","start":146,"end":150,"id":36},{"text":"##F","start":150,"end":153,"id":37},{"text":"25","start":154,"end":156,"id":38},{"text":"x","start":157,"end":158,"id":39},{"text":"20","start":159,"end":161,"id":40},{"text":"27","start":162,"end":164,"id":41},{"text":"x","start":165,"end":166,"id":42},{"text":"18","start":167,"end":169,"id":43},{"text":"100","start":170,"end":173,"id":44},{"text":"x","start":174,"end":175,"id":45},{"text":"80","start":176,"end":178,"id":46},{"text":"102","start":179,"end":182,"id":47},{"text":"x","start":183,"end":184,"id":48},{"text":"78","start":185,"end":187,"id":49}],"spans":[{"start":2,"end":11,"label":"PRODNAME"},{"start":404,"end":411,"label":"PRODNAME"},{"start":412,"end":420,"label":"PRODNAME"},{"start":421,"end":435,"label":"PRODNAME"},{"start":436,"end":444,"label":"PRODNAME"},{"start":445,"end":450,"label":"PRODNAME"},{"start":451,"end":460,"label":"PRODNAME"},{"start":461,"end":468,"label":"PRODNAME"},{"start":473,"end":482,"label":"PRODNAME"},{"start":498,"end":502,"label":"PRODNAME"},{"start":519,"end":528,"label":"PRODNAME"},{"start":529,"end":536,"label":"PRODNAME"},{"start":541,"end":550,"label":"PRODNAME"}]}

What for:
I wanted to use this command to see entities highlited and to accept or correct them.

python -m prodigy ner.manual output_db de_core_news_sm danil_test.jsonl -l PRODNAME,MTRL,ENNUM,TEMPER

But while doing it I got error

ValueError: Mismatched tokenization. Can't resolve span to token index 411. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

{'start': 404, 'end': 411, 'label': 'PRODNAME', 'token_start': 78}

Question. So do I understand correctly, that prodigy won't show me text with highlighted entities untill all of them labeled right? Cause my goal is to see visually what model predicted (even if it's wrong) and to correct it using prodigy.

Hi,

The error you are seeing here refers to "mismatched tokenization", which means that the spans in your example data do not align with the tokens. A span should always start at the start of a token and end at the end of another (or the same) token. If you try to define spans that start or stop in the middle of a token, you'll get this error.

Looking at the 4 data samples you've cited, it feels like perhaps something has gone wrong in a preprocessing step? While you have 4 different texts with different tokens each, the spans annotation is exactly the same for all 4 examples:

"spans":[{"start":2,"end":11,"label":"PRODNAME"},{"start":404,"end":411,"label":"PRODNAME"},{"start":412,"end":420,"label":"PRODNAME"},{"start":421,"end":435,"label":"PRODNAME"},{"start":436,"end":444,"label":"PRODNAME"},{"start":445,"end":450,"label":"PRODNAME"},{"start":451,"end":460,"label":"PRODNAME"},{"start":461,"end":468,"label":"PRODNAME"},{"start":473,"end":482,"label":"PRODNAME"},{"start":498,"end":502,"label":"PRODNAME"},{"start":519,"end":528,"label":"PRODNAME"},{"start":529,"end":536,"label":"PRODNAME"},{"start":541,"end":550,"label":"PRODNAME"}]}

This doesn't seem right?

1 Like

Hi! Yes, you're right! I checked it and fixed tokenization. Now it works fine.
Thanks a lot!

Happy to hear it!