Binary annotated data missed out in making gold data

Arul · November 26, 2018, 9:13pm

I used silver_to_gold recipe to convert the binary annotated ones to gold data. One example given below. One text with same input hash - with accept and reject labels in the binary annotations. but i do not find this text in the gold.

{"text":"POOH to 30 '' casing shoe.","_input_hash":-1258106350,"_task_hash":808499138,"tokens":[{"text":"POOH","start":0,"end":4,"id":0},{"text":"to","start":5,"end":7,"id":1},{"text":"30","start":8,"end":10,"id":2},{"text":"''","start":11,"end":13,"id":3},{"text":"casing","start":14,"end":20,"id":4},{"text":"shoe","start":21,"end":25,"id":5},{"text":".","start":25,"end":26,"id":6}],"spans":[{"start":0,"end":4,"text":"POOH","rank":0,"label":"Action","score":0.6760335891,"source":"xx_model","input_hash":-1258106350}],"meta":{"score":0.6760335891},"answer":"accept"}
{"text":"POOH to 30 '' casing shoe.","_input_hash":-1258106350,"_task_hash":484451321,"tokens":[{"text":"POOH","start":0,"end":4,"id":0},{"text":"to","start":5,"end":7,"id":1},{"text":"30","start":8,"end":10,"id":2},{"text":"''","start":11,"end":13,"id":3},{"text":"casing","start":14,"end":20,"id":4},{"text":"shoe","start":21,"end":25,"id":5},{"text":".","start":25,"end":26,"id":6}],"spans":[{"text":"shoe","start":21,"end":25,"priority":0.5,"score":0.5,"pattern":521189801,"label":"Equipment"}],"meta":{"score":0.5,"pattern":3189},"answer":"reject"}
{"text":"POOH to 30 '' casing shoe.","_input_hash":-1258106350,"_task_hash":1080809320,"tokens":[{"text":"POOH","start":0,"end":4,"id":0},{"text":"to","start":5,"end":7,"id":1},{"text":"30","start":8,"end":10,"id":2},{"text":"''","start":11,"end":13,"id":3},{"text":"casing","start":14,"end":20,"id":4},{"text":"shoe","start":21,"end":25,"id":5},{"text":".","start":25,"end":26,"id":6}],"spans":[{"start":14,"end":20,"text":"casing","rank":0,"label":"Fluid Additive","score":0.6164347514,"source":"xx_model","input_hash":-1258106350}],"meta":{"score":0.6164347514},"answer":"reject"}
{"text":"POOH to 30 '' casing shoe.","_input_hash":-1258106350,"_task_hash":-571148332,"tokens":[{"text":"POOH","start":0,"end":4,"id":0},{"text":"to","start":5,"end":7,"id":1},{"text":"30","start":8,"end":10,"id":2},{"text":"''","start":11,"end":13,"id":3},{"text":"casing","start":14,"end":20,"id":4},{"text":"shoe","start":21,"end":25,"id":5},{"text":".","start":25,"end":26,"id":6}],"spans":[{"start":14,"end":20,"text":"casing","rank":0,"label":"Action","score":0.6226858742,"source":"xx_model","input_hash":-1258106350}],"meta":{"score":0.6226858742},"answer":"reject"}
{"text":"POOH to 30 '' casing shoe.","_input_hash":-1258106350,"_task_hash":-885625635,"tokens":[{"text":"POOH","start":0,"end":4,"id":0},{"text":"to","start":5,"end":7,"id":1},{"text":"30","start":8,"end":10,"id":2},{"text":"''","start":11,"end":13,"id":3},{"text":"casing","start":14,"end":20,"id":4},{"text":"shoe","start":21,"end":25,"id":5},{"text":".","start":25,"end":26,"id":6}],"spans":[{"start":14,"end":20,"text":"casing","rank":0,"label":"Organization","score":0.5286895079,"source":"xx_model","input_hash":-1258106350}],"meta":{"score":0.5286895079},"answer":"reject"}

I do not understand why is this not accepted in the gold data.
like this, 100+ of the annotated text is missing in total of 600+ of total. The interface shows" no tasks available" for this dataset. I don't know if there is anything that I missed out here.

Also Another question:
in this command should I add

--exclude gold_dataset

so that it excludes the existing annotations in the gold dataset?

prodigy ner.silver-to-gold silver_dataset gold_dataset model -F ner_silver_to_gold.py

honnibal · November 28, 2018, 1:52pm

It looks like there might be a bug here, so thanks for the example and the clear report.

One thing that looks suspicious in your data is that you’ve got a space in your entity labels. We normally never have spaces in ours, so I wonder whether that could be causing problems. Could you try making all your labels only have characters in [A-Z_]? That is, all uppercase, with underscores instead of the spaces? You should be able to export with prodigy db-out to get a jsonl file so you can make the find-and-replacements, and then do db-in into a new dataset.

If that solves the problem, then it should be easy to either raise an error if labels have whitespace, or ensure they’re handled correctly.

Arul · November 28, 2018, 7:00pm

Thank you for the note on labels. I will do it and check.
After it said “No tasks available”, i killed the server and restarted it, it worked. So, not sure if this is the problem. I will anyway try and see if there is any difference in training after replacing the space. I have quite a few labels with space.

HSM · February 5, 2020, 7:23pm

@ines So I'm trying to convert a silver NE dataset to gold, but when I run the ner.silver-to-gold, it just shows all sentences without any binary annotation. Is this the way it should be? I believe it has to show all my binary annotation for each sentence and gives me a power to edit or add missing any label. I'm using 1.8.5 version.

honnibal · February 17, 2020, 4:27pm

@HSM Could you try again with the latest version? We've made some improvements to the ner.silver-to-gold recipe that might be relevant.

Btw, the recipe does behave a little differently from what you describe: the idea is to find the highest-scoring parse that's consistent with the annotations. It doesn't necessarily show all the binary annotations.

Topic		Replies	Views
Silver-to-gold unsegmented usage , ner	1	327	October 3, 2020
Training a model on both gold and binary data usage , ner , done	11	1492	August 27, 2021
Gold/Silver Dataset Confusion usage , ner , solved	2	1489	September 3, 2019
ner silver-to-gold resulted in annotating the same objects multiple times bug , ner	3	815	December 13, 2021
"Gold Standard" dataset as evaluation for ner.batch-train with binary annotation? usage , ner	2	788	May 15, 2019

Binary annotated data missed out in making gold data

Related topics