I work for a medium-sized startup, so many people are involved with using our data.
So our procedure is to train an initial model with ner.manual and an initial text file and an NER dataset. Once a basic model is available we annotate with a second text file and the initial model in a make-gold session with the same dataset. I see that this query was included once in teach of those text files, so perhaps that is why the hash of the first input example (with its annotation) did not match during the make-gold session with the second file?
I ran an experiment which involved annotating the same query: once with ner.manual and then once with ner-make gold. Each with different tags. This resulted in two JSON entries in the dataset:
prodigy db-out duplicate_test
{“text”:“Play alternative music.”,"_input_hash":-1439002215,"_task_hash":1687991231,“tokens”:[{“text”:“Play”,“start”:0,“end”:4,“id”:0},{“text”:“alternative”,“start”:5,“end”:16,“id”:1},{“text”:“music”,“start”:17,“end”:22,“id”:2},{“text”:".",“start”:22,“end”:23,“id”:3}],“spans”:[{“start”:5,“end”:16,“token_start”:1,“token_end”:1,“label”:“genre”}],“answer”:“accept”}
{“text”:“Play alternative music.”,"_input_hash":-1439002215,"_task_hash":-840412699,“tokens”:[{“text”:“Play”,“start”:0,“end”:4,“id”:0},{“text”:“alternative”,“start”:5,“end”:16,“id”:1},{“text”:“music”,“start”:17,“end”:22,“id”:2},{“text”:".",“start”:22,“end”:23,“id”:3}],“spans”:[{“start”:5,“end”:22,“token_start”:1,“token_end”:2,“label”:“genre”}],“answer”:“accept”}
It seems that is not what happened since they ended up as two entries. There were some changes in the entity labels. I tried replicating changing entity labels, but they ended up as two entries as before with the same exact hashes as above for input and task…
Let me know if you have more thoughts about what the cause could be.
Data validation sounds like a very useful tool.
Thanks again!