mismatch between accuracy on holdout set and batch-train accuracy

I made a holdout set to evaluate the model using ner.manual. When running ner.batch-train with the --eval-id pointing to this set it has a maximum of 20% accuracy. However if I run ner.print-stream on the holdout set with my trained model the results look really strong(like 80-90%).

Any ideas of why this may be happening?

I’ve had a look through the source, and I can’t find an easy answer. The entities are being set the same way, and the evaluation code within batch train is doing sensible things. Could you do prodigy db-out my-eval-set | less and paste the first two records?

{"answer":"accept","spans":[{"label":"ZIPCODE","token_end":42,"token_start":42,"end":176,"start":171}],"_input_hash":1917893008,"_task_hash":285615731,"text":"<AGENT>  Hi Js ! I hope your day is going well . What is the order number , name and billing address ? \t<CUSTOMER> W592997642 some - name here 3146 bunche ave , fake city MI 00011 \t","tokens":[{"text":"<","id":0,"end":1,"start":0},{"text":"AGENT","id":1,"end":6,"start":1},{"text":">","id":2,"end":7,"start":6},{"text":" ","id":3,"end":9,"start":8},{"text":"Hi","id":4,"end":11,"start":9},{"text":"Js","id":5,"end":14,"start":12},{"text":"!","id":6,"end":16,"start":15},{"text":"I","id":7,"end":18,"start":17},{"text":"hope","id":8,"end":23,"start":19},{"text":"your","id":9,"end":28,"start":24},{"text":"day","id":10,"end":32,"start":29},{"text":"is","id":11,"end":35,"start":33},{"text":"going","id":12,"end":41,"start":36},{"text":"well","id":13,"end":46,"start":42},{"text":".","id":14,"end":48,"start":47},{"text":"What","id":15,"end":53,"start":49},{"text":"is","id":16,"end":56,"start":54},{"text":"the","id":17,"end":60,"start":57},{"text":"order","id":18,"end":66,"start":61},{"text":"number","id":19,"end":73,"start":67},{"text":",","id":20,"end":75,"start":74},{"text":"name","id":21,"end":80,"start":76},{"text":"and","id":22,"end":84,"start":81},{"text":"billing","id":23,"end":92,"start":85},{"text":"address","id":24,"end":100,"start":93},{"text":"?","id":25,"end":102,"start":101},{"text":"\t","id":26,"end":104,"start":103},{"text":"<","id":27,"end":105,"start":104},{"text":"CUSTOMER","id":28,"end":113,"start":105},{"text":">","id":29,"end":114,"start":113},{"text":"W592997642","id":30,"end":125,"start":115},{"text":"some","id":31,"end":129,"start":126},{"text":"-","id":32,"end":131,"start":130},{"text":"soo","id":33,"end":135,"start":132},{"text":"name","id":34,"end":139,"start":136},{"text":"31345","id":35,"end":144,"start":140},{"text":"fake","id":36,"end":151,"start":145},{"text":"ave","id":37,"end":155,"start":152},{"text":",","id":38,"end":157,"start":156},{"text":"some ","id":39,"end":161,"start":158},{"text":"city","id":40,"end":167,"start":162},{"text":"MI","id":41,"end":170,"start":168},{"text":"00011","id":42,"end":176,"start":171},{"text":"\t","id":43,"end":178,"start":177}]}
{"_input_hash":2042399920,"_task_hash":868964077,"spans":[{"token_end":26,"start":117,"end":122,"token_start":26,"label":"ZIPCODE"}],"answer":"accept","text":"<AGENT>  For security purposes can you please confirm the post code / zip code of your billing address ? \t<CUSTOMER> 95242 \t","tokens":[{"start":0,"end":1,"id":0,"text":"<"},{"start":1,"end":6,"id":1,"text":"AGENT"},{"start":6,"end":7,"id":2,"text":">"},{"start":8,"end":9,"id":3,"text":" "},{"start":9,"end":12,"id":4,"text":"For"},{"start":13,"end":21,"id":5,"text":"security"},{"start":22,"end":30,"id":6,"text":"purposes"},{"start":31,"end":34,"id":7,"text":"can"},{"start":35,"end":38,"id":8,"text":"you"},{"start":39,"end":45,"id":9,"text":"please"},{"start":46,"end":53,"id":10,"text":"confirm"},{"start":54,"end":57,"id":11,"text":"the"},{"start":58,"end":62,"id":12,"text":"post"},{"start":63,"end":67,"id":13,"text":"code"},{"start":68,"end":69,"id":14,"text":"/"},{"start":70,"end":73,"id":15,"text":"zip"},{"start":74,"end":78,"id":16,"text":"code"},{"start":79,"end":81,"id":17,"text":"of"},{"start":82,"end":86,"id":18,"text":"your"},{"start":87,"end":94,"id":19,"text":"billing"},{"start":95,"end":102,"id":20,"text":"address"},{"start":103,"end":104,"id":21,"text":"?"},{"start":105,"end":106,"id":22,"text":"\t"},{"start":106,"end":107,"id":23,"text":"<"},{"start":107,"end":115,"id":24,"text":"CUSTOMER"},{"start":115,"end":116,"id":25,"text":">"},{"start":117,"end":122,"id":26,"text":"95242"},{"start":123,"end":124,"id":27,"text":"\t"}]}

As you can see the text is quite long - somewhere in the back end you split up each example from the validation set into sentences and evaluate on the model runs on those individual sentences. Since I had about 5 sentences per example the maximum I could get is 20%. I think this is because of the default behavior of ner.manual of not splitting up the input text.

I am having the same issue with a custom recipe of ner.teach that does not split out the sentences. When using ner.batch-train on my annotations it consistently reports the same horrible accuracy.

@mhigginslp This should be fixed in the new v1.4 release — sorry for the inconvenience!

Is there a prodigy recipe to take in model and test set and reports accuracy (F-score, prec and recall would be nice too)?

@mhigginslp This recipe contributed by @farlee2121 (or a modified version of it) might be what you're looking for:

I am getting some weird results with ner.batch-train now …

There are actually over 1000 annotations in the zip_street db.
The same behavior is occuring when I annotate one label, in which the teach method did not split the sentences.

I have started using the V1.4 today.

Hmm! I hope we didn’t break something. At the moment the prime suspect is the split_sentences function, so if you could quickly try with --unsegmented that would help.

The other thing to try is prodigy ner.print-dataset. This will give you a quick sanity check to see whether the annotations are being calculated correctly on the text.

--unsegmented did the trick for when I used --eval-split but not when I used eval-id pointing to a validation set in which I used ner.manual to annotate - in which case the number of corrects/incorrects were an order of magnitude off.