Hi,
Since I have both a gold dataset and 2 binary datasets, I wanted to try out the new feature in 1.11 of training on both gold and binary datasets at the same time.
The datasets look as follows:
- ~700 sentences w gold standard annotations (without missing entities) of entities A and B (almost all of the contain NEs)
- ~1100 binary annotations of only entity A
- ~600 binary annotations of only entity B on same sentences as above binary set (100% sentence overlap)
Since I could not find in the documentation if I should specify -NM in this case, I ran the following 4 experiments:
- training on gold dataset only
- training on gold dataset only with -NM specified (I know this does not make sense but wanted to compare with the below)
- training on gold + binary datasets
- training on gold + binary datasets with -NM specified
I used the following command:
prodigy train -n datasets -m en_core_sci_lg -L -V (-NM)
Both the resulting F scores, and the actual accuracy of the resulting models looked strange to me.
Regarding F scores, the highest scores were as follows:
1: 0.87
2: 0.86
3: 0.80
4: 0.90
When -NM is not specified, F scores decreased quite a lot (0.87 -> 0.80) so maybe it would indeed be correct to specify -NM?
When specifying -NM F scores increased a bit (0.86->0.90) but not that much, i would have expected more increase using the larger amount of binary annotations vs the gold annotations. (but this might be normal).
The biggest issue i faced is the actual accuracy of the resulting models.
When I use ner.correct using the highest scoring model from 4, the model suggests MANY strange tokens on all sentences: 'and', punctuation or just words that are not NEs at all. I would say only ~30% of predicted entities are correct. The model labels very agressively, sometimes it highlights ~80% of the words of a sentence as NEs. The results are similar for 2 but worse.
The model 1 and 3 seems MUCH better, in line with their high F score. Actually 3 seems better than 1 which would be weird given the F scores.
My questions follow from this are:
- In this case should i indeed specify -NM?
- Could it be that the scoring calculation is incorrect when training on gold + binary datasets with or without -NM specified? (maybe because i did not specify a separate gold standard evaluation set?)
- What could explain the strange "agressive labeling" behavior of the trained models specified with -NM (assuming this flag is correct)?
For reference i also performed above experiments without specifiying the en_core_sci_lg model, and repeated the model-less training also with v1.10.a11 and results are similar, especially the weird behaviour of the trainined models w -NM.
Would be great to get your thoughts on this, thanks a lot!
Kind regards,
Tom