hi @kim_ds!
Interesting problem! (And thanks for joining the Prodigy Community )
Is this the approx distribution of your tags? That is, you have ~1,000 annotations for label A, ~600 annotations for label B, etc.?
If not, can you provide the distributions?
Nothing stands out for now. I have a few questions:
- How are you training:
prodigy train
or spacy train
?
If you're using prodigy train
, after training, did you add the argument --label-stats
to print the label stats? Please provide if so
-
Are you providing a custom spaCy config file? If so, can you provide details?
-
How are you setting your evaluation dataset?
If you're using prodigy train
and you don't specify a dedicated hold out (eval) dataset, it will automatically create one for you. It's usually best practice to create a dedicated dataset and pass through eval:
dataset prefix. I'm thinking there could be a chance if you created on your own there's an error with your holdout (e.g., forgot to include one of the labels when doing processing).
- If you're using spaCy projects (aka have a
config.cfg
file), can you run spacy debug data
?
This will print a helpful output including NER label details that spaCy is reading like below.
python -m spacy debug data ./config.cfg
...
========================== Named Entity Recognition ==========================
ℹ 18 new labels, 0 existing labels
528978 missing values (tokens with '-' label)
New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL'
(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122),
'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC'
(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
...
- How did you get the annotations? Did you use a Prodigy recipe or create them some other way?
Wondering if created externally, there could have been an issue with formatting the data.
Thank you!