Missing entity result

hi @kim_ds!

Interesting problem! (And thanks for joining the Prodigy Community :wave:)

Is this the approx distribution of your tags? That is, you have ~1,000 annotations for label A, ~600 annotations for label B, etc.?

If not, can you provide the distributions?

Nothing stands out for now. I have a few questions:

  • How are you training: prodigy train or spacy train?

If you're using prodigy train, after training, did you add the argument --label-stats to print the label stats? Please provide if so :slight_smile:

  • Are you providing a custom spaCy config file? If so, can you provide details?

  • How are you setting your evaluation dataset?

If you're using prodigy train and you don't specify a dedicated hold out (eval) dataset, it will automatically create one for you. It's usually best practice to create a dedicated dataset and pass through eval: dataset prefix. I'm thinking there could be a chance if you created on your own there's an error with your holdout (e.g., forgot to include one of the labels when doing processing).

  • If you're using spaCy projects (aka have a config.cfg file), can you run spacy debug data?

This will print a helpful output including NER label details that spaCy is reading like below.

python -m spacy debug data ./config.cfg
...

========================== Named Entity Recognition ==========================
ℹ 18 new labels, 0 existing labels
528978 missing values (tokens with '-' label)
New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL'
(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122),
'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC'
(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
...
  • How did you get the annotations? Did you use a Prodigy recipe or create them some other way?

Wondering if created externally, there could have been an issue with formatting the data.

Thank you!