Stats for spaCy's train_new_entity_type.py

Hello,
My question might be a very simple one but I want to clarify if the prodigy's ner.batch-train is similar with spaCy's train_new_entity_type.py training (https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py) ?

After annotating my data with ner.make-gold, I used spaCy's training new entity type. If their use is the same, I'd like to know if there already exists some stats (as in the ner.batch-train stats like accuracy, FP, FN) recipe that I can use for my model which is resulted form spaCy's training new entity type.

For now in the code, I only have the losses.
sizes = compounding(1.0, 1000.0, 1.001)
# batch up the examples using spaCy's minibatch
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
batches = minibatch(TRAIN_DATA, size=sizes)
losses = {}
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.20, losses=losses)
print("Losses", losses)

I hope my question is not stupid.
Thanks,

You can use a model's evaluate() function to calculate performance on a dataset. If you have DEV_DATA in the same format as TRAIN_DATA, you can use:

scorer = nlp.evaluate(DEV_DATA)

The relevant scores for NER are: scorer.ents_p, scorer.ents_r, scorer.ents_f, scorer.ents_per_type

As an alternative, if you have data in a format that you can easily convert to spacy's training format (typically with python -m spacy convert), you can convert it and use the train CLI and/or the evaluate CLI for training and evaluating new models. See: https://spacy.io/api/cli

1 Like

Adriane's answer above should give you everything you need – but just to address this specific point: Yes, under the hood, ner.batch-train also loops over the examples calls into nlp.update, just like all the example scripts we provide.

Prodigy's built-in training recipes are basically wrappers around spaCy's training methods that are optimised for quick experiments and give you nicely-formatted output. They take care of loading and converting Prodigy's data format, merging annotations on the same text (e.g. if you've accepted/rejected multiple entities in the same sentence) and handling both gold-standard and incomplete annotations (if you only know that one span is wrong, but don't know the answer for all tokens in the text).

Thank you very much for your reply Adriane.
I'll try.

Thanks a lot Ines for the clarification.