Stats for spaCy's train_new_entity_type.py

aslitoj · August 19, 2019, 12:53pm

Hello,
My question might be a very simple one but I want to clarify if the prodigy's ner.batch-train is similar with spaCy's train_new_entity_type.py training (https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py) ?

After annotating my data with ner.make-gold, I used spaCy's training new entity type. If their use is the same, I'd like to know if there already exists some stats (as in the ner.batch-train stats like accuracy, FP, FN) recipe that I can use for my model which is resulted form spaCy's training new entity type.

For now in the code, I only have the losses.
sizes = compounding(1.0, 1000.0, 1.001)
# batch up the examples using spaCy's minibatch
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
batches = minibatch(TRAIN_DATA, size=sizes)
losses = {}
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.20, losses=losses)
print("Losses", losses)

I hope my question is not stupid.
Thanks,

adriane · August 19, 2019, 5:31pm

You can use a model's evaluate() function to calculate performance on a dataset. If you have DEV_DATA in the same format as TRAIN_DATA, you can use:

scorer = nlp.evaluate(DEV_DATA)

The relevant scores for NER are: scorer.ents_p, scorer.ents_r, scorer.ents_f, scorer.ents_per_type

As an alternative, if you have data in a format that you can easily convert to spacy's training format (typically with python -m spacy convert), you can convert it and use the train CLI and/or the evaluate CLI for training and evaluating new models. See: https://spacy.io/api/cli

ines · August 19, 2019, 6:20pm

Adriane's answer above should give you everything you need – but just to address this specific point: Yes, under the hood, ner.batch-train also loops over the examples calls into nlp.update, just like all the example scripts we provide.

Prodigy's built-in training recipes are basically wrappers around spaCy's training methods that are optimised for quick experiments and give you nicely-formatted output. They take care of loading and converting Prodigy's data format, merging annotations on the same text (e.g. if you've accepted/rejected multiple entities in the same sentence) and handling both gold-standard and incomplete annotations (if you only know that one span is wrong, but don't know the answer for all tokens in the text).

aslitoj · August 20, 2019, 7:07am

Thank you very much for your reply Adriane.
I'll try.

aslitoj · August 20, 2019, 7:09am

Thanks a lot Ines for the clarification.

Topic		Replies	Views
Remarkable Difference Between Prodigy and Custom Training Times ner	5	1439	April 4, 2018
--label-stats for spaCy train ner , spacy , solved , transformers	2	20	July 7, 2024
Create baseline metrics based on manual NER annotations usage , ner , solved	3	669	June 8, 2020
ner.batch_train vs spacy nlp.begin_training ner , spacy	1	1098	January 26, 2018
Reproducing prodigy ner.batch-train in spacy: cross-validation results and outputted model usage , ner	3	1874	October 5, 2018

Stats for spaCy's train_new_entity_type.py

Related topics