For a few reasons (such as to facilitating hyper-parameter optimization) I’d like to move from training NER models in Prodigy to Spacy. I’m wondering how I can reproduce the prodigy ner.batch-train results directly in spacy (such as things like cross-validated accuracy).
I have a few questions related to this issue.
-
Model evaluation in Prodigy: Based on prodigy documentation, it isn’t clear to me exactly what is being used for model training and evaluation in
ner.batch-train
. As an example, for one model, the following is printed to console during ner.batch-train, but the numbers don’t exactly add up precisely to anything in the database that I can figure out:Using 50% of accept/reject examples (383) for evaluation Using 100% of remaining examples (399) for training
This output is for a databse with 150 examples labeled via manual
and 1350 examples labeled via teach
(715 of which were accepted). E.g., 50% of the accept/reject examples does not equal 383. The numbers are closer (but don’t seem exact), if I assume that items in the db with the same _input_hash are being merged.
If possible, can you clarify exactly what’s being used for both the training and evaluation here?
- Are data from both ‘manual’ and ‘teach’ being used during training? During evaluation?
- Are both positive and negative (accept and reject) examples being used during training? During evaluation?
- Are rows with matching _input_hash values combined? For instances with multiple accepted entity spans, is the model treated as correct only if it identifies all labeled entities?
- Training models from Prodigy db’s, in Spacy:
I’ve seen how to export prodigy data to a format for spacy here:
Mixing in gold data to avoid catastrophic forgetting
and it seems relatively easy to modify that code so that I can transform examples highlighted via the manual
method.
Given the reformatted data, will updating an NER model (either starting from a blank model or a pretrained model) with the method given here:
https://spacy.io/usage/training#example-train-ner
or here:
https://spacy.io/usage/training#training-simple-style
approximately replicate what prodigy is doing during ner.batch-train?