Reproducing prodigy ner.batch-train in spacy: cross-validation results and outputted model

For a few reasons (such as to facilitating hyper-parameter optimization) I’d like to move from training NER models in Prodigy to Spacy. I’m wondering how I can reproduce the prodigy ner.batch-train results directly in spacy (such as things like cross-validated accuracy).

I have a few questions related to this issue.

  1. Model evaluation in Prodigy: Based on prodigy documentation, it isn’t clear to me exactly what is being used for model training and evaluation in ner.batch-train. As an example, for one model, the following is printed to console during ner.batch-train, but the numbers don’t exactly add up precisely to anything in the database that I can figure out:

     Using 50% of accept/reject examples (383) for evaluation
     Using 100% of remaining examples (399) for training

This output is for a databse with 150 examples labeled via manual and 1350 examples labeled via teach (715 of which were accepted). E.g., 50% of the accept/reject examples does not equal 383. The numbers are closer (but don’t seem exact), if I assume that items in the db with the same _input_hash are being merged.

If possible, can you clarify exactly what’s being used for both the training and evaluation here?
- Are data from both ‘manual’ and ‘teach’ being used during training? During evaluation?
- Are both positive and negative (accept and reject) examples being used during training? During evaluation?
- Are rows with matching _input_hash values combined? For instances with multiple accepted entity spans, is the model treated as correct only if it identifies all labeled entities?

  1. Training models from Prodigy db’s, in Spacy:

I’ve seen how to export prodigy data to a format for spacy here:
Mixing in gold data to avoid catastrophic forgetting
and it seems relatively easy to modify that code so that I can transform examples highlighted via the manual method.

Given the reformatted data, will updating an NER model (either starting from a blank model or a pretrained model) with the method given here:
or here:
approximately replicate what prodigy is doing during ner.batch-train?

It’s definitely a good idea to move the data out form Prodigy and experiment with general-purpose NLP tools at some point. Obviously we typically use spaCy, but other tools can be useful as well, especially for text classification. So, the data export is definitely something we care about. However, there are a couple of subtleties to keep in mind.

Yes, although mixing data from the two modes is a bit awkward currently. The problem is that the batch-train command currently takes a flag, --no-missing, that tells it about the nature of the data: is the data fully annotated (like the data from manual), or does the data have entities which may be correct, but are not annotated (like the data from ner.teach)? This flag is set over the whole dataset, so you can’t get optimal results from mixing the two types of example currently.

Yes, in both training and evaluation, for both types of data (manual and teach).

We do merge annotations from the same example together, using the input hash. Each entity is marked as correct or incorrect during evaluation — evaluation isn’t over the single text.

For data from manual yes, but for data from ner.teach — not quite! Prodigy implements a custom algorithm for training from the binary annotations produced by ner.teach. spaCy assumes that the data has no missing labels, which isn’t true for the binary annotations.

The best solution is to use the ner.print-best command, which will run the parser over your ner.teach dataset, and use your annotations to find the best parse. During training, this best parse would be marked as the gold-standard (even though it might be incorrect). If you output these best parses and use them to train spaCy, you’ll be doing much the same thing as the Prodigy training algorithm.

An even better approach is to pipe the output of ner.print-best into the ner.manual recipe. This lets you correct any errors in the data, effectively up-converting the ner.teach data into normal gold-standard annotations, which are much easier to use in other tools.

First of all–thanks for the detailed response!

I tried using the ner.print-best method, as you suggested, to produce a dataset that could then be passed to spacy for training. But the outputs produced by the print-best method seemed very problematic. Perhaps I’m doing it wrong though? This is what I have done:

  1. Created a labeled dataset in prodigy with about 1500 examples for a single new entity (200 using ner.manual and 1300 using ner.teach).
  2. I fit two models: (a) was trained starting from the ‘en_core_web_sm’ model, and (b) was trained starting from a blank_english model I generated via spacy.

In both cases I used this command to train the model, changing the initial spacy model as appropriate:
prodigy ner.batch-train new_entity en_core_web_sm --output new_entity_model --dropout 0.5 --n-iter 25 -U
(3) I then use print-best as follows:
prodigy ner.print-best pulse models/pulse_1 > new_entity_print_best.jsonl

The outputs from the print-best seem very poor (despite prodigy saying it is getting upwards of 90% accuracy during it’s internal cross-validation procedure, independent of which spacy_model I start with).

  • For the model trained starting from a blank_english model, the model way over-predicts the new entity (often assigning the label 3 or more times within my single-sentence excerpts, that have at most 1 instance of the label).
  • For the model trained starting from en_core_web_sm, the new entity seems to be applied for judiciously, but the default ent-labels in en_core_web_sm get applied all over the place, despite it not being relevant to our texts (e.g., “WORK_OF_ART” is applied on average 5x per excerpt) .

I’m wondering if you know what might be going on somehow? And is it possible that the cross-validation is somehow (especially when training from a blank model) ignoring False positives that the model is producing. Maybe this happens because it isn’t assuming the our model has gold-standard labels, so it can’t assume that the new_entity doesn’t apply to the texts except when an explicit “reject” response has been given to that highlighted section?

I’d say that’s very likely to be the problem. Try adding the --no-missing argument to the batch training. The model directory also has jsonl files with the training and evaluation split data, so you can pipe exactly that data through the model to see why it thinks it’s getting 90% accuracy.