Prodigy ner.batch-train vs Spacy train

Hi everyone, perhaps my question is not profesional. But is Prodigy ner.batch-train using the same algorithm as Spacy train?

Perhaps @ines or @honnibal can explain?

If I want to train NER with Spacy, is it the same as I train with Prodigy?

Again, sorry if my question is silly.

Thank you.

Hi – this is a totally valid question :slightly_smiling_face:

Since Prodigy focuses a lot on usage as a developer tool, the built-in batch-train commands were also designed with the development aspect in mind. They’re optimised to train from Prodigy-style annotations and smaller datasets, include more complex logic to handle evaluation sets and output more detailed training statistics.

Prodigy’s ner.batch-train workflow also supports training from “incomplete” annotations out-of-the-box, e.g. a selection of examples biased by the score, and binary decisions collected with recipes like ner.teach. There’s not really and easy way to train from the sparse data formats created with the active learning workflow using spaCy – at least not out-of-the-box.

spaCy’s spacy train command on the other hand was designed for training from larger corpora, often annotated for several components (named entities, part-of-speech tags, dependencies etc.). It also supports more configuration options and settings to tune hyperparameters.

TL;DR: If you want to run quick experiments, train from binary annotations, or export prototype models from your Prodigy annotations, use the batch-train recipes. If you want to train your production model on a large corpus on annotations, use spacy train.

Thx @ines really good explanation :smile:

1 Like

We have annotated around 10,000 paras in prodigy. We used python code to train the spacy ner model. But we are now planning to use spacy train command line utility as it supports many configurable hyper parameters and also training over GPU.
We are finding it hard to convert prodigy annotated data into json input spacy train expects: https://spacy.io/api/annotation#json-input.
We checked ner.gold-to-spacy but even this is not the format spacy expects.
Any pointers how to convert to the spacy format?

Thanks

Hi @PuneethaPai,

Yes, I think using spacy train is better once you have a reasonably sized data set.

I think the easiest way is to use the spacy convert command, which supports the jsonl format that Prodigy produces. So you should be able to just use prodigy db-out, and then pass that file through spacy convert. If you set the extension to .jsonl, it should select the correct converter automatically. But in case it doesn't, you can also specify it explicitly with --converter jsonl

Hi @honnibal,
Thanks for the reply. Actually found this thread and was following the same steps:
Unable to use Prodigy annotations with SpaCy CLI train.

But I am getting different error.

prodigy db-out train_data_set > /tmp/test.jsonl
python -m spacy convert -t /tmp/test.jsonl -c jsonl --lang en test.jsonl .

/lib/python3.7/site-packages/spacy/cli/converters/jsonl2json.py", line 24, in ner_jsonl2json
    ents = record["spans"]
KeyError: 'spans'

I am using prodigy==1.8.3 and spacy==2.2.2.
Is this the problem with the version or the command I am using is wrong?

Thank You

1 Like

Based on the error, it looks like at least one of your examples in train_data_set doesn't contain a "spans" property. Maybe you have annotations from a previous experiment with a different recipe in there? If you collect annotations with an NER recipe like ner.manual, they should always have "spans" (even if it's just an empty list). I think the easiest solution is to just inspect the JSONL file manually and see if you can find the example(s). Or you could add a quick hack to the script and make it ents = record.get("spans", []).

Hello, I would like to ask more about this (spaCy convert). Is there any way for us to convert the db-out outpout from Prodigy after annotating using textcat.teach to spaCy format that it expects? We tried the spaCy convert, but it only gives us the ner jsonl format and not the text, label:true format? Sorry if my question should be placed somewhere else.

Best regards
Jan

@Rumlerja Training the text classifier via spacy train was only added to spaCy very recently. The converter currently only handles the NER annotations, because that's the most complex part. For text classifier annotations, you can just add a "cats" entry to each document – see here for an example.

Based on this answer my inclination is to use spacy rather than prodigy as a training tool. However, it does not appear that the spacy trainer supports training NER models with a new entity type over existing models. Am I confused and there is some way to do this with spacy, or do I have to use prodigy, or do I have to write my own training iteration code when adding entity types?

I think what's happening here is that the spacy train command expects the base model you want to update to already have all labels added that you want to train. (It processes the data as a stream, so it's not going to compile all labels upfront and silently add them on the fly.) So if you want to update an existing pretrained model and add a new label, you should be able to just add the label and save out the base model:

ner = nlp.get_pipe("ner")
ner.add_label("YOUR_LABEL")
nlp.to_disk("./base-model")

That's not quite writing no code, but it's pretty close. :grinning: Thanks.

@ines

I could not find the code on github for the ner.batch-train recipe anywhere. I expected to find it here - https://github.com/explosion/prodigy-recipes

My goal is to be able to write a python script / function for STEP 4 in this project - https://github.com/explosion/projects/tree/master/ner-food-ingredients

In our project (NER skill extraction) we intend to automate all the steps (except labeling/annotation) as python scripts which can be called from a controlling data pipeline.

Please advise.

Thanks,
Kapil

We don't have versions of all recipes in there and mostly focus on the annotation recipes. The ner.batch-train recipe (and train in v1.9) are really mostly wrappers around spaCy's training API. We do ship the source of all recipes with Prodigy, though. So you can always look at the prodigy/recipes directory of your Prodigy installation. You can find the location by running prodigy stats.

The training recipes are just simple Python functions, so you can always just import and call them from Python. If you're ready to train your final model, you might want to use spaCy directly, e.g. by running Prodigy's data-to-spacy and then running spacy train. You can also run these functions programmatically from Python if you need to (see spacy.cli.train and prodigy.recipes.train).