Hi everyone, perhaps my question is not profesional. But is Prodigy ner.batch-train using the same algorithm as Spacy train?
Perhaps @ines or @honnibal can explain?
If I want to train NER with Spacy, is it the same as I train with Prodigy?
Again, sorry if my question is silly.
Hi – this is a totally valid question
Since Prodigy focuses a lot on usage as a developer tool, the built-in
batch-train commands were also designed with the development aspect in mind. They’re optimised to train from Prodigy-style annotations and smaller datasets, include more complex logic to handle evaluation sets and output more detailed training statistics.
ner.batch-train workflow also supports training from “incomplete” annotations out-of-the-box, e.g. a selection of examples biased by the score, and binary decisions collected with recipes like
ner.teach. There’s not really and easy way to train from the sparse data formats created with the active learning workflow using spaCy – at least not out-of-the-box.
spacy train command on the other hand was designed for training from larger corpora, often annotated for several components (named entities, part-of-speech tags, dependencies etc.). It also supports more configuration options and settings to tune hyperparameters.
TL;DR: If you want to run quick experiments, train from binary annotations, or export prototype models from your Prodigy annotations, use the
batch-train recipes. If you want to train your production model on a large corpus on annotations, use
Thx @ines really good explanation
We have annotated around 10,000 paras in prodigy. We used python code to train the spacy ner model. But we are now planning to use
spacy train command line utility as it supports many configurable hyper parameters and also training over GPU.
We are finding it hard to convert prodigy annotated data into json input
spacy train expects: https://spacy.io/api/annotation#json-input.
ner.gold-to-spacy but even this is not the format spacy expects.
Any pointers how to convert to the spacy format?
Yes, I think using
spacy train is better once you have a reasonably sized data set.
I think the easiest way is to use the
spacy convert command, which supports the jsonl format that Prodigy produces. So you should be able to just use
prodigy db-out, and then pass that file through
spacy convert. If you set the extension to .jsonl, it should select the correct converter automatically. But in case it doesn't, you can also specify it explicitly with
Thanks for the reply. Actually found this thread and was following the same steps:
Unable to use Prodigy annotations with SpaCy CLI train.
But I am getting different error.
prodigy db-out train_data_set > /tmp/test.jsonl
python -m spacy convert -t /tmp/test.jsonl -c jsonl --lang en test.jsonl .
/lib/python3.7/site-packages/spacy/cli/converters/jsonl2json.py", line 24, in ner_jsonl2json
ents = record["spans"]
I am using
Is this the problem with the version or the command I am using is wrong?
Based on the error, it looks like at least one of your examples in
train_data_set doesn't contain a
"spans" property. Maybe you have annotations from a previous experiment with a different recipe in there? If you collect annotations with an NER recipe like
ner.manual, they should always have
"spans" (even if it's just an empty list). I think the easiest solution is to just inspect the JSONL file manually and see if you can find the example(s). Or you could add a quick hack to the script and make it
ents = record.get("spans", ).
Hello, I would like to ask more about this (spaCy convert). Is there any way for us to convert the db-out outpout from Prodigy after annotating using textcat.teach to spaCy format that it expects? We tried the spaCy convert, but it only gives us the ner jsonl format and not the text, label:true format? Sorry if my question should be placed somewhere else.
@Rumlerja Training the text classifier via
spacy train was only added to spaCy very recently. The converter currently only handles the NER annotations, because that's the most complex part. For text classifier annotations, you can just add a
"cats" entry to each document – see here for an example.
Based on this answer my inclination is to use
spacy rather than
prodigy as a training tool. However, it does not appear that the
spacy trainer supports training NER models with a new entity type over existing models. Am I confused and there is some way to do this with
spacy, or do I have to use
prodigy, or do I have to write my own training iteration code when adding entity types?
I think what's happening here is that the
spacy train command expects the base model you want to update to already have all labels added that you want to train. (It processes the data as a stream, so it's not going to compile all labels upfront and silently add them on the fly.) So if you want to update an existing pretrained model and add a new label, you should be able to just add the label and save out the base model:
ner = nlp.get_pipe("ner")
That's not quite writing no code, but it's pretty close. Thanks.
I could not find the code on github for the ner.batch-train recipe anywhere. I expected to find it here - https://github.com/explosion/prodigy-recipes
My goal is to be able to write a python script / function for STEP 4 in this project - https://github.com/explosion/projects/tree/master/ner-food-ingredients
In our project (NER skill extraction) we intend to automate all the steps (except labeling/annotation) as python scripts which can be called from a controlling data pipeline.
We don't have versions of all recipes in there and mostly focus on the annotation recipes. The
ner.batch-train recipe (and
train in v1.9) are really mostly wrappers around spaCy's training API. We do ship the source of all recipes with Prodigy, though. So you can always look at the
prodigy/recipes directory of your Prodigy installation. You can find the location by running
The training recipes are just simple Python functions, so you can always just import and call them from Python. If you're ready to train your final model, you might want to use spaCy directly, e.g. by running Prodigy's
data-to-spacy and then running
spacy train. You can also run these functions programmatically from Python if you need to (see