Prodigy ner.batch-train vs Spacy train

Hi everyone, perhaps my question is not profesional. But is Prodigy ner.batch-train using the same algorithm as Spacy train?

Perhaps @ines or @honnibal can explain?

If I want to train NER with Spacy, is it the same as I train with Prodigy?

Again, sorry if my question is silly.

Thank you.

Hi – this is a totally valid question :slightly_smiling_face:

Since Prodigy focuses a lot on usage as a developer tool, the built-in batch-train commands were also designed with the development aspect in mind. They’re optimised to train from Prodigy-style annotations and smaller datasets, include more complex logic to handle evaluation sets and output more detailed training statistics.

Prodigy’s ner.batch-train workflow also supports training from “incomplete” annotations out-of-the-box, e.g. a selection of examples biased by the score, and binary decisions collected with recipes like ner.teach. There’s not really and easy way to train from the sparse data formats created with the active learning workflow using spaCy – at least not out-of-the-box.

spaCy’s spacy train command on the other hand was designed for training from larger corpora, often annotated for several components (named entities, part-of-speech tags, dependencies etc.). It also supports more configuration options and settings to tune hyperparameters.

TL;DR: If you want to run quick experiments, train from binary annotations, or export prototype models from your Prodigy annotations, use the batch-train recipes. If you want to train your production model on a large corpus on annotations, use spacy train.

Thx @ines really good explanation :smile:

1 Like

We have annotated around 10,000 paras in prodigy. We used python code to train the spacy ner model. But we are now planning to use spacy train command line utility as it supports many configurable hyper parameters and also training over GPU.
We are finding it hard to convert prodigy annotated data into json input spacy train expects: https://spacy.io/api/annotation#json-input.
We checked ner.gold-to-spacy but even this is not the format spacy expects.
Any pointers how to convert to the spacy format?

Thanks

Hi @PuneethaPai,

Yes, I think using spacy train is better once you have a reasonably sized data set.

I think the easiest way is to use the spacy convert command, which supports the jsonl format that Prodigy produces. So you should be able to just use prodigy db-out, and then pass that file through spacy convert. If you set the extension to .jsonl, it should select the correct converter automatically. But in case it doesn't, you can also specify it explicitly with --converter jsonl

Hi @honnibal,
Thanks for the reply. Actually found this thread and was following the same steps:
Unable to use Prodigy annotations with SpaCy CLI train.

But I am getting different error.

prodigy db-out train_data_set > /tmp/test.jsonl
python -m spacy convert -t /tmp/test.jsonl -c jsonl --lang en test.jsonl .

/lib/python3.7/site-packages/spacy/cli/converters/jsonl2json.py", line 24, in ner_jsonl2json
    ents = record["spans"]
KeyError: 'spans'

I am using prodigy==1.8.3 and spacy==2.2.2.
Is this the problem with the version or the command I am using is wrong?

Thank You

Based on the error, it looks like at least one of your examples in train_data_set doesn't contain a "spans" property. Maybe you have annotations from a previous experiment with a different recipe in there? If you collect annotations with an NER recipe like ner.manual, they should always have "spans" (even if it's just an empty list). I think the easiest solution is to just inspect the JSONL file manually and see if you can find the example(s). Or you could add a quick hack to the script and make it ents = record.get("spans", []).

Hello, I would like to ask more about this (spaCy convert). Is there any way for us to convert the db-out outpout from Prodigy after annotating using textcat.teach to spaCy format that it expects? We tried the spaCy convert, but it only gives us the ner jsonl format and not the text, label:true format? Sorry if my question should be placed somewhere else.

Best regards
Jan

@Rumlerja Training the text classifier via spacy train was only added to spaCy very recently. The converter currently only handles the NER annotations, because that's the most complex part. For text classifier annotations, you can just add a "cats" entry to each document – see here for an example.