SpaCy3 models evaluation on a custom dataset

Hi guys,

I have three custom datasets for the Parser, Tagger, and NER that were generated using Prodigy. Now, I want to use them to evaluate Parser, Tagger, and NER in the en_core_web_trf model. Is there a way to do it without using Prodigy? The data format in my sets looks very different from the one described in SpaCy's Example class.

Any guidance is appreciated,
Thank you

Hi! You can use the data-to-spacy recipe to export your annotations, which in Prodigy v1.10 will give you a corpus in spaCy's v2's JSON format. If you're using spaCy v3, you can run spacy convert to convert it to the binary format used by spacy train: https://spacy.io/api/cli#convert You'll then be able to train and evaluate your model using a transformer-based config.

Btw, under the hood, spaCy v3's binary format is just a collection of annotated Doc objects (which now also makes it much easier to generate it programmatically): https://spacy.io/api/data-formats#binary-training

The upcoming Prodigy v1.11, currently available as a nightly pre-release, will allow you to export your data in spaCy's .spacy format out-of-the-box.

Hi Ines,

Thanks for the response!
My data is in JSONL format. The "spacy convert" doesn't work on JSONL, am I wrong?

Yes, the spacy convert command expects data in spaCy's JSON format for training, not just the raw annotations you've exported with Prodigy. You can create it using the data-to-spacy command, which will merge your annotations from different datasets and export a corpus in spaCy's format.

1 Like