How to train a NER model using spaCy 3 only, starting from prodigy (1.11) JSON files?

Hi! I'd like to give to a client who doesn't have a prodigy license a way to train a NER model starting from annotation data I generated using prodigy 1.11.2. Ideally I would like to enable them to train a model

  • without the need of using prodigy, just spaCy 3
  • handing to them JSON/JSONL as "raw" training data, rather than binary .spacy files

Is this possible?

Something that works (but doesn't generate JSON/JSONL files which I'd like to hand over as data source) is the following:

python -m prodigy data-to-spacy --ds . #  doesn't generate intermediate JSON files, so not great for me
python -m spacy train config.cfg --paths.train train.spacy dev.spacy

To allow the training data to be generated from JSONL files rather than binary files, I tried the following:

python -m prodigy db-out ds ds # to generate a JSON annotation file

create a train and dev split with pandas/sklearn in a python console

import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_json("ds/ds.jsonl", lines = True)
train, dev = train_test_split(data)
train.to_json("train.jsonl", orient = "records", lines = True) # respect the original JSONL format
dev.to_json("dev.jsonl", orient = "records", lines = True) # respect the original JSONL format 

The problem is when I try to transform the train and dev files to the .spacy format, so to train with spaCy 3: an attempt using the script in projects/ at 3c17ba90490301e8665503a6516adf3f77bb5b07 · explosion/projects · GitHub

python en train.jsonl train.spacy

gives me a ValueError: Trailing data error. spacy convert also doesn't seem to be of help here, as it handles JSON for v2 files (?).

How can I solve this?

Hi! Under the hood, the binary .spacy format is a serialized DocBin, i.e. a collection of Doc objects. So if you want to create training data from pretty much any format, you can simply construct Doc objects, set the annotations you need (doc.ents, doc.cats, other token-based tags) and save out a DocBin:

Here's an example script that reads in annotations in Prodigy's JSON format with text, tokens and spans and creates a DocBin from them:

One thing to keep in mind is that the data-to-spacy command, which you'd normally use within Prodigy to output data in spaCy's format, also inclues an additional step for merging annotations on the same data into a single example, using the _input_hash to indentify examples with the same input. This is important if you have multiple annotations on the same text, e.g. two datasets annotating the same example with different labels, or an NER dataset + a text classification dataset. So if that's relevant for your data, you want to start by grouping all examples by _input_hash before you create one Doc doc object per example (plus maybe some checks and validation, in case there are conflicts in the data).

1 Like