Hi! I'd like to give to a client who doesn't have a prodigy license a way to train a NER model starting from annotation data I generated using prodigy 1.11.2. Ideally I would like to enable them to train a model
without the need of using prodigy, just spaCy 3
handing to them JSON/JSONL as "raw" training data, rather than binary .spacy files
Is this possible?
Something that works (but doesn't generate JSON/JSONL files which I'd like to hand over as data source) is the following:
python -m prodigy data-to-spacy --ds . # doesn't generate intermediate JSON files, so not great for me
python -m spacy train config.cfg --paths.train train.spacy --paths.dev dev.spacy
To allow the training data to be generated from JSONL files rather than binary files, I tried the following:
python -m prodigy db-out ds ds # to generate a JSON annotation file
create a train and dev split with pandas/sklearn in a python console
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_json("ds/ds.jsonl", lines = True)
train, dev = train_test_split(data)
train.to_json("train.jsonl", orient = "records", lines = True) # respect the original JSONL format
dev.to_json("dev.jsonl", orient = "records", lines = True) # respect the original JSONL format
Hi! Under the hood, the binary .spacy format is a serialized DocBin, i.e. a collection of Doc objects. So if you want to create training data from pretty much any format, you can simply construct Doc objects, set the annotations you need (doc.ents, doc.cats, other token-based tags) and save out a DocBin: https://spacy.io/usage/training/#training-data
Here's an example script that reads in annotations in Prodigy's JSON format with text, tokens and spans and creates a DocBin from them:
One thing to keep in mind is that the data-to-spacy command, which you'd normally use within Prodigy to output data in spaCy's format, also inclues an additional step for merging annotations on the same data into a single example, using the _input_hash to indentify examples with the same input. This is important if you have multiple annotations on the same text, e.g. two datasets annotating the same example with different labels, or an NER dataset + a text classification dataset. So if that's relevant for your data, you want to start by grouping all examples by _input_hash before you create one Doc doc object per example (plus maybe some checks and validation, in case there are conflicts in the data).