Hi! I'd like to give to a client who doesn't have a prodigy license a way to train a NER model starting from annotation data I generated using prodigy 1.11.2. Ideally I would like to enable them to train a model
- without the need of using prodigy, just spaCy 3
- handing to them JSON/JSONL as "raw" training data, rather than binary
Is this possible?
Something that works (but doesn't generate JSON/JSONL files which I'd like to hand over as data source) is the following:
python -m prodigy data-to-spacy --ds . # doesn't generate intermediate JSON files, so not great for me python -m spacy train config.cfg --paths.train train.spacy --paths.dev dev.spacy
To allow the training data to be generated from JSONL files rather than binary files, I tried the following:
python -m prodigy db-out ds ds # to generate a JSON annotation file
create a train and dev split with pandas/sklearn in a python console
import pandas as pd from sklearn.model_selection import train_test_split data = pd.read_json("ds/ds.jsonl", lines = True) train, dev = train_test_split(data) train.to_json("train.jsonl", orient = "records", lines = True) # respect the original JSONL format dev.to_json("dev.jsonl", orient = "records", lines = True) # respect the original JSONL format
The problem is when I try to transform the train and dev files to the
.spacy format, so to train with spaCy 3: an attempt using the script in projects/convert.py at 3c17ba90490301e8665503a6516adf3f77bb5b07 · explosion/projects · GitHub
python convert.py en train.jsonl train.spacy
gives me a
ValueError: Trailing data error.
spacy convert also doesn't seem to be of help here, as it handles JSON for v2 files (?).
How can I solve this?