How to train a NER model using spaCy 3 only, starting from prodigy (1.11) JSON files?

davidefiocco · August 21, 2021, 2:35pm

Hi! I'd like to give to a client who doesn't have a prodigy license a way to train a NER model starting from annotation data I generated using prodigy 1.11.2. Ideally I would like to enable them to train a model

without the need of using prodigy, just spaCy 3
handing to them JSON/JSONL as "raw" training data, rather than binary .spacy files

Is this possible?

Something that works (but doesn't generate JSON/JSONL files which I'd like to hand over as data source) is the following:

python -m prodigy data-to-spacy --ds . #  doesn't generate intermediate JSON files, so not great for me
python -m spacy train config.cfg --paths.train train.spacy --paths.dev dev.spacy

To allow the training data to be generated from JSONL files rather than binary files, I tried the following:

python -m prodigy db-out ds ds # to generate a JSON annotation file

create a train and dev split with pandas/sklearn in a python console

import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_json("ds/ds.jsonl", lines = True)
train, dev = train_test_split(data)
train.to_json("train.jsonl", orient = "records", lines = True) # respect the original JSONL format
dev.to_json("dev.jsonl", orient = "records", lines = True) # respect the original JSONL format

The problem is when I try to transform the train and dev files to the .spacy format, so to train with spaCy 3: an attempt using the script in projects/convert.py at 3c17ba90490301e8665503a6516adf3f77bb5b07 · explosion/projects · GitHub

python convert.py en train.jsonl train.spacy

gives me a ValueError: Trailing data error. spacy convert also doesn't seem to be of help here, as it handles JSON for v2 files (?).

How can I solve this?

ines · August 22, 2021, 11:50pm

Hi! Under the hood, the binary .spacy format is a serialized DocBin, i.e. a collection of Doc objects. So if you want to create training data from pretty much any format, you can simply construct Doc objects, set the annotations you need (doc.ents, doc.cats, other token-based tags) and save out a DocBin: https://spacy.io/usage/training/#training-data

Here's an example script that reads in annotations in Prodigy's JSON format with text, tokens and spans and creates a DocBin from them:

github.com

explosion/projects/blob/v3/tutorials/ner_fashion_brands/scripts/preprocess.py

import typer
import srsly
from pathlib import Path
from spacy.util import get_words_and_spaces
from spacy.tokens import Doc, DocBin
import spacy


def main(
    input_path: Path = typer.Argument(..., exists=True, dir_okay=False),
    output_path: Path = typer.Argument(..., dir_okay=False),
):
    nlp = spacy.blank("en")
    doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])
    for eg in srsly.read_jsonl(input_path):
        if eg["answer"] != "accept":
            continue
        tokens = [token["text"] for token in eg["tokens"]]
        words, spaces = get_words_and_spaces(tokens, eg["text"])
        doc = Doc(nlp.vocab, words=words, spaces=spaces)

This file has been truncated. show original

One thing to keep in mind is that the data-to-spacy command, which you'd normally use within Prodigy to output data in spaCy's format, also inclues an additional step for merging annotations on the same data into a single example, using the _input_hash to indentify examples with the same input. This is important if you have multiple annotations on the same text, e.g. two datasets annotating the same example with different labels, or an NER dataset + a text classification dataset. So if that's relevant for your data, you want to start by grouping all examples by _input_hash before you create one Doc doc object per example (plus maybe some checks and validation, in case there are conflicts in the data).

Topic		Replies	Views
Training prodigy ner data through spacy usage , ner , spacy , solved	3	893	January 8, 2020
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3013	April 17, 2023
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	594	June 15, 2020
Convert DocBins or .spacy files to .jsonl format usage , ner , spacy	2	839	January 3, 2023
SpaCy training from data-to-spacy output ? usage , spacy	8	1814	June 14, 2022

How to train a NER model using spaCy 3 only, starting from prodigy (1.11) JSON files?

Related topics