SpaCy training from data-to-spacy output ?

Hi,

I've had troubles understanding how to configure the model training in SpaCy.

I read the documentation : https://spacy.io/api/cli#train

But I don't really understand how to use the JSON file obtained from the data-to-spacy command in Prodigy. I'd like to configure my own custom NER model, and maybe reuse an already existing language model along. Can you help ?

Thanks.

Hi! If you're using spaCy v2, you can provide it directly as an argument to spacy train: https://v2.spacy.io/api/cli#train

If you're using spaCy v3, you can convert it to the new binary .spacy format using spacy convert: https://spacy.io/api/cli#convert You can then provide the path in your config or via the CLI overrides --paths.train and --paths.dev. See the documentation here for how to get started: https://spacy.io/usage/training#quickstart

Hi Ines,

Thanks for your quick reply. Is it mandatory to provide development data for evaluation ? I have troubles running the command, as I previously said.
Could you provide a further explanation for obtaining a NER custom model from the data-to-spacy output (in JSON) ? I used ner.manual to annotate data.
I don't understand the documentation, it's the first time I'm training my own NER model.

Thanks

(maybe an example of a command line would be clearer, sorry)

Hi Sebastien,

Which version of spaCy are you using?

To quickly answer this question in the meantime:

Yes. Otherwise, there's nothing to evaluate the model on. (If you're training a model, you typically want to know how accurate it is and that's done by running it over annotated examples it hasn't seen during training. That data shouldn't change, so you can meaningfully compare results over time and see if your model is improving.) You can create your development data by shuffling and splitting your training data, and hold back between 20 and 50% of your data, depending on the number of examples. In Prodigy, you can also create a separate dataset just for evaluation and then export that separately.

Hi Sofie. I'm still using SpaCy v.2 and I would like to migrate to SpaCy v3.

I understand, Ines. How do you automate that process ? Should I create the development data with a specific command line ? If I understand correctly, the development data shall also be annotated and that annotation would be like a gold standard ? Am I right ? And the development data is then used to test if the model finds the correct entities "blindly" ? So during the training process we test the capacity of the model to generalize ?

If you're using spaCy v2, the commands could look like this:

python -m prodigy data-to-spacy ./train.json ./dev.json --lang en --ner your_ner_dataset --eval-split 0.5
python -m spacy train en ./output ./train.json ./dev.json

The first command will export your dataset (your_ner_dataset – you'll obviously need to replace that with your dataset name) and split it into the training and evaluation portion. The --eval-split lets you control how many examples are split off for the evaluation set, e.g. 0.5 for 50%. If you leave out the path to the development dataset, Prodigy will export only one single file.

Yes, exactly. The data you evaluate on should be representative of your training data (and obviously what your model will see at runtime). So one way to create that data is to shuffle all examples you've annotated and then separate it into two portions. Another option would be to annotate two datasets in Prodigy: a training dataset (that you can keep adding more examples to), and an evaluation dataset (that doesn't change). This way, you can keep evaluating on the exact same data, which means that you can compare the results across your experiments and as your training data changes.

It's much clearer, thanks for your detailed explanation !

I thought it was common to have a train, validation and test split. Is this possible with the data-to-spacy function? Because here only a train (train.json) and validation (dev.json) dataset is created?