SpaCy training from data-to-spacy output ?

Sebastien · March 2, 2021, 10:07am

Hi,

I've had troubles understanding how to configure the model training in SpaCy.

I read the documentation : https://spacy.io/api/cli#train

But I don't really understand how to use the JSON file obtained from the data-to-spacy command in Prodigy. I'd like to configure my own custom NER model, and maybe reuse an already existing language model along. Can you help ?

Thanks.

ines · March 2, 2021, 1:30pm

Hi! If you're using spaCy v2, you can provide it directly as an argument to spacy train: https://v2.spacy.io/api/cli#train

If you're using spaCy v3, you can convert it to the new binary .spacy format using spacy convert: https://spacy.io/api/cli#convert You can then provide the path in your config or via the CLI overrides --paths.train and --paths.dev. See the documentation here for how to get started: https://spacy.io/usage/training#quickstart

Sebastien · March 2, 2021, 3:08pm

Hi Ines,

Thanks for your quick reply. Is it mandatory to provide development data for evaluation ? I have troubles running the command, as I previously said.
Could you provide a further explanation for obtaining a NER custom model from the data-to-spacy output (in JSON) ? I used ner.manual to annotate data.
I don't understand the documentation, it's the first time I'm training my own NER model.

Thanks

(maybe an example of a command line would be clearer, sorry)

SofieVL · March 2, 2021, 4:51pm

Hi Sebastien,

Which version of spaCy are you using?

ines · March 3, 2021, 1:20am

To quickly answer this question in the meantime:

Yes. Otherwise, there's nothing to evaluate the model on. (If you're training a model, you typically want to know how accurate it is and that's done by running it over annotated examples it hasn't seen during training. That data shouldn't change, so you can meaningfully compare results over time and see if your model is improving.) You can create your development data by shuffling and splitting your training data, and hold back between 20 and 50% of your data, depending on the number of examples. In Prodigy, you can also create a separate dataset just for evaluation and then export that separately.

Sebastien · March 3, 2021, 8:04am

Hi Sofie. I'm still using SpaCy v.2 and I would like to migrate to SpaCy v3.

I understand, Ines. How do you automate that process ? Should I create the development data with a specific command line ? If I understand correctly, the development data shall also be annotated and that annotation would be like a gold standard ? Am I right ? And the development data is then used to test if the model finds the correct entities "blindly" ? So during the training process we test the capacity of the model to generalize ?

ines · March 3, 2021, 11:59pm

If you're using spaCy v2, the commands could look like this:

python -m prodigy data-to-spacy ./train.json ./dev.json --lang en --ner your_ner_dataset --eval-split 0.5
python -m spacy train en ./output ./train.json ./dev.json

The first command will export your dataset (your_ner_dataset – you'll obviously need to replace that with your dataset name) and split it into the training and evaluation portion. The --eval-split lets you control how many examples are split off for the evaluation set, e.g. 0.5 for 50%. If you leave out the path to the development dataset, Prodigy will export only one single file.

Yes, exactly. The data you evaluate on should be representative of your training data (and obviously what your model will see at runtime). So one way to create that data is to shuffle all examples you've annotated and then separate it into two portions. Another option would be to annotate two datasets in Prodigy: a training dataset (that you can keep adding more examples to), and an evaluation dataset (that doesn't change). This way, you can keep evaluating on the exact same data, which means that you can compare the results across your experiments and as your training data changes.

Sebastien · March 4, 2021, 7:54am

It's much clearer, thanks for your detailed explanation !

yllwpr · June 14, 2022, 10:00am

I thought it was common to have a train, validation and test split. Is this possible with the data-to-spacy function? Because here only a train (train.json) and validation (dev.json) dataset is created?

Topic		Replies	Views
SpaCy3 models evaluation on a custom dataset usage , spacy , solved , training	3	640	July 7, 2021
How to train a NER model using spaCy 3 only, starting from prodigy (1.11) JSON files? usage , ner , spacy	1	2624	August 22, 2021
Training prodigy ner data through spacy usage , ner , spacy , solved	3	892	January 8, 2020
data-to-spacy for transformers transformers	6	987	October 12, 2022
Feeding prodigy annotated data to spacy in python usage , spacy , training	4	649	October 8, 2021

SpaCy training from data-to-spacy output ?

Related topics