data-to-spacy for transformers

I have an annotated dataset for ner and I've used trained a model successfully using the prodigy train command. I wanted to train a new model using transformers and used the starter config downloaded from spacy and filled in with the init-config command, however, the results were much worse than the non-transformer model. The .spacy files were created using the data-to-spacy command with no base model. I'm assuming the reason for the poor transformer results is that the data-to-spacy command did not tokenize the examples for transformer training. How do I use the data-to-spacy command to convert the prodigy data to spacy data that can be trained using transformers?

Could you share your commands explicitly? That might help me understand the steps a bit better.

Technically I don't think you need to run data-to-spacy because you could also just use prodigy train directly. You can share the --config directly there as well. Could you try training a model that way?

It is always possible that it's something in your config file that's making the transformer model not converge. You're able to confirm that the system has converged and that running for more epochs does not help?

Thanks for your response. The reason I'm doing the data-to-spacy method is that I train the model in Colab, and converting everything to spacy format makes it easier to transfer the directory from my local machine.

As for the commands I'm using. I use the data-to-spacy command.

prodigy data-to-spacy my_directory --ner my_dataset

Then I export the config file from the spacy quickstart tool with settings NER, GPU, Accuracy.

In colab I install the following:

!pip install -U pip setuptools wheel
!pip install -U 'spacy[cuda113,transformers,lookups]'
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

I fill the base_config file with

python -m spacy init fill-config base_config.cfg config.cfg

Then train with

!python -m spacy train config.cfg --paths.train train.spacy --paths.dev dev.spacy --output model  --gpu-id 0

I'm curious, do you get errors if you don't install torchvision and torchaudio? I could be wrong, but I didn't expect spaCy to need these as dependencies.

Ah, I wasn't aware of the colab situation. Just to check, does colab complain when you try to train a CPU model? One that's optimized for speed? Or did you run the CPU model locally and not on colab?

It does seem like other people are able to run spaCy on colab although it's unclear if they are interested in using the GPU.

Since this question might be more related to spaCy than to Prodigy, it would be a good idea to post this question in the spaCy discussion forum. The spaCy maintainers check that forum, and might be better equipped to answer this question.

My initial question was, "How do I use the data-to-spacy command to convert the prodigy data to spacy data that can be trained using transformers?" The data-to-spacy command is a Prodigy command, not a Spacy command. Accordingly, is there a way to use this Prodigy command to generate training data suitable for transformers?

This command should already take care of that:

prodigy data-to-spacy my_directory --ner my_dataset

If you're training a spaCy pipeline with a transformer then spaCy will take care of all the token translation on your behalf.

You can see me report on all of the required steps here. Since you mentioned you're running similar steps but on Colab ... that's why I'm thinking this might be spaCy issue on top of colab.

It's working now. Per your suggestion, I removed the pytorch installation

!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Maybe that's it :man_shrugging:

In any case, thanks very much for your help!

1 Like