data-to-spacy for transformers

jordandavis · October 8, 2022, 9:43pm

I have an annotated dataset for ner and I've used trained a model successfully using the prodigy train command. I wanted to train a new model using transformers and used the starter config downloaded from spacy and filled in with the init-config command, however, the results were much worse than the non-transformer model. The .spacy files were created using the data-to-spacy command with no base model. I'm assuming the reason for the poor transformer results is that the data-to-spacy command did not tokenize the examples for transformer training. How do I use the data-to-spacy command to convert the prodigy data to spacy data that can be trained using transformers?

koaning · October 10, 2022, 12:58pm

Could you share your commands explicitly? That might help me understand the steps a bit better.

Technically I don't think you need to run data-to-spacy because you could also just use prodigy train directly. You can share the --config directly there as well. Could you try training a model that way?

It is always possible that it's something in your config file that's making the transformer model not converge. You're able to confirm that the system has converged and that running for more epochs does not help?

jordandavis · October 11, 2022, 2:54pm

Thanks for your response. The reason I'm doing the data-to-spacy method is that I train the model in Colab, and converting everything to spacy format makes it easier to transfer the directory from my local machine.

As for the commands I'm using. I use the data-to-spacy command.

prodigy data-to-spacy my_directory --ner my_dataset

Then I export the config file from the spacy quickstart tool with settings NER, GPU, Accuracy.

In colab I install the following:

!pip install -U pip setuptools wheel
!pip install -U 'spacy[cuda113,transformers,lookups]'
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

I fill the base_config file with

python -m spacy init fill-config base_config.cfg config.cfg

Then train with

!python -m spacy train config.cfg --paths.train train.spacy --paths.dev dev.spacy --output model  --gpu-id 0

koaning · October 12, 2022, 10:07am

I'm curious, do you get errors if you don't install torchvision and torchaudio? I could be wrong, but I didn't expect spaCy to need these as dependencies.

Ah, I wasn't aware of the colab situation. Just to check, does colab complain when you try to train a CPU model? One that's optimized for speed? Or did you run the CPU model locally and not on colab?

It does seem like other people are able to run spaCy on colab although it's unclear if they are interested in using the GPU.

Since this question might be more related to spaCy than to Prodigy, it would be a good idea to post this question in the spaCy discussion forum. The spaCy maintainers check that forum, and might be better equipped to answer this question.

jordandavis · October 12, 2022, 12:15pm

My initial question was, "How do I use the data-to-spacy command to convert the prodigy data to spacy data that can be trained using transformers?" The data-to-spacy command is a Prodigy command, not a Spacy command. Accordingly, is there a way to use this Prodigy command to generate training data suitable for transformers?

koaning · October 12, 2022, 1:24pm

This command should already take care of that:

prodigy data-to-spacy my_directory --ner my_dataset

If you're training a spaCy pipeline with a transformer then spaCy will take care of all the token translation on your behalf.

You can see me report on all of the required steps here. Since you mentioned you're running similar steps but on Colab ... that's why I'm thinking this might be spaCy issue on top of colab.

jordandavis · October 12, 2022, 9:18pm

It's working now. Per your suggestion, I removed the pytorch installation

!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Maybe that's it

In any case, thanks very much for your help!

Topic		Replies	Views
can I use prodigy train-curve with a transformers model? usage , spacy , transformers	2	610	June 11, 2021
SpaCy3 models evaluation on a custom dataset usage , spacy , solved , training	3	644	July 7, 2021
Training prodigy ner data through spacy usage , ner , spacy , solved	3	895	January 8, 2020
Prodigy annotations to SpaCy train spacy	13	5623	January 31, 2018
Prodigy ner.batch-train vs Spacy train usage , spacy , best-practices	13	3509	June 2, 2020

data-to-spacy for transformers

Related topics