Using transformer models inside prodigy and finetuning

Hi,
Does Prodigy support usage of Transformer models like BERT or XLNET for performing text classification? Currently I am able to use en_core_web_lg, sm, md, etc. only.

1 Like

Hi! Prodigy's training recipes currently use spaCy's regular textcat model – but you can always export the data and use the training scripts provided in spacy-transformers.

I've been experimenting a lot with workflows for annotating with transformers in the loop, but it's been pretty tricky so far. The models typically require larger batch sizes and don't always respond that well to small single updates. They're also a bit slow, especially on CPU, so you can't really avoid having to wait around for the model to finish updating. A more promising approach IMO is to use Prodigy to collect a small dataset with textcat.manual, use it to fine-tune a large transformer model (ideally to very high accuracy), and then use that to semi-automatically create large volumes of training data for a more efficient (and just as accurate) runtime model. I've been working on some recipes for this that I'll hopefully be able to share soon :slightly_smiling_face:

2 Likes

Sure, thanks Ines!

Hi,
I have my jsonl file which has been exported from prodigy dataset after manually annotating about 2500 examples. I would like to use this to finetune a BERT transformer from inside spaCy for text classification. Can you please guide me on how I can do this? Thanks!

I came across this but it looks too complicated for a starter. https://github.com/explosion/spacy-transformers/blob/master/examples/train_textcat.py

Merged the two transformers threads to keep things in one place :slightly_smiling_face:

Well, this is the full end-to-end training loop with preprocessing, various settings, evaluation logic, early stopping, pretty output and so on. Usage is pretty straightforward – a single command. You can even run it without data to test it on example data (IMDB dataset):

python train_textcat.py en_trf_bertbaseuncased_lg

You can run python train_textcxat.py --help to see the available arguments. To run it on your own data, create a directory with two files training.jsonl and evaluation.jsonl that look like this, and then pass it in as the input directory.

["This is a text", {"cats": {"LABEL_ONE": 0.0, "LABEL_TWO": 1.0}}]

This should be very easy to create from Prodigy annotations collected with textcat.manual – the "accept" list of each annotation contains the correct labels. All other labels are zero.

Great! Thank you very much for the support Ines!

Hi Ines,
What about those annotations where "answer" was "reject" instead of "accept"? Those will not have "accept" field. Please correct me if I am wrong, we cannot use those annotations for training or evaluation and hence they need not be added to training.jsonl or evaluation.jsonl.

Also, can you please explain how evaluation is done using evaluation.jsonl file? When I run the command, by default, it takes positive label for evaluation as one of the possible labels.

Hi Ines!

Just stumbled upon this post. I'm currently collecting data to fine-tune a transformers model for text classification.

Would you mind elaborating on the third step you are suggesting: "to semi-automatically create large volumes of training data for a more efficient (and just as accurate) runtime model." .

  • How would this be done? Is this some kind of data augmentation technique you are talking about?

Any pointers or resources would be really helpful! :slight_smile:

Best regards,
Simon

I think people also refer to this as "uptraining". Basically, a workflow could look like this:

  1. Manually label a small set of examples from scratch and train a large transformer-based model. That model may be pretty accurate already, but possibly very slow and unwieldy.
  2. Use that model to predict categories in your text and stream those in for annotation.
    • If the score is above a certain threshold, automatically add the example to your dataset and assume it's correcct.
    • If the score is below a certain threshold, skip it entirely.
    • If the score is between the low and high threshold, send it out for annotation again and correct the model if it's wrong.
  3. Train a new and more efficient model on all collected data. This model now ideally has a lot of data to learn from and may achieve the same accuracy as a much larger model trained on fewer examples. But you didn't have to spend time and create all those annotations from scratch.

A workflow like this should be pretty straightforward to implement with Prodigy and a custom recipe that processes the incoming texts and only yields out selected examples and uses the database API to auto-add annotations to the dataset.

2 Likes

Really cool, will experiment with, thanks a lot for the explanation! :slight_smile:

1 Like

@Simpan Cool, if you end up trying this, let me know how you go! I had a look and here's an old gist I found from some early experiments – maybe you can use that as inspiration or a starting point. (It still refers to spacy_pytorch_transformers instead of spacy_transformers and there may be a few other things that need to be adjusted. But it's more about the proposed workflow. The textcat.pytt.create-data is the interesting part – you can kinda ignore the rest.)

The trickier parts were finding the right thresholds and the right model for the task – and also the right task that's difficult enough that it'd benefit enough from a large transformer. For some of the datasets I created, the transformer model didn't actually beat a simpler CNN + bag of words architecture, or at least not by much.

1 Like