Using transformer models inside prodigy and finetuning

RohitRanga · February 28, 2020, 7:47am

Hi,
Does Prodigy support usage of Transformer models like BERT or XLNET for performing text classification? Currently I am able to use en_core_web_lg, sm, md, etc. only.

ines · February 28, 2020, 9:36am

Hi! Prodigy's training recipes currently use spaCy's regular textcat model – but you can always export the data and use the training scripts provided in spacy-transformers.

I've been experimenting a lot with workflows for annotating with transformers in the loop, but it's been pretty tricky so far. The models typically require larger batch sizes and don't always respond that well to small single updates. They're also a bit slow, especially on CPU, so you can't really avoid having to wait around for the model to finish updating. A more promising approach IMO is to use Prodigy to collect a small dataset with textcat.manual, use it to fine-tune a large transformer model (ideally to very high accuracy), and then use that to semi-automatically create large volumes of training data for a more efficient (and just as accurate) runtime model. I've been working on some recipes for this that I'll hopefully be able to share soon

RohitRanga · February 28, 2020, 9:52am

Sure, thanks Ines!

RohitRanga · February 28, 2020, 12:00pm

Hi,
I have my jsonl file which has been exported from prodigy dataset after manually annotating about 2500 examples. I would like to use this to finetune a BERT transformer from inside spaCy for text classification. Can you please guide me on how I can do this? Thanks!

I came across this but it looks too complicated for a starter. https://github.com/explosion/spacy-transformers/blob/master/examples/train_textcat.py

ines · February 28, 2020, 2:06pm

Merged the two transformers threads to keep things in one place

Well, this is the full end-to-end training loop with preprocessing, various settings, evaluation logic, early stopping, pretty output and so on. Usage is pretty straightforward – a single command. You can even run it without data to test it on example data (IMDB dataset):

python train_textcat.py en_trf_bertbaseuncased_lg

You can run python train_textcxat.py --help to see the available arguments. To run it on your own data, create a directory with two files training.jsonl and evaluation.jsonl that look like this, and then pass it in as the input directory.

["This is a text", {"cats": {"LABEL_ONE": 0.0, "LABEL_TWO": 1.0}}]

This should be very easy to create from Prodigy annotations collected with textcat.manual – the "accept" list of each annotation contains the correct labels. All other labels are zero.

RohitRanga · February 29, 2020, 4:21pm

Great! Thank you very much for the support Ines!

RohitRanga · March 1, 2020, 5:05pm

Hi Ines,
What about those annotations where "answer" was "reject" instead of "accept"? Those will not have "accept" field. Please correct me if I am wrong, we cannot use those annotations for training or evaluation and hence they need not be added to training.jsonl or evaluation.jsonl.

Also, can you please explain how evaluation is done using evaluation.jsonl file? When I run the command, by default, it takes positive label for evaluation as one of the possible labels.

Simpan · April 29, 2020, 4:48pm

Hi Ines!

Just stumbled upon this post. I'm currently collecting data to fine-tune a transformers model for text classification.

Would you mind elaborating on the third step you are suggesting: "to semi-automatically create large volumes of training data for a more efficient (and just as accurate) runtime model." .

How would this be done? Is this some kind of data augmentation technique you are talking about?

Any pointers or resources would be really helpful!

Best regards,
Simon

ines · April 30, 2020, 10:01am

I think people also refer to this as "uptraining". Basically, a workflow could look like this:

Manually label a small set of examples from scratch and train a large transformer-based model. That model may be pretty accurate already, but possibly very slow and unwieldy.
Use that model to predict categories in your text and stream those in for annotation.
- If the score is above a certain threshold, automatically add the example to your dataset and assume it's correcct.
- If the score is below a certain threshold, skip it entirely.
- If the score is between the low and high threshold, send it out for annotation again and correct the model if it's wrong.
Train a new and more efficient model on all collected data. This model now ideally has a lot of data to learn from and may achieve the same accuracy as a much larger model trained on fewer examples. But you didn't have to spend time and create all those annotations from scratch.

A workflow like this should be pretty straightforward to implement with Prodigy and a custom recipe that processes the incoming texts and only yields out selected examples and uses the database API to auto-add annotations to the dataset.

Simpan · May 1, 2020, 10:08am

Really cool, will experiment with, thanks a lot for the explanation!

ines · May 1, 2020, 12:09pm

@Simpan Cool, if you end up trying this, let me know how you go! I had a look and here's an old gist I found from some early experiments – maybe you can use that as inspiration or a starting point. (It still refers to spacy_pytorch_transformers instead of spacy_transformers and there may be a few other things that need to be adjusted. But it's more about the proposed workflow. The textcat.pytt.create-data is the interesting part – you can kinda ignore the rest.)

gist.github.com

https://gist.github.com/ines/dd618b5bdc544b4ff49b363e98c6368a

prodigy_textcat.py

"""
Very experimental (!) Prodigy recipes for text classification annotation with
transformer models. Requires Prodigy (https://prodi.gy) to be installed.

By taking advantage of transformer models like BERT and XLNet, we can train
a highly accurate text classifier using only a very small set of labelled
examples. Transformers also very large and slow and not always a good fit for
production. However, we can use them to supervise a smaller and more efficient
runtime model (e.g. spaCy's built-in text classifier). First, we can create
a small manually labelled set and use it to fine-tune the pretrained transformer

This file has been truncated. show original

The trickier parts were finding the right thresholds and the right model for the task – and also the right task that's difficult enough that it'd benefit enough from a large transformer. For some of the datasets I created, the transformer model didn't actually beat a simpler CNN + bag of words architecture, or at least not by much.

Topic		Replies	Views
Training BERT on prodigy transformers , relations	3	818	February 2, 2023
Transform annotations to match tokenization required for SpanBERT/BERT spacy , transformers , spancat	19	1602	July 30, 2023
BERT support for prodigy train ner usage , ner , spacy , solved	2	1026	June 30, 2021
Similar models to en_core_web_lg/en_vectors_web_lg usage , spacy	5	1281	February 25, 2021
The model details behind Prodigy usage , spacy , solved	1	404	August 12, 2020

Using transformer models inside prodigy and finetuning

Related topics