Can prodigy be used to automatically train/predict on any BERT model in huggingface's cloud?

Hi :slightly_smiling_face:

I am considering to use prodigy for my next token classification task. I wish to do some fine tuning on a BERT model off from huggingface repository. It's a Hebrew model (avichr/heBERT · Hugging Face).

  1. Will I be able to benefit the automatic "learn as you tag" provided by Prodigy with this model?

  2. Will it be possible to fine tune this model later on for a token classification task using the tagged dataset I will create?

Thank you.

Sure, if you can load the model in Python, you can use it in Prodigy :slightly_smiling_face: For instance, here's an example of using a BERT tokenizer to create NER annotations for updating/training/fine-tuning a transformer model:

After you've annotated your data, you can export it and use it to train/fine-tune your model. The data you export includes everything you need for this: the original text, the tokens, the annotated spans and their labels (see here for an example).

This is a different question and will depend on the model you're actually training. Prodigy will always let you update a model via the update callback, so if your model supports "online learning", you'll be able to update it in the loop in Prodigy. However, if you're working with a large transformer model, you typically want to update it in larger batches and make multiple passes over the data. The model might also be a bit too slow to do both inference and updating in the loop (at least on CPU), so it might not be very efficient or effective for the actual annotation process.

That said, if you start off with good embeddings, you won't need as much training data either. So you're probably better off just creating some annotations from scratch to bootstrap your model and train it on the initial data. You can then have the model suggest annotations and only correct it if it's wrong, which should save you a lot of time and let you create a larger dataset very quickly.

Sounds Cool! And Thanks for the detailed answer :relaxed:

Still confused of the exact doc or recipes that explains where/how I enter my transformer model's "name" or load it? I can see where the dataset and vocab are being set, but not where the BERT transformer name can be set - which is a bit confusing.

Spacy 3.0 is new and I still don't understand the relation between it and Prodigy, so if there is some interface code to write, it would be great if you could point me to an example.

Thanks, and cheers for the cool new tool! I'm really anticipating to start working with it :smile:

Well, in this example it's just using the tokenizer and it's loading it from a vocabulary file – but you can also use the AutoTokenizer or AutoTransformer provided by the transformers library to load in a transformer from a string name and set up everything automatically. You could even use the inference API to process the texts and call that from Python if you want.

If the transformer you're loading predicts named entities and text categories, you can feed those predictions into Prodigy to view and correct them – this way, you get a feeling for the model, and you only need to correct its mistakes. The only piece of code that you have to write is the part that takes the output produced by the model, and sends out data in Prodigy's format to annotate. For example, {"text": "...", "label": "SOME_LABEL"} for text classification, or {"text": "...", "tokens": [...], "spans": [...]} for annotating named entities. The JSON you need depends on the annotation interface you want to use – you can find an overview of the interfaces and data formats they expect here:

Here's another example that shows how to plug in any custom model for NER:

Prodigy comes with out-of-the-box support for spaCy (because we develop both libraries) and it includes various workflows for annotating data with a spaCy model in the loop, training a spaCy model etc. But you don't have to use Prodigy with spaCy – you can also use any other library (or no library at all if you want to annotate manually and just export your data later on).

spaCy v3 brings easier support for training custom models with shared transformer embeddings – so instead of just starting with word vectors or no pretrained embeddings at all, you can now train a Hebrew model with heBERT embeddings and likely train more accurate models on less data. This is pretty independent of Prodigy – in fact, the current release version of Prodigy still uses spaCy 2 for the spaCy-powered workflows (and we have a Prodigy nightly version that updates to spaCy v3, which is mostly relevant for training).

Thanks, that's nice to hear! :smiley: