Well, in this example it's just using the tokenizer and it's loading it from a vocabulary file – but you can also use the AutoTokenizer
or AutoTransformer
provided by the transformers
library to load in a transformer from a string name and set up everything automatically. You could even use the inference API to process the texts and call that from Python if you want.
If the transformer you're loading predicts named entities and text categories, you can feed those predictions into Prodigy to view and correct them – this way, you get a feeling for the model, and you only need to correct its mistakes. The only piece of code that you have to write is the part that takes the output produced by the model, and sends out data in Prodigy's format to annotate. For example, {"text": "...", "label": "SOME_LABEL"}
for text classification, or {"text": "...", "tokens": [...], "spans": [...]}
for annotating named entities. The JSON you need depends on the annotation interface you want to use – you can find an overview of the interfaces and data formats they expect here: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP
Here's another example that shows how to plug in any custom model for NER: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP
Prodigy comes with out-of-the-box support for spaCy (because we develop both libraries) and it includes various workflows for annotating data with a spaCy model in the loop, training a spaCy model etc. But you don't have to use Prodigy with spaCy – you can also use any other library (or no library at all if you want to annotate manually and just export your data later on).
spaCy v3 brings easier support for training custom models with shared transformer embeddings – so instead of just starting with word vectors or no pretrained embeddings at all, you can now train a Hebrew model with heBERT embeddings and likely train more accurate models on less data. This is pretty independent of Prodigy – in fact, the current release version of Prodigy still uses spaCy 2 for the spaCy-powered workflows (and we have a Prodigy nightly version that updates to spaCy v3, which is mostly relevant for training).
Thanks, that's nice to hear!