Using xlm-roberta model for tokenization

Hi, we have a requirement to use xlm-roberta model for tokenization. CMIIW, based on the list here spacy-models/compatibility.json at master · explosion/spacy-models · GitHub, spacy does not support xlm-roberta model. Could you suggest how we are able to implement this? Thank you for the hints :')

Hi! The pipelines listed here were mostly added as pre-packaged examples of some common transformers for use with spaCy v2 – the spacy-transformers librarly lets you load and use any transformer weights, including xlm-roberta. It'll also take care of aligning the word piece tokenization of the transformer to the linguistic tokenization provided by spaCy. In spaCy v3, you can plug in any trained transformer embeddings and train a pipeline with them:

The auomatic tokenization alignment also means that you'll be able to annotate using the linguistic tokenization, which is usually more intuitive and means you can work with real "words" instead of arbitrary chunks. You can then train your model with spaCy and initialise it with the transformer embeddings you want to use. This also makes it easier to swap out the transformer, if you want to try out different ones (that may also use different tokenizers).

The alternative would be annotating the word piece tokens, which can sometimes be confusing because of the added characters like ## etc., or do it character-based (e.g. with --highlight-chars) and risk ending up with annotations that don't map to actual tokens, which is unideal. If you want to see the exact tokenization of the transformer, here's an example recipe that you can adjust to use the respective tokenizer: prodigy-recipes/ at master · explosion/prodigy-recipes · GitHub

Hi Ines, thank you for your suggestions :D.

To give you a little bit more context, we have tasks similar to NER where the annotators should highlight phrases of a sentence and label a topic to each phrase. What we try is to use ner.manual recipe with blank:en model. One requirement is to use the xlm-roberta model for tokenization so that we use the same tokenizer as what we use to train our model.

The first solution seems appealing to me as it has automatic tokenization alignment. I wonder if we can do this in prodigy 1.10 (as we prefer to use a stable release). CMIIW, spacy 3 will be supported in the upcoming prodigy 1.11. Do you have a rough estimation of when will v1.11 become a stable release?

I also stumbled on this thread Custom Tokenizer where you suggested save out the model and package it with spacy_package. Do you think this solution is valid for my problem?

The model training can be entirely separate and you wouldn't have to do that with Prodigy – you can use prodigy data-to-spacy to export your annotations and spacy convert in spaCy v3 to create a corpus in spaCy's format that you can train from. You can do this in a separate environment with spaCy v3, independent from your Prodigy environment.

In that case, the tokenization alignment would be taken care of automatically under the hood. You can also do the alignment yourself if you need to. Here's the library we use for it in spacy-transformers: GitHub - explosion/tokenizations: Robust and Fast tokenizations alignment library for Rust and Python

We're hoping to have the stable release of Prodigy v1.11 out next week or the week after – it mostly depends on how the testing goes and whether there are remaining problems we need to fix.

That's really only relevant if you want custom tokenization rules. spaCy also generally expects the tokenization to be non-destructive (i.e. always preserve the original input text). That's not typically the case for wordpiece tokenizers etc., because they'll often lowercase the text and add control characters like ##. Those tokens are also usually not what you want to work with later on. You typically want to work with the actual words, e.g. "Justin Bieber"["Justin", "Bieber"] instead of ["justin", "bi", "##eber"].