Using xlm-roberta model for tokenization

ctoto93 · July 26, 2021, 2:37pm

Hi Ines, thank you for your suggestions :D.

To give you a little bit more context, we have tasks similar to NER where the annotators should highlight phrases of a sentence and label a topic to each phrase. What we try is to use ner.manual recipe with blank:en model. One requirement is to use the xlm-roberta model for tokenization so that we use the same tokenizer as what we use to train our model.

The first solution seems appealing to me as it has automatic tokenization alignment. I wonder if we can do this in prodigy 1.10 (as we prefer to use a stable release). CMIIW, spacy 3 will be supported in the upcoming prodigy 1.11. Do you have a rough estimation of when will v1.11 become a stable release?

I also stumbled on this thread Custom Tokenizer where you suggested save out the model and package it with spacy_package. Do you think this solution is valid for my problem?

Topic		Replies	Views
is there a way to change prodigy annotations to transformers-based annotations, without re-annotating? usage , ner , solved , transformers	6	823	March 4, 2021
Alignment of NER tokens when creating suggestions using Transformers ner	7	1071	August 12, 2022
Tokenization compatibility issues in rel.manual enhancement , usage , done , transformers , relations	7	1431	September 8, 2020
BERT recipe when using transformer in pipeline? spacy , solved	8	1911	May 21, 2021
Transform annotations to match tokenization required for SpanBERT/BERT spacy , transformers , spancat	19	1611	July 30, 2023

Using xlm-roberta model for tokenization

Related topics