Can I use a multilingual transformer model?

konstantinidis.alexa · May 2, 2020, 9:13pm

In your article only English models are cited:

Could you please advise me as my task involves Greek?

ines · May 3, 2020, 11:11am

These are the models we prepackaged as spaCy model packages – but you can use any other models, assuming they're loadable via the transformers package. Check out the spacy-transformers documentation: https://github.com/explosion/spacy-transformers#setting-up-the-pipeline

In general, Prodigy does not really care what model you use and how you set up your stream – you can always write custom recipe to load your model during annotation. It just needs to be loadable in Python.

konstantinidis.alexa · May 7, 2020, 6:02am

Thank you. I run the code I found at the link you provided:

from spacy_transformers import TransformersLanguage, TransformersWordPiecer, TransformersTok2Vec

    name = "bert-base-multilingual-cased"
    nlp = TransformersLanguage(trf_name=name, meta={"lang": "multi"})
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    nlp.add_pipe(TransformersWordPiecer.from_pretrained(nlp.vocab, name))
    nlp.add_pipe(TransformersTok2Vec.from_pretrained(nlp.vocab, name))
    print(nlp.pipe_names)  # ['sentencizer', 'trf_wordpiecer', 'trf_tok2vec']

When I passed a Greek sentence to the model I got a doc with tokens that had vectors but they did not have the lexicographic attributes to be found in Spacy, like pos_ and tag_.

In contrast and as expected the spacy_stanza offers a solution that provides me with these attributes as shown below:

import stanza

from spacy_stanza import StanzaLanguage

snlp = stanza.Pipeline(lang = 'el')

nlp_el = StanzaLanguage(snlp)

doc1 = nlp_el('Αύριο είναι Πέμπτη στην Ελλάδα.')

for token in doc1:

print(f'{token.text}, {token.pos_}, {token.tag_}')
Αύριο, ADV, ADV
είναι, AUX, AUX
Πέμπτη, PROPN, PROPN
σ, ADP, AsPpSp
την, DET, AtDf
Ελλάδα, PROPN, PROPN
., PUNCT, PUNCT

Going forward, I must do the following:

Train in an unsupervised manner both models with voluminous Greek text that has a special language and usage
Find a way --if this is practical-- to complement the Bert model imported through spacy_transformers with the lexicographic attributes that Spacy offers
Train both models to recognize the Entities that are of interest to my application.
Finally, compare their performance on my Task

Could you point me to code/tutorials/articles that will help me with tasks (1), (2), and (3) above? Thank you.

ines · May 7, 2020, 12:16pm

Yes, it's expected that there are no POS tags or other linguistic attributes, because you're creating a new model from scratch using only the multilingual BERT representations. The model has no trained components, so it's not going to predict anything. BERT weights don't magically predict something – you can just use them to train new models with better representations (compared to just basic word vectors). But you still need to train a model on your data.

spacy-stanza on the other hand wraps pretrained Stanza models that include a trained part-of-speech tagger. So you'll get a Doc object with the annotations predicted by the model.

I'd suggest starting out a bit simpler and plan out your model components and the data you need and create them. You can always experiment with different representations later and try out different strategies for making your model more accurate on your data later on.

Topic		Replies	Views
Using xlm-roberta model for tokenization usage , transformers	3	1468	July 27, 2021
Using Prodigy with Greek text ner , spacy , solved	1	390	April 20, 2020
Turkish language that spaCy doesn’t yet provide pre-trained models usage , spacy	3	1667	January 23, 2020
How to use customized spaCy model in Prodigy? ner , spacy	6	485	July 3, 2023
How to use a (sentence targeted) textcat model together with the core model textcat , spacy	2	1342	November 28, 2017

Can I use a multilingual transformer model?

Related topics