In your article only English models are cited:
Could you please advise me as my task involves Greek?
In your article only English models are cited:
Could you please advise me as my task involves Greek?
These are the models we prepackaged as spaCy model packages – but you can use any other models, assuming they're loadable via the transformers
package. Check out the spacy-transformers
documentation: https://github.com/explosion/spacy-transformers#setting-up-the-pipeline
In general, Prodigy does not really care what model you use and how you set up your stream – you can always write custom recipe to load your model during annotation. It just needs to be loadable in Python.
Thank you. I run the code I found at the link you provided:
from spacy_transformers import TransformersLanguage, TransformersWordPiecer, TransformersTok2Vec
name = "bert-base-multilingual-cased"
nlp = TransformersLanguage(trf_name=name, meta={"lang": "multi"})
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.add_pipe(TransformersWordPiecer.from_pretrained(nlp.vocab, name))
nlp.add_pipe(TransformersTok2Vec.from_pretrained(nlp.vocab, name))
print(nlp.pipe_names) # ['sentencizer', 'trf_wordpiecer', 'trf_tok2vec']
When I passed a Greek sentence to the model I got a doc with tokens that had vectors but they did not have the lexicographic attributes to be found in Spacy, like pos_ and tag_.
In contrast and as expected the spacy_stanza offers a solution that provides me with these attributes as shown below:
import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang = 'el')
nlp_el = StanzaLanguage(snlp)
doc1 = nlp_el('Αύριο είναι Πέμπτη στην Ελλάδα.')
for token in doc1:
print(f'{token.text}, {token.pos_}, {token.tag_}')
Αύριο, ADV, ADV
είναι, AUX, AUX
Πέμπτη, PROPN, PROPN
σ, ADP, AsPpSp
την, DET, AtDf
Ελλάδα, PROPN, PROPN
., PUNCT, PUNCT
Going forward, I must do the following:
Train in an unsupervised manner both models with voluminous Greek text that has a special language and usage
Find a way --if this is practical-- to complement the Bert model imported through spacy_transformers
with the lexicographic attributes that Spacy offers
Train both models to recognize the Entities that are of interest to my application.
Finally, compare their performance on my Task
Could you point me to code/tutorials/articles that will help me with tasks (1), (2), and (3) above? Thank you.
Yes, it's expected that there are no POS tags or other linguistic attributes, because you're creating a new model from scratch using only the multilingual BERT representations. The model has no trained components, so it's not going to predict anything. BERT weights don't magically predict something – you can just use them to train new models with better representations (compared to just basic word vectors). But you still need to train a model on your data.
spacy-stanza
on the other hand wraps pretrained Stanza models that include a trained part-of-speech tagger. So you'll get a Doc
object with the annotations predicted by the model.
I'd suggest starting out a bit simpler and plan out your model components and the data you need and create them. You can always experiment with different representations later and try out different strategies for making your model more accurate on your data later on.