Dependency Parser and POS tagger for unsupported language

Hi there, Thanks for the great library!

I'm looking to create a dependency parser and POS tagger for Afrikaans (and later, other South African languages). I have a labelled treebank corpus but am not certain how to proceed. I've looked at this link and I gather that I require a pretrained language model ? will pretrained fasttext embeddings suffice ? Any help on how to get started training the dependency parser and POS tagger would be a great help.

As an aside, I used the pretrained English model in the recipe linked above and the results, for simple sentences, aren't that bad :slight_smile: . Because Dutch is very similar to Afrikaans I tried using that instead, but it seems as though there isn't a parser to start with for Dutch.

Hi and thanks! :slightly_smiling_face:

You don't have to start off with pretrained embeddings – but if you do have some for your language, this will likely result in better overall accuracy. See here for an example of how to initialize a base model with fastText embeddings (which you can then use for training): https://spacy.io/usage/vectors-similarity#converting

Also see this example for how to use the spacy convert and spacy train commands to train a model from a CoNLL-U-formatted treebank: https://spacy.io/usage/training#spacy-train-cli

If you have a lot of raw text and word vectors, you could also try out the new (and still somewhat experimental) spacy pretrain, which lets you pretrain the token-to-vector layer with a language modelling objective. We've seen pretty good results with it on downstream tasks, and it's easy to train yourself and has virtually no impact on runtime speed :smiley: https://spacy.io/api/cli#pretrain

The nl_core_news_sm model was trained on the Dutch Lassy corpus and has a tagger, parser and entity recognizer: https://spacy.io/models/nl So you could also experiment with fine-tuning that for Afrikaans, or mixing both corpora so you have more data. It's potentially tricky and not worth it, though, because different corpora use different label schemes etc.

Brilliant. Thank you @ines. I'll move forward with your suggestions and report back.

1 Like