Dependency Parser and POS tagger for unsupported language

Ari · March 6, 2020, 9:14am

Hi there, Thanks for the great library!

I'm looking to create a dependency parser and POS tagger for Afrikaans (and later, other South African languages). I have a labelled treebank corpus but am not certain how to proceed. I've looked at this link and I gather that I require a pretrained language model ? will pretrained fasttext embeddings suffice ? Any help on how to get started training the dependency parser and POS tagger would be a great help.

As an aside, I used the pretrained English model in the recipe linked above and the results, for simple sentences, aren't that bad . Because Dutch is very similar to Afrikaans I tried using that instead, but it seems as though there isn't a parser to start with for Dutch.

ines · March 6, 2020, 1:00pm

Hi and thanks!

You don't have to start off with pretrained embeddings – but if you do have some for your language, this will likely result in better overall accuracy. See here for an example of how to initialize a base model with fastText embeddings (which you can then use for training): Linguistic Features · spaCy Usage Documentation

Also see this example for how to use the spacy convert and spacy train commands to train a model from a CoNLL-U-formatted treebank: Training Pipelines & Models · spaCy Usage Documentation

If you have a lot of raw text and word vectors, you could also try out the new (and still somewhat experimental) spacy pretrain, which lets you pretrain the token-to-vector layer with a language modelling objective. We've seen pretty good results with it on downstream tasks, and it's easy to train yourself and has virtually no impact on runtime speed Command Line Interface · spaCy API Documentation

The nl_core_news_sm model was trained on the Dutch Lassy corpus and has a tagger, parser and entity recognizer: Dutch · spaCy Models Documentation So you could also experiment with fine-tuning that for Afrikaans, or mixing both corpora so you have more data. It's potentially tricky and not worth it, though, because different corpora use different label schemes etc.

Ari · March 9, 2020, 7:29am

Brilliant. Thank you @ines. I'll move forward with your suggestions and report back.

Topic		Replies	Views
Linguistic features configured for a non-english model usage , spacy , solved	2	474	January 11, 2019
Train POS on new Language usage , pos	2	694	December 30, 2018
Training a grammar tool usage , textcat	24	5591	February 26, 2018
Training dependency parser usage , ner , done , spacy	5	3884	March 11, 2018
Training POS Tager for Indonesian Language usage , spacy , pos	5	1308	November 20, 2019

Dependency Parser and POS tagger for unsupported language

Related topics