I'm looking to create a dependency parser and POS tagger for Afrikaans (and later, other South African languages). I have a labelled treebank corpus but am not certain how to proceed. I've looked at this link and I gather that I require a pretrained language model ? will pretrained fasttext embeddings suffice ? Any help on how to get started training the dependency parser and POS tagger would be a great help.
As an aside, I used the pretrained English model in the recipe linked above and the results, for simple sentences, aren't that bad . Because Dutch is very similar to Afrikaans I tried using that instead, but it seems as though there isn't a parser to start with for Dutch.
You don't have to start off with pretrained embeddings – but if you do have some for your language, this will likely result in better overall accuracy. See here for an example of how to initialize a base model with fastText embeddings (which you can then use for training): Linguistic Features · spaCy Usage Documentation
If you have a lot of raw text and word vectors, you could also try out the new (and still somewhat experimental) spacy pretrain, which lets you pretrain the token-to-vector layer with a language modelling objective. We've seen pretty good results with it on downstream tasks, and it's easy to train yourself and has virtually no impact on runtime speed Command Line Interface · spaCy API Documentation
The nl_core_news_sm model was trained on the Dutch Lassy corpus and has a tagger, parser and entity recognizer: Dutch · spaCy Models Documentation So you could also experiment with fine-tuning that for Afrikaans, or mixing both corpora so you have more data. It's potentially tricky and not worth it, though, because different corpora use different label schemes etc.