Train POS on new Language


I’m training a new language from Word2Vec vectors. What is the best way to train POS? Should I be doing this in spaCy or can I do it in Prodigy?


Once you have the training data, I would probably recommend using spaCy to do the model training, simply because you’ll have more control. Prodigy would be calling into spaCy anyway, and in general the POS tagging recipes are a bit less mature in Prodigy than the NER recipes.

For the annotation, you can probably find a better workflow than annotating every word in a sentence sequentially. Probably a good first step will be to make yourself a “tag dictionary”: a dictionary of valid tags for each word.

Take a sample of text, and get a word frequency count. Then go down the frequency list, annotating for each word whether some tag is valid. You might want to do this as a binary task for each tag. So, annotate every word in the top say, 2000, according to whether it can be a noun. Then annotate whether those words can be a verb, etc.

I think it’ll probably be faster to go tag-by-tag instead of doing all the tags at once. It’s less clicking, and it’ll likely be more accurate, because it’s hard to remember that some word can act as an adjective unless you ask yourself that explicitly.

Once you have a tag dictionary, you could use it to automatically annotate the sentences, and then make corrections.

We don’t have recipes for this workflow yet, as fewer users are training POS models from scratch. If there’s a Universal Dependencies corpus with an appropriate license for your language, you would be better off training a model from that to start with:

Thanks Matthew. Will try to convert the UD corpus.