Train POS on new Language

helmiruza · December 29, 2018, 5:35pm

Hi,

I’m training a new language from Word2Vec vectors. What is the best way to train POS? Should I be doing this in spaCy or can I do it in Prodigy?

Thanks

honnibal · December 30, 2018, 10:52am

Once you have the training data, I would probably recommend using spaCy to do the model training, simply because you’ll have more control. Prodigy would be calling into spaCy anyway, and in general the POS tagging recipes are a bit less mature in Prodigy than the NER recipes.

For the annotation, you can probably find a better workflow than annotating every word in a sentence sequentially. Probably a good first step will be to make yourself a “tag dictionary”: a dictionary of valid tags for each word.

Take a sample of text, and get a word frequency count. Then go down the frequency list, annotating for each word whether some tag is valid. You might want to do this as a binary task for each tag. So, annotate every word in the top say, 2000, according to whether it can be a noun. Then annotate whether those words can be a verb, etc.

I think it’ll probably be faster to go tag-by-tag instead of doing all the tags at once. It’s less clicking, and it’ll likely be more accurate, because it’s hard to remember that some word can act as an adjective unless you ask yourself that explicitly.

Once you have a tag dictionary, you could use it to automatically annotate the sentences, and then make corrections.

We don’t have recipes for this workflow yet, as fewer users are training POS models from scratch. If there’s a Universal Dependencies corpus with an appropriate license for your language, you would be better off training a model from that to start with: https://github.com/UniversalDependencies

helmiruza · December 30, 2018, 8:00pm

Thanks Matthew. Will try to convert the UD corpus.

Topic		Replies	Views
Training POS Tager for Indonesian Language usage , spacy , pos	5	1302	November 20, 2019
How to do POS tagging with this tool? usage , pos	1	892	December 13, 2019
Problem creating a new language to serve as a base model for further improvement in Prodigy spacy , pos	3	645	August 17, 2020
Does prodigy support Portuguese usage , solved	3	520	September 10, 2018
Custom POS tag model and errors spacy , custom , pos	3	2365	January 16, 2019

Train POS on new Language

Related topics