Training POS Tager for Indonesian Language

Hi, i'm currenltly trying to train my own spacy model for POS Tagging in Indonesian Model in which doesn't have any pretrained model like the "en_core_web" so which step for pos should i use to train a new model, is it the same with training a new entity type?

hello, i want to ask how to make an Indonesian tagging POS with prodigy? because on the other hand, spacy doesn't have an Indonesian language model for POS tagging.

Sincere thanks!

Hi Desy,

Creating a part-of-speech annotated corpus can be a surprisingly challenging task, as there will be a lot of linguistic subtleties to the annotation scheme. For instance, in English, there are many words where it's unclear whether the correct tag should be adjective or verb. I'm sure there will be different problems for Indonesian, but it's sure that there will be some problems.

I would suggest trying to find a previous part-of-speech annotated corpus you can use to train an initial model. You might also find another tool that is able to predict Indonesian parts of speech -- for instance, perhaps the StanfordNLP library has an Indonesian model?

If you're able to use a different tool, you can always train a model with spaCy on its output. You can also use Prodigy to help you correct some of the errors. However, the part-of-speech tagging support in Prodigy is somewhat limited, as it's a less common task to need to annotate than text classification or named entity recognition.

Hi, Matthew, thanks for your input. But i want to ask further, if i use another tool like NLTK library to do the POS tag how can we convert it into a spaCy model? Thanks in advance.

I think what @honnibal is suggesting is that you take an existing pretrained model and run it over a lot of text to automatically annotate the data. So instead of labelling the data yourself, you let the model label it. At the end of it, you have labelled data that you can train a spaCy model with.

Hi, ines, thank you so much for the kind response. But, i want to ask one more thing, if there isn't any pretrained model of POS Tag in Indonesian Language, what model should we use to train these POS tag?