Turkish language that spaCy doesn’t yet provide pre-trained models

Do you have any model for Turkish language?

I saw that

What is best roadmap to create new model?
What kind of minimum information do I need to create new model?

Thanks for your advice

If you want to train a model from scratch and you also want to train a tagger and dependency parser, you probably want to start off with an existing treebank. More details here:

Using Prodigy, you can then create your own named entity annotations. This is by the way how the named entity recognizer of the new Greek model (currently available for spacy-nightly) was trained. It’s a slightly more involved process, because you want to make sure you have enough examples (minimum a few thousand fully annotated) and also a good evaluation. Because you need gold-standard annotations (all labels annotated and no missing values), you probably want to use the ner.manual recipe and label it by hand.

In order to train a NER recognizer in this particular context (language like Turkish, Greek without a pre-defined model), do we need to have dependency parser? Or is it enough to annotate only with named entities? What is the recommended number of sentences for like 5 different datatypes?

spaCy's components can be trained indepdently, so you don't need a parser to train an entity recognizer. (The xx_ent_wiki_sm model is an example of a model that was only trained on NER annotations and that only has an NER component.)

That really depends on your data, entity frequencies etc. We typically recommend annotationg at least a few hundred to a few thousand for meaningful results that you can draw conclusions from. If you're using Prodigy, you can periodically run the train-curve recipe to check if your model is improving with more data and to see if you're on the right track.

1 Like