Do you have any model for Turkish language?
I saw that
What is best roadmap to create new model?
What kind of minimum information do I need to create new model?
Thanks for your advice
Do you have any model for Turkish language?
I saw that
What is best roadmap to create new model?
What kind of minimum information do I need to create new model?
Thanks for your advice
If you want to train a model from scratch and you also want to train a tagger and dependency parser, you probably want to start off with an existing treebank. More details here:
Using Prodigy, you can then create your own named entity annotations. This is by the way how the named entity recognizer of the new Greek model (currently available for spacy-nightly
) was trained. It’s a slightly more involved process, because you want to make sure you have enough examples (minimum a few thousand fully annotated) and also a good evaluation. Because you need gold-standard annotations (all labels annotated and no missing values), you probably want to use the ner.manual
recipe and label it by hand.
In order to train a NER recognizer in this particular context (language like Turkish, Greek without a pre-defined model), do we need to have dependency parser? Or is it enough to annotate only with named entities? What is the recommended number of sentences for like 5 different datatypes?
spaCy's components can be trained indepdently, so you don't need a parser to train an entity recognizer. (The xx_ent_wiki_sm
model is an example of a model that was only trained on NER annotations and that only has an NER component.)
That really depends on your data, entity frequencies etc. We typically recommend annotationg at least a few hundred to a few thousand for meaningful results that you can draw conclusions from. If you're using Prodigy, you can periodically run the train-curve
recipe to check if your model is improving with more data and to see if you're on the right track.