German NER model

Hi @ines,
I have gathered 15k dataset in German language. I ran prodigy ner command with different models. I didn't find vectors model in German so used de large one. I was expecting higher accuracy using de_core_news_lg than en_vectors because of the language. May I know why it has 4% difference?

en_vectors_web_lg - 92%
de_core_news_lg - 88%

Thanks,

Hi @mystuff,

I think the problem might be that you're "resuming" the weights from de_core_news_lg, instead of starting from new weights and just using the vectors. Can you paste the commands you used for each model?

Hi @honnibal,
Thanks for the reply

python -m prodigy train ner de_20000 de_core_news_lg --output de_20000_core_lg
python -m prodigy train ner de_20000 en_vectors_web_lg --output de_20000_en_vectors

Yes I do think the issue is the weight resuming. You can create a model that has the vectors but dumps the trained pipeline components like this:

import spacy
nlp = spacy.load("de_core_news_lg")
with nlp.disable_pipes(*nlp.pipe_names):
    nlp.to_disk("./de_vectors_news_lg")

Alternatively, let's say you want to make a new model, with your own vectors. You can do that with the command spacy init-model, as described here: https://spacy.io/usage/vectors-similarity#converting . You can convert vectors from tools like FastText, so you could use the trained models from fasttext.cc