NER: English training dataset for German language

mystuff · May 6, 2020, 12:54pm

Hi @ines I have English training data in prodigy format for 10 entities. I am getting 90% accuracy for my domain. Thanks to both Prodigy and SPacy.
Now i need to do same NER in German language.

Does that existing english training data with spacy multi language model work for German NER?.
If not, what is the best way in this situation?.
Do i need to collect German language training data separately?

ines · May 7, 2020, 11:59am

Glad to hear your model is worked well

Typically, you'd create different training data for different languages, yes. You want the training data to be as close as possible to the data that the model will see at runtime. For a German NER model, that'd be German text with entities. There are also certain language differences that can have an impact: in English text, capitalisation can be a strong indicator for named entities and the model can take advantage of that. In German, that's not the case at all, because all nouns are capitalised. So that should probably be reflected in your training data - otherwise, your model may get very confused.

That said, if the entities you're looking for are similar and you already have annotated data, there's no need to do everything from scratch. For example, you could use your annotated English entities to create match patterns and then use those in ner.manual. This will pre-highlight those entities for you, so you have less work when creating your German data.

mystuff · May 7, 2020, 4:02pm

Thanks for your reply. will try that way.

Topic		Replies	Views
Is there something wrong in general with the German model? spacy	4	2724	September 1, 2019
New language model for NER usage , ner , spacy , solved	2	570	September 17, 2019
Commands for training NER-Model in prodigy usage , ner , solved , training	9	1120	January 9, 2023
Train NER model to improve existing entities spacy vs prodigy ner , spacy	1	954	December 9, 2019
Train multiple NER from a blank FR model using fastext vectors usage , ner , spacy	12	857	March 24, 2020

NER: English training dataset for German language

Related topics