Is there something wrong in general with the German model?

Hi all,

the main reason why I chose spaCy and Prodigy for my NLP tasks was that it seemed to be very well documented and - most important - came with german models and with prodigy had an excellent tool to train and improve these models.

I need NLP for Named Entity Recognition. My source texts are 20,000 press releases from which I need to extract the organizations, people and locations and in a later step the brands they are about.

These entities (plus other significant terms) will then be used as search terms in order to discover online articles that have been published on base of these press releases.

I've now had the time to experiment with spaCy and prodigy for several weeks and am very frustrated with the german model's NER accuracy.

Be it ORG, PERSON or LOC entity, out of the box it by far detects too many false positives.
For ORG entities, I'd say its predictions are correct in about 30% of the cases.

For the last three weeks I spent my time trying to retrain the ORG entity.
In the end I had generated about 20k annotated sentences with an equal percentage of accept and reject tasks.

After batch training the 'de_core_news_sm' model with these sentences I noticed a slight improvement. The calculated accuracy of the train process now shows 90 %. But from what I can see I'd say it is now correct in about half of the cases.

So I thought, instead of trying to retrain the small model why not start from scratch and create a new entity type.

Following honnibal's video tutorial at https://prodi.gy/docs/video-new-entity-type I did the following:

prodigy dataset ner_org_strict_seedterms "Seed terms for ORG_STRICT"
prodigy terms.teach ner_org_strict_seedterms de_core_news_md --seeds "Daimler, Volkswagen, Apple, Microsoft, BMW, AEG, IBM"

When entering the prodigy app I got presented with the following suggestions:
empty, elephant, unknown, apple, tree, soda, greedy, dip, jumping, berry, slam, lemon .....

Well, these words don't exactly look as if they are German. And being german, I guess I should now.

Guys, are you sure that the german model that comes with spaCy has been trained on the right wikipedia corpus?

The models we distribute for spaCy are limited by what training data is available. We've paid licensing fees to get better data for the English parsing and NER data, and somewhat better data for German dependencies. We distribute these models for free, just as we've made the spaCy library free.

However, no resources are available for us to license for German NER --- so we haven't even had the option to buy better data for German. The same is true for the NER data for most of the other languages.

In order to provide some sort of free NER model for German, we've had to use annotations derived from Wikipedia text semi-automatically. We've tried to note that this is unideal in the docs:

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

So, I hope you can understand why the free German models we distribute are unideal. That said, you should definitely be able to train an improved model on your data using Prodigy.

I think the most likely problem you're facing is that if the initial model is not sufficiently accurate, the ner.teach workflow doesn't work very well. The model needs to be quite good in order to learn effectively from the binary supervision. When you click "reject", the model is only given partial information to learn from. You're also relying on the model to suggest examples, so you can get stuck in situations where the confidence of some entities that are correct is too low, and they're not suggested. We've tried to tune the system so that this happens less often, but if the initial model is inaccurate this can still occur.

I suggest you try the ner.manual workflow, and simply highlight the entities to ensure you're getting gold-standard annotations. You can also use the ner.make-gold recipe if you do want to use the current model to suggest annotations. You can also import your existing "accept" entities into the data, so that you don't have to label those ones.

Once you've created some of these annotations, you can learn with the --no-missing argument to ner.batch-train. This will tell the model that if an entity isn't in your annotations, it's definitely not correct, which makes it much easier for the model to learn from the annotations.

I don't think you should need to do the terms.teach workflow, which would also be limited to single-term entities using the de_core_news_md vectors. It was definitely surprising that the model came back with so many English words. It seems like there are a number of English words in that model's vocab, as there was some amount of mixed-code text in the data used to calculate the vocabulary frequencies. It seems that these words end up close to the entities, since they sometimes occur in the context of those entity words, but never occur in the context of most of the other words in the vocab (which are German). I had a look through the vocab to verify this.

Dear @honnibal,

thank you very much for the quick reply.

However, no resources are available for us to license for German NER --- so we haven't even had the option to buy better data for German. ... So, I hope you can understand why the free German models we distribute are unideal.

I DO understand and I'm sorry if my post lead to the impression that I didn't.

I orginally had planned to solve my NLP problems using the NLTK and spent considerable time looking for instructions on how to use the NLTK with the German language.

While searching I came across a few German corpora like the TIGER corpus, the Huge German Corpus and finally across the GermEval 2014 corpus, the latter being the only one (that I could find) that contained NER annotations.

When I finally found spaCy with its extensive documentation website, German NER out of the box and prodigy as a tool to improve the shipped German model should it not suffice, I finally believed that the quote NER today is regarded as a solved problem in NLP which I had come across during my web researches, must be true.

While this seems to be true for English it seems that for other languages it is not, at least not without investing considerable time in training and creating a language dependent model. prodigy comes in here as a very handy tool.

Just a thought:

Germany in general seems to be (left) behind in many things these days, like mobile network coverage, internet bandwidth and availability and apparently especially when it comes to AI and ML.

Not sure how many german spaCy users are out there, but perhaps it would be a good idea to initiate some kind of German spaCy user group that aims to create an improved, all purpose community driven German model?

After all contributing is what OSS is all about and would help us all to catch up.

Anyhow:

Once you've created some of these annotations, you can learn with the --no-missing argument to ner.batch-train.

Might well be, that the --no-missing argument to ner.batch-train is what I've been missing so far. I'll keep trying and will let you know it goes.

Thank you once more for the very fast reply and enjoy the rest of the weekend.

Cheers,

kamiwa

I do think you'll be able to solve your problems by creating training data with Prodigy. One of the advantages is exactly that you can make data specific to your problem --- that's generally much more directly useful than relying on more generic resources.

This is one reason why the situation with linguistic resources is the way it is. The academic community's mission is to have enough benchmarks to allow the algorithms to be meaningfully ranked. From the perspective of the academic community, it doesn't really matter whether the models trained are useful. If we can expect that articles from the 1984 Wall Street Journal will produce the same relative ordering of systems as some more useful data (a reasonable assumption that's so far been true), there's no reason to go out and create datasets that result in useful models.

So far no efforts to "crowd source" corpora in the way you suggest have been very successful. Typically the resources which become well used tend to be ones which were created much more carefully, usually with relatively few annotators. The reason is that it's quite important that all of the annotators have the same idea of the annotation scheme as each other. This means that some effort needs to be taken to engage with the annotators and resolve disagreements.

Thanks again and your arguments against a crowd sourced model seem reasonable.
As said, it was only a thought.