Hi all,
the main reason why I chose spaCy and Prodigy for my NLP tasks was that it seemed to be very well documented and - most important - came with german models and with prodigy had an excellent tool to train and improve these models.
I need NLP for Named Entity Recognition. My source texts are 20,000 press releases from which I need to extract the organizations, people and locations and in a later step the brands they are about.
These entities (plus other significant terms) will then be used as search terms in order to discover online articles that have been published on base of these press releases.
I've now had the time to experiment with spaCy and prodigy for several weeks and am very frustrated with the german model's NER accuracy.
Be it ORG, PERSON or LOC entity, out of the box it by far detects too many false positives.
For ORG entities, I'd say its predictions are correct in about 30% of the cases.
For the last three weeks I spent my time trying to retrain the ORG entity.
In the end I had generated about 20k annotated sentences with an equal percentage of accept and reject tasks.
After batch training the 'de_core_news_sm' model with these sentences I noticed a slight improvement. The calculated accuracy of the train process now shows 90 %. But from what I can see I'd say it is now correct in about half of the cases.
So I thought, instead of trying to retrain the small model why not start from scratch and create a new entity type.
Following honnibal's video tutorial at https://prodi.gy/docs/video-new-entity-type I did the following:
prodigy dataset ner_org_strict_seedterms "Seed terms for ORG_STRICT"
prodigy terms.teach ner_org_strict_seedterms de_core_news_md --seeds "Daimler, Volkswagen, Apple, Microsoft, BMW, AEG, IBM"
When entering the prodigy app I got presented with the following suggestions:
empty, elephant, unknown, apple, tree, soda, greedy, dip, jumping, berry, slam, lemon .....
Well, these words don't exactly look as if they are German. And being german, I guess I should now.
Guys, are you sure that the german model that comes with spaCy has been trained on the right wikipedia corpus?