Ner.teach just makes random guesses

kevincgrant · June 19, 2019, 11:48am

I used ner.manual to tag a keyword in 600 sentences but the entity recognizer doesn’t recognize the keyword at all in ner.teach, it just makes random guesses like below:

Am I missing a step?

justindujardin · June 19, 2019, 4:29pm

Hi @kevincgrant, welcome!

It’s probably not possible for anyone to give you good advice in this case, unless you add more details. What kind of keywords are you tagging? Can you give a few examples?

You may also want to check out this awesome NER flowchart that @ines put together: https://prodi.gy/docs/pdfs/prodigy_flowchart_ner.pdf

kevincgrant · June 20, 2019, 12:32pm

Wow, that’s an awesome flowchart, thanks!

In this case the defined term I’m trying to tag is Confidential Information (CONF as the ent). The documents I’m working with use dozens of different terms for this (Evaluation Material, Proprietary Information etc) that I want it to recognize so I can automatically insert boilerplate language into the document with the defined terms filled in.

ines · June 20, 2019, 3:11pm

Thanks!

What’s the command you ran and what’s in your patterns? If you start out with a new category, it’s important that the model gets to see enough positive examples of the label early on so it can learn from them – otherwise, it’ll just make random guesses and it can take very long for it to converge.

Also, I noticed something in your screenshot: According to the meta in the bottom right corner, the suggestion you saw there comes from your patterns file, specifically the pattern on line 569. Is it possible that your patterns are noisy and include things like single newlines?

Finally, it’s possible that your label scheme and “confidential information” as an entity type just isn’t a good fit. Named entity recognition models are usually optimised to learn “categories of things”, which typically have very clearly defined boundaries, and are often noun phrases etc. If your definition of “confidential information” ends up including long phrases, half sentences and blurry phrase boundaries, the model may struggle to learn it. In those cases, it’s often bettter to focus on training a model to predict key terms, and then use rules and the dependency parse to resolve longer phrases.

You might also want to check out @honnibal’s talk on designing label schemes for NLP problems, specifically this part:

kevincgrant · June 23, 2019, 2:06am

I was confused on this. I thought I had to process the ner.manual annotations with ner.teach before training the algorithm. Just ran my ner.manual annotations with ner.batch-train with 98.6% accuracy!

Follow up question, I want to train for five other keywords on this corpus. Should I merge my annotation databases before running ner.batch-train or combine the models afterward?

Btw, thanks a ton for the great support on here.

ines · June 24, 2019, 8:56am

Once you're ready to train your final model with all labels, you can use the db-merge command to create a new dataset with all annotations. You can then train from that dataset.

In theory, you could also just add all your annotations for all labels to the same dataset straight away. But we usually recommend using separate datasets, because it makes it much easier to iterate and start over (e.g. if you need to change your label scheme slightly or want to re-annotate a given label).

Topic		Replies	Views
Model tagging all texts as labels usage , ner	1	409	July 16, 2019
Help with messy data usage , ner	8	666	January 20, 2019
What would be a good approach to train a NER model to recognize random strings usage , ner , spacy , solved	3	391	June 27, 2022
Following NER annotation flowchart. Questions on new model and patterns file usage , ner	2	533	August 30, 2019
Help with building NER for job descriptions usage , ner , hr	6	2911	December 27, 2018

Ner.teach just makes random guesses

Related topics