Ner.teach just makes random guesses

I used ner.manual to tag a keyword in 600 sentences but the entity recognizer doesn’t recognize the keyword at all in ner.teach, it just makes random guesses like below:


Am I missing a step?

Hi @kevincgrant, welcome!

It’s probably not possible for anyone to give you good advice in this case, unless you add more details. What kind of keywords are you tagging? Can you give a few examples?

You may also want to check out this awesome NER flowchart that @ines put together: https://prodi.gy/docs/pdfs/prodigy_flowchart_ner.pdf

3 Likes

Wow, that’s an awesome flowchart, thanks!

In this case the defined term I’m trying to tag is Confidential Information (CONF as the ent). The documents I’m working with use dozens of different terms for this (Evaluation Material, Proprietary Information etc) that I want it to recognize so I can automatically insert boilerplate language into the document with the defined terms filled in.

Thanks! :smiley:

What’s the command you ran and what’s in your patterns? If you start out with a new category, it’s important that the model gets to see enough positive examples of the label early on so it can learn from them – otherwise, it’ll just make random guesses and it can take very long for it to converge.

Also, I noticed something in your screenshot: According to the meta in the bottom right corner, the suggestion you saw there comes from your patterns file, specifically the pattern on line 569. Is it possible that your patterns are noisy and include things like single newlines?

Finally, it’s possible that your label scheme and “confidential information” as an entity type just isn’t a good fit. Named entity recognition models are usually optimised to learn “categories of things”, which typically have very clearly defined boundaries, and are often noun phrases etc. If your definition of “confidential information” ends up including long phrases, half sentences and blurry phrase boundaries, the model may struggle to learn it. In those cases, it’s often bettter to focus on training a model to predict key terms, and then use rules and the dependency parse to resolve longer phrases.

You might also want to check out @honnibal’s talk on designing label schemes for NLP problems, specifically this part:

1 Like

I was confused on this. I thought I had to process the ner.manual annotations with ner.teach before training the algorithm. Just ran my ner.manual annotations with ner.batch-train with 98.6% accuracy!

Follow up question, I want to train for five other keywords on this corpus. Should I merge my annotation databases before running ner.batch-train or combine the models afterward?

Btw, thanks a ton for the great support on here.

1 Like

Once you're ready to train your final model with all labels, you can use the db-merge command to create a new dataset with all annotations. You can then train from that dataset.

In theory, you could also just add all your annotations for all labels to the same dataset straight away. But we usually recommend using separate datasets, because it makes it much easier to iterate and start over (e.g. if you need to change your label scheme slightly or want to re-annotate a given label).