Text normalization / conversion with Prodigy / spaCy

I am working on a project where I need to extract certain Named Entities from crawled data.

The data contains usually incomplete sentence that describes speaker(s) of a talk or author(s) of a paper, in different formats.

Example:
[person name] from [company name]
[person name] of [company name]
[person name] - [company name]
[person name]@[company name]
[person name 1], and [person name 2] from [company name]
[person name 1], [person name 2], and [person name 3] from [company name]
[person name 1], [person name 2] from [company name 1], [person name 3] from [company name 2]
[person name 1], [person name 2] from [company name 1], [company name 2]

And many more forms.

What I need is to pair names so it forms a list of tuple ([person name n], [company name n]).

Initially I thought Prodigy’s NER training capability is exactly what I need.
However, after tagging a few hundred sample data with both ner.make_gold and ner.manual, the model is still struggles at it.

Some of the reasons may include incomplete sentence, special characters and nonstandard text formats. The company name contains spaces (multiple words), ampersand, apostrophe, hyphen, or even use CamelCase instead of space. NER will miss many ORGs, and sometimes only tags a portion of it as ORG, for example “Amazon Germany”.

I now think this may be better solved by text normalization, and then simply extract the desired data with regular expressions.

Is NER really suitable in this use case?
Can Prodigy be used to train a text normalization / conversion model (e.g. Seq2Seq)?

Thanks.

The NER model does rely on the tokens being good features for the tagging decisions. Mostly it's trying to learn the tag for a given word based on the surrounding words. If the surrounding words aren't very informative, the model might not perform as well as a rule-based solution. I do think text normalisation might be helpful, perhaps as a pre-process to the NER. If the camel-casing is common, then fixing that will definitely be helpful.

You might also try starting from a blank model, instead of learning on top of the existing spaCy models. Your text is likely to be significantly different from the original training data, so the initial model might not be helpful.

You would have to write your own recipe for that. You might find the diff interface useful: Web Application · Prodigy · An annotation tool for AI, Machine Learning & NLP

After many experiments, I found the problem!

I was using the ner.make-gold in a wrong way.
I thought I need to reject the incorrect NER predictions from the model after I made corrections to it.

After viewing the content of Prodigy SQLite database, it turns out Prodigy doesn’t record the original prediction from the model, only the corrected final result.

So basically my dataset record shows I rejected ~80% of the correct answers, and accepted ~20% of correct model predictions. The model struggles because it sees contradicted training data.

I deleted the entire SQLite, added a text preprocess step, and start over with ner.manual, it now has 90% accuracy after 100 iterations of training.

It seems the reject button really serves no purpose in ner.make-gold or ner.manual, maybe you could consider disable it to avoid confusion.

@Silent Sorry for the confusion there! The reject button can still be a useful convention, because sometimes you can’t correct the exampe, e.g. if there’s no satisfying answer, or if the tokenisation is wrong. It can be useful to have a distinction between “ignore” and “reject” in that case.

Maybe we could find somewhere to add an additional note, possibly in the help text for the recipe.