Two word entities

KavyaGujjala · October 23, 2018, 5:27am

Hi,
Does prodigy catch/train two word entities (new label) like some skincare brands ‘Estee Lauder’ , ‘Bobbi brown’ , ‘SK-II’ ?
Right now its catching ‘estee’ , ‘lauder’ , ‘SK’ , ‘II’ seperately.

ines · October 23, 2018, 11:51am

Sure! In fact, multi-token phrases are pretty essential to named entity recognition and super common. Many of the entity types recognized by spaCy’s pre-trained models need often apply to multiple tokens (e.g. person names).

You just need to show the model enough examples of those entities in context so it can start learning about them - either by providing patterns or by labelling them manually and then pre-training the model.

(If you start training a new category from scratch and the model doesn’t get to see enough positive examples, it’s possible that it first gravitates towards single-token entities. This usually means that the model hasn’t seen enough examples.)

KavyaGujjala · October 26, 2018, 4:45am

Thanks for the reply.
Even though two word entities are given in the seed terms later when prodigy shows up similar named entities its not throwing two word entities.
Say like ‘Estee Lauder’ has been given in the seed terms, but its neither throwing similar two word entities (while entity training) nor it catches the same words in the reviews (in review level labeling).
What could be the reason?

ines · October 26, 2018, 10:46am

Could you share more details on your workflow? Which recipes did you use? And what do you patterns look like? Did you confirm that your patterns actually match (see our matcher demo for an example) and that they properly describe the entities you’re looking for?

KavyaGujjala · October 26, 2018, 11:45am

First created a dataset for skincare brands
prodigy dataset skincare_brands

Gave few brands as seed terms so that it will catch related terms
prodigy terms.teach skincare_brands en_core_web_lg --seeds "estee lauder,bobbi brown,lancome’

And then using the link accepted/rejected brands that were coming up and made a dataset of 400 brands
(it was only throwing one word brands sephora,lancome and also estee and lauder seperately)

we want two word entities ‘estee lauder’ to be caught together.

Imported the dataset to a jsonl file
prodigy terms.to-patterns skincare_brands skincare_brands.jsonl --label SKINCARE

Added a review file to tag the brands, started tagging the entities in the reviews
prodigy ner.teach skincare_ner en_core_web_lg path-to-review-file --label SKINCARE --patterns skincare_brands.jsonl
Even in the review level tagging no two word entities are being caught!

Can you please help us through this?

ines · October 26, 2018, 3:07pm

Thanks for sharing your workflow!

Ahh, this makes more sense now. The problem here is that you're using the en_core_web_lg model for word vectors, which only includes vectors for single words. So it will only be able to suggest you single tokens. All of your patterns will then include examples for single tokens, and the model would only get to see single-token entities. As a result, it doesn't learn anything about multi-token entities while you train it in the loop.

If you want to use terms.teach to bootstrap a terminology list for multi-token entities, you probably want to use your own vectors that were trained on phrases instead of single tokens.

A simpler solution would be to just add more patterns for multi-token entities manually, for example:

{"label": "SKINCARE", "pattern": [{"lower": "estee"}, {"lower": "lauder"}]}
{"label": "SKINCARE", "pattern": [{"lower": "bobbi"}, {"lower": "brown"}]}

KavyaGujjala · October 29, 2018, 7:41am

Thanks for the detailed reply.

So in our use case is it better to use another model say ‘en_core_web_md’ or ‘en_core_web_sm’ , or do it in the way you have mentioned above ( adding two word entities manually) ?

And also how to make sure that every entity in a review gets tagged because in a review
‘SK-II immediately open eye cream large, easy-hwan bright charming eyes, you can.’.
In this review hwan is caught but SK-II didn’t come up. But SK-II was caught in other reviews!

calebchiam · June 18, 2019, 9:09am

@ines Been playing with Prodigy the past few days and I’m loving it! Ran into this same issue and was wondering too if using ‘en_core_web_md’ and ‘en_core_web_sm’ as starting models would face the same issue as in ‘en_core_web_lg’. @KavyaGujjala - did you ever figure this out?

honnibal · June 20, 2019, 9:39pm

If you’re merging two word entities, you’ll need to create your own word vectors model that includes those terms as keys. Otherwise you won’t get word vectors for them.

Topic		Replies	Views
Multi-word entity seeding, entity context usage , ner	19	3960	November 1, 2019
NER from user-generated content (spelling mistakes etc.) usage , ner , solved	5	1552	August 3, 2018
ner.teach does not suggest multiple tokens usage , ner	4	1354	October 16, 2018
Named entity recognition - phrases usage , ner	2	690	November 14, 2018
Two word NER ner , solved	2	873	November 28, 2018

Two word entities

Related topics