Hi,
Does prodigy catch/train two word entities (new label) like some skincare brands ‘Estee Lauder’ , ‘Bobbi brown’ , ‘SK-II’ ?
Right now its catching ‘estee’ , ‘lauder’ , ‘SK’ , ‘II’ seperately.
Sure! In fact, multi-token phrases are pretty essential to named entity recognition and super common. Many of the entity types recognized by spaCy’s pre-trained models need often apply to multiple tokens (e.g. person names).
You just need to show the model enough examples of those entities in context so it can start learning about them - either by providing patterns or by labelling them manually and then pre-training the model.
(If you start training a new category from scratch and the model doesn’t get to see enough positive examples, it’s possible that it first gravitates towards single-token entities. This usually means that the model hasn’t seen enough examples.)
Thanks for the reply.
Even though two word entities are given in the seed terms later when prodigy shows up similar named entities its not throwing two word entities.
Say like ‘Estee Lauder’ has been given in the seed terms, but its neither throwing similar two word entities (while entity training) nor it catches the same words in the reviews (in review level labeling).
What could be the reason?
Could you share more details on your workflow? Which recipes did you use? And what do you patterns look like? Did you confirm that your patterns actually match (see our matcher demo for an example) and that they properly describe the entities you’re looking for?
First created a dataset for skincare brands
prodigy dataset skincare_brands
Gave few brands as seed terms so that it will catch related terms
prodigy terms.teach skincare_brands en_core_web_lg --seeds "estee lauder,bobbi brown,lancome’
And then using the link accepted/rejected brands that were coming up and made a dataset of 400 brands
(it was only throwing one word brands sephora,lancome and also estee and lauder seperately)
we want two word entities ‘estee lauder’ to be caught together.
Imported the dataset to a jsonl file
prodigy terms.to-patterns skincare_brands skincare_brands.jsonl --label SKINCARE
Added a review file to tag the brands, started tagging the entities in the reviews
prodigy ner.teach skincare_ner en_core_web_lg path-to-review-file --label SKINCARE --patterns skincare_brands.jsonl
Even in the review level tagging no two word entities are being caught!
Can you please help us through this?
Thanks for sharing your workflow!
Ahh, this makes more sense now. The problem here is that you're using the en_core_web_lg
model for word vectors, which only includes vectors for single words. So it will only be able to suggest you single tokens. All of your patterns will then include examples for single tokens, and the model would only get to see single-token entities. As a result, it doesn't learn anything about multi-token entities while you train it in the loop.
If you want to use terms.teach
to bootstrap a terminology list for multi-token entities, you probably want to use your own vectors that were trained on phrases instead of single tokens.
A simpler solution would be to just add more patterns for multi-token entities manually, for example:
{"label": "SKINCARE", "pattern": [{"lower": "estee"}, {"lower": "lauder"}]}
{"label": "SKINCARE", "pattern": [{"lower": "bobbi"}, {"lower": "brown"}]}
Thanks for the detailed reply.
So in our use case is it better to use another model say ‘en_core_web_md’ or ‘en_core_web_sm’ , or do it in the way you have mentioned above ( adding two word entities manually) ?
And also how to make sure that every entity in a review gets tagged because in a review
‘SK-II immediately open eye cream large, easy-hwan bright charming eyes, you can.’.
In this review hwan is caught but SK-II didn’t come up. But SK-II was caught in other reviews!
@ines Been playing with Prodigy the past few days and I’m loving it! Ran into this same issue and was wondering too if using ‘en_core_web_md’ and ‘en_core_web_sm’ as starting models would face the same issue as in ‘en_core_web_lg’. @KavyaGujjala - did you ever figure this out?
If you’re merging two word entities, you’ll need to create your own word vectors model that includes those terms as keys. Otherwise you won’t get word vectors for them.