Using Custom Entities

theoldhat · December 19, 2017, 12:15am

Hi,

I apologize, but I don't think I quite understand how to use Prodigy with custom entity types. I have been able to teach using the out of the box entities (i.e. Person, Org etc.), but can't figure out how to use the Prodigy UI to train on entities that I define in my annotations (i.e. Dog, Cat, Bird). I followed the steps to load my own data and annotations per the steps listed here:

but I am stuck after that. Could anyone offer any assistance?

Thanks!

ines · December 19, 2017, 12:36am

Thanks for the question – this is actually perfect timing! We just released a video that shows a Prodigy workflow dealing with this exact topic. We’ve spent a lot of time tuning this workflow, and we’re really happy to see it working pretty well now! For more info and a quick summary, see this thread and this docs page. You can find more details on the recipes used on this page, or in your PRODIGY_README.html.

TL;DR

Note that this workflow requires Prodigy v1.1.0.

Use the terms.teach recipe with word vectors (e.g. a large spaCy model) to create a terminology list of examples of your new entity type. For a DOG entity, you could for example start off with the seed terms “labrador”, “golden retriever” and “poodle”. Based on the vectors, Prodigy will suggest you similar words to add to your list – for example, “corgi”.
Convert your terminology list to match patterns that can be loaded by spaCy’s Matcher using the terms.to-patterns recipe. This will give you a JSONL file with entries like {"label": "DOG", "pattern": [{"lower": "golden"}, {"lower": "retriever"}]}.
Collect annotations for the new entity type using ner.teach with your patterns file as the --patterns argument. The patterns are used to suggest entities found in your data – this helps you collect a bunch of relevant examples first, to get over the “cold start problem”. As the model in the loop improves, it will also start suggesting entities based on what it’s learned so far. You’ll probably want to collect a few hundred annotations before running the first training experiments.
Train your model using ner.batch-train and export it. Hopefully, you’ll now see a nice, initial accuracy score! You can also run ner.train-curve to see how accuracy improves with more data.
Test your exported model on real data (and make sure to use texts the model hasn’t seen during training). You can either use the ner.print-stream to get some nicely formatted command line output, or load it into spaCy using spacy.load('/path/to/model') and check out the doc.ents.

theoldhat · December 19, 2017, 7:54pm

Ah excellent! This is helps immensely. I will give this a try!

Thanks Ines!

JayMan · May 10, 2018, 5:09pm

Hello,

My goal is to create a new NER entity called brands that should recognize famous brands like Campbells, Oshkosh Bigosh etc.

I first ran terms.teach, extracted the jsonl file from that exercise. It came out correctly as
{“pattern”:[{“lower”:“tomtom”}],“label”:“Brand”}

Then ran ner.teach with my own custom dataset like so

python load_brand_data.py | python -m prodigy ner.teach finprod_ner_v1 en_core_web_lg --label Brand --patterns finprod_v1_patterns.jsonl

I then extracted the new jsonl and would have expected to find spans with the Brand labels. Instead I got this - a sample row out of the new jsonl

{“label”:“Brand”,“pattern”:[{“lower”:“We own two of the most highly recognized and most trusted brand names in the children’s apparel industry, Carter’s and OshKosh B’gosh (or “OshKosh”), and a leading baby and young child lifestyle brand, Skip Hop.”}]}

What am I doing wrong please?
Thank you in advance!

ines · May 10, 2018, 5:14pm

This looks all correct so far!

How are you extracting the JSONL? Are you calling db-out? For example:

prodigy db-out finprod_ner_v1 > finprod_ner_v1.jsonl

Or are you running terms.to-patterns somewhere instead? The output you posted looks exactly like the output of the terms.to-patterns recipe – here, Prodigy takes a dataset, only gets the "text" value and creates patterns based on the terms.

JayMan · May 10, 2018, 5:18pm

You are correct. I am using the terms.to-patterns like so
python -m prodigy terms.to-patterns finprod_ner_v1 finprod_v2_patterns.jsonl --label Brand

Is that wrong?

ines · May 10, 2018, 5:22pm

In this case yes, because you want to export the annotations and not create match patterns, right?

The terms.to-patterns recipe is for converting a dataset of words to patterns. You can then use those patterns to collect better annotations with ner.teach and other recipes – just like you did.

If you just want to export JSONL data from the database, you can use the db-out command:

prodigy db-out finprod_ner_v1 > finprod_ner_v1.jsonl

JayMan · May 10, 2018, 5:28pm

Ah got it. Makes sense now that I am looking at the db-out output. One more question. None my annotations have two words - e.g. “Oshkosh Bigosh” or better multiple brands in the same sentence - e.g. We sell OskKosh Bigosh, Top Hat and Skippy brand products. Here I would have 3 brands with 2 of them being multiple words.

How do I over come that? Use the ner_manual annotation interface to manually annotate?
Note my initial seed terms had a few .e.g (I have shortened the terms below for brevity).

python -m prodigy terms.teach finprod_terms_v1 en_core_web_lg --seed “Peerless Funds, Mark Architectural Lighting”

ines · May 10, 2018, 5:42pm

Great, glad you got it working!

In your case, it sounds like the word vectors you used only included single tokens. This is why your patterns also only include single words. One solution for matching multi-word patterns is to simply edit the patterns manually, or create them based on a list or dictionary. The patterns format is pretty simple and you can either specify token patterns like "pattern": [{"lower": "top"}, {"lower": "hat"}] or exact strings like "pattern": "Top Hat".

At the moment, we don't have a built-in solution for phrase vectors – but we're working on integrating sense2vec, which should be really cool. This would let you train context vectors on your data, and then use terms.teach to find phrases that are similar to the ORG Top Hat (i.e. multi-word tokens and meaning in context).

This is also an option. The ner.manual recipe also respects pre-defined entities. And the annotations you export from Prodigy are in the same format as the stream you load into Prodigy. So you can always export your dataset and "post-annotate" it manually to fix mistakes. You probably want to filter only the accepted answers, though!

Topic		Replies	Views
Named Entities(manual) usage , ner , solved	4	803	May 11, 2018
spaCy, prodigy, annotation usage , ner , solved	2	722	February 8, 2019
Using prodigy for NER with custome entities ner	2	433	December 16, 2019
Work Flow for extending an NER model with new entity types ner , best-practices	1	1426	June 1, 2019
Multi-word entity seeding, entity context usage , ner	19	3958	November 1, 2019

Using Custom Entities

TL;DR

Related topics