Using Custom Entities

Hi,

I apologize, but I don't think I quite understand how to use Prodigy with custom entity types. I have been able to teach using the out of the box entities (i.e. Person, Org etc.), but can't figure out how to use the Prodigy UI to train on entities that I define in my annotations (i.e. Dog, Cat, Bird). I followed the steps to load my own data and annotations per the steps listed here:

but I am stuck after that. Could anyone offer any assistance?

Thanks!

Thanks for the question – this is actually perfect timing! We just released a video that shows a Prodigy workflow dealing with this exact topic. We’ve spent a lot of time tuning this workflow, and we’re really happy to see it working pretty well now! For more info and a quick summary, see this thread and this docs page. You can find more details on the recipes used on this page, or in your PRODIGY_README.html.

TL;DR

Note that this workflow requires Prodigy v1.1.0.

  1. Use the terms.teach recipe with word vectors (e.g. a large spaCy model) to create a terminology list of examples of your new entity type. For a DOG entity, you could for example start off with the seed terms “labrador”, “golden retriever” and “poodle”. Based on the vectors, Prodigy will suggest you similar words to add to your list – for example, “corgi”.

  2. Convert your terminology list to match patterns that can be loaded by spaCy’s Matcher using the terms.to-patterns recipe. This will give you a JSONL file with entries like {"label": "DOG", "pattern": [{"lower": "golden"}, {"lower": "retriever"}]}.

  3. Collect annotations for the new entity type using ner.teach with your patterns file as the --patterns argument. The patterns are used to suggest entities found in your data – this helps you collect a bunch of relevant examples first, to get over the “cold start problem”. As the model in the loop improves, it will also start suggesting entities based on what it’s learned so far. You’ll probably want to collect a few hundred annotations before running the first training experiments.

  4. Train your model using ner.batch-train and export it. Hopefully, you’ll now see a nice, initial accuracy score! You can also run ner.train-curve to see how accuracy improves with more data.

  5. Test your exported model on real data (and make sure to use texts the model hasn’t seen during training). You can either use the ner.print-stream to get some nicely formatted command line output, or load it into spaCy using spacy.load('/path/to/model') and check out the doc.ents.

  6. :tada:

Ah excellent! This is helps immensely. I will give this a try!

Thanks Ines!

1 Like

Hello,

My goal is to create a new NER entity called brands that should recognize famous brands like Campbells, Oshkosh Bigosh etc.

I first ran terms.teach, extracted the jsonl file from that exercise. It came out correctly as
{“pattern”:[{“lower”:“tomtom”}],“label”:“Brand”}

Then ran ner.teach with my own custom dataset like so

python load_brand_data.py | python -m prodigy ner.teach finprod_ner_v1 en_core_web_lg --label Brand --patterns finprod_v1_patterns.jsonl

I then extracted the new jsonl and would have expected to find spans with the Brand labels. Instead I got this - a sample row out of the new jsonl

{“label”:“Brand”,“pattern”:[{“lower”:“We own two of the most highly recognized and most trusted brand names in the children’s apparel industry, Carter’s and OshKosh B’gosh (or “OshKosh”), and a leading baby and young child lifestyle brand, Skip Hop.”}]}

What am I doing wrong please?
Thank you in advance!

This looks all correct so far!

How are you extracting the JSONL? Are you calling db-out? For example:

prodigy db-out finprod_ner_v1 > finprod_ner_v1.jsonl

Or are you running terms.to-patterns somewhere instead? The output you posted looks exactly like the output of the terms.to-patterns recipe – here, Prodigy takes a dataset, only gets the "text" value and creates patterns based on the terms.

You are correct. I am using the terms.to-patterns like so
python -m prodigy terms.to-patterns finprod_ner_v1 finprod_v2_patterns.jsonl --label Brand

Is that wrong?

In this case yes, because you want to export the annotations and not create match patterns, right?

The terms.to-patterns recipe is for converting a dataset of words to patterns. You can then use those patterns to collect better annotations with ner.teach and other recipes – just like you did.

If you just want to export JSONL data from the database, you can use the db-out command:

prodigy db-out finprod_ner_v1 > finprod_ner_v1.jsonl

Ah got it. Makes sense now that I am looking at the db-out output. One more question. None my annotations have two words - e.g. “Oshkosh Bigosh” or better multiple brands in the same sentence - e.g. We sell OskKosh Bigosh, Top Hat and Skippy brand products. Here I would have 3 brands with 2 of them being multiple words.

How do I over come that? Use the ner_manual annotation interface to manually annotate?
Note my initial seed terms had a few .e.g (I have shortened the terms below for brevity).

python -m prodigy terms.teach finprod_terms_v1 en_core_web_lg --seed “Peerless Funds, Mark Architectural Lighting”

Great, glad you got it working! :+1:

In your case, it sounds like the word vectors you used only included single tokens. This is why your patterns also only include single words. One solution for matching multi-word patterns is to simply edit the patterns manually, or create them based on a list or dictionary. The patterns format is pretty simple and you can either specify token patterns like "pattern": [{"lower": "top"}, {"lower": "hat"}] or exact strings like "pattern": "Top Hat".

At the moment, we don’t have a built-in solution for phrase vectors – but we’re working on integrating sense2vec, which should be really cool. This would let you train context vectors on your data, and then use terms.teach to find phrases that are similar to the ORG Top Hat (i.e. multi-word tokens and meaning in context).

This is also an option. The ner.manual recipe also respects pre-defined entities. And the annotations you export from Prodigy are in the same format as the stream you load into Prodigy. So you can always export your dataset and “post-annotate” it manually to fix mistakes. You probably want to filter only the accepted answers, though!

1 Like