Using Custom Entities

Thanks for the question – this is actually perfect timing! We just released a video that shows a Prodigy workflow dealing with this exact topic. We’ve spent a lot of time tuning this workflow, and we’re really happy to see it working pretty well now! For more info and a quick summary, see this thread and this docs page. You can find more details on the recipes used on this page, or in your PRODIGY_README.html.

TL;DR

Note that this workflow requires Prodigy v1.1.0.

  1. Use the terms.teach recipe with word vectors (e.g. a large spaCy model) to create a terminology list of examples of your new entity type. For a DOG entity, you could for example start off with the seed terms “labrador”, “golden retriever” and “poodle”. Based on the vectors, Prodigy will suggest you similar words to add to your list – for example, “corgi”.

  2. Convert your terminology list to match patterns that can be loaded by spaCy’s Matcher using the terms.to-patterns recipe. This will give you a JSONL file with entries like {"label": "DOG", "pattern": [{"lower": "golden"}, {"lower": "retriever"}]}.

  3. Collect annotations for the new entity type using ner.teach with your patterns file as the --patterns argument. The patterns are used to suggest entities found in your data – this helps you collect a bunch of relevant examples first, to get over the “cold start problem”. As the model in the loop improves, it will also start suggesting entities based on what it’s learned so far. You’ll probably want to collect a few hundred annotations before running the first training experiments.

  4. Train your model using ner.batch-train and export it. Hopefully, you’ll now see a nice, initial accuracy score! You can also run ner.train-curve to see how accuracy improves with more data.

  5. Test your exported model on real data (and make sure to use texts the model hasn’t seen during training). You can either use the ner.print-stream to get some nicely formatted command line output, or load it into spaCy using spacy.load('/path/to/model') and check out the doc.ents.

  6. :tada: