✨ Video: Training a new entity type with Prodigy

Just released a new video, that shows the workflow for adding a new entity type from scratch. We’ve been fine-tuning the workflow for this for some time, so we’re super excited to have it working so well now :tada:

The video shows you how to create a terminology list with terms.teach, starting from just three seed terms. After converting the word list to a pattern-match file using terms.to-patterns, we use the patterns file to bootstrap a new entity type class, using ner.teach. The neural network starts out with no examples of the class, but you get suggested matches from the patterns file built with terms.teach. The suggestions you accept then become positive examples for the neural network. This is enough to get the model to start suggesting phrases too, which are mixed together with the pattern-matcher suggestions too. Before long, the statistical model takes over, and the normal active learning process can continue.

As an example of this boot-strapping process, we’ve trained a new entity recognition model to detect references to drugs in social media text. I’m hoping to use the model in a small data science project, using text from a large online community of opiate users. I want to look at how often different substances have been mentioned in these discussions over time, to see how the popularity of different substances such as synthetic opioids might relate to health outcomes such as overdose rates.


This video is great. I have a question.
I followed the video as below:

  1. I ran python -m prodigy terms.teach test_dataset en_core_web_lg --seeds "heroin, benzos, weed"
  2. I labeled some of the terms using the browser interface.
  3. I ran python -m prodigy terms.to-patterns test_dataset test_drug_terms.jsonl --label DRUG

However, my output is different from the video after step 3. My patter does not include "LOWER" (see below) like in the video.
The video result shows {"label":"DRUG","pattern":"[{"lower":"heroin"}]}
My result shows {"label":"DRUG","pattern":"heroin"}

Why is this?

I am using
Prodigy 1.10.8
Python 3.8.3
SQLlite (in prodigy database)

Glad you liked the video! It's a bit older already and Prodigy got a bunch of new features since, and the terms.to-patterns recipe now has more options: https://prodi.gy/docs/recipes#terms-to-patterns

One of these options is that you can now create multi-word patterns explicitly provide a spaCy model to use to tokenize the patterns. In that case, you get token-based patterns with descriptions of the tokens and their attributes, e.g. {"lower": "heroin"}. So you can set --spacy-model blank:en to use the default English tokenizer. If you don't provide a model for tokenization, the pattern will be an exact string match.

1 Like