ner.teach not filtering by label when using patterns file

Hi, I'm trying to use the ner.teach recipe with a patterns file and would only like to label examples for a single label at a time. However the --label parameter seems to be ignored. I can get the desired behavior if I use a separate patterns file with only the currently used label, however was hoping to avoid this workaround.

I am using Prodigy v1.10. Am I missing something obvious here?

test.jsonl

{"text": "spam is bad"}
{"text": "ham is good"}
{"text": "this ham is also good"}
{"text": "spam ham is confusing"}

test_patterns.jsonl

{"pattern": "spam", "label": "Spam"}
{"pattern": "ham", "label": "Ham"}

Command and logging output

12:19PM ~/work/ad-hoc/> PRODIGY_LOGGING=basic prodigy ner.teach test_spam en_core_web_lg ./test.jsonl --patterns test_patterns.jsonl --label "Ham"
12:19:07: INIT: Setting all logging levels to 20
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
12:19:08: RECIPE: Calling recipe 'ner.teach'
Using 1 label(s): Ham
12:19:08: RECIPE: Starting recipe ner.teach
12:19:08: LOADER: Using file extension 'jsonl' to find loader
12:19:08: LOADER: Loading stream from jsonl
12:19:08: LOADER: Rehashing stream
12:19:12: RECIPE: Creating EntityRecognizer using model en_core_web_lg
12:19:21: MODEL: Added sentence boundary detector to model pipeline
12:19:21: MODEL: Loading match patterns from disk
12:19:21: MODEL: Adding 2 patterns
12:19:21: MODEL: Ensure pattern labels are added to EntityRecognizer
12:19:21: RECIPE: Created PatternMatcher and loaded in patterns
12:19:21: SORTER: Resort stream to prefer uncertain scores (bias 0.0)
12:19:21: VALIDATE: Validating components returned by recipe
12:19:21: CONTROLLER: Initialising from recipe
12:19:21: VALIDATE: Creating validator for view ID 'ner'
12:19:21: VALIDATE: Validating Prodigy and recipe config
12:19:21: DB: Initializing database SQLite
12:19:21: DB: Connecting to database SQLite
12:19:21: DB: Creating dataset '2020-06-18_12-19-21'
12:19:21: CONTROLLER: Initialising from recipe
12:19:21: CONTROLLER: Validating the first batch for session: None
12:19:21: PREPROCESS: Splitting sentences
12:19:21: FILTER: Filtering duplicates from stream
12:19:21: FILTER: Filtering out empty examples for key 'text'
12:19:21: MODEL: Predicting spans for batch (batch size 64)
12:19:21: MODEL: Sorting batch by entity type (batch size 32)
12:19:21: CORS: initialized with wildcard "*" CORS origins

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

INFO:     ::1:49482 - "GET / HTTP/1.1" 200 OK
INFO:     ::1:49482 - "GET /bundle.js HTTP/1.1" 200 OK
12:19:26: GET: /project
INFO:     ::1:49482 - "GET /project HTTP/1.1" 200 OK
12:19:26: POST: /get_session_questions
12:19:26: FEED: Finding next batch of questions in stream
12:19:26: RESPONSE: /get_session_questions (5 examples)
INFO:     ::1:49482 - "POST /get_session_questions HTTP/1.1" 200 OK
INFO:     ::1:49482 - "GET /favicon.ico HTTP/1.1" 200 OK

Screenshot:

Thanks for the report – that's definitely how it should behave and I'm surprised it doesn't already :thinking: But I already just adjusted this under the hood and will include the update in the next release (also for all other recipes that should filter by label).

In the meantime, you can easily change it yourself by editing the recipe and setting the filter_labels argument on the PatternMatcher to the labels of the recipe:

matcher = PatternMatcher(model.nlp, filter_labels=label).from_disk(patterns)

You can find the ner.teach recipe in recipes/ner.py in your Prodigy installation. To find the location of your Prodigy installation, you can run prodigy stats.

1 Like

Just released v1.10.1, which should include the fix for this out-of-the-box :slightly_smiling_face: