Having trouble training blank model with a new entity types

I’m attempting to train a model with a couple custom entities to recognize exercises and quantities.
Here’s an example phrase: “Four sets of twenty reps of standing calf raises and five sets of nine reps deadlifts”.
The model should recognize both the rep and set count, as well as the exercise being performed. The entities are mostly multi word tokens.
My training dataset has about 100 of these types of phrases with an additional 250 phrases that talk about fitness subjects in general.

I’ve attempted to train a proof of concept model using the following video
https://prodi.gy/docs/video-new-entity-type

I’ve created a jsonl file containing the training data formatted as such:

{"text":"incline chest press one set of six repetitions"}
{"text":"i did twenty burpees and thirty two push ups"}
{"text":"four hyperextensions and sixteen sit ups"}

I’ve created a patterns jsonl file formatted as such:

{"label":"EXERCISE","pattern":[{"lower":"squat"}]}
{"label":"EXERCISE","pattern":[{"lower":"leg"},{"lower":"press"}]}
{"label":"EXERCISE","pattern":[{"lower":"lunge"}]}
{"label":"EXERCISE","pattern":[{"lower":"deadlift"}]}
{"label":"EXERCISE","pattern":[{"lower":"leg"},{"lower":"extension"}]}
{"label":"EXERCISE","pattern":[{"lower":"leg"},{"lower":"curl"}]}
{"label":"QUANTITY","pattern":[{"lower":"six"}]}
{"label":"QUANTITY","pattern":[{"lower":"twenty"},{"lower":"one"}]}

I’ve decided to train the named entity recognizer from scratch as the default entities are not relevant to me. I’ve created and saved a blank 'en' model:

nlp = spacy.blank('en')
nlp.add_pipe(nlp.create_pipe('ner'))
nlp.begin_training()
nlp.to_disk('/path/to/model')

I've performed training using:

prodigy ner.teach workouts model training_data.jsonl --patterns patterns.jsonl --label "EXERCISE, QUANTITY"

I output the model using:

prodigy ner.batch-train workouts model --output workouts-model --label "EXERCISE, QUANTITY" --eval-split 0.2 --n-iter 25 --batch-size 8

The resulting model is entirely inaccurate. I’ve noticed that during training none of the patterns are ever referenced. I’m not sure if this is because I often list plural forms of the exercises. Should I be using "lemma" instead of "lower"? Will the lemma property even work with a blank model? I'm also getting very strange suggestions for entities and difficulty recognizing multi word token quantities.

I've also tried to use ner.manual recipe but it I get "No tasks available" after tagging the first 9 phrases.

I'm not quite sure what to try next.

FYI, I've originally created a similar model to identify foods, quantities, and units solely with Spacy with a similar size training dataset with quite good results for the limited dataset. I'm giving Prodigy a try because tagging entities manually for every phrase was too time consuming but the workflow looks to be very different.

1 Like

Are you using v1.5.1? We’ve just released a fix for the problem I think you’re having. If you haven’t updated yet, try v1.6.1, and please let us know if the problem still persists.

Updating to 1.6.1 has fixed the issues. Thank you!

1 Like