This means it’s trying to show you roughly one suggestion from the model for each suggestion from the pattern matcher. The true matches from the pattern matcher are added as training examples for the model, and the model also learns when you click yes or no to the suggestions.
Just to clarify, is the model supposed to learn live, during the ner.teach
process? Or only after I close the annotation window and run ner.batch-train
? I'm asking because whilst using a patterns file, ner.teach
seems to have the same behaviour as ner.match
; that is, it merely yields matches to the patterns and does not seem to be interleaving novel suggestions from the model. Additionally, even if the patterned strings are precise duplicates of one another, the annotation window repeatedly displays these duplicates and does not seem to be preferring more unlikely queries.
I'm pretty certain that something is broken, and my gut feel is that the model's output is only considered when --label lowercase
, while the pattern matcher's output is only considered when --label UPPERCASE
, in the context of a ner.teach
command.
I've done some tests to narrow this down: After performing my initial set of ~1000 annotations, I ran ner.batch-train
and then ran ner.teach
on the new model I trained. But whenever the label is spelled in lowercase (i.e. the correct casing), the suggestions were free to differ from the available patterns. Whereas when the label is spelled in uppercase, I only get perfect pattern matches (no different from before batch-train
was run), and the score in the bottom corner of the annotation string display is constantly at 0 (which should not be the case, because after ner.batch-train
is run, the model should have an idea of what constitutes an address entity).
Furthermore,
python -m prodigy ner.teach addr_db_v01 models/addr_model_v01 data/source.jsonl --label addr --patterns data/address_seeds.jsonl --exclude addr_db_v01
returns the same output as
python -m prodigy ner.teach addr_db_v01 models/addr_model_v01 data/source.jsonl --label addr --exclude addr_db_v01
which confirms my suspicions.
These bugs are breaking for me, because it disrupts the core purpose that Prodigy was meant to fulfill in my workflow . As of now, the pattern matching is basically just glorified regex with no model input, and it's frustrating to have spent the past week working around the bugs of this software (see other threads on ner.batch-train
failing, etc.). What's the timeline on the updates to fix this?