Label mismatch in Pattern file and textcat.teach command

Hi there,

Thanks for a great product.

I have mistakenly done this:

  • Created a pattern file containing seed words using the “terms.to-patterns” with label LABEL_A
  • Started a textcat.teach annotation using this patterns file but with “–label” specified as LABEL_B

In the Prodigy web UI it shows the label “LABEL_A” in the box above the text being annotated.

I’m worried that I have a label mismatch. What would be the implication on the resultant model generated i.e. would the classifier use LABEL_A or LABEL_B.

My questions:

  1. Do I need to redo the textcat.teach using a corrected pattern file?
  2. Is the label in the pattern file just “for info” in the UI and the model will still use LABEL_B (as provided in the textcat.teach command(

Thanks for your help.

If you're using textcat.teach with patterns, you're essentially using two models: the text classifier, which will suggest examples based on the model's predictions, using the label(s) specified on the command line, and the pattern matcher, which will suggest examples based on our patterns file.

So in your case, you've told the text classifier that you want to annotate LABEL_B, but it knows nothing about it yet, so the suggestions you're seeing are all based on your patterns, which describe LABEL_A. Normally, the text classifier would eventually "kick in" and suggest examples for the label(s) as well – but this won't happen here, since you're only annotating LABEL_A and it never learns anything about LABEL_B.

Btw, to tell where the examples you're annotating are comIng from, you can check out the bottom right corner of the annotation card. You'll either see a score (predicted by the model) or a pattern number (corresponding to the current patterns file), indicating which pattern was used to produce this match.

To answer your questions more explicitly:

Your annotations aren't "wrong" – but they're also not as useful as they could be, because you've only annotated pattern matches for LABEL_A and didn't really get to work with the model. So I'd definitely suggest rerunning textcat.teach. Maybe start with a new dataset, so you can run separate experiments later on: one with only the new set, and one with the new set and old set combined (to see if it makes a difference).

The --label on the command line won't override anything in the patterns. The patterns are just a list of examples for the individual labels, to tell Prodigy "If you come across a match for this, it might be label X".

Prodigy could also handle cases like this better. Since we're parsing all patterns upfront anyways, the pattern matcher could at least warn you if one or more of the input labels aren't present in your patterns. This would also let us filter the patterns by --labels by default.

Thanks, Ines. Really appreciate the quick response.

In my case, I was deliberately using the same set of seeds for each classifier. The two classifiers are to classify POSITIVE and NEGATIVE text. The seeds were the same since they contained words that are ambiguous in isolation (e.g. even a word like “low” is context sensitive; low IQ = NEGATIVE but low arrogance = POSITIVE).

So, if I understand you correctly, I should be fine.

Just a final point of clarification: Am I right that the label used in the final model is the one provided by the --label option of the textcat.teach command e.g. my positive model will use the POSITIVE label (as provided at the command line) even if the labels in the pattern file were NEGATIVE? (I’ve created separate datasets to store the positive and negative annotations).

1 Like

Yes, at least, the model you're training in the loop using ner.teach. For the final model, you'll probably batch train from your collected annotations, which will take all labels into account that are present in the data you're training from.

Thanks. Got it.

Right or wrong, at the moment I’ve been using a protocol of “one dataset per annotation task” but think I can consolidate later by using db-out (on all datasets) and then db-in to a single “merged” dataset from which I can generate a model that does all annotations. Sounds OK or unnecessary?

1 Like

Yes, “one dataset per annotation task” is definitely the strategy we’d recommend. This is also very much in line with what we envision for larger workflows in the future (and the task management in the upcoming Annotation Manager).

You can always merge smaller datasets into one later, but separating existing large sets is obviously more difficult. Keeping a set for every smaller piece also allows you to run experiments with different combinations of annotations – for example, to see if an approach with different patterns improves the model or not.

Brilliant. Thanks again for your support. I’m very impressed!