Pattern files for textcat.teach

Hello,

I don’t think I’m using the pattern file for textcat.teach correctly. I used terms.teach to generate seed terms dataset and then converted that to a pattern file using terms.to-patterns. However, when using textcat.teach , there’s not a single relevant example presented from my source file (after 400 examples). The examples do seem to be close together qualitatively and with scores in the range of 0.35-0.45

Eyeballing the source file I can find 34 positive examples from just one seed term.

Any suggestions?

Thanks

This is strange indeed – your workflow definitely sounds correct! Could you post an example of the patterns and maybe an example from your stream? And which model/vectors did you use in terms.teach?

I am seeing what I believe to be similar behavior, so I thought I’d chime in.

I’ve reduced this to a really simple example that I think should be surfacing examples from the Reddit corpus that contain the word “schedule”, but I’m not getting that.

reddit_patterns.jsonl:

{"label":null,"pattern":[{"lower":"schedule"}]}

Command line:

prodigy dataset test
prodigy textcat.teach test en_core_web_sm ./RC_2017-01.bz2 --loader reddit --label Scheduling --patterns ./reddit_patterns.jsonl

This ‘works’ in the sense that Prodigy starts and gives me examples to annotate, but with multiple tries, refreshes, and annotations, not a single one has the word ‘schedule’ in it, nor anything I can imagine even having anything to do with the word.

"label": null

This is likely the problem here. If no "label" is assigned to the pattern, Prodigy has no way of knowing which category the pattern refers to. Or it might – but the label is null, i.e. None. So in your case, Prodigy will only present you examples of the category "Scheduling" for annotation (since this is what you've specified on the command line). But since there are no patterns for "label": "Scheduling", you'll only see the model's predictions and never any pattern matches.

If you've used terms.to-patterns, you can set --label Scheduling to assign a label to your patterns. Or you can do it manually by editing the JSONL file.

The terms.to-patterns recipe should probably output a better warning if you don't set a --label argument, because patterns with no label should never be the intended output. Maybe the text classifier could output a warning, too, if it comes across a None label.

Thanks for the quick reply. I should’ve mentioned that I tried this as well (changing to "label": "scheduling"), but it still seems to be giving me examples that I can’t relate to that label.

I’m thinking about this in the context of your Insults Classifier example, where using these patterns should surface at least some examples that are containing these words. Is it that my example isn’t working correctly, or am I misunderstanding what it should be presenting to me with the patterns?

Just to confirm (sorry if this was obvious or just a typo here): the labels are case-sensitive, so if you patterns file includes "label": "scheduling" and you set --label Scheduling on the command line when you run ner.teach, you might have the same problem.

I'm surprised you don't see any matches at all, but then again, if you're using a random portion of the Reddit corpus, it's not completely impossible that the word "schedule" simply doesn't occur. As a sanity check, did you try it with a more common word that should definitely be found? Like, "you" or something similar?

Prodigy's pattern matching is pretty straightforward: each text in the batch is tokenized, and if Prodigy comes across a token that matches your pattern (in this case, the token "schedule"), the match is highlighted and the example is presented as a potential candidate for the respective label. The candidates produced by the patterns are then mixed in with the model suggestions. The idea is that you start off by annotating mostly pattern matches, so the model gets a better idea of what you're looking for and is able to suggest you better candidates more quickly.

In the beginning, the model knows absolutely nothing about your label – which also explains the completely random examples you were seeing. Every text is just as likely to be about "scheduling" and getting over the cold start problem this way is pretty difficult. So the patterns are pretty important here.

If you're working with more specific categories, it might also make sense to pre-select the texts – e.g. only focus on one subreddit or use a different data source that contains less noise. This wasn't really necessary for the insults example, because insults are pretty much everywhere. But even for our NER tutorial on training a new label DRUG we only extracted data from r/opiates.

You're right, this was a typo here, but it was correct in my code.

I tried using the word "you," and you're right, this pulls up a ton of examples with the word "you" highlighted as I would expect. This also works for words like "and" or "the". But once I switch to other words that I see in examples but that are less simple (e.g., "weekend" or "organization" or "enjoy"), they aren't shown as examples (in fact I get no highlighted words in my examples). Even if they do occasionally show up as words, they aren't highlighted (e.g., an example input that ends with "Enjoy" isn't highlighted when my pattern is for "enjoy".).

Again, I still might be missing something here, but it seems like this pattern matching on my end (using RC_2017-01.bz2) is only working for extremely short and common words, and not for words that show up hundreds to thousands of times in that same file, but are more complex and/or relatively less common.

This is very strange! It’s possible there’s a bug in the matcher (possibly in spaCy) that only occurs in the word wasn’t in the model’s initial vocabulary. I’m not sure this is likely though — if that were the case, it seems like a lot of other things wouldn’t be working!

For sanity, could you try working with the en_vectors_web_lg model instead of en_core_web_sm? That has a large vocab, so we’ll find out whether the issue comes down to that. Every word you’re likely to try will probably be in the vocabulary already.

You can also try debugging your patterns with our new demo: https://explosion.ai/demos/matcher . This helps you check that your patterns match the text you think they should, so you can pick up problems like casing etc.

If problems persist, it’s a lot easier and less frustrating to debug the process through Python, instead of by clicking through examples. Then you can use tests etc as well. You can find the code for the textcat.teach recipe in prodigy/recipes/textcat.py. It should be pretty easy to create a minimal test-case, as things mostly take iterables of dictionaries as input, and produce iterables of dictionaries as output.

I tested this with en_vectors_web_lg:

prodigy dataset test
prodigy textcat.teach test en_vectors_web_lg ./RC_2017-01.bz2 --loader reddit --label Test --patterns ./patterns.jsonl

patterns.jsonl:

{"label":"Test","pattern":[{"lower":"funny"}]}

The word “funny” is relatively prevalent in RC_2017-01.bz2, but examples aren’t surfaced.

I’ll look into doing some direct testing, but it might be better for me to skip the Prodigy loader altogether and retrieve reddit data in another way.

I came back to this problem after some time. Unfortunately I have been unable to solve it. It continues with prodigy 1.4.2 . The problem is independent of the spacy model used - whether it’s the large or small one or my own custom domain specific model. Prodigy is just not presenting sufficient examples based on the pattern file. For one term it presents three examples - there are 77 in the training data. For another term it also presents three examples, there are 134 in the training data. Unfortunately I can’t post samples publicly.

Prodigy seems like a great idea, despite putting hours into getting it to work, I’ve had no success and will have to stick to a more labour intensive but more effective annotation workflow.

@lswright If some part of the logic isn’t working well on your data, you can always strip that part back and just use the labelling front-end, the DB integration, etc. You might also be able to find what’s wrong.

Here’s a simpler custom recipe for Prodigy that will just feed you examples that match your patterns, without the active learning.


import spacy
from prodigy.models.matcher import PatternMatcher
from pathlib import Path
import json
from prodigy import recipe
from prodigy.components.db import connect


@recipe('textcat.simple-teach',
    dataset=("Dataset ID", "positional", None, str),
    source_file=("File path or stdin", "positional", None, Path),
    patterns=("Path to match patterns file", "positional", None, Path),
    label=("Label to annotate", "option", "L", str)
)
def simple_teach(dataset, source_file, patterns, label="LABEL"):
    DB = connect()
    nlp = spacy.blank('en')
    matcher = PatternMatcher(nlp).from_disk(patterns)
    # For this example, I assume the source file is already formatted as jsonl
    stream = (json.loads(line) for line in open(source_file))
    stream = (eg for score, eg in matcher(stream))
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'update': None,
        'config': {'lang': 'en', 'labels': [label]}
    }
1 Like

Thank you very much Matthew. This has made a huge difference - all the examples are now based on the patterns. My only other query now is that although the ‘view_id’ is set to classification ( 'view_id': 'classification' ), the interface appears as a NER annotation task which slows down reading as the labels appear highlighted with the pattern terms in the text. Any suggestions?

@lswright I think @honnibal forgot to change the default settings that define the type of tasks produced by the pattern matcher. Try the following:

matcher = PatternMatcher(nlp, label_span=False, label_task=True).from_disk(patterns)

label_span specifies whether the matched label is added to the span, and label_task whether it’s added as a “global” label to the whole text.

Thank you, that fixed it!

1 Like

I’m still experiencing weird behavior with this, even outside the Reddit data or anything intensive to search through. As an example:

I’ve created a plain-text input file with 60 rows of single-word data. There are ten instances each of the words the, dog, cat, turtle, horse, and sky.

My patterns.jsonl file is as follows:

{"label":"testlabel","pattern":[{"lower":"sky"}]}

I invoke prodigy as such:

prodigy dataset testdataset
prodigy textcat.teach testdataset en_vectors_web_lg data.txt --label testlabel --patterns patterns.jsonl

Yet, the ‘sky’ case shows up in a random order with the other words, is never highlighted, and never shows a “pattern” match in the metadata.

I’ve tried many permutations of this, and I just cannot figure it out. Does this work? Am I making a simple fundamental error here?

@cody Thanks a lot for the report! This definitely sounds like it should work exactly as your expected. Let me try reproduce this and see if I can figure out what the problem might be :slightly_smiling_face:

I’m experiencing the same issue.
My patterns are served randomly with the data, and when they appear, sometimes they are not highlighted.
Since my label is pretty rare, most of the annotations are ‘reject’ and the model converges to the score 0 for all examples :frowning:

As an update to my original post (the first one in this thread). I’ve tried again with the default textcat.teach command using a pattern file and prodigy 1.5.1 . Unfortunately, it’s still not functioning correctly with no relevant examples being presented.

Could prodigy be explicitly ignoring the terms in the pattern file?

I just ran your example with only one small modification: instead of the vectors, I used the en_core_web_sm model. I also only used a single instance of each of the words, because Prodigy will filter out duplicates anyways. The predictions were obviously random, because the model doesn't know testlabel yet – but I saw all terms, followed by a pattern match of "sky" :thinking:

I spent a lot of time reproducing this and trying to get to the bottom of what could be happening here – it's kinda tricky, because as you can see from the code, the implementation is pretty much identical to the one in ner.teach.

The most likely explanation imo is that depending on the model state and the exponential moving average of the score in the prefer_uncertain sorter, the pattern matches are filtered out. This would also explain why this behaviour has been difficult to reproduce and only occurs sometimes in certain situations.

So for cases like that, we could offer an option to only partially apply the sorter to the stream, or, more generally, come up with an API that would allow examples in the stream to not be sorted or filtered, regardless of their score.

By default, Prodigy will show you the model's predictions and the pattern matches (and won't prioritise one or the other). So it's possible that your first see the model's suggestion and then the same text again, because a pattern was matched on the same text.

If you're dealing with rare labels and a large corpus, it might make sense to divider the bootstrapping process into two steps: use a simple matcher recipe with no model in the loop first to select enough positive examples for the label (or pre-select all matches from your stream, export them to a file and then load it into Prodigy). This way, you'll only see the matches and can work through them quickly. You can then use that data to pre-train the model, and use textcat.teach to improve it in the loop. This also makes the process more predictable: if the textcat.teach session doesn't produce good results, you can go back to the previous step, add more initial training examples via patterns and then repeat the process.

Thanks for the thorough reply! Hopefully this is the final set of clarifications to be able to move on without patterns.

I tried to reproduce this, and wasn't able to get the pattern match to work, but given that the task comes up differently each time, I do believe that given enough runs of the task I might wind up with the results you saw.

This makes sense to me, and I've seen this suggested elsewhere in the support boards. Can I clarify that the steps below are what you are imagining?

prodigy dataset insult_bootstrap
grep -i -E 'list|of|insult|words' inputfile.jsonl | prodigy mark insult_bootstrap --label INSULT --view-id classification
prodigy textcat.batch-train insult_bootstrap en_core_web_lg --output insult_bootstrap_model

Now I have a a model that's pre-trained on the insult label words. This will be used for a regular training round:

prodigy dataset insult
prodigy textcat.teach insult ./insult_bootstrap_model --label INSULT
prodigy textcat.batch-train insult ./insult_bootstrap_model --output insult_model

Does this look right? A few clarifications:

  1. Is the idea correct to filter down to more likely correct cases, annotate on a dedicated label, and then export a trained model? Or is this really a more simple process where I can do this entire process within one dataset?
  2. Should I have imported the insult_bootstrap dataset into the insult dataset to train the actual/final model on those annotations also?
  3. Should bot the textcat.teach and textcat.batch-train use the exported insult_bootstrap_model as their starting model?

Thanks again for your help with this!