textcat.teach - Patterns not filtering Label

Hi again! :slight_smile:

I've been trying to use a patterns file with textcat.teach using the following command:

prodigy textcat.teach categories en_core_web_lg data_path --label LABEL_A --patterns patterns_file.jsonl

When I add this --patterns argument, Prodigy starts suggesting other labels for me to accept/reject. I do have patterns for other labels in the patterns_file.jsonl file, but I was expecting that only LABEL_A patterns would be matched. Is this expected behaviour? If so, is the solution to alter the recipe to filter the patterns file when loading it?

My patters_file.jsonl file:

{"label": "LABEL_A", "pattern": [{"lower": "news"}]}
{"label": "LABEL_A", "pattern": [{"lower": "hello"}]}
{"label": "LABEL_B", "pattern": [{"lower": "hello"}]}
....

Thanks.

That would probably make sense for consistency, yes. Or, more specifically, the PatternMatcher that takes care of suggesting examples based on the pattern should also be able to take a labels keyword argument to filter the labels.

Yes, that sounds like a reasonable solution in the meantime. Alternatively, you could also write a small stream wrapper that only yields out the examples if "label" is in your list of labels. For example:

def filter_stream(stream, labels):
    for eg in stream:
        if eg['label'] in labels:
            yield eg

This approach would also let you incorporate any other custom rules if necessary :slightly_smiling_face:

Thanks Ines. Could you help me out and tell me where and how I should call filter_stream within the textcat.teach recipe?

Sure! :slightly_smiling_face: To find the location of your Prodigy installation, you can run the following command:

python -c "import prodigy; prodigy(__file__)"

The textcat.teach recipe will be in recipes/textcat.py. Just before the recipe returns its components, you’ll find this line, which scores the stream and sorts it by the examples the model is most uncertain about:

stream = prefer_uncertain(predict(stream))

Below, you can then add your filter:

stream = prefer_uncertain(predict(stream))
stream = filter_stream(stream, ['LABEL_ONE', 'LABEL_TWO'])

The above approach will still get the pattern matches for all patterns and only filter out the ones you’re not interested in. So maybe your initial idea of filtering the patterns before they’re added actually makes more sense. The patterns argument of the teach function will be a list of the patterns loaded in from your patters_file.jsonl. So to only use the patterns of a certain label, you could also add the following right before matcher = PatternMatcher:

patterns = [p for p in patterns if p["label"] in ["LABEL_ONE", "LABEL_TWO"]]
matcher = PatternMatcher(... # etc

Thanks for the feedback. I tried the second option.

I implemented your suggestion in the following way by replacing ['LABEL_ONE', 'LABEL_TWO'] by label:

patterns = [p for p in patterns if p["label"] in label]

However, this gets me the following error:

    File "textcat2.py", line 63, in teach
      patterns = [p for p in patterns if p["label"] in label]
TypeError: 'PosixPath' object is not iterable

I thought label would be a list of the labels but I guess not and it’s not what I should use here. What should I use instead?

Omg, sorry, this was my mistake :woman_facepalming: The patterns argument is indeed the path to the patterns file, not actually the loaded patterns. So you first have to open the file, for example:

from prodigy.util import read_jsonl

patterns = read_jsonl(patterns)

Thanks, but this breaks in the next line because patterns is now a list and it’s trying to read it from disk:

    patterns = read_jsonl(patterns)
    patterns = [p for p in patterns if p["label"] in label]
    matcher = PatternMatcher(model.nlp, prior_correct=5.,
                             prior_incorrect=5., label_span=False,
                             label_task=True)
    matcher = matcher.from_disk(patterns)

Getting the following error:

File "cython_src/prodigy/models/matcher.pyx", line 187, in 
  prodigy.models.matcher.PatternMatcher.from_disk
File "/usr/lib/python3.6/pathlib.py", line 1001, in __new__
  self = cls._from_parts(args, init=False)
File "/usr/lib/python3.6/pathlib.py", line 656, in _from_parts
  drv, root, parts = self._parse_args(args)
File "/usr/lib/python3.6/pathlib.py", line 640, in _parse_args
  a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not list

Could you give me a working example of how to add the patterns list to the matcher instead of reading from disk or another solution?

Okay, I think the first approach is just simpler – sorry for the confusion. You could make this work by creating a temporary file, but that’s probably overkill.

I think it’s much easier if you just filter the whole stream by label, like this:

def filter_stream(stream, labels):
    for eg in stream:
        if eg['label'] in labels:
            yield eg
# within the recipe
stream = prefer_uncertain(predict(stream))
stream = filter_stream(stream, ['LABEL_ONE', 'LABEL_TWO'])

Hi Ines. This solution is working! Thank you.

1 Like