textcat.teach - Patterns not filtering Label

nuno · January 2, 2019, 5:32pm

Hi again!

I've been trying to use a patterns file with textcat.teach using the following command:

prodigy textcat.teach categories en_core_web_lg data_path --label LABEL_A --patterns patterns_file.jsonl

When I add this --patterns argument, Prodigy starts suggesting other labels for me to accept/reject. I do have patterns for other labels in the patterns_file.jsonl file, but I was expecting that only LABEL_A patterns would be matched. Is this expected behaviour? If so, is the solution to alter the recipe to filter the patterns file when loading it?

My patters_file.jsonl file:

{"label": "LABEL_A", "pattern": [{"lower": "news"}]}
{"label": "LABEL_A", "pattern": [{"lower": "hello"}]}
{"label": "LABEL_B", "pattern": [{"lower": "hello"}]}
....

Thanks.

ines · January 4, 2019, 9:55pm

That would probably make sense for consistency, yes. Or, more specifically, the PatternMatcher that takes care of suggesting examples based on the pattern should also be able to take a labels keyword argument to filter the labels.

Yes, that sounds like a reasonable solution in the meantime. Alternatively, you could also write a small stream wrapper that only yields out the examples if "label" is in your list of labels. For example:

def filter_stream(stream, labels):
    for eg in stream:
        if eg['label'] in labels:
            yield eg

This approach would also let you incorporate any other custom rules if necessary

nuno · January 5, 2019, 11:03am

Thanks Ines. Could you help me out and tell me where and how I should call filter_stream within the textcat.teach recipe?

ines · January 5, 2019, 11:37am

Sure! To find the location of your Prodigy installation, you can run the following command:

python -c "import prodigy; prodigy(__file__)"

The textcat.teach recipe will be in recipes/textcat.py. Just before the recipe returns its components, you’ll find this line, which scores the stream and sorts it by the examples the model is most uncertain about:

stream = prefer_uncertain(predict(stream))

Below, you can then add your filter:

stream = prefer_uncertain(predict(stream))
stream = filter_stream(stream, ['LABEL_ONE', 'LABEL_TWO'])

The above approach will still get the pattern matches for all patterns and only filter out the ones you’re not interested in. So maybe your initial idea of filtering the patterns before they’re added actually makes more sense. The patterns argument of the teach function will be a list of the patterns loaded in from your patters_file.jsonl. So to only use the patterns of a certain label, you could also add the following right before matcher = PatternMatcher:

patterns = [p for p in patterns if p["label"] in ["LABEL_ONE", "LABEL_TWO"]]
matcher = PatternMatcher(... # etc

nuno · January 7, 2019, 6:10pm

Thanks for the feedback. I tried the second option.

I implemented your suggestion in the following way by replacing ['LABEL_ONE', 'LABEL_TWO'] by label:

patterns = [p for p in patterns if p["label"] in label]

However, this gets me the following error:

    File "textcat2.py", line 63, in teach
      patterns = [p for p in patterns if p["label"] in label]
TypeError: 'PosixPath' object is not iterable

I thought label would be a list of the labels but I guess not and it’s not what I should use here. What should I use instead?

ines · January 9, 2019, 8:44pm

Omg, sorry, this was my mistake The patterns argument is indeed the path to the patterns file, not actually the loaded patterns. So you first have to open the file, for example:

from prodigy.util import read_jsonl

patterns = read_jsonl(patterns)

nuno · January 10, 2019, 11:01am

Thanks, but this breaks in the next line because patterns is now a list and it’s trying to read it from disk:

    patterns = read_jsonl(patterns)
    patterns = [p for p in patterns if p["label"] in label]
    matcher = PatternMatcher(model.nlp, prior_correct=5.,
                             prior_incorrect=5., label_span=False,
                             label_task=True)
    matcher = matcher.from_disk(patterns)

Getting the following error:

File "cython_src/prodigy/models/matcher.pyx", line 187, in 
  prodigy.models.matcher.PatternMatcher.from_disk
File "/usr/lib/python3.6/pathlib.py", line 1001, in __new__
  self = cls._from_parts(args, init=False)
File "/usr/lib/python3.6/pathlib.py", line 656, in _from_parts
  drv, root, parts = self._parse_args(args)
File "/usr/lib/python3.6/pathlib.py", line 640, in _parse_args
  a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not list

Could you give me a working example of how to add the patterns list to the matcher instead of reading from disk or another solution?

ines · January 10, 2019, 10:28pm

Okay, I think the first approach is just simpler – sorry for the confusion. You could make this work by creating a temporary file, but that’s probably overkill.

I think it’s much easier if you just filter the whole stream by label, like this:

def filter_stream(stream, labels):
    for eg in stream:
        if eg['label'] in labels:
            yield eg

# within the recipe
stream = prefer_uncertain(predict(stream))
stream = filter_stream(stream, ['LABEL_ONE', 'LABEL_TWO'])

nuno · January 11, 2019, 11:27am

Hi Ines. This solution is working! Thank you.

Topic		Replies	Views
Pattern files for textcat.teach usage , textcat	20	3749	July 6, 2018
ner.teach not filtering by label when using patterns file ner , done	2	482	July 2, 2020
Textcat.teach not using the pattern file enhancement , textcat , done	10	1917	September 20, 2022
Can we bring back --seeds for textcat.teach? textcat , solved	7	522	February 10, 2023
textcat.manual with --patterns argument enhancement , textcat	7	1100	September 25, 2019

textcat.teach - Patterns not filtering Label

Related topics