When I add this --patterns argument, Prodigy starts suggesting other labels for me to accept/reject. I do have patterns for other labels in the patterns_file.jsonl file, but I was expecting that only LABEL_A patterns would be matched. Is this expected behaviour? If so, is the solution to alter the recipe to filter the patterns file when loading it?
That would probably make sense for consistency, yes. Or, more specifically, the PatternMatcher that takes care of suggesting examples based on the pattern should also be able to take a labels keyword argument to filter the labels.
Yes, that sounds like a reasonable solution in the meantime. Alternatively, you could also write a small stream wrapper that only yields out the examples if "label" is in your list of labels. For example:
def filter_stream(stream, labels):
for eg in stream:
if eg['label'] in labels:
yield eg
This approach would also let you incorporate any other custom rules if necessary
Sure! To find the location of your Prodigy installation, you can run the following command:
python -c "import prodigy; prodigy(__file__)"
The textcat.teach recipe will be in recipes/textcat.py. Just before the recipe returns its components, you’ll find this line, which scores the stream and sorts it by the examples the model is most uncertain about:
The above approach will still get the pattern matches for all patterns and only filter out the ones you’re not interested in. So maybe your initial idea of filtering the patterns before they’re added actually makes more sense. The patterns argument of the teach function will be a list of the patterns loaded in from your patters_file.jsonl. So to only use the patterns of a certain label, you could also add the following right before matcher = PatternMatcher:
patterns = [p for p in patterns if p["label"] in ["LABEL_ONE", "LABEL_TWO"]]
matcher = PatternMatcher(... # etc
Omg, sorry, this was my mistake The patterns argument is indeed the path to the patterns file, not actually the loaded patterns. So you first have to open the file, for example:
from prodigy.util import read_jsonl
patterns = read_jsonl(patterns)
Thanks, but this breaks in the next line because patterns is now a list and it’s trying to read it from disk:
patterns = read_jsonl(patterns)
patterns = [p for p in patterns if p["label"] in label]
matcher = PatternMatcher(model.nlp, prior_correct=5.,
prior_incorrect=5., label_span=False,
label_task=True)
matcher = matcher.from_disk(patterns)
Getting the following error:
File "cython_src/prodigy/models/matcher.pyx", line 187, in
prodigy.models.matcher.PatternMatcher.from_disk
File "/usr/lib/python3.6/pathlib.py", line 1001, in __new__
self = cls._from_parts(args, init=False)
File "/usr/lib/python3.6/pathlib.py", line 656, in _from_parts
drv, root, parts = self._parse_args(args)
File "/usr/lib/python3.6/pathlib.py", line 640, in _parse_args
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not list
Could you give me a working example of how to add the patterns list to the matcher instead of reading from disk or another solution?
Okay, I think the first approach is just simpler – sorry for the confusion. You could make this work by creating a temporary file, but that’s probably overkill.
I think it’s much easier if you just filter the whole stream by label, like this:
def filter_stream(stream, labels):
for eg in stream:
if eg['label'] in labels:
yield eg
# within the recipe
stream = prefer_uncertain(predict(stream))
stream = filter_stream(stream, ['LABEL_ONE', 'LABEL_TWO'])