PatternMatcher with label_task and all_examples gives error

When using PatternMatcher for a text classification task (so label_task is True, but label_span is False), if all_examples is set to True, it gives the following error:

Traceback (most recent call last):
  File "C:\Users\Roland\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Roland\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\Work\staa\.venv\lib\site-packages\prodigy\__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "D:\Work\staa\.venv\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "D:\Work\staa\.venv\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "D:\Work\staa\prodigy_models\.\min_pattern_bug.py", line 38, in manual
    stream = list(matcher(stream))
  File "cython_src\prodigy\models\matcher.pyx", line 215, in __call__
IndexError: list index out of range

If label_task is set to False or all_examples is set to False, it works.

An example recipe to reproduce the issue can be found here: Reproduction for PatternMatcher bug ยท GitHub

I am using prodigy==1.11.7

Hi Roland,

the error seems to occur in this line:

stream = list(matcher(stream))

Just to check before diving deeper, are you 100% sure that the stream isn't empty?

If that's not it, could you share a single pattern from your original pattern file so that we may reproduce the bug locally?

Yes, the stream is not empty.

I am running the following command:

python -m prodigy textcat.manual_patterns de_languages test_data.jsonl blank:en --label "Label1,Label2" --patterns test_patterns.jsonl -F min_pattern_bug.py

test_data.jsonl:

{"text":  "no match"}
{"text":  "something went wrong"}
{"text":  "test this thing"}

test_patterns.jsonl

{"pattern": "test", "label": "Label1"}
{"pattern": "something", "label": "Label2"}

I may have found a bug in Prodigy, but I may also have found an issue with your code. In particular, the PatternMatcher outputs a sequence of (score, example) tuples. The "stream" key in the recipe expects a sequence of examples instead. You can use a sorter, or write your own filter to circumvent this.

Here's my variant of your recipe.

from typing import Union, Iterable, Optional, List

import spacy
from srsly import read_jsonl
from prodigy import recipe, get_stream
from match import PatternMatcher
from prodigy.types import RecipeSettingsType
from prodigy.util import get_labels


@recipe(
    "textcat.manual_patterns",
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    spacy_model=("Loadable spaCy pipeline or blank:lang (e.g. blank:en)", "positional", None, str),
    labels=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    patterns=("Path to match patterns file", "option", "pt", str),
)
def manual(
    dataset: str,
    source: Union[str, Iterable[dict]],
    spacy_model: str,
    labels: Optional[List[str]] = None,
    patterns: Optional[str] = None,
) -> RecipeSettingsType:
    stream = get_stream(source, rehash=True, dedup=True, input_key="text")
    nlp = spacy.load(spacy_model)

    matcher = PatternMatcher(
        nlp,
        label_span=False,
        label_task=False,
        combine_matches=True,
        all_examples=True
    )

    matcher = matcher.from_disk(patterns)
    
    # You can print this stream. It's a sequence of (score, example) tuples.
    stream = list(matcher(stream))

    def filter_stream(stream):
        """This stream merely checks for the presence of tags."""
        for score, example in stream:
            if len(example["spans"]) > 0:
                yield example

    def add_options(stream):
        """We also need to have options for the `choice` view_id"""
        for example in stream:
            example["options"] = [{"label": _, "id": _} for _ in ["a", "b", "c"]]
            yield example

    return {
        "view_id": "choice",
        "dataset": dataset,
        "stream": add_options(filter_stream(stream)),
        "config": {
            "labels": labels,
            "choice_style": "multiple",
            "choice_auto_accept": False,
            "exclude_by": "task",
            "auto_count_stream": True,
        },
    }

This works when I set label_task=False in the PatternMatcher.

Related to this setting: I may have found a bug in Prodigy related to your original recipe. I'll need to check in with my colleagues to figure out what the expected behaviour should be. It's related to a situation where no patterns apply to the current example. Still, Prodigy needs to figure out an appropriate label for it anyway because you've configured the matcher to return all examples. I'm not exactly sure what the best behaviour is in this situation, so I'll need to come back later to report on that.

In the meantime, does this recipe work for you?

The recipe was just a slimmed down version of what I was actually using, so I'm okay with the recipe.

What I wanted to achieve was that I want to label all the documents and if the PhraseMatcher can find any matches, they should be applied to the documents. That's what I was expecting to get with the combination of label_task=True and all_examples=True. Is there another way to do this?

What would be the appropriate behaviour if two matches with different labels match?

As a side note, when you look at the filter function that I wrote:

    def filter_stream(stream):
        """This stream merely checks for the presence of tags."""
        for score, example in stream:
            if len(example["spans"]) > 0:
                yield example

Such a filter function is very flexible. Not only can it filter away data, but it can also add metadata to the dictionary. However, on my personal projects, I usually like to do all of this upfront. I usually have a Jupyter notebook that has a Python snippet that can create an interesting subset (usually called interesting-subset.jsonl that I feed to Prodigy from the command line. This approach is very flexible and might make it easier iterate on annotation approaches early on in a project. I'm mentioning this because it might be an approach that could work well for you at this point.

Hmmmm. I guess it depends on the type of task. If it's multilabel, then all matched labels would be applied. If it's not multilabel, I'd somehow want to surface that there are multiple matches, mostly in order to be able to adjust the patterns.