Can't get phrase matching to work

Hello everyone. I'm very new to the tool and python in general, so bare with me if i did something really stupid in the code :). I tried multiple things and none of them seem to work, so let me explain what i'm trying to do.

Currently i'm wiring a custom recipe (mostly following chatbot video tutorial) that uses spans manual and text classification. Here's the code:

import spacy
import prodigy
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import JSONL, TXT
from prodigy.models.matcher import PatternMatcher, PhraseMatcher
import codecs
import json


@prodigy.recipe(
    "smart-prompt-training",
    dataset=("Dataset to save annotations into", "positional", None, str),
    lang=("Language to use", "positional", None, str),
    file_in=("Path to example prompt file", "positional", None, str),
    label_file=("Path to labels file", "positional", None, str),
    patterns_file=("Path to patterns file", "positional", None, str),
    intents_file=("Path to intents file", "positional", None, str)
)
def custom_recipe(dataset, lang, file_in, label_file, patterns_file, intents_file):
    with open(label_file, 'r') as file:
        span_labels = file.readlines()
    with open(intents_file, 'r') as file:
        intent_labels = file.readlines()

    def add_options(stream):
        for ex in stream:
            ex['options'] = [
                {"id": lab, "text": lab} for lab in intent_labels
            ]
            yield ex

    nlp = spacy.load(lang)
    stream = TXT(file_in)
    stream = add_tokens(nlp, stream, use_chars=None)
    stream = add_options(stream)

    def write_without_bom(file_path):
        BOM = codecs.BOM_UTF8.decode('utf-8')
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        if text.startswith(BOM):
            text = text[len(BOM):]

        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(text)

    if patterns_file is not None:
        write_without_bom(patterns_file)
        pattern_matcher = PatternMatcher(nlp, combine_matches=True, all_examples=True)
        pattern_matcher = pattern_matcher.from_disk(patterns_file)
        stream = (eg for _, eg in pattern_matcher(stream))

    blocks = [
        {"view_id": "spans_manual"},
        {"view_id": "choice", "text": None}
    ]

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "config": {
            "lang": nlp.lang,
            "labels": span_labels,
            "blocks": blocks,
            "choice_style": "single"
        }
    }

I'm exporting my labels (entities) and choices (intents) from somewhere else and placing them in a file. My pattern file looks like this currently:

{"label":"Management","pattern":"littell properties"}
{"label":"Management","pattern":"pallas realty advisors"}
{"label":"Management","pattern":"abbey residential"}

So, what i want to achieve is add a phrase matcher to the custom recipe that'll match phrase by patterns provided in the file. If i have a line "Give me ... managed by abbey residential" i'd like prodigy to automatically highlight abbey residential and match it against "Management" label.

EDIT: I am using the following cmd line to run it:

python -m prodigy smart-prompt-training smart-prompts en_core_web_sm prompt_data.txt entities.txt patterns.jsonl intents.txt -F smartRecipe.py

EDIT2: Just to clarify as i did not write anything on it - i am currently using PatternMatcher in code and tried multiple other ways to replace it with PhraseMatcher and couldn't get it to work.

Thanks for being awesome!

Hi there!

There's nothing stupid about mistakes in code, I prefer to look at them as happy accidents.

Could you link the video that you're referring to? We have a bunch of videos at this point :sweat_smile:.

The PatternMatcher is explained in more details on the Prodigy docs, but it wraps spaCy’s Matcher and PhraseMatcher into a single object. So you should already be able to match against phrasematcher patterns that re-use the tokeniser in the nlp pipeline.

Does it not catch a specific pattern that you hoped for? If so, could you share a reproducible example of a pattern line together with a text that it does not match? In that case I might try to reproduce locally.

Hey hey! Thanks for the reply.

Currently it does not catch the pattern, correct. My pattern file is massive, so i will provide you with one line for each file i use, and you can try and reproduce it with the recipe i've given above:

prompt_data.txt

search for all properties managed by parawest management

entities.txt

Management
NumUnits

patterns.jsonl

{"label":"Management","pattern":"parawest management"}

intents.txt

Search
Export

So i'm expecting the matcher to match the "parawest management" and assign Management label to it

EDIT: Video Link - this is the video i used as a base for the recipe. I copied the pattern matching code from the ner.manual recipe

I think the issue was that you called add_tokens before using the PatternMatcher. I can understand that it is confusing, but the add_tokens should be called after. It's related to the fact that, technically, the PatternMatcher could detect patterns on text that don't have any tokens. The add_tokens function makes sure that the appropriate token information is added for the text, but also to the spans.

I rewrote your recipe a bit to get it all working. I also updated some of the code to use somewhat modern Python tools (most notably pathlib).

from pathlib import Path 
import spacy
import prodigy
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import JSONL
from prodigy.models.matcher import PatternMatcher
from prodigy.util import msg


@prodigy.recipe(
    "smart-prompt-training",
    dataset=("Dataset to save annotations into", "positional", None, str),
    lang=("Language to use", "positional", None, str),
    examples_path=("Path to example prompt file", "positional", None, str),
    span_labels_file=("Path to labels file", "positional", None, str),
    intent_labels_file=("Path to intents file", "positional", None, str),
    patterns_file=("Path to patterns file", "positional", None, str),
)
def custom_recipe(dataset, lang, examples_path, span_labels_file, intent_labels_file, patterns_file):
    span_labels = Path(span_labels_file).read_text().split("\n")
    intent_labels = Path(intent_labels_file).read_text().split("\n")
    msg.info(f"Using span labels: {span_labels}")
    msg.info(f"Using intent labels: {intent_labels}")

    def add_options(stream):
        for ex in stream:
            ex['options'] = [
                {"id": lab, "text": lab} for lab in intent_labels
            ]
            yield ex
    
    nlp = spacy.blank(lang)

    stream = JSONL(examples_path)
    stream = add_options(stream)

    pattern_matcher = PatternMatcher(nlp, allow_overlap=True, combine_matches=True).from_disk(patterns_file)

    stream = (eg for _, eg in pattern_matcher(stream))
    stream = add_tokens(nlp, stream, use_chars=None)

    blocks = [
        {"view_id": "spans_manual"},
        {"view_id": "choice", "text": None}
    ]

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "config": {
            "lang": nlp.lang,
            "labels": span_labels,
            "blocks": blocks,
            "choice_style": "single"
        }
    }

There are some files on my disk that I use with the following contents.

spans.txt

MANAGEMENT
NUMUNITS

intents.txt

SEARCH
EXPORT

examples.jsonl

{"text": "search for all properties managed by parawest management"}
{"text": "parawest management si what id like to know more about yo"}

patterns.jsonl

{"label":"MANAGEMENT","pattern":"parawest management"}

Then when I call it all via:

python -m prodigy smart-prompt-training issue-6628 en examples.jsonl spans.txt intents.txt patterns.jsonl -F recipe.py

Then I see this interface, ready for use.