Can't get phrase matching to work

vulegend · June 25, 2023, 4:01pm

Hello everyone. I'm very new to the tool and python in general, so bare with me if i did something really stupid in the code :). I tried multiple things and none of them seem to work, so let me explain what i'm trying to do.

Currently i'm wiring a custom recipe (mostly following chatbot video tutorial) that uses spans manual and text classification. Here's the code:

import spacy
import prodigy
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import JSONL, TXT
from prodigy.models.matcher import PatternMatcher, PhraseMatcher
import codecs
import json


@prodigy.recipe(
    "smart-prompt-training",
    dataset=("Dataset to save annotations into", "positional", None, str),
    lang=("Language to use", "positional", None, str),
    file_in=("Path to example prompt file", "positional", None, str),
    label_file=("Path to labels file", "positional", None, str),
    patterns_file=("Path to patterns file", "positional", None, str),
    intents_file=("Path to intents file", "positional", None, str)
)
def custom_recipe(dataset, lang, file_in, label_file, patterns_file, intents_file):
    with open(label_file, 'r') as file:
        span_labels = file.readlines()
    with open(intents_file, 'r') as file:
        intent_labels = file.readlines()

    def add_options(stream):
        for ex in stream:
            ex['options'] = [
                {"id": lab, "text": lab} for lab in intent_labels
            ]
            yield ex

    nlp = spacy.load(lang)
    stream = TXT(file_in)
    stream = add_tokens(nlp, stream, use_chars=None)
    stream = add_options(stream)

    def write_without_bom(file_path):
        BOM = codecs.BOM_UTF8.decode('utf-8')
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        if text.startswith(BOM):
            text = text[len(BOM):]

        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(text)

    if patterns_file is not None:
        write_without_bom(patterns_file)
        pattern_matcher = PatternMatcher(nlp, combine_matches=True, all_examples=True)
        pattern_matcher = pattern_matcher.from_disk(patterns_file)
        stream = (eg for _, eg in pattern_matcher(stream))

    blocks = [
        {"view_id": "spans_manual"},
        {"view_id": "choice", "text": None}
    ]

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "config": {
            "lang": nlp.lang,
            "labels": span_labels,
            "blocks": blocks,
            "choice_style": "single"
        }
    }

I'm exporting my labels (entities) and choices (intents) from somewhere else and placing them in a file. My pattern file looks like this currently:

{"label":"Management","pattern":"littell properties"}
{"label":"Management","pattern":"pallas realty advisors"}
{"label":"Management","pattern":"abbey residential"}

So, what i want to achieve is add a phrase matcher to the custom recipe that'll match phrase by patterns provided in the file. If i have a line "Give me ... managed by abbey residential" i'd like prodigy to automatically highlight abbey residential and match it against "Management" label.

EDIT: I am using the following cmd line to run it:

python -m prodigy smart-prompt-training smart-prompts en_core_web_sm prompt_data.txt entities.txt patterns.jsonl intents.txt -F smartRecipe.py

EDIT2: Just to clarify as i did not write anything on it - i am currently using PatternMatcher in code and tried multiple other ways to replace it with PhraseMatcher and couldn't get it to work.

Thanks for being awesome!

koaning · June 26, 2023, 11:41am

Hi there!

There's nothing stupid about mistakes in code, I prefer to look at them as happy accidents.

Could you link the video that you're referring to? We have a bunch of videos at this point .

The PatternMatcher is explained in more details on the Prodigy docs, but it wraps spaCy’s Matcher and PhraseMatcher into a single object. So you should already be able to match against phrasematcher patterns that re-use the tokeniser in the nlp pipeline.

Does it not catch a specific pattern that you hoped for? If so, could you share a reproducible example of a pattern line together with a text that it does not match? In that case I might try to reproduce locally.

vulegend · June 26, 2023, 2:29pm

Hey hey! Thanks for the reply.

Currently it does not catch the pattern, correct. My pattern file is massive, so i will provide you with one line for each file i use, and you can try and reproduce it with the recipe i've given above:

prompt_data.txt

search for all properties managed by parawest management

entities.txt

Management
NumUnits

patterns.jsonl

{"label":"Management","pattern":"parawest management"}

intents.txt

Search
Export

So i'm expecting the matcher to match the "parawest management" and assign Management label to it

EDIT: Video Link - this is the video i used as a base for the recipe. I copied the pattern matching code from the ner.manual recipe

koaning · June 27, 2023, 9:25am

I think the issue was that you called add_tokens before using the PatternMatcher. I can understand that it is confusing, but the add_tokens should be called after. It's related to the fact that, technically, the PatternMatcher could detect patterns on text that don't have any tokens. The add_tokens function makes sure that the appropriate token information is added for the text, but also to the spans.

I rewrote your recipe a bit to get it all working. I also updated some of the code to use somewhat modern Python tools (most notably pathlib).

from pathlib import Path 
import spacy
import prodigy
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import JSONL
from prodigy.models.matcher import PatternMatcher
from prodigy.util import msg


@prodigy.recipe(
    "smart-prompt-training",
    dataset=("Dataset to save annotations into", "positional", None, str),
    lang=("Language to use", "positional", None, str),
    examples_path=("Path to example prompt file", "positional", None, str),
    span_labels_file=("Path to labels file", "positional", None, str),
    intent_labels_file=("Path to intents file", "positional", None, str),
    patterns_file=("Path to patterns file", "positional", None, str),
)
def custom_recipe(dataset, lang, examples_path, span_labels_file, intent_labels_file, patterns_file):
    span_labels = Path(span_labels_file).read_text().split("\n")
    intent_labels = Path(intent_labels_file).read_text().split("\n")
    msg.info(f"Using span labels: {span_labels}")
    msg.info(f"Using intent labels: {intent_labels}")

    def add_options(stream):
        for ex in stream:
            ex['options'] = [
                {"id": lab, "text": lab} for lab in intent_labels
            ]
            yield ex
    
    nlp = spacy.blank(lang)

    stream = JSONL(examples_path)
    stream = add_options(stream)

    pattern_matcher = PatternMatcher(nlp, allow_overlap=True, combine_matches=True).from_disk(patterns_file)

    stream = (eg for _, eg in pattern_matcher(stream))
    stream = add_tokens(nlp, stream, use_chars=None)

    blocks = [
        {"view_id": "spans_manual"},
        {"view_id": "choice", "text": None}
    ]

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "config": {
            "lang": nlp.lang,
            "labels": span_labels,
            "blocks": blocks,
            "choice_style": "single"
        }
    }

There are some files on my disk that I use with the following contents.

`spans.txt`

MANAGEMENT
NUMUNITS

`intents.txt`

SEARCH
EXPORT

`examples.jsonl`

{"text": "search for all properties managed by parawest management"}
{"text": "parawest management si what id like to know more about yo"}

`patterns.jsonl`

{"label":"MANAGEMENT","pattern":"parawest management"}

Then when I call it all via:

python -m prodigy smart-prompt-training issue-6628 en examples.jsonl spans.txt intents.txt patterns.jsonl -F recipe.py

Then I see this interface, ready for use.

Topic		Replies	Views
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1562	December 15, 2020
textcat.manual with --patterns argument enhancement , textcat	7	1100	September 25, 2019
Seeding text categorization with phrases textcat , done , custom	9	4205	March 21, 2018
Simple way of getting tagged/marked words and phrases after span categorization task solved	4	291	January 14, 2023
Use patterns.jsonl to automatically annotate entire dataset spancat	6	508	October 20, 2022