Pattern matching feature request

Feature Request

Say you’re building seed patterns to detect the names of drugs like you do in the named entity video tutorial. In that tutorial your seeds are all examples of drugs you’d like to match, but there are other a priori textual cues that something might be a drug. For instance, in the sentence “X will get you really high”, X is likely a drug, regardless of what surface form X takes.

In addition to the current patterns I’d like to be able to to write seed patterns equivalent to the following regular expression

(\w+) will get you really high

where ner.teach suggests whatever is matched by (\w+) as a candidate named entity.

Alternative to the Feature Request

Maybe you don’t want to go down the route of making the pattern matching DSL feature-rich because that’s not the core of your product. In that case is there a more “manual” way of creating seeds from more complicated patterns? Maybe instead of passing in a corpus plus a set of seed patterns, I pass in a corpus of parsed documents with candidate named entities already annotated. That way I could write really sophisticated pattern matching in code if I felt like I needed it, without Prodigy having to support that sophistication in its pattern matching DSL.

Note that this open-ended pattern matching support could cover more than just regular expression groupings. For example, in my current project I am using spaCy to extract entities using a combination of patterns and logic that I write in code myself. The logic I write takes into account various kinds of contextual features, position in document, etc. It’s way too bespoke and brittle to be the ultimate solution, but is probably a good starting point for annotation.

If you have a look at the ner.teach recipe, you should be able to see that there’s a spaCy Matcher object being used to drive the patterns part. This makes it pretty easy to implement the logic you’re interested in. Have a look at the matcher docs here:

We think it’s generally better to encourage customisation and scripting of the tool in the recipe function, instead of trying to move the logic into lots of different data formats. One of the ideas in Prodigy is that you already know a good scripting language — Python.

Makes sense.

Is there documentation for PatternMatcher anywhere? I don’t see it in the online docs or PRODIGY_README.html and I can’t read the source.

I’m figuring it out by reading the recipe code and playing with it in the REPL, but I’m wondering if overlooking documentation.

Sorry, I think the detailed API docs are currently missing because the PatternMatcher was just added recently – but will update the docs for the next release. In the meantime, I’m just putting them here. (@honnibal Feel free to add to this in case I forgot something!)

METHOD PatternMatcher.__init__

Create a new pattern matcher.

Argument Type Description
nlp Language The nlp object with a loaded spaCy model.
RETURNS PatternMatcher The pattern matcher.

METHOD PatternMatcher.__call__

Match patterns on a stream of tasks.

Argument Type Description
stream iterable The stream of annotation examples, i.e. dictionaries.
YIELDS tuple (score, task) tuples. Tasks include a "span" property of the matched text as well as a "label", set by the pattern. The "meta" includes the score and the ID of the matched pattern.

Given a pattern like this:

{"label": "DRUG", "pattern": [{"lower": "fentanyl"}]}

… a task could look like this:

    "text": "fentanyl is dangerous", 
    "spans": [{
        "start": 0,
        "end": 8,
        "label": "DRUG",
        "score": 0.9,
        "priority": 0.9,
        "pattern": 1
    "meta": {"score": 0.9, "pattern": 1}

METHOD PatternMatcher.add_patterns

Add patterns to the pattern matcher.

Argument Type Description
patterns list The patterns to add.

METHOD PatternMatcher.add_matcher

Add a new Matcher or PhraseMatcher to the pattern matcher model.

Argument Type Description
matcher Matcher / PatternMatcher The matcher to add.

METHOD PatternMatcher.update

Update the pattern matcher model with annotations.

Argument Type Description
examples list A list of dictionaries of examples.
drop float The dropout rate, defaults to 0.
batch_size int The batch size, defaults to 8.

METHOD PatternMatcher.from_disk

Load in a list of patterns from a file and add them to the pattern matcher. The file should be newline-delimited JSON (JSONL) with one entry per line. See the patterns file documentation for details.

Argument Type Description
path unicode or Path Path to the patterns file.
RETURNS PatternMatcher The pattern matcher with the loaded patterns.

We’re also planning to add a has_label method to keep the PatternMatcher consistent with the other models. This hasn’t been relevant so far in the existing recipes, but might be nice in the future so you can do matcher.has_label('DRUG').

Adding each such words from the dictionary passing through nlp() to PhraseMatcher will take a lot of time. I think it is better to load this once and save the model to the disk and reuse the model. Even nlp.pipe() take long time for a large dictionary added to PhraseMatcher (when the generator is iterated upon). Is there any other way to add such phrases from large dictionaries faster?

@sandeep118 I think the problem is the nlp.pipe() is applying the whole pipeline. Is it faster if you do docs = (nlp.make_doc(text) for text in texts)?

Right on the point. That helped. Thanks a ton. :slight_smile: