Pattern matching feature request

wpm · January 2, 2018, 3:25pm

Feature Request

Say you’re building seed patterns to detect the names of drugs like you do in the named entity video tutorial. In that tutorial your seeds are all examples of drugs you’d like to match, but there are other a priori textual cues that something might be a drug. For instance, in the sentence “X will get you really high”, X is likely a drug, regardless of what surface form X takes.

In addition to the current patterns I’d like to be able to to write seed patterns equivalent to the following regular expression

(\w+) will get you really high

where ner.teach suggests whatever is matched by (\w+) as a candidate named entity.

Alternative to the Feature Request

Maybe you don’t want to go down the route of making the pattern matching DSL feature-rich because that’s not the core of your product. In that case is there a more “manual” way of creating seeds from more complicated patterns? Maybe instead of passing in a corpus plus a set of seed patterns, I pass in a corpus of parsed documents with candidate named entities already annotated. That way I could write really sophisticated pattern matching in code if I felt like I needed it, without Prodigy having to support that sophistication in its pattern matching DSL.

Note that this open-ended pattern matching support could cover more than just regular expression groupings. For example, in my current project I am using spaCy to extract entities using a combination of patterns and logic that I write in code myself. The logic I write takes into account various kinds of contextual features, position in document, etc. It’s way too bespoke and brittle to be the ultimate solution, but is probably a good starting point for annotation.

honnibal · January 3, 2018, 1:09am

If you have a look at the ner.teach recipe, you should be able to see that there’s a spaCy Matcher object being used to drive the patterns part. This makes it pretty easy to implement the logic you’re interested in. Have a look at the matcher docs here: https://spacy.io/usage/linguistic-features#section-rule-based-matching

We think it’s generally better to encourage customisation and scripting of the tool in the recipe function, instead of trying to move the logic into lots of different data formats. One of the ideas in Prodigy is that you already know a good scripting language — Python.

wpm · January 3, 2018, 1:57am

Makes sense.

wpm · January 3, 2018, 2:12am

Is there documentation for PatternMatcher anywhere? I don’t see it in the online docs or PRODIGY_README.html and I can’t read the source.

I’m figuring it out by reading the recipe code and playing with it in the REPL, but I’m wondering if overlooking documentation.

ines · January 3, 2018, 2:59pm

Sorry, I think the detailed API docs are currently missing because the PatternMatcher was just added recently – but will update the docs for the next release. In the meantime, I'm just putting them here. (@honnibal Feel free to add to this in case I forgot something!)

`METHOD` PatternMatcher.init

Create a new pattern matcher.

Argument	Type	Description
`nlp`	`Language`	The `nlp` object with a loaded spaCy model.
RETURNS	`PatternMatcher`	The pattern matcher.

`METHOD` PatternMatcher.call

Match patterns on a stream of tasks.

Argument	Type	Description
`stream`	iterable	The stream of annotation examples, i.e. dictionaries.
YIELDS	tuple	`(score, task)` tuples. Tasks include a `"span"` property of the matched text as well as a `"label"`, set by the pattern. The `"meta"` includes the score and the ID of the matched pattern.

Given a pattern like this:

{"label": "DRUG", "pattern": [{"lower": "fentanyl"}]}

... a task could look like this:

{
    "text": "fentanyl is dangerous", 
    "spans": [{
        "start": 0,
        "end": 8,
        "label": "DRUG",
        "score": 0.9,
        "priority": 0.9,
        "pattern": 1
    }],
    "meta": {"score": 0.9, "pattern": 1}
}

`METHOD` PatternMatcher.add_patterns

Add patterns to the pattern matcher.

Argument	Type	Description
`patterns`	list	The patterns to add.

`METHOD` PatternMatcher.add_matcher

Add a new Matcher or PhraseMatcher to the pattern matcher model.

Argument	Type	Description
`matcher`	`Matcher` / `PatternMatcher`	The matcher to add.

`METHOD` PatternMatcher.update

Update the pattern matcher model with annotations.

Argument	Type	Description
`examples`	list	A list of dictionaries of examples.
`drop`	float	The dropout rate, defaults to `0`.
`batch_size`	int	The batch size, defaults to `8`.
RETURNS	int	`0`

`METHOD` PatternMatcher.from_disk

Load in a list of patterns from a file and add them to the pattern matcher. The file should be newline-delimited JSON (JSONL) with one entry per line. See the patterns file documentation for details.

Argument	Type	Description
`path`	unicode or `Path`	Path to the patterns file.
RETURNS	`PatternMatcher`	The pattern matcher with the loaded patterns.

We're also planning to add a has_label method to keep the PatternMatcher consistent with the other models. This hasn't been relevant so far in the existing recipes, but might be nice in the future so you can do matcher.has_label('DRUG').

SandeepNaidu · February 9, 2018, 1:26am

Hi,
Adding each such words from the dictionary passing through nlp() to PhraseMatcher will take a lot of time. I think it is better to load this once and save the model to the disk and reuse the model. Even nlp.pipe() take long time for a large dictionary added to PhraseMatcher (when the generator is iterated upon). Is there any other way to add such phrases from large dictionaries faster?

honnibal · February 9, 2018, 4:35am

@sandeep118 I think the problem is the nlp.pipe() is applying the whole pipeline. Is it faster if you do docs = (nlp.make_doc(text) for text in texts)?

SandeepNaidu · February 9, 2018, 5:49am

Right on the point. That helped. Thanks a ton.

Topic		Replies	Views
Creating patterns library from scratch usage	2	427	August 18, 2021
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1573	December 15, 2020
How do I add a --patterns option to ner.make-gold? ner , solved	11	1809	October 25, 2018
Bootstrapping terms with pattern file usage	7	1439	July 9, 2019
Patterns and custom NER usage , ner	1	2770	December 27, 2017

Pattern matching feature request

Feature Request

Alternative to the Feature Request

METHOD PatternMatcher.__init__

METHOD PatternMatcher.__call__

METHOD PatternMatcher.add_patterns

METHOD PatternMatcher.add_matcher

METHOD PatternMatcher.update

METHOD PatternMatcher.from_disk

Related topics