Say you’re building seed patterns to detect the names of drugs like you do in the named entity video tutorial. In that tutorial your seeds are all examples of drugs you’d like to match, but there are other a priori textual cues that something might be a drug. For instance, in the sentence “X will get you really high”, X is likely a drug, regardless of what surface form X takes.
In addition to the current patterns I’d like to be able to to write seed patterns equivalent to the following regular expression
(\w+) will get you really high
where ner.teach suggests whatever is matched by (\w+) as a candidate named entity.
Alternative to the Feature Request
Maybe you don’t want to go down the route of making the pattern matching DSL feature-rich because that’s not the core of your product. In that case is there a more “manual” way of creating seeds from more complicated patterns? Maybe instead of passing in a corpus plus a set of seed patterns, I pass in a corpus of parsed documents with candidate named entities already annotated. That way I could write really sophisticated pattern matching in code if I felt like I needed it, without Prodigy having to support that sophistication in its pattern matching DSL.
Note that this open-ended pattern matching support could cover more than just regular expression groupings. For example, in my current project I am using spaCy to extract entities using a combination of patterns and logic that I write in code myself. The logic I write takes into account various kinds of contextual features, position in document, etc. It’s way too bespoke and brittle to be the ultimate solution, but is probably a good starting point for annotation.
If you have a look at the ner.teach recipe, you should be able to see that there’s a spaCy Matcher object being used to drive the patterns part. This makes it pretty easy to implement the logic you’re interested in. Have a look at the matcher docs here: https://spacy.io/usage/linguistic-features#section-rule-based-matching
We think it’s generally better to encourage customisation and scripting of the tool in the recipe function, instead of trying to move the logic into lots of different data formats. One of the ideas in Prodigy is that you already know a good scripting language — Python.
Sorry, I think the detailed API docs are currently missing because the PatternMatcher was just added recently – but will update the docs for the next release. In the meantime, I'm just putting them here. (@honnibal Feel free to add to this in case I forgot something!)
METHOD PatternMatcher.__init__
Create a new pattern matcher.
Argument
Type
Description
nlp
Language
The nlp object with a loaded spaCy model.
RETURNS
PatternMatcher
The pattern matcher.
METHOD PatternMatcher.__call__
Match patterns on a stream of tasks.
Argument
Type
Description
stream
iterable
The stream of annotation examples, i.e. dictionaries.
YIELDS
tuple
(score, task) tuples. Tasks include a "span" property of the matched text as well as a "label", set by the pattern. The "meta" includes the score and the ID of the matched pattern.
Add a new Matcher or PhraseMatcher to the pattern matcher model.
Argument
Type
Description
matcher
Matcher / PatternMatcher
The matcher to add.
METHOD PatternMatcher.update
Update the pattern matcher model with annotations.
Argument
Type
Description
examples
list
A list of dictionaries of examples.
drop
float
The dropout rate, defaults to 0.
batch_size
int
The batch size, defaults to 8.
RETURNS
int
0
METHOD PatternMatcher.from_disk
Load in a list of patterns from a file and add them to the pattern matcher. The file should be newline-delimited JSON (JSONL) with one entry per line. See the patterns file documentation for details.
Argument
Type
Description
path
unicode or Path
Path to the patterns file.
RETURNS
PatternMatcher
The pattern matcher with the loaded patterns.
We're also planning to add a has_label method to keep the PatternMatcher consistent with the other models. This hasn't been relevant so far in the existing recipes, but might be nice in the future so you can do matcher.has_label('DRUG').
Hi,
Adding each such words from the dictionary passing through nlp() to PhraseMatcher will take a lot of time. I think it is better to load this once and save the model to the disk and reuse the model. Even nlp.pipe() take long time for a large dictionary added to PhraseMatcher (when the generator is iterated upon). Is there any other way to add such phrases from large dictionaries faster?
@sandeep118 I think the problem is the nlp.pipe() is applying the whole pipeline. Is it faster if you do docs = (nlp.make_doc(text) for text in texts)?