Say you’re building seed patterns to detect the names of drugs like you do in the named entity video tutorial. In that tutorial your seeds are all examples of drugs you’d like to match, but there are other a priori textual cues that something might be a drug. For instance, in the sentence “X will get you really high”, X is likely a drug, regardless of what surface form X takes.
In addition to the current patterns I’d like to be able to to write seed patterns equivalent to the following regular expression
(\w+) will get you really high
ner.teach suggests whatever is matched by
(\w+) as a candidate named entity.
Alternative to the Feature Request
Maybe you don’t want to go down the route of making the pattern matching DSL feature-rich because that’s not the core of your product. In that case is there a more “manual” way of creating seeds from more complicated patterns? Maybe instead of passing in a corpus plus a set of seed patterns, I pass in a corpus of parsed documents with candidate named entities already annotated. That way I could write really sophisticated pattern matching in code if I felt like I needed it, without Prodigy having to support that sophistication in its pattern matching DSL.
Note that this open-ended pattern matching support could cover more than just regular expression groupings. For example, in my current project I am using spaCy to extract entities using a combination of patterns and logic that I write in code myself. The logic I write takes into account various kinds of contextual features, position in document, etc. It’s way too bespoke and brittle to be the ultimate solution, but is probably a good starting point for annotation.