PhraseMatcher and spaces

kbyatnal · December 17, 2017, 1:39am

I’m trying to create a new entity (ex. Fruit) for NER.

I’m following the suggestions here but I’m having some issues annotating my text.

It seems to ignore text surrounded by ‘[’ and ‘]’.

matcher.add('FRUIT', None, 'Apple')
#[Apple](www.apple.com) is a company -- this sentence will not match

Is there something in the settings to fix this?

kbyatnal · December 17, 2017, 2:30am

On doing some more testing, it looks like this occurs when there isn’t a space at the end of the match.

<p>Apple</p> #This will not match
<p>Apple </p> #This will match

I see that the option to disable this exists for Matcher - can it also be applied for PhraseMatcher?

ines · December 17, 2017, 11:08am

Just like the Matcher, the PhraseMatcher also depends on the tokenization, since it operates on Doc objects (and not on raw text). This means that the phrase matcher still expects the phrase patterns to map to individual tokens and not just parts of tokens – but this is also why it’s so efficient and more efficient than just using regular expressions on raw text.

By default, spaCy’s tokenizer will split on whitespace characters and then apply additional prefix, suffix and infix rules. I assume that in your examples, “Apple” isn’t split into its own token, so the pattern won’t match. For example:

String	Tokens	Match
`'Hello Apple'`	`['Hello', 'Apple']`	→ match for “Apple”
`'<p>Apple</p>'`	`['<p>Apple</p>']`	→ no match for “Apple”
`'<p> Apple'`	`['<p>', 'Apple']`	→ match for “Apple”

As a solution, you could either pre-process your raw text to strip out markup before processing it with spaCy, or customise spaCy’s tokenization rules to make sure HTML tags and Markdown formatting characters are always split off.

Topic		Replies	Views
Fuzzy (partial) matching with PhraseMatcher (NER task) usage , spacy , solved , medical	10	10088	January 13, 2020
Accept hyphen(-) in patterns shape usage , ner , spacy	4	1643	October 12, 2018
Add tokenization rule usage , spacy	4	739	May 15, 2020
NER or PhraseMatcher? ner , spacy , best-practices	17	6101	September 20, 2018
Can I combine token and phrase matcher?	1	432	August 4, 2022

PhraseMatcher and spaces

Related topics