PhraseMatcher and spaces

I’m trying to create a new entity (ex. Fruit) for NER.

I’m following the suggestions here but I’m having some issues annotating my text.

It seems to ignore text surrounded by ‘[’ and ‘]’.

matcher.add('FRUIT', None, 'Apple')
#[Apple](www.apple.com) is a company -- this sentence will not match

Is there something in the settings to fix this?

On doing some more testing, it looks like this occurs when there isn’t a space at the end of the match.

<p>Apple</p> #This will not match
<p>Apple </p> #This will match

I see that the option to disable this exists for Matcher - can it also be applied for PhraseMatcher?

Just like the Matcher, the PhraseMatcher also depends on the tokenization, since it operates on Doc objects (and not on raw text). This means that the phrase matcher still expects the phrase patterns to map to individual tokens and not just parts of tokens – but this is also why it’s so efficient and more efficient than just using regular expressions on raw text.

By default, spaCy’s tokenizer will split on whitespace characters and then apply additional prefix, suffix and infix rules. I assume that in your examples, “Apple” isn’t split into its own token, so the pattern won’t match. For example:

String Tokens Match
'Hello Apple' ['Hello', 'Apple'] → match for “Apple”
'<p>Apple</p>' ['<p>Apple</p>'] → no match for “Apple”
'<p> Apple' ['<p>', 'Apple'] → match for “Apple”

As a solution, you could either pre-process your raw text to strip out markup before processing it with spaCy, or customise spaCy’s tokenization rules to make sure HTML tags and Markdown formatting characters are always split off.