Just like the Matcher, the PhraseMatcher also depends on the tokenization, since it operates on Doc objects (and not on raw text). This means that the phrase matcher still expects the phrase patterns to map to individual tokens and not just parts of tokens – but this is also why it’s so efficient and more efficient than just using regular expressions on raw text.
By default, spaCy’s tokenizer will split on whitespace characters and then apply additional prefix, suffix and infix rules. I assume that in your examples, “Apple” isn’t split into its own token, so the pattern won’t match. For example:
String
Tokens
Match
'Hello Apple'
['Hello', 'Apple']
→ match for “Apple”
'<p>Apple</p>'
['<p>Apple</p>']
→ no match for “Apple”
'<p> Apple'
['<p>', 'Apple']
→ match for “Apple”
As a solution, you could either pre-process your raw text to strip out markup before processing it with spaCy, or customise spaCy’s tokenization rules to make sure HTML tags and Markdown formatting characters are always split off.