Hi Ines, looking at the ner example here it would be nice to have more information about pattern files using the ner.teach recipe I am struggling to get multiword expressions or anything remotely useful pulled out. I am not sure if this is due to a malformed pattern file or if multiword expressions are impossible to extract.
@swartchris8 Have you seen the example patterns yet? The country patterns have a lot of multi-word examples in them. Under the hood, the token-based patterns follow the same logic as the patterns for spaCy's rule-based matcher. To test and debug the patterns interactively on your text, you might find our demo useful:
@ines Actually looking at the pattern explorer it is still not clear to me how to put lower case multi-string expressions. Or even just how to put longer text into the pattern demo. Out of vocab pattern words also seem to cause a bit of trouble. My intermediate solution was just using the “text” key:
{"label": "MINOR_HARM", "pattern": [{"text": "inconvenience"}, {"text":"let down"}, {"text":"upset"}, {"text":"dissatisfied"}, {"text":"dishonest"}, {"text":"no faith"}, {"text":"broken promises"}, {"text":"valuable time"}, {"text":"worried"}, {"text":"not listening"}, {"text":"frustrated"}, {"text":"disappointing"}, {"text":"misled"}, {"text":"time wasting"}, {"text":"time wasted"}, {"text":"does not care"}, {"text":"doesn't care"}, {"text":"fobbing off"}, {"text":"unprofessional"}, {"text":"schock"}, {"text":"frustrated"}, {"text":"annoyed"}, {"text":"loosed time"}, {"text":"loss of time"}, {"text":"irrititating"}, {"text":"ruined"}, {"text":"phone exchange took"}, {"text":"shamed"}, {"text":"emotionally"}, {"text":"calling repeatedly"}]}
@swartchris8 Each dictionary represents one token and each entry in the dictionary represents a token attribute. In the interactive demo, each block on the left represents one token, and each line represents a token attribute.
So if you want to match a phrase like “hello world”, you could write a pattern like this:
{"label": "SOME_LABEL", "pattern": [{"lower": "hello"}, {"lower": "world"}]}
This will match a sequence of two tokens: one whose lowercase form matches “hello” and one whose lowercase form matches “world”. So for example, “hello world”, “HELLO world”, “hElLo WoRlD” and so on. See here for the example in the demo.
For more advanced examples using other token attributes, see the spaCy documentation. Token-based patterns are very powerful, because they let you use a variety of attributes – for example, you could write {"lemma": "have"}
to match all tokens with the base form “have” (“have”, “had”, “having” etc). However, you need to pay close attention to spaCy’s tokenization. If you describe two tokens and spaCy’s tokenizer doesn’t split them into two tokens, your pattern will never match.
If you already have word lists and are only interested in exact string matches, you could also write string patterns instead:
{"label": "SOME_LABEL", "pattern": "Hello world"}
This will match the exact phrase “Hello world” (case sensitive).