ner.manual: issue to recognize multi-words entity containing "-"


My patterns.jsonl look like:

{"label": "X", "pattern": [{"lower": "a-b-c"}, {"lower": "d"}]}

here how I use the Prodigy ner.manual

prodigy ner.manual data_1 blank:en data.jsonl --label A,B --patterns patterns.jsonl

The goal is to highlight the multi word entity "a-b-c d" when running the ner.manual command. However, the multi word entity "a-b-c d" is not recognized. I guess the issue in the tokenization of "-". How to solve this issue taking into account that in my data I have abundant multi word entities containing "-"

Hi! When writing patterns, it often helps to double-check the tokenization of what you're trying to match using the given tokenizer. For example:

nlp = spacy.blank("en")
doc = nlp("a-b-c d")
print([token.text for token in doc])
# ['a', '-', 'b', '-', 'c', 'd']

So in this example, the string is split on - (which seems like a reasonable default). This means that a pattern looking for a single token a-b-c will never match because the tokenizer never produces it. Instead, you probably want your pattern to look something like this, with one dictionary describing one token:

{"label": "X", "pattern": [{"lower": "a"}, {"lower": "-"}, {"lower": "b"}, {"lower": "-"}, {"lower": "c"}, {"lower": "d"}]}

Thanks a lot !