The goal is to highlight the multi word entity "a-b-c d" when running the ner.manual command. However, the multi word entity "a-b-c d" is not recognized. I guess the issue in the tokenization of "-". How to solve this issue taking into account that in my data I have abundant multi word entities containing "-"
Hi! When writing patterns, it often helps to double-check the tokenization of what you're trying to match using the given tokenizer. For example:
nlp = spacy.blank("en")
doc = nlp("a-b-c d")
print([token.text for token in doc])
# ['a', '-', 'b', '-', 'c', 'd']
So in this example, the string is split on - (which seems like a reasonable default). This means that a pattern looking for a single token a-b-c will never match because the tokenizer never produces it. Instead, you probably want your pattern to look something like this, with one dictionary describing one token: