Accept hyphen(-) in patterns shape

Hi,

I was trying to annotate this texts:

{"text":"Silicon Valley Investors Flexed Their Muscles in Uber C15.3 Fight"}
{"text":"Uber is a Creature of an Industry 2-123 Struggling to Grow Up 1-840x"}
{"text":"\u2018The Internet Is Broken\u2019: @ev Is  P12.23 Trying to Salvage It"}

with the patterns defined as:

{"label":"CODE","pattern":[{"SHAPE":"d-ddd"}], "example": ["7-780"]}
{"label":"CODE","pattern":[{"SHAPE":"Xdd.d"}], "example": ["C15.3"]}
{"label":"CODE","pattern":[{"SHAPE":"d-dddx"}], "example": ["1-840x"]}

This “P12.23” type of text is detected well, but the ones with hyphens in it like “1-840x” are not detected, but sometimes only 1 is detected as CODE and not others as a whole.

I would appreciate if anyone would suggest me on how to achieve this?

Hi! I think the main reason your patterns don't match is because the patterns are token-based. By default, the English tokenizer will split some of the strings into two tokens:

from spacy.lang.en import English

nlp = English()
doc = nlp("7-780")
print([token.text for token in doc])
# ['7', '-', '780']

This means that your patterns also need to reflect this by defining one dict per token. For example:

[{"SHAPE": "d"}, {"ORTH": "-"}, {"SHAPE": "ddd"}]

If you haven't seen it already, you might also find our interactive Matcher demo useful, which lets you test your patterns against a text to make sure they match:

Hi, Thank you for the quick help!

I came across the interactive pattern explorer and was able to generate this:

{"label":"CODE", "pattern":[{"SHAPE": "d"}, {"ORTH": "-"}, {"SHAPE": "ddd"}], "example": ["7-780"]}
{"label":"CODE", "pattern":[{"SHAPE": "d"}, {"ORTH": "-"}, {"SHAPE": "dddx"}], "example": ["1-840x"]}

And, now, sometimes the pattern matches correctly and sometimes it just identifies 7 as a code from 7-780…

It’s strange, and sometimes although the code is there in the text it does not get it but rather highlights another word as a code. Why?

Are you using ner.teach with --patterns? Because if so, the other suggestions you’re seeing are suggestions by the model. Once you’ve updated the model in the loop with examples of your CODE entity, it will try to also make suggestions. In the beginning, those can be random – but as you annotate and accept / reject the examples, the model should ideally adjust to the entity definition.

You might also want to try using a few more patterns with more examples (if possible for your category). After all, you’re starting completely from scratch and the model doesn’t know anything about your entity CODE yet.

Alternatively, if you really only want to label pattern matches in your text without a model in the loop, you could also just use the ner.match recipe.

Sounds good. Will certainly look into that and get back here. :slight_smile:

1 Like