Accept hyphen(-) in patterns shape

smanjil · October 12, 2018, 11:57am

Hi,

I was trying to annotate this texts:

{"text":"Silicon Valley Investors Flexed Their Muscles in Uber C15.3 Fight"}
{"text":"Uber is a Creature of an Industry 2-123 Struggling to Grow Up 1-840x"}
{"text":"\u2018The Internet Is Broken\u2019: @ev Is  P12.23 Trying to Salvage It"}

with the patterns defined as:

{"label":"CODE","pattern":[{"SHAPE":"d-ddd"}], "example": ["7-780"]}
{"label":"CODE","pattern":[{"SHAPE":"Xdd.d"}], "example": ["C15.3"]}
{"label":"CODE","pattern":[{"SHAPE":"d-dddx"}], "example": ["1-840x"]}

This “P12.23” type of text is detected well, but the ones with hyphens in it like “1-840x” are not detected, but sometimes only 1 is detected as CODE and not others as a whole.

I would appreciate if anyone would suggest me on how to achieve this?

ines · October 12, 2018, 12:15pm

Hi! I think the main reason your patterns don't match is because the patterns are token-based. By default, the English tokenizer will split some of the strings into two tokens:

from spacy.lang.en import English

nlp = English()
doc = nlp("7-780")
print([token.text for token in doc])
# ['7', '-', '780']

This means that your patterns also need to reflect this by defining one dict per token. For example:

[{"SHAPE": "d"}, {"ORTH": "-"}, {"SHAPE": "ddd"}]

If you haven't seen it already, you might also find our interactive Matcher demo useful, which lets you test your patterns against a text to make sure they match:

smanjil · October 12, 2018, 12:22pm

Hi, Thank you for the quick help!

I came across the interactive pattern explorer and was able to generate this:

{"label":"CODE", "pattern":[{"SHAPE": "d"}, {"ORTH": "-"}, {"SHAPE": "ddd"}], "example": ["7-780"]}
{"label":"CODE", "pattern":[{"SHAPE": "d"}, {"ORTH": "-"}, {"SHAPE": "dddx"}], "example": ["1-840x"]}

And, now, sometimes the pattern matches correctly and sometimes it just identifies 7 as a code from 7-780…

It’s strange, and sometimes although the code is there in the text it does not get it but rather highlights another word as a code. Why?

ines · October 12, 2018, 12:27pm

Are you using ner.teach with --patterns? Because if so, the other suggestions you're seeing are suggestions by the model. Once you've updated the model in the loop with examples of your CODE entity, it will try to also make suggestions. In the beginning, those can be random – but as you annotate and accept / reject the examples, the model should ideally adjust to the entity definition.

You might also want to try using a few more patterns with more examples (if possible for your category). After all, you're starting completely from scratch and the model doesn't know anything about your entity CODE yet.

Alternatively, if you really only want to label pattern matches in your text without a model in the loop, you could also just use the ner.match recipe.

smanjil · October 12, 2018, 12:29pm

Sounds good. Will certainly look into that and get back here.

Topic		Replies	Views
How to use a spaCy pattern in Prodigy usage , ner	3	2242	May 22, 2019
✨ Tip: Test your patterns with our new Matcher Explorer demo spacy , project	4	2343	May 8, 2023
Match Pattern Converter: Dataframe to JSON usage , spacy , solved	8	460	June 4, 2021
"Negative" pattern matching (RegEx) usage , spacy	2	2276	November 5, 2021
Problem with new entity type and patterns usage , ner , solved	2	817	January 8, 2019

Accept hyphen(-) in patterns shape

Related topics