Doubt about PatternMatcher

Hello, I'm using in a custom recipe the function PatternMatcher.from_disk like this:

model = PatternMatcher(spacy.load(spacy_model), 
                                          combine_matches=True,
                                          all_examples=True).from_disk(patterns)

In my patterns file I have these two entries:

{"label": "GPE", "pattern": "Murcia"}
{"label": "GPE", "pattern": "Región de Murcia"}

And in my stream I have a text like this one:

{"text": "[...]En el caso de la retirada y destrucción de bovinos muertos en la explotación, el ámbito de aplicación lo constituyen las explotaciones ubicadas en el territorio de las comunidades autónomas de Andalucía, Aragón, Principado de Asturias, Illes Balears, Canarias, Cantabria, Castilla-La Mancha, Castilla y León, Cataluña, Extremadura, Galicia, La Rioja, Madrid, Región de Murcia, Foral de Navarra y Valenciana.[...]"}

The span recognized is "Región de Murcia", not "Murcia", that's because it is longer? If not, which is the criterion to select "Región de Murcia" and not "Murcia"?

Your hunch is correct: when spans overlap, the longest span is preferred over shorter spans.

1 Like

If you want to implement your own custom logic for which spans to actually annotate, you could use spaCy's PhraseMatcher directly and match your patterns on each incoming text in the stream, which gives you all possible matches including overlaps. You can then have your own logic for which one to add to the "spans" if two matches overlap. Just make sure that your logic is consistent, because otherwise, you end up with inconsistent suggestions and are more likely to also get inconsistent data.

1 Like