Extended pattern performance question

Hi Ines, hi Matt,

I'm trying to build a NER model that recognizes company address informations from their websites. While for some information a statistical model is needed, other entities might benefit from some rules.

To create a gold standard, my idea is to take a blank model and add an entity ruler and use this 'prediction' in a custom or the 'make-gold' recipe in order to boost my manual annotation process.

Let's say I have a list of a few 100 or 1000 most common city names for my CITY entity. I could create a pattern for the entityruler of the form

pattern =[ {"lower": {"IN": ["city1", "city2", ...]}} ]

or I could do a pattern list

pattern = [ [{"lower": "city1"}], [{"lower": "city2"}], ...]

Which version is to be preferred, or does it depend on the number of items? Or is this approach a bad idea in general?

Thank you for your help!

The IN check involves a Python function call, but you’d only be calling the function once per word, and then doing a set membership check. So it’s something like this:


class SetPredicate:
    def __init__(self, words):
        # Set of 64-bit integer IDs.
        self.value = set(get_string_id(word) for word in words)
    def __call__(self, word):
        return word in self.value

predicate = SetPredicate(["city1", "city2", ...])
matches = []
for word in doc:
    if predicate(word.lower):
        matches.append(word)

The second version is basically a nested loop, but it’s fully Cython. It’s something like:

cdef const TokenC* tokens = doc.c
for i in range(doc.length):
    token = tokens[i]
    for j in range(n_patterns):
        if token.lower == patterns[j]:
            matches.append(doc[i])

So the question is really: how big an array of values can you loop over in C, following one or two pointers per value, before you’re slower than one Python function call that makes a set membership check? Off the top of my head, I don’t know. I’d guess maybe 100?

I would probably go with the second way, but you can try them both and see if there’s a big performance difference.

I'm sorry to dig this up again. A short update:

Yes you are right, I tested this with a city list with roughly 600 entries and the IN approach is significantly faster than adding each city seperately as a single pattern.

Short follow up question:
Would there be any (dis-)advantage of using the PhraseMatcher for this? Context: Some city names consist of several tokens. For these, I can't use the IN operator, so at the moment I transform them to single multi-token rules accordingly.

If I understand the PhraseMatcher correctly, I can simply plug these city names in (again each as separate rule) and be done with it. Will there be a expectable difference in performance, or is it "under the hood" the same thing as doing it with the token matcher?

Or in other words, I don't understand the essential difference of the PhraseMatcher to the Matcher, except that it can in some cases supply a more convenient interface.

Thank you in advance for your thoughts and help with that!

The PhraseMatcher takes Doc objects as patterns instead of token descriptions. It calls into the regular Matcher under the hood, but it should be significantly faster, because it doesn't have to check the individual token attributes at runtime. (When you add the patterns, it sets tags on the vocab marking if some word can be part of a phrase you're looking for. It then only has to match one those.)

So, in summary, the Matcher is good if you need to write abstract token-based descrptions using various token attributes. The PhraseMatcher is good if you're mostly interested in matching exact words and phrases.

Hi Ines,

sadly I couln't verify your explanation regarding the PhraseMatcher. Let's say I have a list of 5000 first names in order to bootstrap my person entity recognition. A naive rule for the TokenMatcher would be something like

pattern = [{"ORTH": {"IN": forename_list}, "OP": "+"}, {"IS_TITLE": True}]

This works fine, but I tried the PhraseMatcher

for name in forename_list:
    rules.append({"label": "FORENAME", "pattern": name})

Additionally, one would have to add a second EntityRuler step to merge the FORENAME entity with the surname. A timing comparison over the same document set shows that the latter approach is approx. 15% slower even without this 2nd step!

Am I missing something here or is there a different way to give several items to the PhraseMatcher?

I actually wrote a very detailed post on this exact topic on Stack Overflow the other day that explains the token matcher vs. phrase matcher vs. entity ruler and what to consider when profiling them:

In your example, it also looks like what you're really after are abstract token patterns – matching with operators and token attributes is exactly what the token matcher is for, and you can't easily replicate that with regular string/Doc matching. The EntityRuler will also always be slower because it writes to the doc.ents. (Finally, if you're profiling things, always make sure you exclude the setup time or make that a separate step. The PhraseMatcher will always be slower to set up, because creating a Doc object is slower than instantiating a list of dicts.)

Thank you ver much for your ultra-fast reply Ines!
Sorry that I have missed your post on Stack Overflow, I just checked in the forum here.

I generate the toy model separately beforehand. Afterwards, I loop over approx 25k of my documents each time calling model(text) in order to get my entities. My timing measurements are only converning this loop.

The EntityRuler is slower, but internally it uses both the PhraseMatcher and the TokenMatcher depending on the patterns I feed the ruler. I wanted to compare the performance of these two being called inside.

Summary as a rule of thumb:
For larger (100+) terminology lists of single words, one will benefit from using a TokenMatcher pattern with the IN operator. This is especially true, if this should be included in an abstract pattern (like my person name example).

For shorter lists, one should prefer adding its entries as single rules. If the entries consist of multi-token phrases (and nothing abstract to be added), the PhraseMatcher should be considered.

...somewhat right?