Prodigy patterns not behaving like Spacy patterns

I’m trying to recognize a pattern with a Person followed by age-indicative language. For example, in the following sentence:

Henry Joseph Fitzsimons, aged 48, of no fixed abode, but originally from the Andersonstown area of west Belfast, applied to have his daily reporting conditions to police lifted.

It would find “Henry Joseph Fitzsimons, aged 48”. I’m planning to use this to train text classification. When I test the following Spacy pattern using the Matcher Explorer, it works:

pattern = [{‘ENT_TYPE’: ‘PERSON’},
{‘IS_PUNCT’: True, ‘OP’: ‘?’},
{‘LEMMA’: ‘age’, ‘OP’: ‘?’},
{‘IS_DIGIT’: True, ‘OP’: ‘+’}]

However, when I try using the equivalent pattern in the patterns file for textcat.teach, it doesn’t recognize the same phrase:

[{“ENT_TYPE”: “PERSON”}, {“IS_PUNCT”: true, “OP”: “?”}, {“LEMMA”: “age”, “OP”: “?”}, {“IS_DIGIT”: true, “OP”: “+”}]

Hi! That’s definitely a bit strange – because under the hood, Prodigy directly calls into spaCy’s matcher.

Could you share the full command you ran? And could you test your patterns and texts with ner.match, just to double-check? The thing with textcat.teach is that it is an active learning recipe and the scores of the examples decide whether you see a suggestion or not. This also includes the pattern suggestions - so it’s theoretically possible that Prodigy decides to skip a pattern match if it’s scored low based on your previous decisions and instead focuses on a different example instead.

Wow, you’re fast! Sure – here’s the command:

$ prodigy textcat.teach textcat_age en_core_web_lg ~/data/training_data/prodigy/AcquireMedia_archive_20170410.jsonl --patterns ~/data/training_data/prodigy/patterns/textcat_age.jsonl --label PERSON_AGE

Curious thing is, when I run ner.teach with the same arguments, the matcher throws an exception:


File “cython_src/prodigy/models/matcher.pyx”, line 211, in prodigy.models.matcher.PatternMatcher.from_disk
File “cython_src/prodigy/models/matcher.pyx”, line 136, in prodigy.models.matcher.PatternMatcher.add_patterns
File “cython_src/prodigy/models/matcher.pyx”, line 60, in prodigy.models.matcher.create_matchers
File “cython_src/prodigy/models/matcher.pyx”, line 29, in prodigy.models.matcher.parse_patterns
ValueError: Invalid pattern: [{‘ENT_TYPE’: ‘PERSON’}, {‘IS_PUNCT’: True, ‘OP’: ‘?’}, {‘LEMMA’: ‘age’, ‘OP’: ‘?’}, {‘IS_DIGIT’: True, ‘OP’: ‘+’}]

How does your textcat_age.jsonl file look? Are you passing in one dict with "label" and "pattern" per line? That error is usually raised if there’s no "pattern" or no "label" key in the pattern entry. So a line in the patterns file should look like this:

{"label": "PERSON_AGE", "pattern": [{"ENT_TYPE": "PERSON"}, {"IS_PUNCT": true, "OP": "?"}, {"LEMMA": "age", "OP": "?"}, {"IS_DIGIT": true, "OP": "+"}]}

However, if that was the problem, I’m confused why textcat.teach didn’t raise the same error :thinking:

Btw, to test the patterns without any active learning, you want to be using the ner.match recipe! This will just stream in the texts and show you all matches in order.

Magic! This is what it used to look like:

[{“ENT_TYPE”: “PERSON”}, {“IS_PUNCT”: true, “OP”: “?”}, {“LEMMA”: “age”, “OP”: “?”}, {“IS_DIGIT”: true, “OP”: “+”}]

And this is what it looks like now:

{“label”: “PERSON_AGE”, “pattern”: [{“ENT_TYPE”: “PERSON”}, {“IS_PUNCT”: true, “OP”: “?”}, {“LEMMA”: “age”, “OP”: “?”}, {“IS_DIGIT”: true, “OP”: “+”}]}

(You had left out the square brackets around the list of dicts for the pattern).

The matcher is only picking up the last name of the PERSON entity – not sure why, but I can debug it on my own. See attached screenshot.

Glad it worked – and yes, that was a typo. I edited my example accordingly.

Well, this is probably because you're only looking for one token with the entity type PERSON. So even if the model correctly predicts the multi-token span, your pattern will only capture the last token. And I guess there'll always be some error margin, because you're relying on statistical predictions of the entity type – even if your model gets 90% of the relevant person entities correct, every 10th one will be a missed match.

I don’t think so… it seems to be some weirdness in the way patterns are processed. I was playing around with Matcher Explorer, and noticed that while the original pattern only captures the last name with the PERSON entity:

When I remove the final token – {“IS_DIGIT”: true, “OP”: “+”} – or replace it the “+” with a “?” it captures the entire name:

However, when I use the same approach in the patterns file textcat_age.jsonl:

{“label”: “PERSON_AGE”, “pattern”: [{“ENT_TYPE”: “PERSON”}, {“IS_PUNCT”: true, “OP”: “?”}, {“LEMMA”: “age”, “OP”: “?”}, {“IS_DIGIT”: true, “OP”: “?”}]}

It appears to grab the first name, and not the last name or the age:

This is very confusing… maybe I’m missing something, but I don’t understand why the second pattern would match more than one token of the PERSON entity. I kinda suspect that it might be related to some bugs in the matcher engine (which we fixed for v2.1) and the way operators behaved.

The expected result is definitely one single token with the entity type PERSON. I guess you could add the operator "OP": "+" to match one or more tokens like that.

Good morning. I’ve been experimenting with variations in the patterns file, rather than using the Matcher Explorer, since they seem to behave differently in some cases. I’ve uncovered a number of issues. I’m wondering if there’s a better place to log bugs than here.

At any rate:

  1. Looking for a literal comma by including {“TEXT”: “,”} in a pattern causes an exception:

  1. CARDINAL is not processed correctly in some cases, only finding textual representations of numbers, rather than numbers themselves. This pattern:

Matches this:

34%20AM

But not this:

25%20AM

This pattern seems to work fairly well:

I’ll keep experimenting. BTW, is version 2.1 available? Based on the name of my wheel file (which I downloaded only a week or two ago) I’m on 1.7.1

A bit more weirdness when using the pattern above. It sometimes (but not always) fails to match the pattern. In the screenshot below, note that “Hailton Vazcontreiras, 26” is matched, but not “Wilker Amaro, 22”:

18%20AM

And in this one, “John Mitchell, 21” is matched, but it subsequently skips over “Macauley Lawless, 21”:

25%20AM

And it sometimes matches whitespace to one of the PERSON entries:

And doesn’t always grab an entire name:

34%20AM

Thanks for the detailed reports! I think for some of these cases, it’d be easier if you could test them with spaCy directly and verify that the matching works there. Prodigy is currently using spaCy v2.0.x, which did have some bugs in complex cases with operators (as mentioned in my previous post). So if spaCy doesn’t match your pattern, then Prodigy can’t match it either.

For the patterns that rely on the ENT_TYPE: Did you verify that the model actually predicts those correctly? For instance, if your patern checks for ORDINAL and the model doesn’t actually predict that, the pattern also won’t match.

Also, keep in mind that ner.match will only ever show you one match at a time. It’s a binary recipe, so you’ll only ever be giving feedback on one span at a time. So it’s expected that each question only has one highlight, and you’ll be stepping through the matches.

we had a similar problem annotating legal texts.
the span contains paragraph signs, numbers + letters + punctuation returned results under spacy 2.1.
We wanted to improve this with prodigy. but with spacy 2.0 there are unfortunately lousy results. and now we wait for prodigy upgrade. therefore my question: is there a date in the near future?

Yes, the Matcher engine for spaCy v2.1 has been completely rewritten to support additional logic and fix a range of inconsistencies with the operators. So it’s expected that the results can potentially be quite different. This is also part of the reason we’re not taking the Prodigy update to spaCy v2.1 lightly – every user will have to retrain all of their models and potentially review match patterns and other custom logic. We’re hoping to have the stable version out by the end of the week – if you want to test the beta wheel, you can also send us an email :slightly_smiling_face:

What recipes are you using to annotate in Prodigy? If you’re not relying on a very tight coupling of the patterns to a model in the loop, you can also always pre-process your text with spaCy v2.1 (tokenize the text and add spans for the patterns) and then load that into a recipe like ner.manual?

hi ines
nice to hear that it goes on! I don’t wonna start with a beta if a working version will be expected till the end of this week or close to this.

i trust in your serious and solid work and know that it is hard to develop a product when on the other hand there are customers with real requirements. but that’s the way it is.

in this context a question about corpus management: annotation manager/ scale. It’s a little bit quiet about this. Is there any progress with the beta?
grtz

@ronaldoviber Thanks, I appreciate that! We actually briefly considered delaying the stable spaCy v2.1 release until Prodigy was ready, but we quickly discarded that plan because it seemed pointless and counterproductive. The stable version has now been out for a little under two months, which gave us enough time to fix remaining bugs and make sure all new features work as expected. (Yeah, we had lots of alphas before and some amazing alpha testers, but I can also understand that many serious production users prefer to wait for the stable release. I mean, I don’t always test alphas either :sweat_smile:)

You can read more about Prodigy Scale and the beta in this thread. The corpus manager is kind of a separate project and will be open-source – but it currently has a lower priority for us (because our main focus is still on Prodigy, Scale and spaCy).

i know the thread to scale, thank you. that’s why i asked the question. Because Matthew had written that he works with scale himself and so I thought that there was already something to use.
my problem is, besides the access rights, managing the files in the backend, which isn’t a problem in the development environment but requires an answer for users. our file management under python is currently managed via django. if something comes from you in the near future, then there is no need to create custom files here. that was the reason for the question.
i know the scripts of andy halterman. can the current corpusmanager use something for it? if you put it os then we could get involved or fork it and the other users would also benefit from it. because this concerns everyone who doesn’t work alone but in an application stack.

Yes, I did test that the ENT_TYPEs were matching correctly -- particularly for ORDINAL. It works fine all by itself, but not when used within the pattern.

Yes, I know -- I wasn't expecting the matcher to show more than one match at a time; The point I was making is that some expected matches (“Wilker Amaro, 22”, “Macauley Lawless, 21”) never match -- the matcher skips over them.

BTW, I checked and Spacy recognizes the numeric ages as dates, for some reason. So in the following: “Wilker Amaro, 22” “Wilker Amaro” is tagged correctly as a PERSON, but “22” is tagged as a DATE. That’s why using CARDINAL never worked. Using “IS-DIGIT” does work, but still displays the problem with skipping described above.

I’m now getting errors using terms.teach, but I’ll start a separate thread for that.

This is one of the downsides of relying on statistical predictions in match patterns – you kinda lose the reliability of rule-based matching, because the results now depend on what the model happens to predict for a particular example. So for things like numbers, IS_DIGIT or LIKE_NUM is definitely the more reliable option. In spaCy v2.1+, you can also use the IN operator (for set membership) or REGEX to describe the token using a regular expression.

Do you have a small reproducible example that shows the "skipping" behaviour you mean?

Have you upgraded to v1.8.2? If you were seeing warnings about empty vectors that slowed down the startup, that might be the problem. See here for details: