Use patterns.jsonl to automatically annotate entire dataset

Hi there,

I am currently using spans.manual to bootstrap a huge patterns.jsonl file which later allows me to label my entire corpus of texts. I am wondering now, how do I use the patterns.jsonl to derive text labels without clicking through my 30k+ examples in prodigy? I am thinking that I could somehow use the pattern file in spacy but my forum and stackoverflow search hasn't yield a satisfactory result/approach so far...

I am grateful for any kind of advice! :slight_smile:

Hi @simonschoe,

thank you for your question.
You can use prodigy's PatternMatcher to load the your patterns.jsonl and to match your texts using your provided patterns (see: https://prodi.gy/docs/api-components#patternmatcher).
One way to do this, is to use the following example code:

import spacy
from prodigy.components.loaders import JSONL
from prodigy.models.matcher import PatternMatcher

nlp = spacy.load("en_core_web_lg")
matcher = PatternMatcher(nlp).from_disk("./patterns.jsonl")
stream = JSONL("./path_to_your_texts_saved_as_jsonl")
stream = matcher(stream)

However, please be aware that matcher(stream) is a generator yielding only tuples of scores and dictionaries for those texts where the matcher was able to match a pattern.

I hope this answers your question, please let me know if you have any further questions.

1 Like

Works like a charm, thanks for pointing me to the component section!

However, please be aware that matcher(stream) is a generator yielding only tuples of scores and dictionaries for those texts where the matcher was able to match a pattern.

Fortunately, because of the new all_examples I was also able to retrieve examples absent any match.

PatternMatcher(nlp, combine_matches=True, all_examples=True)

One (last) quick follow-up: Is there any clever way to a) remove punctuation before matching within prodigy and b) make whitespaces between tokens optional? With regards to the former, I would expect you recommend preprocessing a priori?

Thanks for your support! :slight_smile:

Hello @simonschoe,

I have forgotten about the all_examples-option. Sorry about that and I am glad that you found it :slight_smile:

Regarding your two questions:

remove punctuation before matching within prodigy

What kind of punctuation do you want to remove, all or just some specific characters? In general, I'd be careful regarding removing punctuation after you have created your patterns. You should always run the same logic in all of your steps.

make whitespaces between tokens optional

You can make whitespaces in your patterns optional using quantifiers: {"is_space": true, "op": "*"}. In whole, a pattern using this could look like this:

{"label": "LABEL", "pattern": [{"lower": "text1"}, {"is_space": true, "op": "*"}, {"lower": "text2"}]} 

This would match "text1 text2" as well as "text1 text2". However, since the PatternMatcher works on tokens, this pattern won't catch "text1text2". But I am unsure whether you need to regard those cases too?

1 Like

What kind of punctuation do you want to remove, all or just some specific characters?

Currently, I am facing lots of cases like these:

-cfo
chief executive -officer

where the hyphen is considered as part of the tokens -cfo and -officer so that my string patterns won't match. Of course, I could add an optional hyphen to all my patterns (respectively tokens), but I find this overly complicated. In particular, none of the patterns I designed accounts for a hyphen in any other way, so I thought I could just as easily get rid of all hyphens beforehand to avoid the issue. And of course I could do that before creating my .jsonl file as input for prodigy, but I thought maybe there is a more elegant way that prodigy could handle that automatically for me. Hope my use case became a little clearer. :slight_smile:

This would match "text1 text2" as well as "text1 text2" . However, since the PatternMatcher works on tokens, this pattern won't catch "text1text2" . But I am unsure whether you need to regard those cases too?

In fact, I was particularly interested in the second case, i.e., being able to match "text1 text2" as well as "text1text2". But yes, given that the matcher operates on individual tokens, I thought that it is not feasible to cover this use case.

Again, thanks for your help, I believe I found a satisfactory solution!

1 Like

Hi @simonschoe,

happy to hear, that you have found a solution.

Of course, I could add an optional hyphen to all my patterns (respectively tokens), but I find this overly complicated.

Indeed, I consider it as overly complicated too.
If you want preceeding hyphens to be recognized as separate tokens, you could write an own tokenizer or adjust spaCy's tokenizer slightly. You can load this tokenizer into your nlp-pipeline, see: https://spacy.io/usage/linguistic-features#native-tokenizers.
Using my code above and the tokenizer (slightly changed compared to the example in the spaCy docs), I am able to match texts successfully where a word has a preceeding hyphen:

import spacy
from prodigy.components.loaders import JSONL
from prodigy.models.matcher import PatternMatcher
from spacy.tokenizer import Tokenizer
import re

prefix_re = re.compile(r'''^[-]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)

matcher = PatternMatcher(nlp).from_disk("./patterns.jsonl")
stream = JSONL("path_to_your_jsonl")
stream = matcher(stream)

Maybe this is a good solution for your problem? However, maybe you still have to adjust some patterns in your patterns.jsonl.

In fact, I was particularly interested in the second case

Sorry to hear that. Sadly, I currently cannot come up with a satisfactory solution. Of course, you could add some patterns to catch those cases too, but depending on your patterns and the number of patterns you use, this might be too much overhead. Or you could preprocess these words to be separated, which might be an overhead too, depending on your input data.

@Jette16 Thanks for the tokenizer tip, that is indeed a very elegant way to solve this! Regarding the optional white token issue, I guess I'll just accept that some typos may be possible. Since I presume them to be rare in terms of frequency, the cost of accounting for all possible variations is simply not worth it.

Thanks for supporting me throughout!

Best
Simon

1 Like