Match Patterns and Efficient Bootstrapping

rory-hurley-gds · December 9, 2022, 10:08am

Hello,

My team and I are looking labelling several thousand sentences for using ner.manual.

In this annotation round, the labels occur quite infrequently in our labelling dataset. Therefore, it will take some time to come across sentences that contain entities that we are labelling. Perhaps going through several 1000s of sentences to label even a few 100 examples. This is obviously quite inefficient.

We have discussed the possibility of using match patterns to pre-label some of the entities we know are of interest to us. These would be a non-exhaustive list, but fairly comprehensive for some entity categories.

My question is - Is it possible to use match patterns to "surface" sentences in a corpus during annotation, so that sentences that contain match patterns are presented to the annotator(s) quicker?

Many thanks for any advice on this.

koaning · December 9, 2022, 2:16pm

If I were in your shoes I might do this offline with a Python script or a Jupyter notebook. Maybe a script called find_interest_candidates.py that can take a big dataset and turn it into a smaller set of interesting candidates. I'd then start by feeding this set of interest examples to Prodigy.

You can use the spaCy matchers in such a script, which is indeed something I've done a lot in the past. It might be good to also observe that you can use any trick that you like here, so feel free to use your own domain knowledge or pre-trained model here as well. If you're dealing with a large set (10K+) of interesting tokens that need to be matched then you might enjoy using flashtext.

One thing to be careful with, now that I think of it, is that you are technically feeding a biased dataset to Prodigy when you're doing this. So I would also annotate some of the sentences that don't have the entities. Just for sake of balance.

rory-hurley-gds · December 13, 2022, 11:24am

Hi Vincent,

Thank you very much for this advice. I will give it a go - FlashText looks like a very useful package for this.

Thanks,
Rory

Topic		Replies	Views
Surfacing sentences for annotation - Entity sparsity usage , ner , spacy	8	319	April 20, 2022
[Request] best practice for bootstrapping data for training partially new Named Entites? (and a question about PhraseMatcher ) usage , ner , spacy , best-practices , training	3	296	February 16, 2024
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1564	December 15, 2020
(Re)using labels in patterns usage , spacy	1	316	July 21, 2021
NER - Multi-entity and proper use of datasets ner , database , best-practices	2	2109	February 7, 2019

Match Patterns and Efficient Bootstrapping

Related topics