Match Patterns and Efficient Bootstrapping


My team and I are looking labelling several thousand sentences for using ner.manual.

In this annotation round, the labels occur quite infrequently in our labelling dataset. Therefore, it will take some time to come across sentences that contain entities that we are labelling. Perhaps going through several 1000s of sentences to label even a few 100 examples. This is obviously quite inefficient.

We have discussed the possibility of using match patterns to pre-label some of the entities we know are of interest to us. These would be a non-exhaustive list, but fairly comprehensive for some entity categories.

My question is - Is it possible to use match patterns to "surface" sentences in a corpus during annotation, so that sentences that contain match patterns are presented to the annotator(s) quicker?

Many thanks for any advice on this.

If I were in your shoes I might do this offline with a Python script or a Jupyter notebook. Maybe a script called that can take a big dataset and turn it into a smaller set of interesting candidates. I'd then start by feeding this set of interest examples to Prodigy.

You can use the spaCy matchers in such a script, which is indeed something I've done a lot in the past. It might be good to also observe that you can use any trick that you like here, so feel free to use your own domain knowledge or pre-trained model here as well. If you're dealing with a large set (10K+) of interesting tokens that need to be matched then you might enjoy using flashtext.

One thing to be careful with, now that I think of it, is that you are technically feeding a biased dataset to Prodigy when you're doing this. So I would also annotate some of the sentences that don't have the entities. Just for sake of balance.

Hi Vincent,

Thank you very much for this advice. I will give it a go - FlashText looks like a very useful package for this.


1 Like