Prodigy textcat.teach data collection advice

alfu · March 19, 2021, 2:45pm

Hi,

Thanks for prodigy and spacy! I regret why we didn't find these awesome tools long time ago, but better late than never

We are currently experimenting with prodigy on a sample use case of detecting security-relevant comments on reddit. Security-relevant comments includes phrases/words like "SQL injection", "buffer overflow", "security vulnerability", "Privilege escalation", "XSS", "remote code execution" etc.

Question:
What is the general advice on collecting a representative (comments) data sample to use with the textcat.teach prodigy recipe to collect an annotated dataset? Considering my use case, should we select:

A random sample of reddit comments without considering whether the comments are security-relevant or not?
A sample of reddit comments containing X% security-relevant comments and Y% non security-relevant comments (e.g 50-50 split)?

Background:
I watching this video (by Ines) were she trained a NER model to find mentions of food inside reddit comments.
However, I could not figure out how she pre-selected the sample of 10K reddit comments used in textcat.teach recipe. @ines Can you please shed some light on how you preselected the 10K sample of reddit comments? Do you think your choice of (pre)-selection may have an impact on the quality of the annotated dataset that you get using prodigy?

honnibal · March 25, 2021, 2:46am

Needle-in-a-haystack class imbalance is definitely the biggest problem that makes text classification difficult in practice. When the positive class is a rare event like this, you'll always struggle to get a high density of positive cases, and so you'll typically want to bias the collection sample as much as possible towards getting the comments you're interested in. When it comes to an evaluation sample, you'll want to consider the end-to-end system context. Probably you won't want to run the model over literally all of reddit all the time, so if they have some sort of pre-filter, they can use that.

If you do want to run it over all of reddit, you can maybe do an evaluation in two parts: recall over the true positives in your evaluation set, and then the number of false positives in a daily snapshot, or something like that.

The logic of trying to collect as many true-positive examples as possible during annotation is that you'll always be able to collect a sample that has arbitrarily few true examples in it, just by collecting texts that have a low prior probability of being of the interesting case (e.g. by collecting from a different subreddit). So your annotation efficiency will go up a lot if you can somehow get yourselves samples with higher density of the class you want. You can then train models that work as better filters to select the next round of examples.

Topic		Replies	Views
Best Practices for text classifier annotations usage , textcat , best-practices	7	5005	March 24, 2021
How textcat.teach works under the hood usage , textcat	16	94	March 26, 2025
textcat.teach showing same text twice (and not using active learning?) textcat	15	2300	August 15, 2018
textcat teach examples from source or from dataset usage , textcat	10	839	August 15, 2019
annotating entities in text documents usage , ner , solved	15	9930	November 28, 2017

Prodigy textcat.teach data collection advice

Related topics