Prodigy textcat.teach data collection advice


Thanks for prodigy and spacy! I regret why we didn't find these awesome tools long time ago, but better late than never :slight_smile:

We are currently experimenting with prodigy on a sample use case of detecting security-relevant comments on reddit. Security-relevant comments includes phrases/words like "SQL injection", "buffer overflow", "security vulnerability", "Privilege escalation", "XSS", "remote code execution" etc.

What is the general advice on collecting a representative (comments) data sample to use with the textcat.teach prodigy recipe to collect an annotated dataset? Considering my use case, should we select:

  1. A random sample of reddit comments without considering whether the comments are security-relevant or not?
  2. A sample of reddit comments containing X% security-relevant comments and Y% non security-relevant comments (e.g 50-50 split)?

I watching this video (by Ines) were she trained a NER model to find mentions of food inside reddit comments.
However, I could not figure out how she pre-selected the sample of 10K reddit comments used in textcat.teach recipe. @ines Can you please shed some light on how you preselected the 10K sample of reddit comments? Do you think your choice of (pre)-selection may have an impact on the quality of the annotated dataset that you get using prodigy?

Needle-in-a-haystack class imbalance is definitely the biggest problem that makes text classification difficult in practice. When the positive class is a rare event like this, you'll always struggle to get a high density of positive cases, and so you'll typically want to bias the collection sample as much as possible towards getting the comments you're interested in. When it comes to an evaluation sample, you'll want to consider the end-to-end system context. Probably you won't want to run the model over literally all of reddit all the time, so if they have some sort of pre-filter, they can use that.

If you do want to run it over all of reddit, you can maybe do an evaluation in two parts: recall over the true positives in your evaluation set, and then the number of false positives in a daily snapshot, or something like that.

The logic of trying to collect as many true-positive examples as possible during annotation is that you'll always be able to collect a sample that has arbitrarily few true examples in it, just by collecting texts that have a low prior probability of being of the interesting case (e.g. by collecting from a different subreddit). So your annotation efficiency will go up a lot if you can somehow get yourselves samples with higher density of the class you want. You can then train models that work as better filters to select the next round of examples.