Hi,
Thanks for prodigy and spacy! I regret why we didn't find these awesome tools long time ago, but better late than never
We are currently experimenting with prodigy on a sample use case of detecting security-relevant comments on reddit. Security-relevant comments includes phrases/words like "SQL injection", "buffer overflow", "security vulnerability", "Privilege escalation", "XSS", "remote code execution" etc.
Question:
What is the general advice on collecting a representative (comments) data sample to use with the textcat.teach prodigy recipe to collect an annotated dataset? Considering my use case, should we select:
- A random sample of reddit comments without considering whether the comments are security-relevant or not?
- A sample of reddit comments containing X% security-relevant comments and Y% non security-relevant comments (e.g 50-50 split)?
Background:
I watching this video (by Ines) were she trained a NER model to find mentions of food inside reddit comments.
However, I could not figure out how she pre-selected the sample of 10K reddit comments used in textcat.teach recipe. @ines Can you please shed some light on how you preselected the 10K sample of reddit comments? Do you think your choice of (pre)-selection may have an impact on the quality of the annotated dataset that you get using prodigy?