Sorry if this was a little confusing in the video! The fourth argument of the command after the dataset and the model is the data source, i.e. the texts you’re loading in. So the
train in the command above is the path to the data we’re loading in – for training, we’ve created a directory
/train containing the data files. Iin the video, you’ll see that it’s underlined, because it points to a directory.) Here’s a more explicit version of the command:
ner.teach drugs_ner en_core_web_lg /path/to/reddit/data --loader reddit --label DRUG --patterns drug_patterns.jsonl
The data loaded in is a portion of the Reddit Comments Corpus, which you can download for free. The built-in Reddit loader in Prodigy is available via
--loader reddit and can take either a single
.bz2 archive (the format the corpus is shipped in), or a directory containing multiple archives, which are then loaded in order.
If you’re following the video example, note that we’ve pre-processed the Reddit data and divided it into a training set, an evaluation set and a test set (to make sure we can actually evaluate the model properly). We’ve also extracted only comments from
/r/opiates. However, if you want to try out a similar approach with a broader category (like, slang terms or technology companies or whatever else you come up with), you can also easily stream in texts from all subreddits.
Hope this helps!