Using Loaders

ines · February 25, 2018, 3:15pm

Sorry if this was a little confusing in the video! The fourth argument of the command after the dataset and the model is the data source, i.e. the texts you’re loading in. So the train in the command above is the path to the data we’re loading in – for training, we’ve created a directory /train containing the data files. Iin the video, you’ll see that it’s underlined, because it points to a directory.) Here’s a more explicit version of the command:

ner.teach drugs_ner en_core_web_lg /path/to/reddit/data --loader reddit --label DRUG --patterns drug_patterns.jsonl

The data loaded in is a portion of the Reddit Comments Corpus, which you can download for free. The built-in Reddit loader in Prodigy is available via --loader reddit and can take either a single .bz2 archive (the format the corpus is shipped in), or a directory containing multiple archives, which are then loaded in order.

If you’re following the video example, note that we’ve pre-processed the Reddit data and divided it into a training set, an evaluation set and a test set (to make sure we can actually evaluate the model properly). We’ve also extracted only comments from /r/opiates. However, if you want to try out a similar approach with a broader category (like, slang terms or technology companies or whatever else you come up with), you can also easily stream in texts from all subreddits.

Hope this helps!

Topic		Replies	Views
OSError: Can't find file path: train docs , usage , solved	8	1754	July 17, 2019
Strange OSError Using the Reddit Loader ner , solved	3	728	July 23, 2018
.txt Source Loader for ner.teach usage , solved , streams	7	649	March 26, 2020
Loading Multiple Files for ner.teach ner , custom , solved	4	1430	February 1, 2018
Create Custom Loader usage , ner , custom	21	3877	August 14, 2019

Using Loaders

Related topics