ines
(Ines Montani)
October 10, 2019, 12:24pm
2
Hi! The argument after the model name is the text source you want to load in and annotate. In the tutorial video, we've used a directory called train
containing the Reddit data. From the error, it looks like you're trying to load data from a path train
, but that path doesn't exist.
I've explained this in more detail here :
Sorry if this was confusing in the video – I’m copying over my reply from this thread , which asked the same question:
The fourth argument of the command after the dataset and the model is the data source , i.e. the texts you’re loading in. So the train
in the command above is the path to the data we’re loading in – for training, we’ve created a directory /train
containing the data files. Iin the video, you’ll see that it’s underlined, because it points to a directory.) Here’s a more explicit version of the command:
ner.teach drugs_ner en_core_web_lg /path/to/reddit/data --loader reddit --label DRUG --patterns drug_patterns.jsonl
The data loaded in is a portion of the Reddit Comments Corpus , which you can download for free. The built-in Reddit loader in Prodigy is available via --loader reddit
and can take either a single .bz2
archive (the format the corpus is shipped in), or a directory containing multiple archives, which are then loaded in order.
If you’re following the video example, note that we’ve pre-processed the Reddit data and divided it into a training set, an evaluation set and a test set (to make sure we can actually evaluate the model properly). We’ve also extracted only comments from /r/opiates
. However, if you want to try out a similar approach with a broader category (like, slang terms or technology companies or whatever else you come up with), you can also easily stream in texts from all subreddits.
Hope this helps!