I’m trying to follow the video tutorial (https://prodi.gy/docs/video-new-entity-type), and at around 13:10 mark there Matthew runs a command prodigy ner.teach drugs_ner en_core_web_lg train --loader reddit --label DRUG --patterns drug_patterns.jsonl.
It fails with my setup. It seems to be choking on test in the command. I can’t find the docs that clarify what’s it for. Maybe the issue is that I don’t have the reddit corpus installed locally, again I couldn’t find the relevant documentation: where should I put it to make it available to Prodigy? How do I use --loader at all?
Sorry if this was a little confusing in the video! The fourth argument of the command after the dataset and the model is the data source, i.e. the texts you’re loading in. So the train in the command above is the path to the data we’re loading in – for training, we’ve created a directory /train containing the data files. Iin the video, you’ll see that it’s underlined, because it points to a directory.) Here’s a more explicit version of the command:
ner.teach drugs_ner en_core_web_lg /path/to/reddit/data --loader reddit --label DRUG --patterns drug_patterns.jsonl
The data loaded in is a portion of the Reddit Comments Corpus, which you can download for free. The built-in Reddit loader in Prodigy is available via --loader reddit and can take either a single .bz2 archive (the format the corpus is shipped in), or a directory containing multiple archives, which are then loaded in order.
If you’re following the video example, note that we’ve pre-processed the Reddit data and divided it into a training set, an evaluation set and a test set (to make sure we can actually evaluate the model properly). We’ve also extracted only comments from /r/opiates. However, if you want to try out a similar approach with a broader category (like, slang terms or technology companies or whatever else you come up with), you can also easily stream in texts from all subreddits.
You can find an overview of the data and live API loaders on the website or in your PRODIGY_README.html, available for download with Prodigy. The README also includes more detailed API docs for each individual loader, including how to call it from Python in your own recipe scripts.
Except for the Reddit and Images loader, Prodigy currently expect the input data to be single files – so maybe that was the problem? Alternatively, you can also check the “Input formats” section of the README, which lists the expected formats of the individual file types supported by Prodigy.
If you don’t specify a --loader on the command line, the appropriate loader will be selected based on the file extension. So if your file is called data.jsonl, Prodigy should use the JSONL loader.
(Btw, there’s currently an open feature request for allowing data loading from directories – if this would be useful to you as well, we could make this happen for a future release.)
Sorry to say but the ‘Prodigy tutorial’ IS HIGHLY CONFUSING - I am spending most of my time FIXING stuff and making out how to do stuff rather than actually using the tool itself. I am even thinking of writing my own annotation program to just do some basic annotation for me. This is all I wanted out of Prodigy but it turns out I am still navigating between this and that video tutorial to get things done. Your tutorials are supposed to make the lives of people easier, and not the other way round!
Thanks for your reply. I figured out how to use the tool for annotation finally, but so far the tool annotates only one word at a time. I need to annotate plant species names (which are tricky, and some having a latin form - most of them consist of 2 words). Although I have created a database of plant terms (with the labels consisting of 2 words), I fail to understand why the tool picks up only one word instead of picking two consecutive words (termed as a plant label).
It sounds like you might not have defined your patterns file correctly, or alternatively, if you're using terms.teach, you might need to change how the vectors are trained. These threads might be useful:
Finally, if you just want to label text in an uncomplicated way, remember that you can always use the ner.manual mode, which doesn't do anything fancy -- it just lets you highlight text, without having a model suggest things.