Using Loaders

Hello!

I’m trying to follow the video tutorial (https://prodi.gy/docs/video-new-entity-type), and at around 13:10 mark there Matthew runs a command prodigy ner.teach drugs_ner en_core_web_lg train --loader reddit --label DRUG --patterns drug_patterns.jsonl.

It fails with my setup. It seems to be choking on test in the command. I can’t find the docs that clarify what’s it for. Maybe the issue is that I don’t have the reddit corpus installed locally, again I couldn’t find the relevant documentation: where should I put it to make it available to Prodigy? How do I use --loader at all?

Thanks.

1 Like

Sorry if this was a little confusing in the video! The fourth argument of the command after the dataset and the model is the data source, i.e. the texts you’re loading in. So the train in the command above is the path to the data we’re loading in – for training, we’ve created a directory /train containing the data files. Iin the video, you’ll see that it’s underlined, because it points to a directory.) Here’s a more explicit version of the command:

ner.teach drugs_ner en_core_web_lg /path/to/reddit/data --loader reddit --label DRUG --patterns drug_patterns.jsonl

The data loaded in is a portion of the Reddit Comments Corpus, which you can download for free. The built-in Reddit loader in Prodigy is available via --loader reddit and can take either a single .bz2 archive (the format the corpus is shipped in), or a directory containing multiple archives, which are then loaded in order.

If you’re following the video example, note that we’ve pre-processed the Reddit data and divided it into a training set, an evaluation set and a test set (to make sure we can actually evaluate the model properly). We’ve also extracted only comments from /r/opiates. However, if you want to try out a similar approach with a broader category (like, slang terms or technology companies or whatever else you come up with), you can also easily stream in texts from all subreddits.

Hope this helps!

Thank you, it does help a lot!

How can I see a list of all the available loaders?

I have my data as a list text files in a flat directory, I’ve reformatted it as JSONL for the first example in the video. Neither of those works without specifying a loader.

You can find an overview of the data and live API loaders on the website or in your PRODIGY_README.html, available for download with Prodigy. The README also includes more detailed API docs for each individual loader, including how to call it from Python in your own recipe scripts.

Except for the Reddit and Images loader, Prodigy currently expect the input data to be single files – so maybe that was the problem? Alternatively, you can also check the "Input formats" section of the README, which lists the expected formats of the individual file types supported by Prodigy.

If you don't specify a --loader on the command line, the appropriate loader will be selected based on the file extension. So if your file is called data.jsonl, Prodigy should use the JSONL loader.

(Btw, there's currently an open feature request for allowing data loading from directories – if this would be useful to you as well, we could make this happen for a future release.)

1 Like

Sorry to say but the ‘Prodigy tutorial’ IS HIGHLY CONFUSING - I am spending most of my time FIXING stuff and making out how to do stuff rather than actually using the tool itself. I am even thinking of writing my own annotation program to just do some basic annotation for me. This is all I wanted out of Prodigy but it turns out I am still navigating between this and that video tutorial to get things done. Your tutorials are supposed to make the lives of people easier, and not the other way round!

1 Like

Hi @vatsala,

I’m sorry you’re having a bad experience, but without understanding what you’re struggling with it’s pretty hard to know how to help! What are you trying to do, and which tutorials are you following?

You’re on a free research license, right? For some research topics, writing your own tooling really is the best solution, especially since you can then release it alongside the data.

Best,
Matt

Hi Matt,

Thanks for your reply. I figured out how to use the tool for annotation finally, but so far the tool annotates only one word at a time. I need to annotate plant species names (which are tricky, and some having a latin form - most of them consist of 2 words). Although I have created a database of plant terms (with the labels consisting of 2 words), I fail to understand why the tool picks up only one word instead of picking two consecutive words (termed as a plant label).

Best,
Vatsala

It sounds like you might not have defined your patterns file correctly, or alternatively, if you're using terms.teach, you might need to change how the vectors are trained. These threads might be useful:

Finally, if you just want to label text in an uncomplicated way, remember that you can always use the ner.manual mode, which doesn't do anything fancy -- it just lets you highlight text, without having a model suggest things.

thanks for the pointer, i will go through them.

Vatsala