Strange OSError Using the Reddit Loader

I am trying to use the reddit loader to load a reddit corpus with the following command
prodigy ner.teach writingStyle_ner en_core_web_lg reddit_data/2013/RC_2013-01.bz2 --loader reddit --label writingStyle --patterns data/writingStyle_patterns.jsonl
When not using the reddit loader/corpus I am able to do the annotation training just fine, but if I run the reddit loader with this command I get this error:
OSError: Invalid data stream.
Any help would be appreciated. Thank you!

I just had a look at what might cause the error and it seems like it’s triggered within the bz2 module when uncompressing the file. Most of the threads I found online report that it was caused by a corrputed file – so just as a sanity check, could you try uncompressing it manually and check if everything looks alright?

I was having issues unzipping the files so I redownloaded them from another source and it worked! I apologize if I should open another issue for this but is there an easy way to use the reddit loader for just a specific subreddit? Im not getting a lot of annotations that I am approving by using the entire reddit corpus, so limiting it down would be useful.

No worries, glad it all worked now! :+1:

Yes, that definitely makes sense. The stream produced by the Reddit loader (and all other loaders) is a regular Python generator, so you can always implement your own filtering at runtime with a custom loader and by calling the Reddit loader directly in your code (see the PRODIGY_README.html for more details and API docs).

However, it might actually be more efficient to pre-process the data, create a new input file with only the selected subreddit(s) and then load that into Prodigy. That’s also how we did it for our video tutorial.