Strange OSError Using the Reddit Loader

Jonah · July 20, 2018, 3:40pm

I am trying to use the reddit loader to load a reddit corpus with the following command
prodigy ner.teach writingStyle_ner en_core_web_lg reddit_data/2013/RC_2013-01.bz2 --loader reddit --label writingStyle --patterns data/writingStyle_patterns.jsonl
When not using the reddit loader/corpus I am able to do the annotation training just fine, but if I run the reddit loader with this command I get this error:
OSError: Invalid data stream.
Any help would be appreciated. Thank you!

ines · July 20, 2018, 5:15pm

I just had a look at what might cause the error and it seems like it’s triggered within the bz2 module when uncompressing the file. Most of the threads I found online report that it was caused by a corrputed file – so just as a sanity check, could you try uncompressing it manually and check if everything looks alright?

Jonah · July 23, 2018, 1:22am

I was having issues unzipping the files so I redownloaded them from another source and it worked! I apologize if I should open another issue for this but is there an easy way to use the reddit loader for just a specific subreddit? Im not getting a lot of annotations that I am approving by using the entire reddit corpus, so limiting it down would be useful.

ines · July 23, 2018, 9:13am

No worries, glad it all worked now!

Yes, that definitely makes sense. The stream produced by the Reddit loader (and all other loaders) is a regular Python generator, so you can always implement your own filtering at runtime with a custom loader and by calling the Reddit loader directly in your code (see the PRODIGY_README.html for more details and API docs).

However, it might actually be more efficient to pre-process the data, create a new input file with only the selected subreddit(s) and then load that into Prodigy. That's also how we did it for our video tutorial.

Topic		Replies	Views
OSError: Can't find file path: train docs , usage , solved	8	1768	July 17, 2019
Using Loaders usage , solved	8	3590	November 12, 2018
ner-food-ingredients (tutorial) still trying to get it to run	2	305	January 3, 2024
Template for Prodigy corpus and API loaders custom , front-end , solved	5	2481	March 4, 2018
Prodigy hangs on "FILTER: Filtering out empty examples for key 'text'" usage , solved	2	581	June 10, 2018

Strange OSError Using the Reddit Loader

Related topics