I am trying to use the reddit loader to load a reddit corpus with the following command
prodigy ner.teach writingStyle_ner en_core_web_lg reddit_data/2013/RC_2013-01.bz2 --loader reddit --label writingStyle --patterns data/writingStyle_patterns.jsonl
When not using the reddit loader/corpus I am able to do the annotation training just fine, but if I run the reddit loader with this command I get this error:
OSError: Invalid data stream.
Any help would be appreciated. Thank you!
I just had a look at what might cause the error and it seems like it’s triggered within the bz2
module when uncompressing the file. Most of the threads I found online report that it was caused by a corrputed file – so just as a sanity check, could you try uncompressing it manually and check if everything looks alright?
I was having issues unzipping the files so I redownloaded them from another source and it worked! I apologize if I should open another issue for this but is there an easy way to use the reddit loader for just a specific subreddit? Im not getting a lot of annotations that I am approving by using the entire reddit corpus, so limiting it down would be useful.
No worries, glad it all worked now!
Yes, that definitely makes sense. The stream produced by the Reddit loader (and all other loaders) is a regular Python generator, so you can always implement your own filtering at runtime with a custom loader and by calling the Reddit
loader directly in your code (see the PRODIGY_README.html
for more details and API docs).
However, it might actually be more efficient to pre-process the data, create a new input file with only the selected subreddit(s) and then load that into Prodigy. That's also how we did it for our video tutorial.