Hi @ines - would be great if you could update the video for training insults classifier or at least put a comment above it indicating it's out-dated - I ran into the seeds issue and needed to track down these posts:
in order to figure it out.
It seems that it has been quite a while since the seeds method in the video would have worked (~1 yr)
Incidentally have there been some updates recently to the online docs?
And once last question (sorry!) Can the Reddit corpus loader be made to work with .xz files, since I see that they're not compressed that way (https://files.pushshift.io/reddit/comments/ shows the last .bz2 format file being 2017-11). Plus any way to enable it to split long comments into sentences would be awesome too. The loader component is compiled (loaders.cpython-35m-x86_64-linux-gnu.so), so I couldn't look into it myself to check what was needed.
Thanks for the feedback and suggestions! I've just added a note to the video description that points to the thread and notes the updated argument name. (Unfortunately, YouTube doesn't seem to let you add text info as annotations anymore, and the cards can only link to sites selected by YouTube )
Looks like we forgot to replace that link! The overview of files is now on the features page, but for the cookbook, it's probably better to just list them anyways. I'll fi that
If the compression format has changed, we definitely want to adjust that, yes! We're also using the Reddit corpus a lot for other experiments, so this is good to know. I think we've mostly been using the older months so far, so I hadn't even noticed that.
For the basic Reddit loader script similar to the one we also use in Prodigy, see here:
So if you want to customise it, add your own filtering, preprocess the data etc., you could adapt this and write a your own loader
Yes, the PRODIGY_README.html is definitely the most comprehensive documentation, as it also has the full API docs of the Python library. We wanted the website to give developers enough of an overview to get an idea of how the tool works, what's possible and what to expect from the API. Maybe we should an infobox or another type of note that makes this more clear?
Iâm actually getting a very low number of even vaguely insulting terms coming out. I guess this is the cold start problem, right?
Was wondering if I could add some patterns that were a bit more like regular insult phrases (rather than just my seed terms) to get things going, but unsure if thatâs a good/bad idea and if so what form to make the patterns. Perhaps something with âyouâ at the start as thatâs typical of insults, eg âyou numptyâ etc etc.
Are you just using a random month of comments? Maybe you could try filtering it by subreddits that are more likely to have heated discussions and insults?
Otherwise, your idea is definitely worth a try and something you should be able to express with token patterns. For example:
The first one would match âyou areâ but also âyouâreâ (which spaCy will split into two tokens), the second one âyouâ followed by a noun. Thereâs probably more similar stuff that you can experiment with here.