Training Insults classifier video out of date (--seeds argument) and moved documentation

Hi @ines - would be great if you could update the video for training insults classifier or at least put a comment above it indicating it's out-dated - I ran into the seeds issue and needed to track down these posts:

in order to figure it out.
It seems that it has been quite a while since the seeds method in the video would have worked (~1 yr) :frowning:

Incidentally have there been some updates recently to the online docs?

I see a few details that I cannot find which I'm sure I saw earlier (before I bought Prodigy). For instance details of the loaders and the apis. Looking here: Prodigy 101 – everything you need to know · Prodigy · An annotation tool for AI, Machine Learning & NLP there's a link to Prodigy 101 – everything you need to know · Prodigy · An annotation tool for AI, Machine Learning & NLP but that no longer has the loader info (although it used to, according to the Way Back Machine: Overview | Prodigy - Radically efficient machine teaching )
It looks like there may be comments in the README.html that aren't on the online docs - I've since found some detail on the loaders in the README.html, but it wasn't initially obvious things would be out of sync.

And once last question (sorry!) Can the Reddit corpus loader be made to work with .xz files, since I see that they're not compressed that way (https://files.pushshift.io/reddit/comments/ shows the last .bz2 format file being 2017-11). Plus any way to enable it to split long comments into sentences would be awesome too. The loader component is compiled (loaders.cpython-35m-x86_64-linux-gnu.so), so I couldn't look into it myself to check what was needed.

Thanks for the feedback and suggestions! I've just added a note to the video description that points to the thread and notes the updated argument name. (Unfortunately, YouTube doesn't seem to let you add text info as annotations anymore, and the cards can only link to sites selected by YouTube :disappointed:)

Looks like we forgot to replace that link! The overview of files is now on the features page, but for the cookbook, it's probably better to just list them anyways. I'll fi that :+1:

If the compression format has changed, we definitely want to adjust that, yes! We're also using the Reddit corpus a lot for other experiments, so this is good to know. I think we've mostly been using the older months so far, so I hadn't even noticed that.

For the basic Reddit loader script similar to the one we also use in Prodigy, see here:

https://github.com/explosion/spaCy/blob/develop/bin/load_reddit.py

So if you want to customise it, add your own filtering, preprocess the data etc., you could adapt this and write a your own loader :slightly_smiling_face:

Yes, the PRODIGY_README.html is definitely the most comprehensive documentation, as it also has the full API docs of the Python library. We wanted the website to give developers enough of an overview to get an idea of how the tool works, what's possible and what to expect from the API. Maybe we should an infobox or another type of note that makes this more clear?

1 Like

Thanks a lot - that’s v helpful, especially the GitHub link (I hadn’t thought to check there!)

I’m actually getting a very low number of even vaguely insulting terms coming out. I guess this is the cold start problem, right?
Was wondering if I could add some patterns that were a bit more like regular insult phrases (rather than just my seed terms) to get things going, but unsure if that’s a good/bad idea and if so what form to make the patterns. Perhaps something with “you” at the start as that’s typical of insults, eg “you numpty” etc etc.

Are you just using a random month of comments? Maybe you could try filtering it by subreddits that are more likely to have heated discussions and insults?

Otherwise, your idea is definitely worth a try and something you should be able to express with token patterns. For example:

{"label": "INSULT", "pattern": [{"lower": "you"}, {"lemma": "be"}]}
{"label": "INSULT", "pattern": [{"lower": "you"}, {"pos": "NOUN"}]}

The first one would match “you are” but also “you’re” (which spaCy will split into two tokens), the second one “you” followed by a noun. There’s probably more similar stuff that you can experiment with here.

1 Like