First and foremost, thanks so much for creating such amazing tools! I've only been using Prodigy for a couple of days and already love it!
My core question is around additional training resources (especially those geared towards those of us new to NLP in general). The videos in the Prodigy docs are wonderful, and I found myself wanting a lot more of them!!
I find I learn best through video + real world examples, so I'd love to see a lot more video content around using the Spacy + Prodigy ecosystem for solving common problem-types, or getting around "beginner traps" in NLP projects, etc...
I'd happily join a separate membership site that was dedicated to Spacy + Prodigy, (ie something similar to what laracasts.com does for laravel developers... Singular focus on one technology/stack, regular content, practical examples, etc)
Is this something in the plans for either the creators, or any of you Prodigy prodigies out there?
Hi! Thanks so much for the kind words – glad to hear you've been finding the videos useful
The question is very timely because we've actually just teamed up with someone from the community (edit: namely, @koaning) to produce a new video series for our YouTube channel. It'll show an end-to-end real-life NLP problem, from the first idea and experiments to data collection, training a model and so on, all from a user's perspective. The first episode is pretty much done, so we're hoping to launch that soon.
Wow, thanks for the responses, and even moreso, for the forthcoming content!! That just made my day!
I know one of the challenges is that every project is relatively unique, and that it's hard to pick representational challenges that will be useful for everybody without being too diluted...
Here are 2 specific areas that I'd really like to get some insight into after (I've got something suboptimal working after a few days of banging my head against a wall, but I know there are way better / easier / more elegant ways to do them).
When doing web scraping, what are some approaches for dealing with pre-processing & navigation junk data... EG: I set up a pipeline that uses html2text to give me back just text, but the NER thinks that many menu items are PERSON names... How to best train a model that doesn't make that mistake?
How to update the NER for "PERSON" to be more useful & accurate for populating a database (IE grabs first + last name combos only; or if it's a "Mr. X" or "Dr. Y", can grab a LAST_NAME attribute).
On perhaps a larger scale, it'd be interesting to have a number of "multi-part series" going step-by-step (ie super beginner hand-hold level) for a medium sized / medium interesting project.... It sounds like you've already got some in the pipeline, which is awesome!
Here's one I think would be interesting: "Let's assume you've got a database of content of varying lengths (can probably use Reddit comments) - when you type in a comment, find the N comments in your corpus that are most like the one you typed."
I'd imagine there are some interesting questions that could be tackled with a problem like that... When I built a system a couple years ago to do something similar, I noticed issues with text length affecting what came back (IE short comments only matched to short documents, long comments only matched to long documents); is it better to try to match based on sum of predicted category scores, or semantic similarity? etc
Finally... In my day-to-day software development, I’m one of those annoying “TDD is the one true way” guys… It's be great to see some content about whether or not it's appropriate to use testing in NLP projects, and where.
@shawn Thanks so much for the feedback and ideas Just a quick update to let you know that episode 1 of our new series is now live:
I think the format we have in mind for the live trainings wouldn't really make sense as a recording. We want the trainings to take full advantage of the fact that the're "in real life" and in person, and work closely with the attendees. So you probably wouldn't get that much value from just watching it. If the trainings go well, we're definitely keen to take them on the road And of course we'll keep publishing free materials for self-study online, like the spaCy course at https://course.spacy.io.
I should say: I imagine the content for the training and the content for the videos would overlap. That said, a training will be very different in terms of experience because of the student-teacher interaction (which will be very hard to mimic for youtube). Still; my aim with the youtube videos is that they should be relevant to folks keen on learning.
The way I am going about the programming problem in the videos:
first make a heuristic that can confirm if I am on to something
if I am then use this heuristic to generate a subset data that is easy to label
label said dataset manually
feed this to the model
now repeat step 2 but replace the heuristic with the best model until i am confident of the approach
Something similar might work for you. I can imagine that if you label enough menu items as "bad examples" that the model should be able to learn from this.
TDD is a good attitude but a bit awkward in ML sometimes since you should assume that the model is at times inaccurate and it can be very hard to predict when. My best advice here is to maybe start small and stay small for as long as possible such that you can spend some time understanding the domain of your problem very well. My experience is that heuristics with domain knowledge usually survive in production while black-box models need to get replaced the moment it makes mistakes that nobody understands.
To conclude; I think video 2, maybe 3, should demonstrate an example using NER. So keep an eye on that.
Fantastic, thanks for your explanation - it makes a lot of sense, and I'm definitely keeping my eyes peeled for the NER videos.
I know it's an intro to Spacy, but are you going to also show the Prodigy workflow in there as well.
Re: TDD, from what I can see, it looks like there might be some value to it in the pipeline process, but not in the ML portion, and even there, probably "tests after" vs. "test driven" just to ensure that refactorings of the data preparation / collection / cleaning stages work as expected. Would that be on the right track?
Not to worry about prodigy: there will be plenty of it. What is currently happening in the first video is essentially pre-work to do labelling and NER.
Re: TDD. Yeah things like "pre-processing" can be tested and it is certainly recommended to do so. It's usually also good to test if an ML system can handle "obvious cases". My point was more to the fact that if you're used to TDD in, say, web development that you should expect that the experience is different with ML systems. There's different things that might go wrong. If you're interested in this, PyData London had some good talks on this. I recall these three (the last one is from yours truly).
Just finished the video - excellent work, and I'm definitely excited to see the rest!
I'm definitely more used to TDD in a web development context, as that's where most of my work has been focused over the past 10 years or so. There's definitely some transition in my thinking required!
Thanks so much for the time and attention on these, and let me know what I can do to support as consistent a stream of them as possible!