Wow, thanks for the responses, and even moreso, for the forthcoming content!! That just made my day!
I know one of the challenges is that every project is relatively unique, and that it's hard to pick representational challenges that will be useful for everybody without being too diluted...
Here are 2 specific areas that I'd really like to get some insight into after (I've got something suboptimal working after a few days of banging my head against a wall, but I know there are way better / easier / more elegant ways to do them).
When doing web scraping, what are some approaches for dealing with pre-processing & navigation junk data... EG: I set up a pipeline that uses html2text to give me back just text, but the NER thinks that many menu items are PERSON names... How to best train a model that doesn't make that mistake?
How to update the NER for "PERSON" to be more useful & accurate for populating a database (IE grabs first + last name combos only; or if it's a "Mr. X" or "Dr. Y", can grab a LAST_NAME attribute).
On perhaps a larger scale, it'd be interesting to have a number of "multi-part series" going step-by-step (ie super beginner hand-hold level) for a medium sized / medium interesting project.... It sounds like you've already got some in the pipeline, which is awesome!
Here's one I think would be interesting: "Let's assume you've got a database of content of varying lengths (can probably use Reddit comments) - when you type in a comment, find the N comments in your corpus that are most like the one you typed."
I'd imagine there are some interesting questions that could be tackled with a problem like that... When I built a system a couple years ago to do something similar, I noticed issues with text length affecting what came back (IE short comments only matched to short documents, long comments only matched to long documents); is it better to try to match based on sum of predicted category scores, or semantic similarity? etc
Finally... In my day-to-day software development, I’m one of those annoying “TDD is the one true way” guys… It's be great to see some content about whether or not it's appropriate to use testing in NLP projects, and where.
Thanks for considering these suggestions!!!