I’m pretty new to NLP stuff, and I’m just getting started on a project where I’m trying to parse free text queries for a specific domain (e.g. books). For example, someone might query
horror books by stephen king that aren't hardcovers. In my domain’s case, I have
stephen king and
hardcover all denoted as keywords (or I guess keyphrases, since author name is multi-word), and I want all those to be detected, and in particular
hardcover should register that it’s been negated. But an equally valid query might simply be
king, which would mean all books written by Stephen King (or any other author with surname King). Or
2ed paperback, which would mean the 2nd edition of any book that’s also a paperback.
The problem I’ve been running into is that multi-word keyword matching is tricky, especially if you want those keyphrases to be recognized as a single entity to do things like negation checking (e.g.
books not by stephen king). I’ve tried the
PhraseMatcher, but that doesn’t take into account lemmas (so in my initial example,
hardcover wouldn’t be found since it’s plural in the original query). I’ve also tried using plugins like
spacy-lookup, but to make that work I have to lemmatize my whole query and it’s difficult to find the original tokens to find their dependencies from that (given that the lemmatized query and original query are different lengths). So I was looking through the spacy docs and I found training a parser for custom semantics, which seemed perfect. I can annotate a bunch of sample queries with custom semantics, train it, and off we go. But I wasn’t sure if that approach would work for something like the ultra-short queries I’ve mentioned. And I also want to include keyphrase recognition, since I have a complete list of things that should be caught in a query.
Given all that, what’s the best approach to take with this problem? (Also, sorry if this is the wrong forum to ask about this, I wasn’t sure where the right place was to ask.)