Best methods to use for creating a domain-specific query parser?

Hi,
I’m pretty new to NLP stuff, and I’m just getting started on a project where I’m trying to parse free text queries for a specific domain (e.g. books). For example, someone might query horror books by stephen king that aren't hardcovers. In my domain’s case, I have horror, stephen king and hardcover all denoted as keywords (or I guess keyphrases, since author name is multi-word), and I want all those to be detected, and in particular hardcover should register that it’s been negated. But an equally valid query might simply be king, which would mean all books written by Stephen King (or any other author with surname King). Or 2ed paperback, which would mean the 2nd edition of any book that’s also a paperback.

The problem I’ve been running into is that multi-word keyword matching is tricky, especially if you want those keyphrases to be recognized as a single entity to do things like negation checking (e.g. books not by stephen king). I’ve tried the PhraseMatcher, but that doesn’t take into account lemmas (so in my initial example, hardcover wouldn’t be found since it’s plural in the original query). I’ve also tried using plugins like spacy-lookup, but to make that work I have to lemmatize my whole query and it’s difficult to find the original tokens to find their dependencies from that (given that the lemmatized query and original query are different lengths). So I was looking through the spacy docs and I found training a parser for custom semantics, which seemed perfect. I can annotate a bunch of sample queries with custom semantics, train it, and off we go. But I wasn’t sure if that approach would work for something like the ultra-short queries I’ve mentioned. And I also want to include keyphrase recognition, since I have a complete list of things that should be caught in a query.

Given all that, what’s the best approach to take with this problem? (Also, sorry if this is the wrong forum to ask about this, I wasn’t sure where the right place was to ask.)

Query text is definitely hard. It’s probably worth noting that normal search and information retrieval systems don’t usually do this. Systems like Lucene, Solr and ElasticSearch are far from perfect, but they’ve been honed by a lot fo experience on search problems.

Have you considered a hybrid approach? You could use a vanilla information extraction system as a fallback, and classify queries according to whether they have internal structure that you want to parse. If someone just writes “king”, you can let the normal IR system handle it. If they write “books not written by stephen king”, maybe it’s worth doing the special processing. One thing that’s good about this approach is it should at least be easy to make sure you’re not doing worse than a normal IR system. You can make the query selector pretty conservative, and only do special stuff when it’s quite sure the special processing is really better.

As for how to actually do the special processing, I can think of two main ways to go about it. One is to use rules to transform the query, for instance map negations into reworded queries. “horror books by stephen king that aren’t hardcovers” might become “softcover stephen king horror books”. The other is to try to map the NL query into an exact SQL query. Neither approach will be that easy, and the specifics will depend on the data. You probably want to try using the pretrained dependency parse, as it’s the best way to get the relationships you’ll be interested in. Have a look at the demo here to get a feel for it: https://explosion.ai/demos/displacy