Hi, we are total newbies in NLP just starting to learn on a real live project. We need to intake hundreds of thousands of survey question-response pairs. There are surprisingly many different formulations of the same core questions in the set.
An example would be:
(a) How satisfied are you with access to water quality data for the northeastern part of the City.
(b) Indicate your level of agreement with the following statement: I have easy access to water quality data for the northeastern part of the City.
(c) Rate the accessibility of water quality information for the northeastern part of the City.
These are essentially the same exact question.
We need to boil down the questions into their generic equivalents.
So far we are doing this by hand, like this:
We are hoping to use Prodigy's tagging or something similar for our training set. Instead of POS elements shown in the picture below (like adjective, noun etc) we would like to use our own categories which we are still defining.
Is Prodigy's interface capable of this or are we on wrong path? Any help would be greatly appreciated, thanks!
I do think you'll be able to make good progress on this, either with Prodigy or even just with spaCy alone.
One way to do this is as a text classification task: if you only have a limited number of the core questions, you can make those the categories. Another way is to identify keywords. You might find that if a question asks about "water quality", there's only one fundamental question it could be asking about.
I would probably avoid tagging at the word level initially, as I think it's probably not the most effective approach. You'll probably get more out of either a rule-based phrase identification approach, or learning categories at the sentence level.
Another option for you to consider is spaCy's dependency parser. If you load your questions up in spaCy, you can do stuff like
doc.noun_chunks, which will probably be useful to you. You can also get the head word of a word or phrase, or its children. You can play around with the dependency parser demo here: https://explosion.ai/demos/displacy
Thank you, @honnibal, for the tips. It sounds like one decent way to start could be to associate a generic phrase like "Rate water quality" with an original sentence "Satisfaction with access to water quality data for NE part of the City" based on a keyword-based evaluation of the entire original sentence. Then we could run the same original sentence through doc.noun_chunks to get the other relevant parts/words. Or we could begin with doc.noun_chunks. Hope I didn't confuse myself here, still researching the details of your advice.