Sentence Decomposition or Multi-class Classification from Newbies?

njones · November 16, 2019, 7:57pm

Hi, we are total newbies in NLP just starting to learn on a real live project. We need to intake hundreds of thousands of survey question-response pairs. There are surprisingly many different formulations of the same core questions in the set.

An example would be:
(a) How satisfied are you with access to water quality data for the northeastern part of the City.
(b) Indicate your level of agreement with the following statement: I have easy access to water quality data for the northeastern part of the City.
(c) Rate the accessibility of water quality information for the northeastern part of the City.

These are essentially the same exact question.

We need to boil down the questions into their generic equivalents.
So far we are doing this by hand, like this:

We are hoping to use Prodigy's tagging or something similar for our training set. Instead of POS elements shown in the picture below (like adjective, noun etc) we would like to use our own categories which we are still defining.

Is Prodigy's interface capable of this or are we on wrong path? Any help would be greatly appreciated, thanks!

-Nick

honnibal · November 18, 2019, 4:56pm

Hi @njones,

I do think you'll be able to make good progress on this, either with Prodigy or even just with spaCy alone.

One way to do this is as a text classification task: if you only have a limited number of the core questions, you can make those the categories. Another way is to identify keywords. You might find that if a question asks about "water quality", there's only one fundamental question it could be asking about.

I would probably avoid tagging at the word level initially, as I think it's probably not the most effective approach. You'll probably get more out of either a rule-based phrase identification approach, or learning categories at the sentence level.

Another option for you to consider is spaCy's dependency parser. If you load your questions up in spaCy, you can do stuff like doc.noun_chunks, which will probably be useful to you. You can also get the head word of a word or phrase, or its children. You can play around with the dependency parser demo here: https://explosion.ai/demos/displacy

njones · November 22, 2019, 2:53am

Thank you, @honnibal, for the tips. It sounds like one decent way to start could be to associate a generic phrase like "Rate water quality" with an original sentence "Satisfaction with access to water quality data for NE part of the City" based on a keyword-based evaluation of the entire original sentence. Then we could run the same original sentence through doc.noun_chunks to get the other relevant parts/words. Or we could begin with doc.noun_chunks. Hope I didn't confuse myself here, still researching the details of your advice.

Topic		Replies	Views
Sentence-based classification: Automated sentence splitting? usage , textcat , spacy , solved	5	1834	June 14, 2018
Training a grammar tool usage , textcat	24	5580	February 26, 2018
Will NER improve Text Categorization?	2	413	July 18, 2022
Text classification with window usage , textcat	4	851	May 12, 2019
sequence labelling with prodigy ? usage	2	625	February 27, 2018

Sentence Decomposition or Multi-class Classification from Newbies?

Related topics