What is the best practice with prodigy for aspect level sentiment analysis (opinion term extraction)?
Example 1:
Delivery was very fast. But they wont reply to my questions. delivery = positive customer service = negative
Example 2:
Although the price is high they do what they promise. Got an answer in 5 minutes after contacting support, couldn’t be better. price = negative customer service = positive
I think this task presents a number of challenges for annotation, because it’s difficult to design the schema precisely. I would probably encourage you to limit the number of target topics to a smallish set, e.g. delivery, customer service, price, quality etc. Then you can do the annotation as a text classification task, with multiple labels possible on the text.
Modelling the problem as text classification has the advantage that you’re not running into problems identifying which words in a text constitute the target. As we see in your second example, the phrase “customer service” never occurs. Even the noun phrase which does anchor the reference (support), doesn’t have a direct syntactic relationship with the word which ascribes the sentiment (better).
The disadvantage of the text classification approach is that the number of targets of the sentiment are fixed — and have to be designed into the annotation scheme, in fact. If you use the sentiment analysis system, you can’t discover negative sentiment about something new, like e.g. the online booking system.
I’d say that the questions around how to do this best with Prodigy are the same questions you’ll face doing the task with other tools. Prodigy makes you confront some of these questions more explicitly, which we see as a big advantage — but no matter what tooling you use, the same issues will be there.
I got interested in this thread as I’m looking at similar scenarios myself. Reading this I got an idea of an approach. I don’t have the code solution, but just a theory.
If you map the text “Although the price is high they do what they promise. Got an answer in 5 minutes after contacting support, couldn’t be better.” in displaCy, you get this graph
The idea I got is this:
In both your examples, splitting the text into sentences somewhat works to keep the sentiment and the subject together and separate them from the other statement.
Example 1:
Delivery was very fast.
But they wont reply to my questions.
Example 2:
Although the price is high they do what they promise.
Got an answer in 5 minutes after contacting support, couldn’t be better.
But in reality, it’s probably not going to be that cleanly defined. I can really see people writing the first example as one sentence, which gives this graph
So I came up with this theoretical method:
Break this into two unit of texts - one per sentence.
Break each sentences into verb groups.
Although the price is high
they do what they promise
Got an answer in 5 minutes after contacting support
couldn’t be better
You can classify each as for subject - support, price, etc.
Analyze sentiment on each.
This also extracts one new sentiment - the phrase they do what they promise could be considered a positive feedback on your service.
It’s not perfect, and this “complex” statement is probably a typical feedback text, unfortunately.
Of course, in real examples we don't want to generalise from two examples :). A good way to check this sort of thing is to spin up a small Prodigy task, so you can annotate how often there's a one-to-one relationship between sentences and sentiment in your data. This sort of analysis is very useful when making modelling decisions. You should definitely spend at least an afternoon with the data collecting these sorts of statistics before making assumptions.
Give it a try! But, rather than focussing on an accurate model for now, run an auxiliary test, to see how well your assumptions hold. A good trick for this type of situation is what's called an oracle experiment. Let's say you model the problem with some pipeline, where you first segment the text, and then apply a label to each segment. The pipeline approach makes some simplifying assumptions about how the data works, which won't always hold. When the assumptions don't hold, there's no correct answer. The oracle experiment assumes you have a perfect classifier for one stage of the process, and then asks, "If I do get that part right, how often can I be correct? What's my upper-bound?". If the upper-bound is too low, you know the approach isn't really workable: the simpifying assumption simplifies too much.