My dataset consists of apparel images and their corresponding product descriptions as below:
Meet your simplest summer outfit. Designed with a relaxed straight-leg fit, the Super-Soft Summer Jean Coverall features a buttoned front, side pockets, and a short-sleeve silhouette that’s great for warm weather.
Let us say that we focus on one particular aspect of apparels, say 'Season'. Now, say using text classification or using image classification, I am able to classify the dress to belong to the 'Summer' class. The goal is to be able to find words/phrases which are related to 'Summer', for example, "great for warm weather". So, considering the entire dataset, for every class within the 'Season' attribute, we would have a list of associated phrases/words. Something like this:
Do you actually need a list of phrases as output, or do you just need a correctly predicted label at the end of the pipeline?
In general, I would recommend a text classification approach for this challenge, as that will automatically pick up on phrases or words that are important for predicting the correct label. But it might not be trivial to actually get those phrases out of the model weights. The ML model will be more complex than just consulting a list of phrases (which is good for accuracy, but less so for interpretability), there might be multiple clues & combinations in the description that point to "Summer" (or any other season), etc.
With NER, you really would be trying to determine the individual phrases, but I think this approach will be a challenge. The phrases might not be continuous or even well-defined (is it "warm weather" or "great for warm weather" and what if the text says "not appropriate for warm weather"?). What if it says "Now that winter's gone - get ready for summer!" - what will you annotate?
Thank you for your reply! Yes, I need the actual phrases and I was thinking if the attention scores could be used to point out the phrase(s) in a given sentence which have led to it being classified as summer etc. Is that a common approach?
I agree with @SofieVL that framing this as a pure span extraction task will likely be inefficient because the boundaries of the spans are very ambiguous and potentially disjoint, and it'll be very difficult to create consistent data for it. The information about the season may also be spread across multiple keywords and indicators, including how the sentence is phrased, which a text classifier will be much better suited for.
This thread might be relevant to you, and it actually deals with a very similar task (extracting season compatibility for garments from reviews):
I actually just did a talk where I walked through a similar problem (span extraction vs. labels over the whole text), which might be interesting as well. The example starts around ~13:00: https://www.youtube.com/watch?v=mJqFI7vhqdA
So for your use case, you could start by classifying the texts by season, which should be pretty straightforward to label and predict. You can still use keywords to pre-select and pre-label the examples, e.g. if a text contains "warm weather", it'll likely be SUMMER. If you have a text classifier that gives you the attention scores for the individual tokens, this would give you insights into what the strongest signals were that led to the prediction.
However, what you do next depends on your end goal and what you're actually looking to do with those tokens/phrases. If those will be consumed by some other downstream application, you might want to consider how useful those will actually be: you'll likely end up with arbitrary tokens that don't necessarily have anything in common or represent the same type of syntactic constituents. So it might be difficult to actually do anything useful with them.