NLP - Span Categorization - compound words or wordsstucktogetherlikethis

My intention is to convert loose messages about product leads and convert them into their categories. Let's say I have 3 categories: buy_intent, product, price.

Sometimes, messages turn into things like this: ltbbanana10usd

where 'ltb' = buy_intent, 'banana' = product, '10usd' = price

Should I tag this whole thing as all 3 categories? I was thinking if something was classified as more than one thing at a high accuracy, I could then process it after the fact and 'split' it.

Another thought I had was creating a 'compound_word' category and then tagging that as only that, then dealing with those categories in some other process.

Does span cat work in these situations where words are stuck together but belong to multiple categories?

Should I be making a big effort to sort of pre-process this? I'm afraid of pre-processing as I might split some things that don't need to be split, tainting my data-set.

Hi @jrouss and welcome to the forum :slight_smile:

Should I tag this whole thing as all 3 categories? I was thinking if something was classified as more than one thing at a high accuracy, I could then process it after the fact and 'split' it.

This annotation strategy probably wouldn't translate into high accuracy because:

  1. you'd be mixing up the "well formed" instances of the categories with the agglutinations
  2. there would likely be many (inifinite?) combinations of such triplets so it would probably be challenging to provide enough training samples for each
    Furthermore, the resulting labels would not give you any information about the split of the subtokens.
    For these reasons your second idea is much more likely to succeed. You could add an extra label to span categorizer and try to predict the agglutinations.
    Once your pipeline predicts the categories (3 + MIX) successfuly you could add another component to the pipeline that handles the agglutinations.

For that you could try character level annotations in Prodigy and train a classifier similiar to a tokenizer for these agglutinations.
Manual annotation of agglutinations would also give you an idea how prevalent the problem is, how much variation there is and, ultimately, whether it make sense to generate regex to handle them rather than train the classifier.

On a general note, it probably make sense to split the task in two phases: 1) detecting the "well" formed categories plus the agglutinations 2) handling agglutinations via patterns or by training a classifier like tokenizer for them if there are enough instances to provide quality training samples or to much variation to reliably handle by patterns.

Should I be making a big effort to sort of pre-process this? I'm afraid of pre-processing as I might split some things that don't need to be split, tainting my data-set.

That's the reason it's probably best to first identify the agglutinations and only apply the splitting procedure (be it via patterns or a model) to what likely needs splitting.

1 Like