Adding meta data to text classification model

I’m currently working on an intelligent news reader. A user can subscribe to a certain topic and I will pull in articles from many different sources via http://eventregistry.org. However, not all of those articles are relevant to my users and I want my system to learn over time what’s relevant to the user ==> text classification.

Given that the data I receive from EventRegistry is already annotated with categories, labels, locations, etc, I would like to use this information. So I would like to train a model that uses the text of the articles, but also all the other features provided.

Is there a way to achieve this with Prodigy / Spacy? Or should I use Spacy only to get a vector representation of the text and then train a general classification model on the vector representation concatenated with the other features?

Thanks in advance for your help,

Stephan

I think you’ll probably be better off using a linear model for this task. spaCy’s CNN model is better for some tasks, but bag-of-words models tend to work very well for topic classification of news. The reason is that the signal is all from the presence or absence of domain words which are common enough to have feature weights, but rare enough to be useful indicators. These features are fundamentally discrete in nature, so the continuous transformation performed by the neural network just gets in the way. You can convince the neural network to make similar decisions as a bigram bag-of-words model — all the information is there, after all — but it’s much slower, and the resulting model will be harder to reason about.

The best linear model text classification package is Vowpal Wabbit. Scikit-Learn is also very good for this.

If you do want to use spaCy’s text classifier, probably the best approach will be to train the CNN on only the text, and then have a model on top which uses the model scores as features in addition to your metadata. You can do this in Vowpal Wabbit or Scikit-Learn. If you do go this route, XGBoost is another good package to consider — it’s well suited to this type of ensembling.

There are also simple hacks to put the features into spaCy’s text classification. The easiest is to just prepend the labels into the text. This will give the information to the text classifier, so they can be used as features. It’s dumb but it’s likely to work as well as some more principled way of separately embedding the labels.

Hey Honnibal,

thank you very very much for that detailed reply, that helped a lot. And sorry for never replying, I read this right before vacation and then forgot about it.

Thanks a lot, I really love how great the support of Prodigy is

stephan

1 Like