Including document-level, non-textual metadata in model training

Hi again!

I'm using Prodigy to train classifiers, but my dataset also includes metadata about each document that I think in principle could significantly improve model performance. Is there any way of including this in the model training process? I'm not that hopeful after reading this post by Matt:

I can think of a very goofy way of doing this: transform the metadata into natural language. So something like {'document_country_of_origin': 'US'} would mean adding the phrase "This document was published in the US" to both the training data - and the data we make predictions about. But really that seems too ugly to countenance - and it wouldn't work for continuous variables anyway.

Is there a better way? :slight_smile:

We don't really have a good way to include such features in spaCy models at the moment. Try just prepending the identifiers to the text -- it's not actually that bad a solution, as they'll at least be included in the bag of words. I wouldn't make it a sentence, just adding US_origin as a token would do it. It's not very satisfying, but it might work.