including extra features/meta-data into text classification

I have extra, numeric data that comes along with the text that I am trying to classify. I suspect that these features will help the model learn - is there any way to include them into the default textcat? Or perhaps I can extract the features from the last layer of the net and concatenate them with my features …
Any ideas?


There’s not really a nice API for modifying the textcat model at the moment, unfortunately. The best quick-and-dirty solution is to pre-process the features into boolean IDs according to your preferences, and then just toss them into the text (probably at the end). The textcat model stacks a CNN with a linear bag-of-words classifier, so you’ll at least get the features in the linear model. The CNN might do weird things with them, because it’ll be paying attention to their ordering, and some substrings — this may or may not be good. Hopefully if it’s irrelevant, the model at least ignores it.

Hopefully this approach is enough to exploit the features during the textcat.teach process. After you’ve got your annotations, you should be able to export the data and then you can build a classifier with a standard ML package, which will give you full support for arbitrary features. I’ll think about how to add a feature extraction API to spaCy’s TextCategorizer class to make this cleaner. Suggestions would be welcome — if you have ideas, you might open an issue on the spaCy tracker?

I have a similar issue : working on banking statements where I want to use both the text itself (bank operation) as well as metadata (positive or negative amount, date of the operation)

I wonder if maybe a better approach would be to use NER or POS (not sure which is best) to find all the interesting bits in the text, and then create a "normal" classification matrix and process that with e.g. Scikit-Learn

If I illustrate with an example, consider the following bank statement, a wire transfer to pay an employee:
{"text": "VIR SALAIRE LOUIS", "date": "2021-01-31", "amount": -1234.5, "currency": "€"}

I could simply create a text from it such as:
VIR SALAIRE LOUIS -1234.5€ 2021-01-31
...and then use the full range of prodigy/spacy features (NER/TEXTCAT) to process it, but I feel I'd be losing a lot of information that I already have

Instead, I wonder if if wouldn't be better to use NER to obtain something like this:
{"operation": "VIR", "category": "SALAIRE", "recipient": "LOUIS", "amount": -1234.5, "currency": "€"}
...and then use a regular classification algorithm to find the accounting category which corresponds to that (for example)

What do you think @honnibal ?

Also, side question more specific to my use-case here: would POS possibly be better than NER for bank statements ? There are definitely some relations between the 3 words in my text above, and maybe those relations yield more accurate information than NER

I definitely think you should look at completely custom classification logic for your data, rather than using the default NLP pipelines. I'm actually not as much an expert on this type of machine learning, but I understand XGBoost is one of the standard packages people use.

There's actually a lot of assumptions built into spaCy's models about how linguistic inputs behave. Those assumptions don't necessarily hold for your data, which don't have normal grammatical structure, and where you can have all sorts of other information the model should capture.