Labeling & Training a Textcat with Contextual / Anchor Data

I have a use case that I'm wondering if you have any advise on.

To improve the textcat model in spacy I have it appears I need to include context. Basically, I want to have a dictionary of features with current values to be included in the textcat model and labeling.

For instance, when teaching the textcat, I'd like to have a section below it showing the believed important context. Primarily, because it's the only way to actually differentiate a two categories sometimes.

Do you have any suggestions?

Hi Kevin,

I think I need a bit of clarification on your question, to make sure I understand it properly.

Do you want to show the extra context only during annotation, or are you trying to improve what information the machine learning model has access to?

There are a number of ways you can change what's displayed to the user, but as far as the machine learning model is concerned, you can pretty much just concatenate all the textual information together and let the model sort it out.

I was hoping to do both. Reference context while labeling and use it during prediction.

For instance. I index all the text messages by buckets of date / time and do the same with the contextual data.

text = "10s30s 25.375 bid legs"
context = {
"USDLIB.Swap-3M/SB|2Y": 0.23901581516804385,
"USDLIB.Swap-3M/SB|3Y": 0.27860792067121454,
"USDLIB.Swap-3M/SB|4Y": 0.3428405232294781,
"USDLIB.Swap-3M/SB|5Y": 0.42459642400000003,
"USDLIB.Swap-3M/SB|6Y": 0.5161279134682054,
"USDLIB.Swap-3M/SB|7Y": 0.6070392007765967,
"USDLIB.Swap-3M/SB|8Y": 0.6919168190821608,

textcat label options = SB or TS
human selects TS

I think the meta field will probably be the nicest way to display that to the user. You can set it on the example dict, and it'll be displayed as a subscript in the card. We use that to display things like the subreddit for reddit data.

To include this as a feature, the best way in spaCy v2 is to include it as an additional token. If you want to make a custom model with a library like PyTorch, a common recipe for this sort of thing is to embed the extra features with a separate embeddings table, and then sum the word embeddings with the extra feature embeddings before you pass the data into the contextual encoding. This puts the information into the model early, so that it can condition on it easily. I actually found this a surprising solution, but it's what Devlin recommends in the BERT paper, and I must say that transformers challenged my intuition about modelling in general (I still find it really unintuitive that the positional embedding works, for instance).

The next major release of Prodigy will use spaCy v3, which will make it much easier to use a custom model with Prodigy. For now, I would add the extra information to the annotation tool, and then do the experiments about the features separately from Prodigy, so that you have one less level of software to work with. You can export the annotations from Prodigy, and just run the experiment with your favourite combination of tooling. Once you've figured out what works best, you can get that working with Prodigy if you want the active learning model to do the same thing.

If you just want to try the easiest-to-implement thing first, do try just prepending the text with the contextual markers. Another easy-to-implement solution is to insert the contextual markers in between every token of the text. This looks weird and redundant, but it might be the best way to help the CNN exploit the information.