Labeling & Training a Textcat with Contextual / Anchor Data

Kevin · October 29, 2020, 5:48pm

I have a use case that I'm wondering if you have any advise on.

To improve the textcat model in spacy I have it appears I need to include context. Basically, I want to have a dictionary of features with current values to be included in the textcat model and labeling.

For instance, when teaching the textcat, I'd like to have a section below it showing the believed important context. Primarily, because it's the only way to actually differentiate a two categories sometimes.

Do you have any suggestions?

honnibal · November 3, 2020, 10:40am

Hi Kevin,

I think I need a bit of clarification on your question, to make sure I understand it properly.

Do you want to show the extra context only during annotation, or are you trying to improve what information the machine learning model has access to?

There are a number of ways you can change what's displayed to the user, but as far as the machine learning model is concerned, you can pretty much just concatenate all the textual information together and let the model sort it out.

Kevin · November 9, 2020, 4:59pm

I was hoping to do both. Reference context while labeling and use it during prediction.

For instance. I index all the text messages by buckets of date / time and do the same with the contextual data.

text = "10s30s 25.375 bid legs"
context = {
"USDLIB.Swap-3M/SB|2Y": 0.23901581516804385,
"USDLIB.Swap-3M/SB|3Y": 0.27860792067121454,
"USDLIB.Swap-3M/SB|4Y": 0.3428405232294781,
"USDLIB.Swap-3M/SB|5Y": 0.42459642400000003,
"USDLIB.Swap-3M/SB|6Y": 0.5161279134682054,
"USDLIB.Swap-3M/SB|7Y": 0.6070392007765967,
"USDLIB.Swap-3M/SB|8Y": 0.6919168190821608,
...
}

textcat label options = SB or TS
human selects TS

honnibal · November 13, 2020, 1:08am

I think the meta field will probably be the nicest way to display that to the user. You can set it on the example dict, and it'll be displayed as a subscript in the card. We use that to display things like the subreddit for reddit data.

To include this as a feature, the best way in spaCy v2 is to include it as an additional token. If you want to make a custom model with a library like PyTorch, a common recipe for this sort of thing is to embed the extra features with a separate embeddings table, and then sum the word embeddings with the extra feature embeddings before you pass the data into the contextual encoding. This puts the information into the model early, so that it can condition on it easily. I actually found this a surprising solution, but it's what Devlin recommends in the BERT paper, and I must say that transformers challenged my intuition about modelling in general (I still find it really unintuitive that the positional embedding works, for instance).

The next major release of Prodigy will use spaCy v3, which will make it much easier to use a custom model with Prodigy. For now, I would add the extra information to the annotation tool, and then do the experiments about the features separately from Prodigy, so that you have one less level of software to work with. You can export the annotations from Prodigy, and just run the experiment with your favourite combination of tooling. Once you've figured out what works best, you can get that working with Prodigy if you want the active learning model to do the same thing.

If you just want to try the easiest-to-implement thing first, do try just prepending the text with the contextual markers. Another easy-to-implement solution is to insert the contextual markers in between every token of the text. This looks weird and redundant, but it might be the best way to help the CNN exploit the information.

Topic		Replies	Views
including extra features/meta-data into text classification usage , textcat	3	997	February 10, 2021
textcat_multilabel with only some labels annotated for some examples	5	376	June 14, 2022
textcat.teach not taking into account label value textcat , done	4	601	December 7, 2018
Don't understand the label files from data-to-spacy usage , textcat	2	510	February 5, 2022
Can't improve textcat model performance textcat	2	389	May 3, 2020

Labeling & Training a Textcat with Contextual / Anchor Data

Related topics