Textcat model with multiple classes

We have a number of Prodigy-trained models in production at the moment and are happy with our results. So far, everything we have done has been NER related. Now, we are attempting some text classification and I'm bumping into some issues.

Some dimensions to our particular scenario:

  • Texts are generally less than 250 characters
  • I have 27 different categories, each text can be only one.
  • I have a complex spaCy pipeline that tags texts with these categories with some accuracy, about 85% or so, but it needs to be better, so why not train a textcat model with the output?

So, given the above, my current flow is like so:

First, export some raw data for later use:

examples = df[['id', 'text']].reset_index()
with open('./data/raw_data_{}.jsonl'.format(date_suffix), 'w') as outfile:
    for each in examples.itertuples():
        json.dump({'text': each.text,
                    'meta': {'id': each.id}
                  }, outfile)
        outfile.write('\n')

Second, exporting some labeled examples from the spaCy pipeline for use with prodigy mark (Code not shown: creating a balanced sample of each category tag):

with open('./data/mark_examples_{}.jsonl'.format(date_suffix), 'w') as outfile:
    for each in examples_for_mark.itertuples():
        json.dump({'text': each.category + ' \n ' + each.text,
                  'meta': {'menu_id': str(each.id),
                          'category': str(each.category)}
                    }, outfile)
        outfile.write('\n')

Then, running Prodigy like so:

prodigy mark textcat_menu_data_2 ./data/mark_examples_2019_10_29.jsonl --view-id text  --memorize

After annotating, I export my annotations:

prodigy db-out textcat_menu_data_2 ./data/annotations_2019_10_30

Then, I load and parse the annotations in order to reformat the jsonl file for textcat:

# Load annotations
annotations_file = './data/annotations_2019_10_30/textcat_menu_data_2.jsonl'
data = []
with open(annotations_file, 'r') as f:
    for line in f:
        line = json.loads(line)
        data.append([line['meta']['menu_id'], line['meta']['category'], line['answer'], line['text']] )

df_training_data = pd.DataFrame(data)
df_training_data.columns = ['menu_id', 'category', 'annotation', 'text']   

df_training_data['text'] = df_training_data['text'].str.split('\n').apply(pd.Series)[1]

Next, we export this as training data:

# I can't figure out how to give accept and reject answers at the moment
accept_only = df_training_data[df_training_data['annotation'] == 'accept'].reset_index()

with open('./data/training_data_2019_10_30.jsonl', 'w') as outfile:
    for each in accept_only.itertuples():
        json.dump({'text': each.text,
                   'label': each.category,
                  'meta': {'menu_id': str(each.menu_id)},
#                    'answer': each.annotation,
                    }, outfile)
        outfile.write('\n')

Next, we create a new dataset and load the training data:

prodigy dataset textcat_categorization_6
prodigy db-in textcat_categorization_6 ./data/training_data_2019_10_30.jsonl

Finally, we train the model:

prodigy textcat.batch-train textcat_categorization_6 en_core_web_lg --output ./models/text_cat_2019_10_30_v2 --eval-split 0.2 --n-iter 5

The first sign that something went wrong is that it achieves 100% accuracy on the 2nd iteration:

Loaded model en_core_web_lg
Using 20% of examples (176) for evaluation
Using 100% of remaining examples (704) for training
Dropout: 0.2  Batch size: 10  Iterations: 5  

#            LOSS         F-SCORE      ACCURACY  
01           0.119        0.994        0.989                                                                                          
02           0.006        1.000        1.000  

But let's try running textcat.teach (creating a new dataset to put these annotations), on just a few of the labels at a time (three, these are fake category names but otherwise my exact code) and see where we're at anyway:

prodigy textcat.teach textcat_categorization_7 ./models/text_cat_2019_10_30_v2 ./data/raw_data_2019_10_28.jsonl --label "CATEGORY_1,CATEGORY_2,CATEGORY_3"

When using texcat.teach here, the model seems to be very confident about all the wrong examples and has apparently learned nothing.

Questions:

  1. Is there a better flow to achieve the above? It seems like quite a lot of steps for what would seem to me to be a common approach, but I'm entirely fine with it now that I've sketched it out (although it doesn't work as is):
    • tag some data with NLP rules,
    • accept/reject the tags with Prodigy
    • use the results to train a new model
  2. I'm a little confused on how datasets work in general, as you can see I'm creating multiple different datasets but I assume this is not the right way to be thinking about this. Can I keep my annotations and my raw data in the same dataset? Etc.
  3. How can i fix my above approach to function properly?

Thanks again for all your brilliant work and on-going support.

Hi! Glad to hear things have been going well so far :smiley:

Your current approach does indeed sound a bit complicated for what it is, and I'm sure there's an easier way to achieve the same result. Have you had a look at the textcat.manual recipe yet? It shows the text and options in a multiple-choice interface and the format you get out is directly compatible with textcat.teach.

A single annotation task in the choice format could look like this:

{
    "text": "Some text",
    "options": [
        {"id": "LABEL1", "text": "Label 1"},
        {"id": "LABEL2", "text": "Label 2"}
    ]
}

When you select an option, a key "accept" is added to the task and it holds a list of the selected IDs. For example: "accept": ["LABEL1", "LABEL2"]. You can also provide those when you load in the data to pre-select certain categories – e.g. based on your rules – and then correct them if needed.

For training, you might also consider training with spaCy directly – this gives you more flexibility and you get to tweak more settings, experiment with different architectures etc. See here for an example script.

Datasets in Prodigy hold the annotations you collect. There's typically no need to import raw data before you annotated – this can all be done on the command line when you start the recipe.

Datasets are append-only so you'll never lose any state or data. So if you want to manually edit examples in an existing dataset, you should export it, edit it and then import it to a new dataset. This creates more data overall – but it means you'll always be able to recover the previous dataset. We recommend creating a new dataset for every annotation experiment, annotation type etc. Merging datasets later is easy – there's a db-merge command and each example has hashes that let you find all annotations on the same input text. You can also think of a dataset as one unit of data you'd run a particular experiment with.

I have used textcat.manual, but being that I already have so many tagged examples, doesn't it make the most sense just to filter them down with the binary annotation method (e.g. using mark to determine if the NLP rules were correct) and then use that as training data? Using textcat.manual would basically be like starting over (not using output from the the NLP rules as training data at all), right?

Ah, so my suggestion was to use your rules to pre-populate the selected categories in textcat.manual and then review them (and correct them if needed). So what you get out at the end is corrected gold-standard data to train your text classifier. Or am I misunderstanding your use case?

Ohhhhh yes, this is exactly what i'm trying to do -- create gold standard data from the tags produced by the NLP pipeline. I certainly did not know it was this simple, I must have missed that in the docs.

So, it's possible to create a dataset for textcat.manual where the tags generated by my NLP pipeline are pre-selected in the Prodigy interface?

Would it look like this:

image

or like this:

image

??

Certainly, the latter is preferred, and is basically what I'm doing with mark but in a hacky way!

What does the .jsonl file need to look like to boot up textcat.manual in this fashion?

And then how would I batch-train and subsequently texcat.teach on new data from the output of textcat.manual ?

If you run textcat.manual with more than one label, it will look like image 1. If you run it with only one label, it'll look like image 2.

I'm still not 100% what all the different import/export conversion steps are for – I think you should be able to avoid those and make mark produce data in text. The textcat training recipe needs a "text" and a "label". So if you add "label": str(each.category), that should work, right? And instead of --view-id text, you can use --view-id classification. The dataset created with this should work out-of-the-box with textcat.batch-train.