We have a number of Prodigy-trained models in production at the moment and are happy with our results. So far, everything we have done has been NER related. Now, we are attempting some text classification and I'm bumping into some issues.
Some dimensions to our particular scenario:
- Texts are generally less than 250 characters
- I have 27 different categories, each text can be only one.
- I have a complex spaCy pipeline that tags texts with these categories with some accuracy, about 85% or so, but it needs to be better, so why not train a
textcat
model with the output?
So, given the above, my current flow is like so:
First, export some raw data for later use:
examples = df[['id', 'text']].reset_index()
with open('./data/raw_data_{}.jsonl'.format(date_suffix), 'w') as outfile:
for each in examples.itertuples():
json.dump({'text': each.text,
'meta': {'id': each.id}
}, outfile)
outfile.write('\n')
Second, exporting some labeled examples from the spaCy pipeline for use with prodigy mark
(Code not shown: creating a balanced sample of each category tag):
with open('./data/mark_examples_{}.jsonl'.format(date_suffix), 'w') as outfile:
for each in examples_for_mark.itertuples():
json.dump({'text': each.category + ' \n ' + each.text,
'meta': {'menu_id': str(each.id),
'category': str(each.category)}
}, outfile)
outfile.write('\n')
Then, running Prodigy like so:
prodigy mark textcat_menu_data_2 ./data/mark_examples_2019_10_29.jsonl --view-id text --memorize
After annotating, I export my annotations:
prodigy db-out textcat_menu_data_2 ./data/annotations_2019_10_30
Then, I load and parse the annotations in order to reformat the jsonl
file for textcat
:
# Load annotations
annotations_file = './data/annotations_2019_10_30/textcat_menu_data_2.jsonl'
data = []
with open(annotations_file, 'r') as f:
for line in f:
line = json.loads(line)
data.append([line['meta']['menu_id'], line['meta']['category'], line['answer'], line['text']] )
df_training_data = pd.DataFrame(data)
df_training_data.columns = ['menu_id', 'category', 'annotation', 'text']
df_training_data['text'] = df_training_data['text'].str.split('\n').apply(pd.Series)[1]
Next, we export this as training data:
# I can't figure out how to give accept and reject answers at the moment
accept_only = df_training_data[df_training_data['annotation'] == 'accept'].reset_index()
with open('./data/training_data_2019_10_30.jsonl', 'w') as outfile:
for each in accept_only.itertuples():
json.dump({'text': each.text,
'label': each.category,
'meta': {'menu_id': str(each.menu_id)},
# 'answer': each.annotation,
}, outfile)
outfile.write('\n')
Next, we create a new dataset and load the training data:
prodigy dataset textcat_categorization_6
prodigy db-in textcat_categorization_6 ./data/training_data_2019_10_30.jsonl
Finally, we train the model:
prodigy textcat.batch-train textcat_categorization_6 en_core_web_lg --output ./models/text_cat_2019_10_30_v2 --eval-split 0.2 --n-iter 5
The first sign that something went wrong is that it achieves 100% accuracy on the 2nd iteration:
Loaded model en_core_web_lg
Using 20% of examples (176) for evaluation
Using 100% of remaining examples (704) for training
Dropout: 0.2 Batch size: 10 Iterations: 5
# LOSS F-SCORE ACCURACY
01 0.119 0.994 0.989
02 0.006 1.000 1.000
But let's try running textcat.teach
(creating a new dataset to put these annotations), on just a few of the labels at a time (three, these are fake category names but otherwise my exact code) and see where we're at anyway:
prodigy textcat.teach textcat_categorization_7 ./models/text_cat_2019_10_30_v2 ./data/raw_data_2019_10_28.jsonl --label "CATEGORY_1,CATEGORY_2,CATEGORY_3"
When using texcat.teach
here, the model seems to be very confident about all the wrong examples and has apparently learned nothing.
Questions:
- Is there a better flow to achieve the above? It seems like quite a lot of steps for what would seem to me to be a common approach, but I'm entirely fine with it now that I've sketched it out (although it doesn't work as is):
- tag some data with NLP rules,
- accept/reject the tags with Prodigy
- use the results to train a new model
- I'm a little confused on how datasets work in general, as you can see I'm creating multiple different datasets but I assume this is not the right way to be thinking about this. Can I keep my annotations and my raw data in the same dataset? Etc.
- How can i fix my above approach to function properly?
Thanks again for all your brilliant work and on-going support.