Hello, I have some questions regarding texcat/spacy, I found some answers here and git page, but i would be very appreciated if you could help me please.
I use code:
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training()
#print("Training the model...")
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'Prec', 'Recall', 'Fscore'))
tmp=[]
tmp2=[]
acc = []
acc2=[]
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4., 32., 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=drop_rate, losses=losses)
tmp.append(losses['textcat'])## save losses
acc.append(class_report(nlp.tokenizer, textcat, train_texts, train_cats)['accuracy'] ) ## calculate accuracy for train set
losses_dev={}
batches2 = minibatch(dev_data, size=compounding(4., 32., 1.001))
for batch2 in batches2:
texts2, annotations2 = zip(*batch2)
nlp.update(texts2, annotations2, sgd=None, losses=losses_dev) #No weigth update here on test data
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}' # print a simple table
.format(losses['textcat'], scores['textcat_p'],
scores['textcat_r'], scores['textcat_f']))
tmp2.append(losses_dev['textcat'])
acc2.append(class_report(nlp.tokenizer, textcat, dev_texts, dev_cats)['accuracy'] )
++++++++++++++++++++++++++++++++++++
-
Initially I used example from https://spacy.io/usage/training#textcat. Do i undestand correctly in that example, that losses that are calculated based on dev_texts do not bring any updates to a model and parameters, just use to print? I am asking, since i am thinking if i need to split into train/test or train/test/final evaluation?
-
With code above and 'bow' type, i got output losses and plots.
'LOSS', 'Prec', 'Recall', 'Fscore'
|3.215|1.000|0.766|0.867|
|2.314|1.000|0.844|0.915|
|1.494|1.000|0.937|0.968|
|1.138|1.000|0.969|0.984|
|0.844|1.000|0.984|0.992|
|0.607|1.000|0.984|0.992|
|0.422|1.000|1.000|1.000|
|0.408|1.000|1.000|1.000|
|0.301|1.000|1.000|1.000|
|0.275|1.000|1.000|1.000|
|0.257|1.000|1.000|1.000|
|0.181|1.000|1.000|1.000|
|0.215|1.000|1.000|1.000|
|0.115|1.000|1.000|1.000|
|0.138|1.000|1.000|1.000|
|0.132|1.000|1.000|1.000| etc losses are still going down
It looks strange to have so high accuracy.
What might be a reason for it? It looks wrong for me. -
I check here Help needed to get started with text classification
It was discussion about bow vs cnn. Since i have compared several types, i have feelings that 'default' is working slower than 'ensemble'. I do not understand why. Should be the same? It migth be related to my previous question with too high accuracy. But in general yes, seems to be 'bow' has a bit higher accuracy and the fastest time, as i was in that discussion. -
Is it possible somehow to link textat with any explainers? In ideal case extract kernel to see what are the most meaning features.
-
I am not sure if i need to do any additional text preprocessing (I almost have not done). I found a tip " Tip: merge phrases and entities", but it was not regarding textcat, so not sure what is the best strategy here?
Thank you.