Text Classifier annotations

Hi Prodigy team,

I have a query regarding text classifier. I am working on a domain specific dataset (Oil & Gas Industry) for training on one classification only.

I trained the model using 1700 annotations and 600 evalutation data (non skewed data).

My queries are:

  1. I have got the following results with 25 iterations.
    `

LOSS F-SCORE ACCURACY

01 7.337 0.811 0.840
02 3.945 0.894 0.903
03 3.479 0.854 0.872
04 2.830 0.837 0.855
05 2.398 0.832 0.853
06 1.567 0.836 0.855
07 1.395 0.830 0.852
08 1.418 0.847 0.865
09 1.185 0.858 0.872
10 1.831 0.839 0.858
11 1.568 0.847 0.865
12 1.119 0.845 0.863
13 0.692 0.847 0.865
14 1.308 0.852 0.868
15 0.877 0.858 0.872
16 0.994 0.851 0.867
17 0.623 0.809 0.835
18 0.739 0.815 0.840
19 0.296 0.810 0.837
20 0.679 0.783 0.817
21 1.262 0.783 0.818
22 1.205 0.777 0.812
23 0.989 0.787 0.818
24 0.683 0.788 0.820
25 0.940 0.792 0.823

Baseline 0.50
Precision 0.99
Recall 0.82
F-score 0.89
Accuracy 0.90

Model is giving quite decent results with less iterations, which I am confused about since in NN models, the number of iterations is quite high. Is it normal to expect such decent results with less iterations in spacy? or I am missing something?

  1. I have gone through the Prodigy documentation which states that best model will be stored. What is the criteria for the best model? Let’s say in the above example:

Model no. 19 with lowest loss function or Model no. 2 with highest accuracy will be stored?

  1. The training classifier model evaluates on the basis of ‘accept’ and ‘reject’ (Eventually we can see the score also using nlp command and .cats command). But how does the model assign a score to the sentences? or What is the threshold for assigning ‘accept’ or ‘reject’. Is it .50? and Is there any way that we can change the threshold?

  2. I have gone through the following link:
    https://support.prodi.gy/t/best-practices-for-text-classifier-annotations/135/4
    but if you can give any other useful tip that I should keep in mind while training such kind of model?

Might be some of the above queries is already covered in documentation and I am not able to find it. Apologies for that. :slight_smile:

Thanks

@khushal17ad Glad to see the results look promising — I hope this translates to useful accuracy! Of course, that’s not to be taken for granted…I always find 0.99 precision (or 0.99 recall) suspicious.

You might want to look at the models outputs carefully to make sure it hasn’t learned something fairly trivial. It could be that there’s some subset of the true examples (80% or so, for instance) that can be trivially identified with a few key words. The model might have learned that and nothing else. Then as iterations continue, it tries to learn a better hypothesis — but it can’t hill-climb to something better than its simple initial solution.

The 0.99 precision, in combination with the early peak, is suspicious — but the early peak alone isn’t necessarily. I sometimes joke: “If accuracy is still improving after 20 epochs, I’m unhappy: why is training so slow? But then if it stops improving after 2 epochs, I’m also unhappy: why’s it stop learning?”

I did design spaCy’s text classifier was designed for Prodigy, and tried my best to get it to learn quickly. If the model takes too long to respond, the active learning is much less effective. A key trick is to make the model an ensemble between a unigram bag-of-words and the CNN model. The unigram bag-of-words features converge very quickly. It may be that the model is mostly relying on them. If that’s the case, it’s possible you’ll see better accuracy from using scikit-learn or Vowpal Wabbit — it might be worth trying. It would be nice to have some extensions for Prodigy to make those experiments easier.

The criterion is accuracy on the held-out set. The code is available in your Prodigy installation; do python -c "import prodigy.recipes.textcat; print(prodigy.recipes.textcat.__file__)" to get the path.

I’m not sure what shape of answer you’re looking for here. I mean, the model assigns the score by doing a bunch of linear algebra — but that’s obviously not a useful explanation!

For the purposes of evaluation, the model regards 0.5 as the cutoff for “accept”. It’s trained on a cross-entropy objective though, so it’s trying to assign 1.0 to the positive classes and 0.0 to the negative ones. So, there’s nothing special about 0.5. You can make your own decisions based on the score in doc.cats.

There’s no way to change the threshold in the evaluation function currently. As a quick hack, you could add a pipeline component that boosted the scores, e.g.


def increase_scores(doc):
    doc.cats['LABEL'] += 0.3
    return doc

nlp.add_pipe(increase_scores, after='textcat')

Hi Honnibal,

Thanks for the detailed answer.

I am looking at the output response of the model and trying to figure out what can be the potential issue. Need some more time to solve this out.

Looking at this, I think I should give more attention to the precision and recall parameters rather than caring more about the iterations.

I will try to look into the possibility for this kind of model.

By this, you mean to say that model is not taking Context into account?

Yes, that will be a great value addition :slight_smile:

I got it :slight_smile:

That helps a bit. I guess, I need to read the articles given on Spacy - API webpage regarding NN Architecture to dive more into it.

Yes, that solves my doubt. Thanks for the hacking tip.

Thanks again

Just to be clear, Prodigy is already doing this. It might be why it converges quickly.

To be a bit more precise (since “model” is frustratingly ambiguous in ML…). The algorithm that learns the weights can find a solution which does take context into account. However, I think it’s settled on a solution that doesn’t take context into account for your problem.

Try following this guide: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html . It shows you how to build a simple model that only has access to each word independently. scikit-learn has nice feature and frequency visualisation stuff, so you’ll be able to see whether a couple of words dominate. This should help you figure out what’s going on.