Best Practices for text classifier annotations

I’m working on a text classification model similar to the insults classifier example, but I am training it on twitter posts. My model is intended to identify harassment aimed at people and groups in online posts.

My ultimate goal is then to have a system that can identify harassment and differentiate between harassment aimed at a person and aimed at a group of people. To keep the problem constrained while I find my feet I’m only considering harassment aimed at a specific person.

I’ve gone through about 400 examples, my model has not been able to beat ~80% accuracy during training, and it misses some obvious things in the real world that makes me wonder about how well it has generalized.

I have a few, possibly dumb, questions about existing best practices for creating text classification datasets for training with prodigy.

  1. Is there a good rule of thumb for the ratio of accepted to rejected examples? I wonder if the poor performance on real-world data is because I rejected too many texts and the system learned to reject everything?

  2. Is there a way with prodigy to review dataset breakdown of accepted/rejected/ignored annotations?

  3. My intuition tells me that trimming garbage data from annotation texts might help the learning process. For instance someone posts “!!_$? You are a horrible person”, does it make sense to trim the example to “You are a horrible person” and accept that?

4 Are there any good examples/discussions of how to debug model performance problems (with prodigy?) to which you could direct me?


1 Like

Thanks for the questions and sharing your use case! What you’re trying to do definitely sounds feasible, so here are some answers and ideas:

In the beginning, you usually want a higher number of accepted examples – there are many thing you might not want your model to learn, so it’s always good to start off with some examples of what you do want. A good way to do this is to start off with a list of seed terms that are very likely to occur in texts your label applies to. You can see an example of an end-to-end workflow with seed terms in my insults classifier tutorial.

All Prodigy recipes are included as Python files, so you can edit the module and tweak the recipe, or take inspiration from it to write your own. By default, Prodigy uses the prefer_uncertain sorter, which should ideally lead to a roughly 50/50 distribution of accept and reject, since it will ask about the examples it’s most uncertain about, i.e. the ones with a prediction closest to 50/50. You can also try tweaking the bias keyword argument, which recenters the preference away from 0.5. Alternatively, you could also swap the prefer_uncertain sorter for prefer_high_scores to start off with high-scoring examples. You can find more info and the API reference in the PRODIGY_README.html.

Yes, you can run prodigy stats [dataset_name], and it will print the dataset meta, number of annotations and a breakdown by answer.

You could definitely try that – however, whether it’s a good idea depends on the data your application will see at runtime. If you’re training on nicely cleaned text, but your model will have to process live tweets as they come in, you might see significantly worse performance. So as a general rule of thumb, we’d always recommend training on data that’s as close to the expected runtime input as possible. Sometimes, this can even mean messing up your clean data on purpose to make the model more robust.

Check out the “Debugging & logging” section in the PRODIGY_README. You can set the environment variable PRODIGY_LOGGING=basic or PRODIGY_LOGGING=verbose to log what’s going on. The verbose mode will also print the individual annotation examples passing through Prodigy. In your own code, you can use the prodigy.log helper to add entries to the log.

To get a better idea of how the annotations affect the training, you can also run the textcat.train-curve command, which will train on different portions of the data (by default, 25%, 50%, 75% and 100%). As a rule of thumb, the last 25% are usually the most relevant – if you see an increase in accuracy here, it’s likely that collecting more annotations similar to the ones in your set will improve the model. If not, it can indicate that the types of annotations you collect need to be adjusted.

Once you’re getting more “serious” about training and evaluation, it’s also often a good idea to create a separate, dedicated evaluation set (if you haven’t done so already). By default, the batch-train command will select the evaluation examples randomly – if the distribution of accepted and rejected examples is very uneven, this can also lead to a suboptimal distribution among the training and evaluation set. You can use the textcat.eval recipe to create an evaluation set from a stream. Setting the --exclude argument lets you exclude examples from your training set, to make sure no training examples are present in the evaluation set. When you exit textcat.eval, Prodigy will also print evaluation stats based on the collected examples.


Thanks for the comprehensive reply!

I had roughly 80 accept / 120 reject examples. Removing some rejects and adding a few more accepts flipped the 100% output on textcat.train-curve from negative to positive, thanks! :clap:

> prodigy stats harassment && prodigy textcat.train-curve harassment en_core_web_sm


  Dataset            harassment         
  Annotations        573                
  Accept             109                
  Reject             104                
  Ignore             360                 

Starting with model en_core_web_sm
Dropout: 0.2  Batch size: 10  Iterations: 5  Samples: 4

%          ACCURACY  
25%        0.52       +0.52                                                                                                                                                                                             
50%        0.49       -0.04                                                                                                                                                                                             
75%        0.53       +0.05                                                                                                                                                                                             
100%       0.67       +0.13       

Just so I’m clear, when you say the last 25% you mean the 100% reading, correct?

Great, I like the idea of leaving the data as-is. It does make me think I should mention that the Twitter loader is truncating tweets. I see a lot of this:

The problem description is here and here, but the gist is that when Twitter updated the tweet length to 280, their API started returning compatibility mode tweets that link to the full-text message. The search API calls need to have tweet_mode=extended set to receive complete messages.

I was wondering about this after looking at the console output during batch-train. Can you tell me about how the split_evals function works? My dataset has more ignore examples than accept and reject combined. The textcat.batch-train output indicates that it’s splitting up all of my examples including the ignore ones. Removing the ignore examples speeds up my training sessions. What role do the ignore examples play in the training process?

Yes, exactly, the last step between 75% and 100%. In your case, this actually looks very good – 13% improvement is pretty significant. So there’s a high chance that you will keep seeing improvements if you collect more annotations similar to the ones you already have. It also confirms that your approach is very feasible, and that it makes sense to invest more and explore it further :blush:

(Really glad to see this is working well so far btw – this type of analysis and being able to verify an idea quickly is one of the central use cases we’ve had in mind for Prodigy!)

Thanks for the info, this is really helpful. I wrote the Twitter loader before Twitter introduced the 280 limit, and we actually launched Prodigy shortly after they announced the new API offerings. I’ve been meaning to look into this and see if it’ll finally make the Twitter API less frustrating to work with. I’d love to drop the external dependency and use a more straightforward implementation.

Loaders are pretty simple, though, so you can also just write your own. All it needs to do is query the API and reformat the response to match Prodigy’s JSON format:

def custom_loader():
    page = 0   # if API is paged, keep a counter
    while True:
        r = requests.get('http://some-api', params={'page': page})
        response = r.json()
        for item in response['results']:  # or however it's structured
            yield {'text': item['text']}  # etc.
        page += 1  # after page is exhausted, increment

Once you move past the experimental stage, you might also want to consider scraping the tweets manually instead of streaming them in via the live API. This way, you’re also less dependent on arbitrary limitations and what Twitter decides to show you or not show you – assuming you’re using the free API. (After all, the docs explicitly state that the API is focused on “relevance and not completeness”.)

Sorry – this should probably be documented better. split_evals takes an already shuffled list of annotated examples and an optional eval_split (set via the command line). If no eval_split is set, it defaults to 0.5 for datasets of under 1000 examples, and 0.2 for larger sets. The function then splits the examples accordingly, and returns a (train_examples, eval_examples, eval_split) triple.

The ignore examples have no further purpose except for being ignored. They’re still stored in the database in case you want to use or analyse them later. For example, if you’re working with external annotators, you might want to look at the examples they ended up ignoring to find out what was most unclear, and whether there were any problems with your data. In some cases, you might also want to go back and re-annotate them later on.

Now that I think about it, Prodigy should probably filter out the ignore examples before doing the shuffling and splitting. Will fix this in the next release :+1: Maybe we could also add an option to the prodigy drop command that lets you delete example with a certain answer, if that’s helpful? For example, you could do prodigy drop my_dataset --answer ignore and it’d remove all ignores from the set.

1 Like

Thanks again for your thoughtful responses!

You were right, that was pretty easy! I put together a few loader functions for experimenting with ways to query the Twitter API. I used a library called twitter on pip. I created a gist in case any of it is useful.

Yeah, I’m still not 100% sure about how I want to interact with their API… :persevere: I’ve been experimenting with both the Search and Stream APIs. I am thinking that Search will be best for manual scraping during model training, and the Stream apis with tracking will work for users in production that want the system to watch their feeds.

At this point, my model is around 73% accurate with a little over 1000 accept/reject examples, and I have a few more (possibly dumb) questions:

Can you intentionally teach a model about specific things?

I noticed that my model seemed to be learning interesting things when I found that it gave “nice tits” a higher probability of being harassment than “nice dress.” It was encouraging until I fed it “hearty soup” and it reported 90% probability of being harassment. :sob:

The problem appeared to be that the model thinks all “[adj] [noun]” examples are bad. That felt like a problem I could grasp at and might offer an opportunity to “engineer” some of the features the model learns.

What I would like to try is to generate “[adj] [noun]” example pairs from a set of input terms (similar to terms.teach) using the vectors model. So for example, I would give it the input “nice shoes,” and it could use the vectors model to suggest similar combinations that might be near the same meaning, e.g., “killer boots” or “sweet loafers.” The same could then be done for positive example inputs like “nice tits” or “smile more.” The desired outcome would be that the model learns about which combinations of “[adj] [noun]” are harassing, and which are not.

I suppose my question is, does the approach of generating extra input examples from a “template” seem reasonable? Do people try to intentionally engineer the features that are learned by models using similar techniques? I’ll probably try it anyway just to get familiar with how the word vectors work, but I’m interested in your thoughts/guidance as well.

Is it natural for the train-curve to dip when learning about new things?

I’ve been finding examples of harassment by search term for a while. For instance, I would spend some period of time sorting through all the harassing stuff I could find using the search term “asshole”. What I have seen a few times is that the textcat.train-curve output takes a dip when I start adding new entries from a different set of search terms.

I know it’s not something you can answer definitively, but if you saw something like what was described above, would you think it means “your first few accept/reject examples from this new term suck” or “I don’t know much about these new terms, so it’s tanking the overall evaluation”?

Accept/Reject ratio after initial bootstrapping?

I’ve been keeping an eye on the number of accept/reject examples that I have, with reject now being about 100 more than accept. I’ve actually been skipping over other legitimate reject examples to keep them in balance. Is there a point after which you can stop paying close attention to the ratio of accept/reject? For example if I’ve got 2000 accept examples, would it be okay to have 3500 rejections?

Ship It!

I was able to deploy my proof of concept, thanks to all your help until this point:

It still gets very confused about some things. It seems to think any leading “You” pronoun is harassing, has difficulty with “[adj] [noun]” as discussed above.

I’m impressed with Prodigy, excellent work!

@justindujardin Really awesome stuff — thanks for sharing so much about what you’re doing. I’ve been away on holiday, but it’s great to see so much progress on a cool application :smiley:

To answer some of your questions: Your idea about generating adjective + noun pairs and classifying them is interesting. One thing I would note is that the frequency distribution for phrases is very imbalanced. Common phrases are common, and most “possible” combinations will occur zero or one times in your data. It’s therefore excellent value to focus on phrases which are common in your data. If you extract whole noun phrases and sort them by frequency, you could then classify these phrases, and then project the classifications onto the tweets, and use the tweets as positive or negative examples. This will probably work pretty well.

You could also try using the merge options on terms.train-vectors, which lets you do the “sense2vec” trick described in this blog post: . This will let you use longer phrases in terms.teach. The Reddit model is already pretty good at suggesting synonyms of the terms you’re interested in:

I’m not so surprised by the behaviour of the loss function you’re seeing. Remember that in a fully connected neural network, everything is connected to everything else. This means that when the model learns some pattern like adj. + noun, and then that pattern gets falsified, a lot of weights will be changing. This messes up the recognition of other patterns temporarily, until it learns how to move everything around again.

1 Like

sense2vec is great, many thanks! I noticed that the reddit_vectors model doesn’t appear to be available for download on Is that because it was a Spacy 1.x model?

I generated 500 examples of adj/noun pairs with synonyms (sorted by count) from sense2vec to test the idea out. My model accuracy increased about ~2% on its evaluation set. Further, the model appears to have become sensitive to adj/noun chunks at many positions in an input.

That’s a really helpful pointer, thanks! I’ve got about 4 million example tweets at this point, so I’ll try extracting noun phrases and sorting them as you suggest.

Hi, I'm starting out so have some basic questions about best practices for classification:

I have 10 labels and discussions with our annotators showed it would probably be easier for them to do annotation with sub-sets of the classes at a time.

My understanding is that I would therefore create different datasets from the training data for each subset of labels (they are not hierarchical). Once the data has been annotated for each subset, I would then merge the datasets via the example command prodigy db-merge dataset1,dataset2 combineddataset. I could then obtain the combined annotations for all labels by saving the combineddataset to the annotations.json.

Have I got this right?

I was also wondering if I could bootstrap the annotations with some patterns, but not all the classes are associated with clear vocabulary, in fact for some of them I think it's more in the semantics of words and syntactic constructions in which they occur.

If I were to use patterns, is it possible to provide patterns for several classes or does it then become a question of one pattern file per class and training a single model per class?

If it is ok to provide patterns for different classes in one patterns.jsonl file, is it also ok not to provide patterns for one class/label?

thanks for your advice,

[UPDATE] I tried this approach "I would therefore create different datasets from the training data for each subset of labels .." and the merged annotations json contains separate entries for each subset of labels, but for my next step I'd ideally have one set of annotations per 'text' entry..

[UPDATE2] Apologies, only found more reading on this after posting, so I now understand I can merge the entries by using the _input_hash. Going to dig around to see if someone has posted a nice snippet already :wink:

1 Like