Best Practices for text classifier annotations

Thanks for the questions and sharing your use case! What you're trying to do definitely sounds feasible, so here are some answers and ideas:

In the beginning, you usually want a higher number of accepted examples – there are many thing you might not want your model to learn, so it's always good to start off with some examples of what you do want. A good way to do this is to start off with a list of seed terms that are very likely to occur in texts your label applies to. You can see an example of an end-to-end workflow with seed terms in my insults classifier tutorial.

All Prodigy recipes are included as Python files, so you can edit the prodigy.recipes.textcat module and tweak the recipe, or take inspiration from it to write your own. By default, Prodigy uses the prefer_uncertain sorter, which should ideally lead to a roughly 50/50 distribution of accept and reject, since it will ask about the examples it's most uncertain about, i.e. the ones with a prediction closest to 50/50. You can also try tweaking the bias keyword argument, which recenters the preference away from 0.5. Alternatively, you could also swap the prefer_uncertain sorter for prefer_high_scores to start off with high-scoring examples. You can find more info and the API reference in the PRODIGY_README.html.

Yes, you can run prodigy stats [dataset_name], and it will print the dataset meta, number of annotations and a breakdown by answer.

You could definitely try that – however, whether it's a good idea depends on the data your application will see at runtime. If you're training on nicely cleaned text, but your model will have to process live tweets as they come in, you might see significantly worse performance. So as a general rule of thumb, we'd always recommend training on data that's as close to the expected runtime input as possible. Sometimes, this can even mean messing up your clean data on purpose to make the model more robust.

Check out the "Debugging & logging" section in the PRODIGY_README. You can set the environment variable PRODIGY_LOGGING=basic or PRODIGY_LOGGING=verbose to log what's going on. The verbose mode will also print the individual annotation examples passing through Prodigy. In your own code, you can use the prodigy.log helper to add entries to the log.

To get a better idea of how the annotations affect the training, you can also run the textcat.train-curve command, which will train on different portions of the data (by default, 25%, 50%, 75% and 100%). As a rule of thumb, the last 25% are usually the most relevant – if you see an increase in accuracy here, it's likely that collecting more annotations similar to the ones in your set will improve the model. If not, it can indicate that the types of annotations you collect need to be adjusted.

Once you're getting more "serious" about training and evaluation, it's also often a good idea to create a separate, dedicated evaluation set (if you haven't done so already). By default, the batch-train command will select the evaluation examples randomly – if the distribution of accepted and rejected examples is very uneven, this can also lead to a suboptimal distribution among the training and evaluation set. You can use the textcat.eval recipe to create an evaluation set from a stream. Setting the --exclude argument lets you exclude examples from your training set, to make sure no training examples are present in the evaluation set. When you exit textcat.eval, Prodigy will also print evaluation stats based on the collected examples.

2 Likes