data augmentation and workflow

Just some thoughts on augmentation, workflows and some nice to haves for future features.
It is my understanding that if we are doing multilabel classification it is best practice to augment our data so that there is approximately the same number of instances of each class. Furthermore, we want to augment our data so that we have about the same number of ‘accepts’ and ‘rejects’ from our annotation process.
Depending on the class it may or may not be appropriate to use an additive or subtractive approach. For example, if I am detecting phone numbers then it is reasonable to replace any phone number with any other to create more accepted annotations.
On the other hand, if I am detecting products, I might not want to replace any product with any product because I will lose the context of the surrounding text in any downstream task. However if we can detect if the context is close (spacy has similarity in context), or the products are close(use word-vectors) maybe substitution is appropriate.
Or maybe we don’t mind throwing away a bunch of annotations, in which case we want to discard the fewest possible annotations that will still give us an approximately balanced dataset.

It would be nice to have these cases covered when you implement some nice augmentation functionality.

Let’s say I have some augmentation process that I am happy with - how to integrate with the annotation process? I was thinking:

  1. Annotate on new label using last trained model
  2. Augment new annotations from (1) + all other annotations (not augmented)
  3. Train (2) with all labels made so far on new model.
  4. Go to (1) if you have more labels to annotate.
  5. Evaluate per label accuracy - if happy done - if not then go back to (1) using label you are most unhappy with.

Any thoughts on this process? I have not heard of using augmentation in an active learning setting before.

Do people mostly just augment at the end of the annotation process?

In step 2 does it make sense to augment your already augmented dataset?

I had to think about this a while before deciding whether I agree. I think I agree that it's probably not important to represent the prior on the classes in the dataset. We can always adjust the bias later. But if we're optimising for micro-averaged F-score (i.e. accuracy over label instances, not accuracy over label types), we want the loss function to reflect the fact that some label types are much more important than others. For instance, if some label type is very rare, we might get to a higher F-score solution by ignoring it. If we change the data distribution during training, we're changing the loss function, so this higher F-score solution will be higher loss than a lower F-score solution.

This mismatch between the evaluation we care about and the loss function could be corrected by weighting the examples, and multiplying the gradients by the instance weights before the update. However, this takes us back to some of the problem we're trying to correct by changing the data distribution.

In summary, I think it's important to have a clear vision of what metric we care about. We then want to make sure the model's objective represents our interests. It's usually not a good idea to craft an objective that isn't what we want, and compensate by trying to prevent the model from optimizing for the stated objective. We don't want to be ambivalent: we want the lowest loss solution to be the one we most prefer, so we can optimize ruthlessly towards 0 loss.

If we care about macro-averaged F1, i.e. the average of the F1 scores per label type, then we definitely need to avoid having a per-instance loss --- because that won't optimize our true objective. If we care about micro-averaged F1, we should be careful about changing the data distribution to address imbalanced classes, to avoid changing the objective in a way we don't like.

It's usually better to upsample the rare class, instead of discarding data from the common class. Think of doubling the positive examples like making two passes over the data. If we're making two epochs, we may as well have different negative examples in each epoch.

Yes, this can definitely be a problem. Much of what we want the model to learn is in these ordering effects, where the product is more likely given the previous context. Whether this substitution can help the model learn anything useful is very problem dependent.

There's a very long literature in NLP about building word lists and finding semantic relations using pattern templates. The techniques are now obsolete for that task because of word vectors, but I'd like to go back and understand the findings there better, because I think it has a new relevance for neural networks, as there are now new possibilities for bootstrap training.

Thank you for your thoughtful response!
Would you mind commenting on workflow using augmentation/active learning (multilabel,NER)?