Just some thoughts on augmentation, workflows and some nice to haves for future features.
It is my understanding that if we are doing multilabel classification it is best practice to augment our data so that there is approximately the same number of instances of each class. Furthermore, we want to augment our data so that we have about the same number of ‘accepts’ and ‘rejects’ from our annotation process.
Depending on the class it may or may not be appropriate to use an additive or subtractive approach. For example, if I am detecting phone numbers then it is reasonable to replace any phone number with any other to create more accepted annotations.
On the other hand, if I am detecting products, I might not want to replace any product with any product because I will lose the context of the surrounding text in any downstream task. However if we can detect if the context is close (spacy has similarity in context), or the products are close(use word-vectors) maybe substitution is appropriate.
Or maybe we don’t mind throwing away a bunch of annotations, in which case we want to discard the fewest possible annotations that will still give us an approximately balanced dataset.
It would be nice to have these cases covered when you implement some nice augmentation functionality.
Let’s say I have some augmentation process that I am happy with - how to integrate with the annotation process? I was thinking:
- Annotate on new label using last trained model
- Augment new annotations from (1) + all other annotations (not augmented)
- Train (2) with all labels made so far on new model.
- Go to (1) if you have more labels to annotate.
- Evaluate per label accuracy - if happy done - if not then go back to (1) using label you are most unhappy with.
Any thoughts on this process? I have not heard of using augmentation in an active learning setting before.
Do people mostly just augment at the end of the annotation process?
In step 2 does it make sense to augment your already augmented dataset?