TLDR - Does using both sent.correct and sent.teach for the same dataset reduce training performance?
I have 100 annotations in a job postings dataset that explicitly delineate the correct sentence start through sent.correct. These data achieve ~80% when training using default config and a 70/30 split.
However, I noticed a signficant decrease in training scores after I created 100+ additional annotations using sent.teach.
It appears that the false answers from sent.teach affected training performance. I was able to copy the examples from the db, drop the dataset, and import back in.
How many examples with labels do you have? I usually prefer to have at least ~500 examples in a validation set before I take performance numbers very serious. The reason is that there might be a risk of overfitting on a subset of the data that isn't representative of the task.
That said if you have a relatively small dataset to train/score on, it's possible that the active learning approach, the one learning on newly labelled subsets, overfits a bit on those newly labelled subsets.
My gut feeling is that this issue will go away once you have more labels. This is my gut feeling though, if you have a much larger dataset with labels and if the issue persists then it'd certainly be interesting to dive into a bit more.