Can I use the model training data again as the source data for textcat.teach?

Hello!
I want to update the text classification model that does polarity checking in textcat.teach. The text classification model was trained on annotation data made from textcat.manual.
So I have one question.
As the title says, is there any adverse effect of using the same .jsonl data used for training as source data when actively learning a model with textcat.teach?
I'm going to use a .jsonl data without annotations, not datasets.
What I'm worried about is that using the same data as in training may leak answers and have a negative impact on active learning.

Hi! If your goal is to update the model you already have and to improve it, updating it on the same data wouldn't really work, because there's nothing new from the model to learn from. If you use the same examples again during textcat.teach, the annotations you're creating in that process are basically a subset of what you've already collected manually.

If you have more raw unseen text, it's better to use that instead. textcat.teach will pre-select examples based on their scores and ideally guide you towards annotating the more relevant examples so you won't have to label everything.

Thank you for your answer! :smile:
I understand that using the same data is not something the model can learn. I will work on updating the model with the new data.
prodigy is a great tool. I'm always looking forward to your updates.

1 Like

Thanks, that's nice to hear! :blush: