I have text from research projects from different fields and I want to classify them by research field. I have around 50k projects and I have a defined a label for 8k. For the rest I would like to apply the model in order to assign them a label.
First of all, as I have limited training data, I have used
spacy pretrain in order to use transfer learning to initialize the model with text from the research projects.
After I have added the label data, with the same amount of
accept examples to a
prodigy (8k accepted and the same 8k with a different label and rejected)
I have use the
textcat.batch-train for training the model that will classify the research projects
prodigy textcat.batch-train textcat_test_reject en_vectors_web_lg -t2v "./pretrained-model/model22.bin" --eval-split 0.2 --output /tmp/model
The problem here is that I think I didn't understand the textcat recipes properly because I thought that if I use
textcat.teach giving the all 50k projects as a source, I was going to being able to accept or reject labels assigned to all the projects and not just those that have a label already (that are in the dataset).
prodigy textcat.teach textcat_test_reject en_vectors_web_lg textcat_all_projects.jsonl --label ENERGY, HEALTH,...
Why I can only assign a label to the projects that are in the dataset and not to all projects in the source? I am confused and I hope the question is not as confused as I am