I want to train a binary classifier in sentence level. In order to do this, my preprocessing consists of parsing the training texts into sentences, performing the removal of duplicate sentences using pandas dataframe, and saving the resulting data into a csv file with almost 500.000 lines.
I created a seed jsonl file using terms.teach and loaded the csv file using textcat.teach in order to annotate the sentences using the provided label. It contains only around 20 single token seeds.
My problem is that the type of sentence I’m labeling is very specific, it must happen once or twice in every text file. I’m guessing here that around 1% (possibly less) of my dataset will have the label accepted.
When prodigy loads my csv dataset, it’s making the suggestions based on the similarity of the seeds and sentences, taking into consideration the order of the training files from which the content was extracted and segmented during the creation of the aforementioned csv file.
My questions are:
- Since its not sorting the whole csv file in order to suggest the sentences based on similarity, I guess I’ll end up annotating tens of thousands of sentences in order to get only a few hundred accepted ones. Am I right in my guess?
- If I’ll end up with such an imbalanced dataset, what can I do after the annotation process to balance it and train a classification model as fair as possible?