Sentence Classification

Hi there!

I want to train a binary classifier in sentence level. In order to do this, my preprocessing consists of parsing the training texts into sentences, performing the removal of duplicate sentences using pandas dataframe, and saving the resulting data into a csv file with almost 500.000 lines.

I created a seed jsonl file using terms.teach and loaded the csv file using textcat.teach in order to annotate the sentences using the provided label. It contains only around 20 single token seeds.

My problem is that the type of sentence I’m labeling is very specific, it must happen once or twice in every text file. I’m guessing here that around 1% (possibly less) of my dataset will have the label accepted.

When prodigy loads my csv dataset, it’s making the suggestions based on the similarity of the seeds and sentences, taking into consideration the order of the training files from which the content was extracted and segmented during the creation of the aforementioned csv file.

My questions are:

  1. Since its not sorting the whole csv file in order to suggest the sentences based on similarity, I guess I’ll end up annotating tens of thousands of sentences in order to get only a few hundred accepted ones. Am I right in my guess?
  2. If I’ll end up with such an imbalanced dataset, what can I do after the annotation process to balance it and train a classification model as fair as possible?

Thanks!

Can you have some sort of rule-based processing that discards irrelevant sentences with high confidence? Or perhaps a separate classifier you train for this with scikit-learn. I think the model will really struggle to fit the problem as you've described it.

When the data set is very imbalanced, it takes the model a long time to learn things about the problem which might be very obvious to you. In these stages a rule-based classifier can be much more effective than the machine-learnt solution. For instance, if you only have 2 positive examples, rules that you create from those 5 examples will probably generalise better than a model trying to optimise on them, especially one designed for more general-purpose situations (as Prodigy is).

I see.
I think I might come up some sort of rule-based solution.

I’m trying to identify sentences in judicial documents in which the judge takes some sort of decision or not, and they might use many different verbs (mostly accept/defer and synonyms) for declaring the decision, and most of the times they use a negation to the same set of verbs to declare a denial decision.

Although there are not so many different verbs indicating these decisions, in Portuguese we have many different inflections for the words, and people from legal domain still have a notorious preference for “speaking pompously”, using a very elaborate and, very often, confused vocabulary. I’m even considering training word embeddings specifically in a large corpus of legal documents that we have here, to try to pick up these contexts more accurately.

I was trying to come up with many different examples that indicate a positive and a negative decision, in order to conceive a binary classifier for each of them. So I would end up with positive decisions, negative decisions and sentences which are not decisions at all (which would be over 90% of the times).

I still need a legal expert annotator to make the annotations for me, so that we can assemble as many different examples as possible of ways that judges can indicate their decisions. Do you still think that a rule-based approach might be a better fit? My concern is that we’re likely to have a few dozen possibilities for each binary classifier.

Thanks!