Finding rare positive examples for textcat

I’m making a classifier for a relatively rare category (less than 1% of documents). After using the seed terms to find a bunch of examples and doing an initial batch train of the model, I’ve been having a hard time finding enough positive examples in the stream once I switch to the model, even with prefer_high_scores. I’m guessing this has to do with the limit on number of examples the model looks at in the stream at once. I hacked something together that classifies an entire JSONL (in my case, with around 10,000 sentences) and writes out the top n highest scoring to a JSONL for use in Prodigy.

Hopefully this helps someone else with a similar problem, or maybe someone can tell me a better way of doing this.

import plac
import spacy
import jsonlines
import operator
from tqdm import tqdm
from random import shuffle
import re

@plac.annotations(
    input_file=("File to get high probability text from.", "option", "i", str),
    model=("Prodigy text classification model to use (path to).", "option", "m", str),
    label=("Label to use.", "option", "l", str),
    max_examples=("Max number of examples to export.", "option", "n", str))
def main(input_file, model, label, max_examples = 200):
    """
    When you're looking for rare positive examples, run the model over your entire file and pull out the highest scoring n examples.
    """
    nlp = spacy.load(model)
    print("Calculating scores...")
    with jsonlines.open(input_file) as reader:
        lines = list(reader)
        docs = []
        for obj in tqdm(lines):
            # should use pipe here but would require an extra reassembly step
            obj['score'] = nlp(obj['text']).cats[label]
            docs.append(obj)
    docs.sort(key=operator.itemgetter('score'), reverse=True)
    filtered_docs = docs[0:max_examples]
    # keep things interesting, don't go best to worst
    shuffle(filtered_docs)

    outfile = re.sub("\.jsonl",  "_high_scores.jsonl", input_file)
    with jsonlines.open(outfile, mode='w') as writer:
        writer.write_all(filtered_docs)
    print("Wrote {0} tasks to {1}".format(len(filtered_docs), outfile))

if __name__ == "__main__":
    plac.call(main)
1 Like

Thanks for the code! We really appreciate how much users are helping each other — it makes a big difference :smile:

The problem you’re seeing is definitely in the sorter, not the model itself. Prodigy’ textcat class is a pretty thin wrapper around the spaCy pipeline component.

I’m pretty sure that the problem is due to the fact that the default sorting algorithm tries to adjust to different scales of the output scores. This way you don’t get stuck with no questions when the output scores are like, 10**-7. Similarly, if the scores are all over 0.99, we still want the sorter to discriminate.

The solution in the default sorter is to track a running mean, and emit scores which are above it. This learns to ignore the absolute scale of the scores, which in your case isn’t what you want. You should therefore try passing algorithm="probability". This sorter asks the question if random.random() < score. If this isn’t quite what you want, you can also write your own sorting algorithm. It should take a sequence of (score, example) tuples, and generate a sequence of example dicts.

2 Likes