Control whether to show results from PatternMatcher or Model

I am having problems with a highly imbalanced text classification exercise for sentences. The standard textcat.teach doesn't produce any sentences belonging to the label in his binary classification problem.

To amend this I have written a custom textcat recipe following the guidelines in the docs (step 5 in

My idea was to customise the combine_models() function to force it suggest sentences using the PatternMatcher if the number of accepts are low in order to keep the labeled dataset balanced (the pattern matcher results in a more than 90% accept rate).

I managed to do the following which works fine. However, the results of the check_annotation_balance function is only updated once results are saved.

Question: Is there a way to access the controller to see the accept reject rates of the current session (i.e. those that have not been saved to the database)?

def combine_models(one, two):
    def check_annotation_balance():
        db = connect()
        answers = [row["answer"] for row in db.get_dataset("custom-textcat-climate")]
        count = Counter(answers)
        log(f"annotation balance: {count}")
        return count["accept"] - count["reject"]

    def predict(stream):
        for batch in partition_all(batch_size, stream):
            batch = list(batch)
            stream1 = one(batch)
            stream2 = two(batch)
            balance = check_annotation_balance()
            if balance - len(batch) < 0:
                yield from stream1
                yield from interleave((stream1, stream2))

    def update(examples):
        loss = one.update(examples) + two.update(examples)
        return loss

    return predict, update

Hi! The approach definitely makes sense. I think it mostly comes down to this:

Answers that are not submitted yet won't be available on the controller or the Python API layer at all, because they're only kept on the client. This allows you to undo in the UI without having to reconcile multiple duplicate answers on the back-end. Maybe you just want to experiment with a lower batch size, so you're receiving examples back from the app more quickly? So if you're sending back batches of 3, you'll receive the answers quicker.

You could also just keep a record of the answer counts in the update function, which will receive all answers you get back from the web app. That saves you the roundtrip via the database, which can easily get mor expensive as your dataset grows.

Thanks! Could you give a example how I would go about keeping a record in the update? I am not sure I see when this function is being called. I tried placing a simple log function inside it but nothing is being outputted... Also if keeping track of this in the update function how would I best communicate with the predict function in order to steer what stream to display?

Sure! I was thinking you could keep the counter as a global/nonlocal variable, update it in the update callback and then refer to it in the predict function. As you receive answers from the web app, the counts are updated, and since the stream is a generator and consumed in batches, its behaviour can respond to the changing counts:

count = Counter()

def predict(stream):
    for batch in partition_all(batch_size, stream):
        nonlocal count
        balance = count["accept"] - count["reject"]
        # ...

def update(answers):
    nonlocal count
    for eg in answers:
        count[eg["answer"]] += 1
    # ...

Keep in mind that the update callback is only executed once a full batch of size batch_size (defined in your prodigy.json or recipe config) is available in the app on top of the answers that are kept on the client and displayed in the history in the sidebar (which aren't submitted to the back-end yet so you can easily undo). So with a batch size of 10, you'll need to annotate 20 examples until the first batch of 10 examples is sent back and your update callback runs.

Also make sure that you pass it in as the "update" returned by your recipe.

1 Like