Control whether to show results from PatternMatcher or Model

gustav · September 20, 2021, 5:37pm

I am having problems with a highly imbalanced text classification exercise for sentences. The standard textcat.teach doesn't produce any sentences belonging to the label in his binary classification problem.

To amend this I have written a custom textcat recipe following the guidelines in the docs (step 5 in https://prodi.gy/docs/text-classification#active-learning).

My idea was to customise the combine_models() function to force it suggest sentences using the PatternMatcher if the number of accepts are low in order to keep the labeled dataset balanced (the pattern matcher results in a more than 90% accept rate).

I managed to do the following which works fine. However, the results of the check_annotation_balance function is only updated once results are saved.

Question: Is there a way to access the controller to see the accept reject rates of the current session (i.e. those that have not been saved to the database)?

def combine_models(one, two):
    def check_annotation_balance():
        db = connect()
        answers = [row["answer"] for row in db.get_dataset("custom-textcat-climate")]
        count = Counter(answers)
        log(f"annotation balance: {count}")
        return count["accept"] - count["reject"]

    def predict(stream):
        for batch in partition_all(batch_size, stream):
            batch = list(batch)
            stream1 = one(batch)
            stream2 = two(batch)
            balance = check_annotation_balance()
            if balance - len(batch) < 0:
                yield from stream1
            else:
                yield from interleave((stream1, stream2))

    def update(examples):
        loss = one.update(examples) + two.update(examples)
        return loss

    return predict, update

ines · September 22, 2021, 10:13am

Hi! The approach definitely makes sense. I think it mostly comes down to this:

Answers that are not submitted yet won't be available on the controller or the Python API layer at all, because they're only kept on the client. This allows you to undo in the UI without having to reconcile multiple duplicate answers on the back-end. Maybe you just want to experiment with a lower batch size, so you're receiving examples back from the app more quickly? So if you're sending back batches of 3, you'll receive the answers quicker.

You could also just keep a record of the answer counts in the update function, which will receive all answers you get back from the web app. That saves you the roundtrip via the database, which can easily get mor expensive as your dataset grows.

gustav · September 22, 2021, 12:09pm

Thanks! Could you give a example how I would go about keeping a record in the update? I am not sure I see when this function is being called. I tried placing a simple log function inside it but nothing is being outputted... Also if keeping track of this in the update function how would I best communicate with the predict function in order to steer what stream to display?

ines · September 23, 2021, 9:25am

Sure! I was thinking you could keep the counter as a global/nonlocal variable, update it in the update callback and then refer to it in the predict function. As you receive answers from the web app, the counts are updated, and since the stream is a generator and consumed in batches, its behaviour can respond to the changing counts:

count = Counter()

def predict(stream):
    for batch in partition_all(batch_size, stream):
        nonlocal count
        balance = count["accept"] - count["reject"]
        # ...

def update(answers):
    nonlocal count
    for eg in answers:
        count[eg["answer"]] += 1
    # ...

Keep in mind that the update callback is only executed once a full batch of size batch_size (defined in your prodigy.json or recipe config) is available in the app on top of the answers that are kept on the client and displayed in the history in the sidebar (which aren't submitted to the back-end yet so you can easily undo). So with a batch size of 10, you'll need to annotate 20 examples until the first batch of 10 examples is sent back and your update callback runs.

Also make sure that you pass it in as the "update" returned by your recipe.

Topic		Replies	Views
How can I improve a textcat model? usage , textcat	1	731	May 6, 2019
Best way to annotate rare labels for classification usage , textcat	8	907	January 22, 2019
Text classification scoring usage , textcat , custom	1	536	March 24, 2020
When is the model called and the scores updated in the textcat teach method textcat	1	303	August 19, 2022
Train doesn't use rejected text for binary classification textcat , done	3	416	March 17, 2020

Control whether to show results from PatternMatcher or Model

Related Topics