combine_models - the effect of batch_size?

JanP · November 10, 2019, 1:34pm

Hello team!

I have been trying to get the prefer_high_scores sorter to look over the larger window of examples to present the bootstrapped (although relatively rare) texts first. I have came across this thread where
the function combine_models is discussed.

Best way to annotate rare labels for classification

Ah, sorry about that. Here’s the source for combine_models , it’s pretty straightforward:

from toolz.itertoolz import interleave, partition_all

def combine_models(one, two, batch_size=32):
    """Combine two models and return a predict and update function. Predictions
    of both models are combined using the toolz.interleave function. Mostly
    used to combine an EntityRecognizer with a PatternMatcher.
    one (callable): First model. Requires a `__call__` and `update` method.
    two (callable): Second model. Requires a `__call__` and `update` method.
    batch_size (int): The batch size to use for predicting the stream.
    RETURNS (tuple): A `(predict, update)` tuple of the respective functions.
    """

    def predict(stream):
        for batch in partition_all(batch_size, stream):
            batch = list(batch)
            stream1 = one(batch)
            stream2 = two(batch)
            yield from interleave((stream1, stream2))

    def update(examples):
        loss = one.update(examples) + two.update(examples)
        return loss

return predict, update

Would setting a higher batch_size to combine model achieve just that? I have found nothing in the docs so maybe it was already deprecated. The goal is simply to get the examples containing the patterns to come up first (now I have to run the rule-based matcher in a separate script to pre-assign the scores, which is not very convenient).

Thank you in advance!
Best wishes,
Jan

ines · November 11, 2019, 12:23pm

Hi! There are two batch sizes here: first, the batch size used to partition the two generators and interleave them, and second, the batch size that Prodigy uses to divide the final stream into batches.

The batch_size on combine_models is less relevant here and mostly used for efficiency. In the end, the predict function still just yields (score, example) tuples. The batch_size setting in Prodigy is what decides about how many examples are fetched from the stream at once, how many are sent to the web app and how many are sent back to the server.

See my reply here for more ideas on how you could solve your problem:

JanP · November 11, 2019, 3:39pm

Oh, I see!

Currently, I have compared bootstrapped textcat.teach with textcat.manual stream and since the patterns were not so common, the stream were almost identical. I will try using higher batch-size setting.

Thank you so much!

Topic		Replies	Views
Sorter Batch Size / Local Sorters? enhancement , usage , api	1	1026	March 14, 2018
Active learning with a custom model usage , ner	1	341	November 10, 2021
Combining Models usage , ner	2	452	July 8, 2020
Active learning model default sample size? usage	3	455	March 18, 2020
Can you increase the question batch size in ner.teach active learning? usage , ner	1	465	June 21, 2020

combine_models - the effect of batch_size?

Related topics