Custom model Requirements

yusun · June 20, 2018, 7:58pm

I would like to use my own custom model (not a spacy model. e.g. pytorch, tensorflow, keras) with prodigy interface in active learning. What are the requirements of the model and how to integrate into custom recipe?

ines · June 21, 2018, 8:31am

To make any model integrate with Prodigy’s active learning workflow, you mainly need to expose two functions:

a predict function that takes an iterable stream of examples in Prodigy’s JSON format, scores them and yields (score, example) tuples
an update callback that takes a list of annotated examples and updates the model accordingly

Here’s a pseudocode example of how this could look in a custom text classification recipe. How you implement the individual components of course depends on the specifics of your model.

import copy
from prodigy.components.loaders import JSONL
from prodigy.components.sorters import prefer_uncertain

@prodigy.recipe('custom')
def custom_recipe(dataset, source):
    stream = JSONL(source)
    model = load_your_model()

    def predict(stream):
        for eg in stream:
            predictions = get_predictions_from_model(eg)
            for label, score in predictions:
                example = copy.deepcopy(eg)
                example['label'] = label
                yield (score, example)

    def update(answers):
        for eg in answers:
            if eg['answer'] == 'accept':
                update_model_with_accept(eg)
            elif eg['answer'] == 'reject':
                update_model_with_reject(eg)
        loss = get_loss()
        return loss

    return {
        'dataset': dataset,
        'view_id': 'classification',
        'stream': prefer_uncertain(predict(stream)),
        'update': update
    }

You can also find more details on the expected formats and component APIs in your PRODIGY_README.html or in the custom recipes workflow.

yusun · July 23, 2018, 3:28pm

Hi,

Thanks for your instruction. I used this method to apply my customized pytorch model, which make loss converged when I do batch training. But when I use this algorithm to teach annotation, the result is close to even worse than samples which is random chosen from whole dataset. The experiment I took is to predict the sentiment of the IMDB review is positive or negative. The experiment group is using the annotations generated by prodigy active learning process while the baseline group is a random order chosen from whole dataset. For both groups, I trained successive data based on the model trained on previous samples .

I’m very confused about the result, it seems algorithm that is good for supervised learning is not good for active learning. I’m curious if there is any requirement for the customized model output? The meaning of my customized score is the probability of positiveness now. Could you tell me the logic of the ```prefer_uncertain`` function? I just think it’s important to know it well to build more suitable model.

honnibal · March 7, 2019, 10:15am

Just so I’m understanding your experiment correctly: what’s step in the graph above? Is it the number of data samples? If so, is this training from only one epoch? What happens if you train for multiple epochs?

My more general answer: it’s true that active learning isn’t a good fit for every problem. The IMDB sentiment corpus was designed to investigate particular text classification techniques, so the dataset has several characteristics that make models converge well on the data. Specifically:

The texts are quite long.
The texts are of similarish length.
Exactly two classes.
Perfect class balance.
Few boundary cases.
Low annotation noise.

These problem characteristics make the dataset a relatively bad example for active learning, I think. The class composition is especially relevant. In many datasets you’ll have a lot of labels, with one label making up a lot of the examples, some classes that are rare but easy to predict (because the examples are all very similar), and some other classes that are easily confused.

Another thing to consider is that when doing active learning, it can matter a lot how fast your model responds to new examples. The default text classification model for Prodigy actually has a few features designed to make it learn better under an active learning regime. The most important feature is that it’s an ensemble of a unigram bag-of-words model and a CNN. During the first 10-20 weight updates, the CNN is still performing at close to chance, but the unigram bag-of-words model can already have learned a lot, because it starts off with such a useful inductive bias about the problem.

So, it’s possible that your model architecture learns a bit too slowly, and that’s one reason why your active learning might not perform well. But it might not be the decisive reason — I do think IMDB is a very tough example for active learning, so I wouldn’t be surprised if Prodigy’s default configuration actually doesn’t beat the baseline on it either. I haven’t run that experiment; I’d be interested to find out the result.

akshitasood63 · March 20, 2019, 10:05am

Hi @ines @honnibal
I wanted to know that if there are multiple predictions for one eg, then we will get that example multiple times for annotation ? If not, how will the score be calculated for that particular example ?

Also, in my use case, I am updating the model in the loop after every ‘n’ sentences. I want the predict method to be called whenever my model is updated. Is there any way to trigger the predict method externally ?
From what I understood from the documentation is that the stream should be updated after every chunk of given batch_size.

Thanks

honnibal · March 22, 2019, 2:55pm

Hi @akshitasood63,

If you have a custom training loop, you can control the flow of questions that Prodigy asks you exactly how you want them. Specifically, your recipe just needs to return a dictionary, and one of the items will be the "stream". This can be a function implementing a generator, and you can yield out whatever examples you want from it. So, if you have a model that predicts scores for multiple classes, and you want to ask a different question for each class, you can definitely do that. You would just yield multiple tasks from your generator for each item in your input.

You can also control the logic that gets executed on update inside the "update" callback. If you wish to make predictions, you can run the model then. Note that if you just want the updates to be reflected in the model that’s running in the questions loop, you don’t normally need to do anything special. The update callback can just update the model in place, and then as the data is streaming through your model, your model will be using the updated weights.

akshitasood63 · March 25, 2019, 5:17am

Thanks @honnibal
I got a pretty clear idea on multiple class problem.
Now I want the predictions from the updated model to be reflected in the questions that are fed to prodigy UI, so I was editing the stream in the predict function and then yielding it. Seems like the predict function is not called for every batch of questions. Can you give me some insight on how does it work?

akshitasood63 · March 25, 2019, 7:07am

So, should I yield the updated stream in update instead of predict function ?

ines · March 25, 2019, 10:53am

No, the data should always be yielded out in the generator you pass in as the "stream". However, since it’s a generator, it can respond to state changes – for example, if you update your model with each batch of answers you receive, that model will score the stream differently when you predict the incoming new examples. So what’s sent out for annotation will change as the model changes.

Here’s an example that shows this idea with a “dummy model”:

github.com

explosion/prodigy-recipes/blob/master/textcat/textcat_custom_model.py

# coding: utf8
from __future__ import unicode_literals

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.sorters import prefer_uncertain
from prodigy.util import split_string
import random


class DummyModel(object):
    # This is a dummy model to help illustrate how to use Prodigy with a model
    # in the loop. It currently "predicts" random numbers – but you can swap
    # it out for any model of your choice, for example a text classification
    # model implementation using PyTorch, TensorFlow or scikit-learn.

    def __init__(self, labels=None):
        # The model can keep arbitrary state – let's use a simple random float
        # to represent the current weights
        self.weights = random.random()

This file has been truncated. show original

Topic		Replies	Views
Including own active learning function / active learning outputs usage , api , custom	6	734	May 10, 2019
Generic workflow for Active Learning for non NLP tasks and a custom or scikit-learn model usage , custom	1	214	September 11, 2023
Customizing prodigy for NER and relationship extraction usage , ner , custom	4	4197	December 20, 2017
Image classification usage , image , custom	1	1411	November 9, 2017
Putting it all together from "Using a custom model" does not work solved	1	323	January 8, 2023

Custom model Requirements

Related topics