simulating prodigy user

This may be multiple questions in one

Hello, I'm trying to run a grid search on an active learning pipeline, and I want to do this automatically without actually having a human in the loop, I need to somehow be able to simulate prodigy and connect to it's database without launching the server.

setup

Version          1.10.8
Location         C:\Users\fhijazi\AppData\Roaming\Python\Python37\site-packages\prodigy
Prodigy Home     C:\Users\fhijazi\.prodigy
Platform         Windows-10-10.0.19041-SP0
Python Version   3.7.13

my active learning setup is by following the instructions from the man himself (Robert Munro) using his pytorch active learning repository, specifically this file: pytorch_active_learning/active_learning_basics.py at master · rmunro/pytorch_active_learning · GitHub

Now with active learning, there are many parameters like which type of uncertainty sampling to use etc.., the trivial way to test which hparam is better, is to manually run prodigy with the model in the loop and label, and do this with every single parameter.

however I already have the full dataset labeled and I'd like to automate this without the prodigy server. Meaning I will need to simulate prodigy by accessing the stream and the update functions.

The issue I'm having is that I need to somehow connect to the database.

need

Here's the list of what I need to do:

  • get access current annotations
  • get all previous annotations (I can do this by running the db-out and then parsing the output string in python), but I'm sure there's a better way
  • write current session annotations to prodigy db (only needed for the simulation to simulate a user labeling)

Current solution

I'm currently using connect() and db.get_dataset(dataset),

I implemented it all as a generator to wrap around the stream iterator, then I later use stream = al_generator(stream)

code to al_generator

def al_generator(stream):
    data_prev = # somehow access all annotations from previous sessions
    try:
        while True:
            training_data = data_prev + db.get_sessions_examples(db.sessions)
            training_count = len(training_data)
            # make sure we have enough eval data
            if len(data_val) < minimum_evaluation_items:
                #Keep adding to evaluation data first
                print("Creating evaluation data:\n")

                needed = minimum_evaluation_items - len(data_val)
                print(str(needed) + " more annotations needed")
                eg = yield from iter_limited(itershuffle(stream), needed)  # get annotations

            if training_count < minimum_training_items:
                # lets create our first training data! 
                print("Creating initial training data:\n")

                needed = minimum_training_items - training_count
                print(str(needed) + " more annotations needed")
                eg = yield from iter_limited(itershuffle(stream), needed)  # get annotations
            else:
                print("Sampling via Active Learning:\n")
                #TODO: remove this 200
                data = list(iter_limited(stream, select_per_epoch))
                # ...

                eg = yield from iter_limited(sampled_data)  # get annotations
    except StopIteration:
        pass

currently it seems like I can use db.get_sessions_examples(db.sessions) to access the current session's answers, and db.add_examples([example], [dataset]) to write to the DB, however I'm not sure how I should be generating the _task_hash and _input_hash

I may be overcomplicating things, but the reason the setup is complex, is because in Robert's code, he assumes the code is blocking and the entire pipeline just stops until the user labels, in addition to having the data in global variables which makes things easier to code in Robert's case.

hi @FarisHijazi!

Thanks for your message and welcome to the Prodigy community :wave:

Interesting project!

Curious - is your goal to run experiments to determine the best strategies on using active learning for improved accuracy? Interesting stuff! I haven't seen Robert's code but I would be interested to learn more. I can get back to you later.

One interesting thing for "simulating" active learning is to also add noise to the annotations (i.e., purposely make x% of annotations incorrect). If you do a simulation where you assume the annotator is correct each time, this isn't reflective on the reality that annotators make mistakes. I've devised AL experiments in the past and found an important "hyperparameter" is the assumed accuracy of the annotators. Just another factor to consider in your experiments.

Have you heard/used Prodigy's entry points?

Entry points let you expose parts of a Python package you write to other Python packages. This lets one application easily customize the behavior of another, by exposing an entry point in its setup.py or setup.cfg. For a quick and fun intro to entry points in Python, check out this excellent blog post. Prodigy can load custom function from several different entry points, for example custom recipe functions. To see this in action, check out the sense2vec package, which provides several custom Prodigy recipes. The recipes are registered automatically if you install the package in the same environment as Prodigy. The following entry point groups are supported:

prodigy_recipes Entry points for recipe functions.
prodigy_db Entry points for custom Database classes.
prodigy_loaders Entry points for custom loader functions.

I haven't used these yet but they may do the trick. Here's where they were used in sense2vec.

Let me think more in general -- I may have some suggestions.

In the meantime, I found a relevant post (which you may have read already):

1 Like

yes exactly, finding the best strategy, instead of having humans do it and I try different configs with them, I just run all the configs and simulate a human since I have the labels

Have you heard/used Prodigy's entry points?

well, I know command line entry points, but what I need is something more like an SDK or something that can emulate the browser sending data, except that I don't want the server to be running when I do this

to be honest, there is a simpler solution, I could simulate the AL pipeline WITHOUT any prodigy components or DB components, and then when I find the best parameters, I would just set those values in the prodigy version of the code, but this will be hard to maintain and I'll have to maintain 2 code versions and any change will have to reflect in both

Currently I'm using the db connect() and it provides me with a way to get the past data from previous sessions

now I'm working on submitting samples to the DB

will update you on what happens