This may be multiple questions in one
Hello, I'm trying to run a grid search on an active learning pipeline, and I want to do this automatically without actually having a human in the loop, I need to somehow be able to simulate prodigy and connect to it's database without launching the server.
Version 1.10.8 Location C:\Users\fhijazi\AppData\Roaming\Python\Python37\site-packages\prodigy Prodigy Home C:\Users\fhijazi\.prodigy Platform Windows-10-10.0.19041-SP0 Python Version 3.7.13
my active learning setup is by following the instructions from the man himself (Robert Munro) using his pytorch active learning repository, specifically this file: pytorch_active_learning/active_learning_basics.py at master · rmunro/pytorch_active_learning · GitHub
Now with active learning, there are many parameters like which type of uncertainty sampling to use etc.., the trivial way to test which hparam is better, is to manually run prodigy with the model in the loop and label, and do this with every single parameter.
however I already have the full dataset labeled and I'd like to automate this without the prodigy server. Meaning I will need to simulate prodigy by accessing the stream and the update functions.
The issue I'm having is that I need to somehow connect to the database.
Here's the list of what I need to do:
- get access current annotations
- get all previous annotations (I can do this by running the
db-outand then parsing the output string in python), but I'm sure there's a better way
- write current session annotations to prodigy db (only needed for the simulation to simulate a user labeling)
I'm currently using
I implemented it all as a generator to wrap around the
stream iterator, then I later use
stream = al_generator(stream)
def al_generator(stream): data_prev = # somehow access all annotations from previous sessions try: while True: training_data = data_prev + db.get_sessions_examples(db.sessions) training_count = len(training_data) # make sure we have enough eval data if len(data_val) < minimum_evaluation_items: #Keep adding to evaluation data first print("Creating evaluation data:\n") needed = minimum_evaluation_items - len(data_val) print(str(needed) + " more annotations needed") eg = yield from iter_limited(itershuffle(stream), needed) # get annotations if training_count < minimum_training_items: # lets create our first training data! print("Creating initial training data:\n") needed = minimum_training_items - training_count print(str(needed) + " more annotations needed") eg = yield from iter_limited(itershuffle(stream), needed) # get annotations else: print("Sampling via Active Learning:\n") #TODO: remove this 200 data = list(iter_limited(stream, select_per_epoch)) # ... eg = yield from iter_limited(sampled_data) # get annotations except StopIteration: pass
currently it seems like I can use
db.get_sessions_examples(db.sessions) to access the current session's answers, and
db.add_examples([example], [dataset]) to write to the DB, however I'm not sure how I should be generating the
I may be overcomplicating things, but the reason the setup is complex, is because in Robert's code, he assumes the code is blocking and the entire pipeline just stops until the user labels, in addition to having the data in global variables which makes things easier to code in Robert's case.