This may be multiple questions in one
Hello, I'm trying to run a grid search on an active learning pipeline, and I want to do this automatically without actually having a human in the loop, I need to somehow be able to simulate prodigy and connect to it's database without launching the server.
setup
Version 1.10.8
Location C:\Users\fhijazi\AppData\Roaming\Python\Python37\site-packages\prodigy
Prodigy Home C:\Users\fhijazi\.prodigy
Platform Windows-10-10.0.19041-SP0
Python Version 3.7.13
my active learning setup is by following the instructions from the man himself (Robert Munro) using his pytorch active learning repository, specifically this file: pytorch_active_learning/active_learning_basics.py at master · rmunro/pytorch_active_learning · GitHub
Now with active learning, there are many parameters like which type of uncertainty sampling to use etc.., the trivial way to test which hparam is better, is to manually run prodigy with the model in the loop and label, and do this with every single parameter.
however I already have the full dataset labeled and I'd like to automate this without the prodigy server. Meaning I will need to simulate prodigy by accessing the stream and the update functions.
The issue I'm having is that I need to somehow connect to the database.
need
Here's the list of what I need to do:
- get access current annotations
- get all previous annotations (I can do this by running the
db-out
and then parsing the output string in python), but I'm sure there's a better way - write current session annotations to prodigy db (only needed for the simulation to simulate a user labeling)
Current solution
I'm currently using connect()
and db.get_dataset(dataset)
,
I implemented it all as a generator to wrap around the stream
iterator, then I later use stream = al_generator(stream)
code to al_generator
def al_generator(stream):
data_prev = # somehow access all annotations from previous sessions
try:
while True:
training_data = data_prev + db.get_sessions_examples(db.sessions)
training_count = len(training_data)
# make sure we have enough eval data
if len(data_val) < minimum_evaluation_items:
#Keep adding to evaluation data first
print("Creating evaluation data:\n")
needed = minimum_evaluation_items - len(data_val)
print(str(needed) + " more annotations needed")
eg = yield from iter_limited(itershuffle(stream), needed) # get annotations
if training_count < minimum_training_items:
# lets create our first training data!
print("Creating initial training data:\n")
needed = minimum_training_items - training_count
print(str(needed) + " more annotations needed")
eg = yield from iter_limited(itershuffle(stream), needed) # get annotations
else:
print("Sampling via Active Learning:\n")
#TODO: remove this 200
data = list(iter_limited(stream, select_per_epoch))
# ...
eg = yield from iter_limited(sampled_data) # get annotations
except StopIteration:
pass
currently it seems like I can use db.get_sessions_examples(db.sessions)
to access the current session's answers, and db.add_examples([example], [dataset])
to write to the DB, however I'm not sure how I should be generating the _task_hash
and _input_hash
I may be overcomplicating things, but the reason the setup is complex, is because in Robert's code, he assumes the code is blocking and the entire pipeline just stops until the user labels, in addition to having the data in global variables which makes things easier to code in Robert's case.