How can I override Controller.get_questions

Hi,
In the doc it mentions that we can override the Controller.

I found in the code ( the __main__.py) :

recipe = get_recipe(command)
controller = recipe(*args, use_plac=True)

I couldn't find the implementation of the @recipe decorator.

My question is: What is the best way to override the get_questions method?

Or if it's easier: How can I create a Controller that would behave similar to the one created via the @recipe decorator?

I'd like to leverage the syntax and doc from @recipe and only override what I need (In that instance the get_questions method)

And the follow up question: Once I have a Controller created, how can I link it easily with the prodigy bin? Is calling the set_controller from app.py the correct way? Or is there a cleaner way?

Thank you for your help
Thibault

Hi! Could you provide some more details on what exactly you want to achieve by overriding the get_questions method? I wouldn't necessarily recommend this for most use cases because it's fairly complex and you'd have to make sure you handle all possible scenarios correctly and manage the already annotated examples that should be excluded etc.

That said, a recipe can also return an instance of the Controller instead of a dictionary of components. So this would be the correct and most elegant way to do it: construct the controller in your recipe and return it.

Hi,
For my usecase, I'd like to implement my own logic on whether a user should get more questions or not (Basically I'd like to implement each user get exactly N questions)

So in my usecase I could easily just re-use any existing controller and using some custom logic I could decide whether I return prodigy get_questions or nothing if I know the user (using session_id) already processed N questions.

At the moment I'm thinking in overriding the __main__.py to use the controller and then monkey patch the get_questions method to add my own logic but that does sound super ugly....

Could you show me an example of recipe which create its own controller? Ideally re-using the logic behind @recipe? As I said...I'm happy with all the heavy lifting handled by prodigy. I just want to implement my own "job dispatching" (get_questions) logic

Hope that clarifies!

Ah, so if you already have a monkey-patched solution that works, then this should be pretty straightforward to move to the custom recipe. Basically, all you have to do is construct the Controller in the recipe with the given arguments, instead of returning just a dictionary. For example:

@recipe("foo")
def some_function(dataset):
    ...
    return {"dataset": dataset, "stream": stream}  # etc.
@recipe("foo")
def some_function(dataset):
    ...
    # define whatever you need and set the rest to None
    ctrl = Controller(
        dataset, view_id, stream, update, store, progress, on_load, 
        on_exit, before_db, get_session_id, exclude, config, None)
    # monkey-patch your controller here
    return ctrl

Thanks for the reply.

I'm trying to test it with the ner.manual recipe for now without any changes (except I want to return the Controller instead of a dict)

This is the relevant part:

@recipe("custom_recipe",
....)
def manual(...):
    ...
    view_id = "ner_manual"
    update = None
    db = None
    progress = None
    on_load = None
    on_exit = None
    before_db = None
    get_session_id = None
    config = {
        "lang": nlp.lang,
        "labels": labels,
        "exclude_by": "input",
        "ner_manual_highlight_chars": highlight_chars,
        "auto_count_stream": True,
    }
    ctrl = Controller(
        dataset, view_id, stream, update, db, progress, on_load,
        on_exit, before_db, get_session_id, exclude, config, None)

    return ctrl

When I try to run it, I get the following error:

 prodigy custom_recipe test1 blank:en ./news_headlines_short.jsonl --label PERSON,ORG,PRODUCT,LOCATION -F ./recipe.py 
Using 4 label(s): PERSON, ORG, PRODUCT, LOCATION
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/thibanir/.virtualenvs/prodigy/lib/python3.9/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/thibanir/.virtualenvs/prodigy/lib/python3.9/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/thibanir/.virtualenvs/prodigy/lib/python3.9/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/data/Project/hexagone/prodigy/./recipe.py", line 97, in manual
    ctrl = Controller(
  File "cython_src/prodigy/core.pyx", line 45, in prodigy.core.Controller.__init__
TypeError: __init__() takes exactly 15 positional arguments (14 given)

Could you tell me, what is the missing argument? I've checked the documentation and it only shows 13...
This is the version of prodigy I'm using: prodigy-1.11.4-cp39-cp39-linux_x86_64.whl

And as another question: For all the argument I'm passing as None, Is it using the default implementation or disabling the method? For example if get_session_id is None, does it mean I won't be able to use session ID?

Thanks
Thibault

Ah, sorry, looks like we forgot to update that. Just pushed an update to the site – the expected signature is this:

controller = Controller(dataset, view_id, stream, update, db,
                        progress, on_load, on_exit, before_db,
                        validate_answer, get_session_id, exclude,
                        config, None)

No, the arguments here corresponds to what you would (or wouldn't) return by your recipe as a dictionary. So passing in None for update, before_db or get_session_id is the equivalent of not returning this config setting by your recipe. Prodigy will then fall back to the default.

Thank you for the updates! It does work like I was expecting!

Just to understand the get_questions: It's mostly calling the stream iterator batch_size time (in the case of feed_overlap=True ) and skip all the annotations already done using the exclude parameter.

It raises 2 questions:

  • Are the hashes for a given question the same for all sessions? (The answer seems to be yes but i wanted to confirm)
  • If I override the get_questions method, do I get any benefit of defining a stream + extra logic in get_questions? Or could I just override the get_questions method to retrieve my data and forget about the stream? I wonder if this streamis used in other places...As it seems it's just an iterator that can be calling a random API, I'd be tempted to say that get_questions is the only place calling it..., is that a correct assumption?

Thanks for the quick replies
Thibault