get_hashes method in v1.12

Hello, I am using a custom database to retrieve text and store annotations. I've set the configuration "allow_work_stealing" to "false" in the Prodigy config file. However, I encountered an error: AttributeError: 'AnnotationDatabase' object has no attribute 'get_hashes'. I have searched through the documentation but could not find any information regarding this method. I am curious about what should be included in this method.

That is indeed interesting. Could you share the full traceback?

Also, just to be clear, if you turn work stealing back to true then this issue does not appear?

I just checked the source code and can confirm that this is a documentation mistake on our end. We added a get_hashes method to our main Database class which the core Controller app does use on startup.

I will try to get these items added to the documentation right away and I'll check and see if there are perhaps other methods missing as well.

Thanks for the notification, and sorry about this headacke!

Here's the API doc for the get_hashes method inline:

Database.get_hashes

Argument Type Description
*names list The dataset names.
kind str The kind of hash to check, can be "input" or "task" (default)
RETURNS int Set of the hashes in the provided dataset names.

I also found another method that you probably should implement.

Database.get_dataset_sessions

Get all session datasets associated with a parent dataset. Finds all the session
datasets that have examples also associated with the parent dataset. Can be an
expensive query for large datasets.

Argument Type Description
dataset_name str The parent dataset name
RETURNS List[str] The list of session dataset names

Hello, thank you for your reply. In the previous version, there are two methods related to "hashes": get_input_hashes and get_task_hashes. I have already implemented them correctly to get Prodigy to run on the customized database. How do those methods differ from "get_hashes"?

@Michelle-Ming96 Prodigy v1.12 does change quite a bit about the Database so it would be great if you could share a bit more about your use case. What Database are you trying to connect to?

The get_hashes is just a convenience method we added to reduce some verbosity in our Controller implementation. It just defers to get_task_hashes and get_input_hashes

Implementation

def get_hashes(
        self, *names: str, kind: Literal["task", "input"] = "task"
    ) -> Set[int]:
        """
        *names (str): The dataset names.
        kind (str): The kind of hash. Can be "input" or "task"
        RETURNS (set): Set of the hashes in the provided dataset names.
        """
        if kind not in ["input", "task"]:
            raise ValueError("Can only use `task` or `input` kinds of hashes.")
        if kind == "task":
            hashes = self.get_task_hashes(*names)
        elif kind == "input":
            hashes = self.get_input_hashes(*names)
        return hashes

For your reference, The db component is a part of Prodigy not compiled with Cython so you're free to view the source implementation yourself.

Just find Prodigy in your installed site-packages for your Python environment and navigate to the prodigy/components/db.py file.

Hello, thank you for your reply. I am using the Python package Mondigy to connect MongoDB with Prodigy. Data will be retrieved from MongoDB, and annotations will be saved there as well. In my project, there were multiple annotators, and we want to ensure that each document is allocated to only one annotator, with all annotators receiving a similar number of passages to annotate. To achieve this, I opened multiple sessions and set "allow_work_stealing" to "false". However, after upgrading to version 12, the custom loader is no longer supported. I encountered an error stating "could not resolve loader". It appears that the "prodigy_loader" entry points are no longer supported in the latest version. What should I do to enable custom loader?

Could you share the full stacktrace?

I might also advise using annotations_per_task: 1 in this case. That way, even if one annotator is much faster than the rest the task router will still distribute the tasks evenly. It won't allocate more tasks to this one speedy annotator, assuming the annotators are known upfront and set via the PRODIGY_ALLOWED_SESSIONS environment variable.

I'm going to explore this and will report back once I've found something. Just to double-check that we didn't accidentally break anything. One check though; is there anything else you can share about your setup? What kind of data are you annotating? NER? Audio?

In the meantime, you might still be able to generate the stream by leveraging a custom recipe and handling the fetching via Python. The loader should get fixed, but the custom recipe might be able to unblock you in the short term.

The section of the docs mentions how you can pass the database class directly at the end of the recipe.

@Michelle-Ming96 We're working on setting up a new loaders registry for the new prodigy.components.stream.get_stream utility and are hoping to have that ready for v1.13. For now the legacy loaders registry that mondigy is using won't work.

There's nothing stopping you from using the old prodigy.components.loaders.get_stream util in a custom recipe so if you still want to load data from your MongoDB instance that would be my recommendation.

Alternatively, you could just export data from the MongoDB instance to a JSONL file and use that as the source argument for your recipe.

I haven't tested this but I believe the Prodigy db-out command should still work with your custom Database.

prodigy db-out my_mongodb_source_dataset ./output

and then you can start one of the default Prodigy recipes (like ner.manual) pointing to this JSONL file.

prodigy ner.manual new_annotations ./output/my_mongodb_source_dataset.jsonl ...