Hello, I am using a custom database to retrieve text and store annotations. I've set the configuration "allow_work_stealing" to "false" in the Prodigy config file. However, I encountered an error: AttributeError: 'AnnotationDatabase' object has no attribute 'get_hashes'. I have searched through the documentation but could not find any information regarding this method. I am curious about what should be included in this method.
That is indeed interesting. Could you share the full traceback?
Also, just to be clear, if you turn work stealing back to
true then this issue does not appear?
I just checked the source code and can confirm that this is a documentation mistake on our end. We added a
get_hashes method to our main
Database class which the core
Controller app does use on startup.
I will try to get these items added to the documentation right away and I'll check and see if there are perhaps other methods missing as well.
Thanks for the notification, and sorry about this headacke!
Here's the API doc for the
get_hashes method inline:
||list||The dataset names.|
||str||The kind of hash to check, can be "input" or "task" (default)|
|RETURNS||int||Set of the hashes in the provided dataset names.|
I also found another method that you probably should implement.
Get all session datasets associated with a parent dataset. Finds all the session
datasets that have examples also associated with the parent dataset. Can be an
expensive query for large datasets.
||str||The parent dataset name|
|RETURNS||List[str]||The list of session dataset names|
Hello, thank you for your reply. In the previous version, there are two methods related to "hashes": get_input_hashes and get_task_hashes. I have already implemented them correctly to get Prodigy to run on the customized database. How do those methods differ from "get_hashes"?
@Michelle-Ming96 Prodigy v1.12 does change quite a bit about the Database so it would be great if you could share a bit more about your use case. What Database are you trying to connect to?
get_hashes is just a convenience method we added to reduce some verbosity in our
Controller implementation. It just defers to
def get_hashes( self, *names: str, kind: Literal["task", "input"] = "task" ) -> Set[int]: """ *names (str): The dataset names. kind (str): The kind of hash. Can be "input" or "task" RETURNS (set): Set of the hashes in the provided dataset names. """ if kind not in ["input", "task"]: raise ValueError("Can only use `task` or `input` kinds of hashes.") if kind == "task": hashes = self.get_task_hashes(*names) elif kind == "input": hashes = self.get_input_hashes(*names) return hashes
For your reference, The
db component is a part of Prodigy not compiled with Cython so you're free to view the source implementation yourself.
Just find Prodigy in your installed
site-packages for your Python environment and navigate to the
Hello, thank you for your reply. I am using the Python package Mondigy to connect MongoDB with Prodigy. Data will be retrieved from MongoDB, and annotations will be saved there as well. In my project, there were multiple annotators, and we want to ensure that each document is allocated to only one annotator, with all annotators receiving a similar number of passages to annotate. To achieve this, I opened multiple sessions and set "allow_work_stealing" to "false". However, after upgrading to version 12, the custom loader is no longer supported. I encountered an error stating "could not resolve loader". It appears that the "prodigy_loader" entry points are no longer supported in the latest version. What should I do to enable custom loader?
Could you share the full stacktrace?
I might also advise using
annotations_per_task: 1 in this case. That way, even if one annotator is much faster than the rest the task router will still distribute the tasks evenly. It won't allocate more tasks to this one speedy annotator, assuming the annotators are known upfront and set via the
PRODIGY_ALLOWED_SESSIONS environment variable.
I'm going to explore this and will report back once I've found something. Just to double-check that we didn't accidentally break anything. One check though; is there anything else you can share about your setup? What kind of data are you annotating? NER? Audio?
In the meantime, you might still be able to generate the stream by leveraging a custom recipe and handling the fetching via Python. The loader should get fixed, but the custom recipe might be able to unblock you in the short term.
The section of the docs mentions how you can pass the database class directly at the end of the recipe.
@Michelle-Ming96 We're working on setting up a new loaders registry for the new
prodigy.components.stream.get_stream utility and are hoping to have that ready for v1.13. For now the legacy loaders registry that
mondigy is using won't work.
There's nothing stopping you from using the old
prodigy.components.loaders.get_stream util in a custom recipe so if you still want to load data from your MongoDB instance that would be my recommendation.
Alternatively, you could just export data from the MongoDB instance to a JSONL file and use that as the source argument for your recipe.
I haven't tested this but I believe the Prodigy
db-out command should still work with your custom Database.
prodigy db-out my_mongodb_source_dataset ./output
and then you can start one of the default Prodigy recipes (like
ner.manual) pointing to this JSONL file.
prodigy ner.manual new_annotations ./output/my_mongodb_source_dataset.jsonl ...