Hello, I am using a custom database to retrieve text and store annotations. I've set the configuration "allow_work_stealing" to "false" in the Prodigy config file. However, I encountered an error: AttributeError: 'AnnotationDatabase' object has no attribute 'get_hashes'. I have searched through the documentation but could not find any information regarding this method. I am curious about what should be included in this method.
That is indeed interesting. Could you share the full traceback?
Also, just to be clear, if you turn work stealing back to true
then this issue does not appear?
I just checked the source code and can confirm that this is a documentation mistake on our end. We added a get_hashes
method to our main Database
class which the core Controller
app does use on startup.
I will try to get these items added to the documentation right away and I'll check and see if there are perhaps other methods missing as well.
Thanks for the notification, and sorry about this headacke!
Here's the API doc for the get_hashes
method inline:
Database.get_hashes
Argument | Type | Description |
---|---|---|
*names |
list | The dataset names. |
kind |
str | The kind of hash to check, can be "input" or "task" (default) |
RETURNS | int | Set of the hashes in the provided dataset names. |
I also found another method that you probably should implement.
Database.get_dataset_sessions
Get all session datasets associated with a parent dataset. Finds all the session
datasets that have examples also associated with the parent dataset. Can be an
expensive query for large datasets.
Argument | Type | Description |
---|---|---|
dataset_name |
str | The parent dataset name |
RETURNS | List[str] | The list of session dataset names |
Hello, thank you for your reply. In the previous version, there are two methods related to "hashes": get_input_hashes and get_task_hashes. I have already implemented them correctly to get Prodigy to run on the customized database. How do those methods differ from "get_hashes"?
@Michelle-Ming96 Prodigy v1.12 does change quite a bit about the Database so it would be great if you could share a bit more about your use case. What Database are you trying to connect to?
The get_hashes
is just a convenience method we added to reduce some verbosity in our Controller
implementation. It just defers to get_task_hashes
and get_input_hashes
Implementation
def get_hashes(
self, *names: str, kind: Literal["task", "input"] = "task"
) -> Set[int]:
"""
*names (str): The dataset names.
kind (str): The kind of hash. Can be "input" or "task"
RETURNS (set): Set of the hashes in the provided dataset names.
"""
if kind not in ["input", "task"]:
raise ValueError("Can only use `task` or `input` kinds of hashes.")
if kind == "task":
hashes = self.get_task_hashes(*names)
elif kind == "input":
hashes = self.get_input_hashes(*names)
return hashes
For your reference, The db
component is a part of Prodigy not compiled with Cython so you're free to view the source implementation yourself.
Just find Prodigy in your installed site-packages
for your Python environment and navigate to the prodigy/components/db.py
file.
Hello, thank you for your reply. I am using the Python package Mondigy to connect MongoDB with Prodigy. Data will be retrieved from MongoDB, and annotations will be saved there as well. In my project, there were multiple annotators, and we want to ensure that each document is allocated to only one annotator, with all annotators receiving a similar number of passages to annotate. To achieve this, I opened multiple sessions and set "allow_work_stealing" to "false". However, after upgrading to version 12, the custom loader is no longer supported. I encountered an error stating "could not resolve loader". It appears that the "prodigy_loader" entry points are no longer supported in the latest version. What should I do to enable custom loader?
Could you share the full stacktrace?
I might also advise using annotations_per_task: 1
in this case. That way, even if one annotator is much faster than the rest the task router will still distribute the tasks evenly. It won't allocate more tasks to this one speedy annotator, assuming the annotators are known upfront and set via the PRODIGY_ALLOWED_SESSIONS
environment variable.
I'm going to explore this and will report back once I've found something. Just to double-check that we didn't accidentally break anything. One check though; is there anything else you can share about your setup? What kind of data are you annotating? NER? Audio?
In the meantime, you might still be able to generate the stream by leveraging a custom recipe and handling the fetching via Python. The loader should get fixed, but the custom recipe might be able to unblock you in the short term.
The section of the docs mentions how you can pass the database class directly at the end of the recipe.
@Michelle-Ming96 We're working on setting up a new loaders registry for the new prodigy.components.stream.get_stream
utility and are hoping to have that ready for v1.13. For now the legacy loaders registry that mondigy
is using won't work.
There's nothing stopping you from using the old prodigy.components.loaders.get_stream
util in a custom recipe so if you still want to load data from your MongoDB instance that would be my recommendation.
Alternatively, you could just export data from the MongoDB instance to a JSONL file and use that as the source argument for your recipe.
I haven't tested this but I believe the Prodigy db-out
command should still work with your custom Database.
prodigy db-out my_mongodb_source_dataset ./output
and then you can start one of the default Prodigy recipes (like ner.manual
) pointing to this JSONL file.
prodigy ner.manual new_annotations ./output/my_mongodb_source_dataset.jsonl ...