Hello,
First of all, thanks for developing Prodigy, it's a great tool.
I'm currently trying to use MongoDB as a source for annotations and a place to deposit annotated text. After looking through the docs, it looks like a custom recipe is necessary for storing annotations. I am trying to set up very basic functionality for now. I have a Mongo collection that stores the text to be annotated and a Mongo collection that stores the annotated text. These are referred to in my code as source_collection
and dest_collection
, respectively. My understanding of a dataset is that it's just a collection of annotated text, and the way we want to implement datasets is up to us.
Here's my simple custom recipe.
@prodigy.recipe(
'ner-custom',
dataset=("Dataset to save answers to", "positional", None, str),
)
def custom_recipe(dataset):
view_id = 'ner_manual'
mongo_client = CustomMongoClient(MONGO_INSTANCE, PORT, DB_NAME, SOURCE_COLLECTION_NAME, DEST_COLLECTION_NAME)
def get_inputs_from_mongo():
for doc in mongo_client.source_collection.find():
del doc['_id']
yield doc
nlp = spacy.blank('en')
stream = get_inputs_from_mongo()
stream = add_tokens(nlp, stream)
return {
'view_id': view_id,
'db': mongo_client,
'stream': stream,
'dataset': dataset,
'config': {
'labels': ['ITEM']
}
}
Here is my custom MongoDB client.
from pymongo import MongoClient
class CustomMongoClient:
def __init__(self, mongo_instance, port, db_name, source_collection_name, dest_collection_name):
self.client = MongoClient(mongo_instance, port)
self.db = self.client[db_name]
self.source_collection = self.db[source_collection_name]
self.dest_collection = self.db[dest_collection_name]
def get_dataset(self, name, default=None):
return list(self.dest_collection.find())
def get_examples(self, ids, by="task_hash"):
return list(self.dest_collection.find({"task_id": {"$in": ids}}))
def add_dataset(self, name, meta={}, session=False):
pass
def add_examples(self, examples, datasets):
self.dest_collection.insert_many(examples)
def get_sessions_examples(self, session_ids=None):
return self.get_dataset("temp")
As shown above, I've implemented some of the DB functions as outlined in the API. For the dataset functions, I'm basically just pulling everything from my dest_collection
. When a dataset is "added", I don't do anything.
When I try to run prodigy ner-custom temp-dataset -F custom_recipe.py
, I get an odd error regarding my custom DB class.
15:47:22: CLI: Importing file custom_recipe.py
15:47:22: RECIPE: Calling recipe 'ner-custom'
15:47:22: CONFIG: Using config from global prodigy.json
15:47:22: VALIDATE: Validating components returned by recipe
15:47:22: CONTROLLER: Initialising from recipe
15:47:22: VALIDATE: Creating validator for view ID 'ner_manual'
15:47:22: VALIDATE: Validating Prodigy and recipe config
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 61, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "cython_src/prodigy/core.pyx", line 63, in prodigy.core.Controller.from_components
File "cython_src/prodigy/core.pyx", line 143, in prodigy.core.Controller.__init__
File "cython_src/prodigy/core.pyx", line 342, in prodigy.core.Controller.connect_db
TypeError: argument of type 'CustomMongoClient' is not iterable
Do I need to implement all DB methods for the custom DB to work? Or am I doing something wrong here with my setup. I've also tried looking at GitHub - jdagdelen/mondigy: A small component for using Mongodb databases with Prodigy annotation applications., but still not exactly sure what's happening here. Any help is appreciated. Thanks!