Custom Recipe with MongoDB

Hello,

First of all, thanks for developing Prodigy, it's a great tool.

I'm currently trying to use MongoDB as a source for annotations and a place to deposit annotated text. After looking through the docs, it looks like a custom recipe is necessary for storing annotations. I am trying to set up very basic functionality for now. I have a Mongo collection that stores the text to be annotated and a Mongo collection that stores the annotated text. These are referred to in my code as source_collection and dest_collection, respectively. My understanding of a dataset is that it's just a collection of annotated text, and the way we want to implement datasets is up to us.

Here's my simple custom recipe.

@prodigy.recipe(
    'ner-custom',
    dataset=("Dataset to save answers to", "positional", None, str),
)
def custom_recipe(dataset):
    view_id = 'ner_manual'
    mongo_client = CustomMongoClient(MONGO_INSTANCE, PORT, DB_NAME, SOURCE_COLLECTION_NAME, DEST_COLLECTION_NAME)

    def get_inputs_from_mongo():
        for doc in mongo_client.source_collection.find():
            del doc['_id']
            yield doc

    nlp = spacy.blank('en')
    stream = get_inputs_from_mongo()
    stream = add_tokens(nlp, stream)
    
    return {
        'view_id': view_id,
        'db': mongo_client,
        'stream': stream,
        'dataset': dataset,
        'config': {
            'labels': ['ITEM']
        }
    }

Here is my custom MongoDB client.

from pymongo import MongoClient

class CustomMongoClient:

    def __init__(self, mongo_instance, port, db_name, source_collection_name, dest_collection_name):
        self.client = MongoClient(mongo_instance, port)
        self.db = self.client[db_name]
        self.source_collection = self.db[source_collection_name]
        self.dest_collection = self.db[dest_collection_name]

    def get_dataset(self, name, default=None):
        return list(self.dest_collection.find())

    def get_examples(self, ids, by="task_hash"):
        return list(self.dest_collection.find({"task_id": {"$in": ids}}))

    def add_dataset(self, name, meta={}, session=False):
        pass

    def add_examples(self, examples, datasets):
        self.dest_collection.insert_many(examples)

    def get_sessions_examples(self, session_ids=None):
        return self.get_dataset("temp")

As shown above, I've implemented some of the DB functions as outlined in the API. For the dataset functions, I'm basically just pulling everything from my dest_collection. When a dataset is "added", I don't do anything.

When I try to run prodigy ner-custom temp-dataset -F custom_recipe.py, I get an odd error regarding my custom DB class.

15:47:22: CLI: Importing file custom_recipe.py
15:47:22: RECIPE: Calling recipe 'ner-custom'
15:47:22: CONFIG: Using config from global prodigy.json
15:47:22: VALIDATE: Validating components returned by recipe
15:47:22: CONTROLLER: Initialising from recipe
15:47:22: VALIDATE: Creating validator for view ID 'ner_manual'
15:47:22: VALIDATE: Validating Prodigy and recipe config
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 63, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 143, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/core.pyx", line 342, in prodigy.core.Controller.connect_db
TypeError: argument of type 'CustomMongoClient' is not iterable

Do I need to implement all DB methods for the custom DB to work? Or am I doing something wrong here with my setup. I've also tried looking at GitHub - jdagdelen/mondigy: A small component for using Mongodb databases with Prodigy annotation applications., but still not exactly sure what's happening here. Any help is appreciated. Thanks!

Hi there!

Prodigy only supports Sqlite, MySQL and Postgres natively. Any other databases will require a custom implementation. This section on our docs gives more details on how you could do that:

I've not used mondigy myself, so I can't comment on it's utility. Alternatively though, you can also just export the MongoDB data into a JSONL file and then use that as input data for Prodigy. Similarily you can also use the db-out recipe to get JSONL data out of Prodigy and then you could use a small Python script to get it into Mongo again. It's a bit of scripting this way, which might be more convenient than writing the database class.

Thanks, I'll try that. As for implementing a custom DB class, is it required to have all methods defined?

1 Like

I would say that it is recommended. It's possible that you'll only use a subset of methods initially, but especially if you're going to be using it for the long term, you'll likely want to ensure that everything works as expected.

1 Like