Using custom DB (Google Spanner)

Hi,

I'd like to connect Prodigy to read and write the text data from/to Spanner.
Our text samples are sitting as a plain text in a Spanner table (table = "input_data", column name = "text", type = STRING).
The annotations will sit in another table (table = "annotations").
I'm trying to implement most of the database functions as in Database · Prodigy · An annotation tool for AI, Machine Learning & NLP.
I'm not really sure about the structure which is needed. For example, how should the output of get_examples look like?
If someone has already implemented a connection to a custom database which isn't one of the available on peewee, it would be great to learn from.

Thanks

Hi @GalM ,

You can check out this package, which implements most of the database functions for MongoDB. I've used it as a starting point for my own custom annotation setup.

2 Likes

Thanks @kirilov , I'll take a look.

Hi,

Still need your help regarding this issue.
It's unclear what components are must-have.
Also, what's the table structure needed for Prodigy's internal tables + which tables should be created.
The package linked in the previous comment is Mongo specific and uses an old Prodigy version.
Can you please add a documentation with the exact components needed?

Thanks

If I'll find a way to connect to Spanner using psql, will it become a simpler task?

You can find the default table structure here: Database · Prodigy · An annotation tool for AI, Machine Learning & NLP However, if you're using your own database, you can also decide on your own scheme here. Ultimately, all Prodigy will do is ask for examples or give you examples to store, so if your database can perform these actions, it's up to you how you want to store the data.

If you want to implement a Database class that slots into Prodigy just like the built-in one, you can see the methods that should be implemented in the mondigy implementation here: https://github.com/jdagdelen/mondigy/blob/6c7928121faf1ba7d78fe29367bacd997ecc1f24/mondigy/database.py#L37

If there's a way to automate this so you can connect to Postgres directly, then yes, you should be able to just use the Postgres integration out-of-the-box.

1 Like

Thanks a lot!

Hi,

I've implemented the necessary parts (I think so :slight_smile: ).
Now I'm running:
prodigy spans.manual text_annotation blank:en - --label FORM,TAX_FILER --loader spanner_loader
And I get ✘ No loader found for 'spanner_loader'.
spanner_loader file:

I'm running the prodigy command from the same folder where the loader file sits.

Thanks

You also need to tell Prodigy where to find your loader by name. One option is to not make it a recipe and register it:

from prodigy.util import registry

@registry.loaders.register("spanner_loader")
def spanner_loader(source):
   ...

The loader will always receive whatever you pass in as the source argument on the CLI – for instance, the mondigy package uses this to provide a configuration file.

(The more advanced solution that mondigy uses is to wrap everything in a package and register the loader as an entry point: mondigy/setup.cfg at 6c7928121faf1ba7d78fe29367bacd997ecc1f24 · jdagdelen/mondigy · GitHub)

Alternatively, you can also make your loader write to standard output, i.e. by calling print(j) instead of yield j. If your loader writes to standard output, you can use it by piping its output forward into the recipe and setting the source to - so it reads from standard input. For example:

prodigy spanner-loader -F spanner_loader.py | prodigy ner.manual dataset blank:en - --label X,Y,Z

Thanks @ines.
When using the final option, I get an error:

  File "/Users/gal/.pyenv/versions/3.9.4/envs/april-dev-venv/lib/python3.9/site-packages/prodigy/components/db.py", line 84, in connect
    raise ValueError(f"Invalid database id: {db_id}")
ValueError: Invalid database id: spanner

After implementing all of the DB class by ourselves, how do we make it not to try and reach the prodigy.components.db?

Also, regarding the second option you've suggested:
from prodigy.util import registry- "util" is not found.

That's strange :thinking: What's the exact error message you're seeing and is Prodigy installed correctly in the environment?

Prodigy works fine (without Spanner), so I guess it's installed correctly (?).
Sorry, when running directly through Python CLI, the works fine. PyCharm doesn't recognize it.
image

When trying to run it with the registry option I get:

File "/Users/gal/.pyenv/versions/3.9.4/envs/april-dev-venv/lib/python3.9/site-packages/prodigy/components/db.py", line 84, in connect
    raise ValueError(f"Invalid database id: {db_id}")
ValueError: Invalid database id: spanner

Looks like it tries to connect using the prodigy connect function. And not using my custom DB.

Ah, that's likely because of Cython, so it's just the editor not being able to resolve the module.

It looks like this is related to the database itself, not the loader. If you're not using entry points to tell Prodigy where to find the code (e.g. like mondigy does here: https://github.com/jdagdelen/mondigy/blob/6c7928121faf1ba7d78fe29367bacd997ecc1f24/setup.cfg#L26) you can also register it explicitly, e.g.:

@registry.databases("spanner")
def spanner_db():
    return YourDatabaseInstance()