Using Google Firestore as database or not

I have all my documents in Google Firestore so I stream my training examples from Firestore into prodigy app. I’ll annotate the data without any preprocessing but I’ll do some preprocessing when training - this might change over time though so the saved text field by prodigy might need to be updated over time. FYI I train the model in the loop like .teach recipe on the preprocessed data.

I figured its easier for me to keep all the annotated data in Firestore as well but it is not exactly clear to me what I need to implement in order for prodigy to use Firestore instead of sqlite. postgresql etc.? I’ve looked here and PRODIGY_README.html but I am still not sure.

Prodigy lets you pass in a custom Database class via the "db" setting returned by a custom recipe, or via an entry point of your own Python package installed in the same environment.

So basically, you can write a class that exposes the same methods and properties as the built-in Database class but writes to your remote Firestore database. For instance, it’d have a method add_examples that takes a list of examples and a list of one or more dataset names and then adds those examples to the given datasets in your custom database. The datasets property returns a list of all dataset names in your custom database, and so on.

It’s also possible that some of the methods won’t even have to do anything in your case – for example, I’m not sure reconnecting is an issue with Firestore, so your reconnect would just do nothing. Similarly, the link and unlink methods are really only used internally within Prodigy’s existing database class. So in your Firestore connection, you can just write to a table directly if you want. (Not 100% sure what the best practices are for Firestore/Firebase.)

For details on the API, you can check out the Readme or the source of components/db.py in your Prodigy installation.

1 Like

Cool thanks - that’s what I figured I needed to do. I was just not sure if I needed to implement everything from Database class.

How does prodigy handle duplicates? Is there anything preventing that you get an example presented twice? If so, how?

Yes, that's determined using the "_input_hash" and "_task_hash" values. The input hash describes the raw input data – e.g. the text or the image – and the task hash the input data plus pre-defined annotations, if available. This lets you distinguish between questions on the same input and identical questions. For instance, in a workflow like ner.teach, you might have several questions with different highlighted spans on the same text. Prodigy uses the task hash to check whether an example is identical to something that's already in the dataset. Later on during training, we can then use the input hash to find all annotations on the same text and merge them.

That makes sense. I have to be careful when I use text in the backend for teach and present a corresponding HTML report in the app. Careful because I want to experiment with the html to text preprocessing. Hence the text might change although the report and label should remain the same and therefore the _input_hash should remain the same. Is that correctly understood?

What is the difference between Dataset.id and Dataset.name in the Database class? I just started the implementation but that part confuses me a bit. name is unique so why isn't this just the id?

Another thing that confuses me is whats going on in drop_dataset()? The goal is to delete a specified dataset and all linked examples if those examples are not linked to any other dataset. Is that correct?

The relationship between Example and Dataset is many-to-many? Also when is Database.save() fired? I don't think I need to implement that either but I am not sure.

A final question. I imagine I do something like db = Database('firestore')

@prodigy.recipe('custom-recipe')
def custom_recipe():
    return {'db': db}  # etc.

but that won't affect my CLI commands? How can I fix that?

By the looks of it. I now have a working Database but I am not confident that some features are missing and whether these can be vital later on.

The id is an automatically generated integer ID that's used internally and the name is the human-readable string name like "my_cool_dataset". Basically, the dataset-related methods need to be able to create and retrieve datasets by their string names – how you implement this under the hood is up to you.

Yes, exactly. In the built-in implementation, the same example can be part of more than one dataset and it'll only be stored once. So when deleting a dataset, we only want to delete the examples that are only present there and not in any other set. Not sure how well this logic translates to Firestore – if you implement a unique record for each example, the drop_dataset method could also just dump everything and be done with it.

Yes.

This is fired only if it exists and at the end of an annotation session when the user exits the server. A better name for the method would probably have been save_and_exit or something like that.
If you need to trigger any final actions like closing the connections or confirming stashed changes etc., that's where you would do it.

If you wrap your custom database in a Python package and expose an entry point firesotre as prodigy_db, Prodigy should recognise it by its string name and the logs should say something like "DB: Added X connector(s) via entry points". You can then also edit your prodigy.json and add "db": "firestore" there.

1 Like

Hi @ines, I am working on a project to stream and store annotated data in Snowflake DB. Looks like I’ve to write my custom DB class to make it work but I’m confused where to start so I have a few questions,

  • Do you have a tutorial on how to create a custom DB class? It would be easier for me to understand
  • Can you list all the methods in the DB class that I’ve to update
  • If possible can you share a sample custom db.py for an external DB for reference or point me to a Github project which does that?

You'll be able to find the detailed API reference of the Database class in your PRODIGY_README.html (in the "API" section). It lists all the methods, their arguments and expected return values. This thread also has some notes on the methods you might not have to implement yourself, since they're not actually called internally.

If you prefer to look at an actual implementation, you can check out Prodigy's built-in components/db.py. To find the location of your Prodigy installation, you can run the following:

python -c "import prodigy;print(prodigy.__file__)"

I don't know of any publicly shared custom Database implementation, but I do know thatquite a few users have done it. @nix411 If you've made progress on your Firestore integration, maybe you can give some tips and examples? :slightly_smiling_face:

1 Like