Loading a dataset from the DB instead of from disk/api?

Is there any way to load an unannotated dataset into the DB for use in annotation tasks instead of reading from disk or an API? I have remote users doing the annotating and I’d like to be able to load the data into the database since it provides some level of security. The data I’m working with can’t be stored on the users disks for security reasons. Implementing a custom API is probably outside the scope of the project I’m working on at the moment unfortunately.

I tried the following and then checked my DB but it doesn’t look like the data is loaded into the DB until it is annotated?

prodigy textcat.teach test_set en_core_web_sm test.jsonl --label TESTING

Yes, Prodigy only stores the annotated examples in the dataset when they come back from the REST API, not the incoming stream. This also means that Prodigy will only fetch one batch at a time from your stream via the /get_questions endpoint, just enough to fill up the queue for the annotator. Streams are generators and Prodigy has no concept of the full scope of the raw dataset – it just keeps asking for more batches, until the stream is exhausted.

So if you run Prodigy on a server controlled by you, and have the annotators access the web app via their browser (e.g. authenticated via your proxy solution), this should meet your security requirements? Or am I misunderstanding the requirements?

In any case, you can definitely read in data from a dataset, if that makes it more convenient. Importing your raw data is easy – the db-in command can read in data using the file loaders supported by Prodigy. For example:

prodigy db-in raw_dataset /your_data.jsonl  # or any other format

If the source argument of a recipe is not set, it defaults to sys.stdin. This lets you pipe data forward from a previous process, e.g. a custom script. All you have to do is print the dumped JSON of each annotation task, for example:

# dataset_loader.py
from prodigy.components.db import connect()
import json

examples = db.get_dataset('raw_dataset')
for eg in examples:
    print(json.dumps(eg))

You can then run the custom loader and recipe like this:

python dataset_loader.py | prodigy ner.teach ner_dataset en_core_web_sm

Of course, you don’t have to use the Prodigy database. You could also plug in any other solution or a different database. Instead of fetching all examples at once and then iterating over them, your loader script could also fetch X examples at a time, and then make a new request if the previous batch is running out. This should work fine with Prodigy’s streaming logic, might reduce loading times for very large datasets and and it’d mean that the data only has to leave your database when it’s needed to fill up the annotator’s queue.

1 Like

Great tip thanks!

Forgot to mention that we’re currently attempting to run the app inside a docker container on the user’s machine since we haven’t acquired a server to run it on yet.

Looks like this should work for my weird usecase.

Thanks!

You might want to check out ngrok:

We haven't really tested it yet in detail, but it looks very promising. Their service lets you expose a local service via a public URL, and also supports password protection and various other features (even on the free plan).

Unfortunately ngrok.com is blocked by our corporate firewall :frowning:

Your dataset_loader.py should be sufficient to get things moving while we wait on a server.

In case anyone runs into this later there was a typo in the above example code. It should read:

# dataset_loader.py
from prodigy.components.db import connect
import json

db = connect()

examples = db.get_dataset('raw_dataset')
for eg in examples:
    print(json.dumps(eg))

Thanks!