MemoryError for db-in on virtual machine

Hello,

I’m planning to build a classifier using batch-train on a virtual machine. I have the data I want in a jsonl file, but running python -m prodigy db-in dataset_name data.jsonl returns the error message:
image

The full dataset is 75gb, but I get the same error for very small datasets too.

If it’s not clear where I’m going wrong, could I run a json.dump straight into a dataset instead?

Many Thanks,
Love the tool

Hi! How much memory does your machine have?

And just to confirm (also in case others come across this thread in the future): The data you want to load in is annotations you’ve already collected, right? Because if you just want to label your data, you won’t have to import it into a dataset first. Datasets only store the collected annotations.

Another thing to keep in mind is that Prodigy’s training commands are optimised for quick experiments – so if you want to train in 75gb of annotated data, you might want to use spaCy directly instead. This gives you more control over the training loop, the data loading and the hyperparameters.


Some background on the import process, in case you do want to customise it and write your own script: During import, Prodigy will load each line of the JSON data, set hashes and add the "answer" key if not present in the data. So I guess this is just taking up too much memory. If you look at the __main__.py shipped with Prodigy, you can see the source of the db_in function. The two important parts are:

  • Calling set_hashes on each record in the data to assign input and task hashes (to allow Prodigy to identify annotations on the same text). If you add a _task_hash (unique ID of the annotation) and _input_hash (unique ID of the input the annotation was collected on, e.g. the text) value to each record in your data yourself, you can skip this step.
  • Initializing the database and calling db.add_examples to add examples (you can find the detailed API docs in your PRODIGY_README.html.
from prodigy.components.db import connect

db = connect()  # uses settings from prodigy.json
db.add_examples(list_of_examples, datasets=['dataset_name'])

Using the above code, you could write your own script that batches up the examples to import and adds them to the database as they’re loaded.

Thanks very much for the speedy reply. Yes, these are annotations already collected. We’ve been running smaller experiments with prodigy and just wanted to see how things would scale, before focusing on the details.

In the first instance I hadn’t moved the .prodigy home directory correctly which was causing problems. Now it’s fine running db-in for smaller jsonl files. We’re using a 1TB drive to store everything, but the machine is 32GB, so I’m guessing there’s a cache being overloaded during import?

Thanks again!

I’m curious to hear if you’re able to load such a large dataset. I made a quick batched db_in command, can you give it a shot and let me know if it works for you?

db_in.py

# coding: utf8
import sys
from pathlib import Path
from prodigy.components.db import connect
from prodigy.components.loaders import get_loader
from prodigy.util import get_timestamp_session_id, set_hashes
from prodigy.util import prints
from cytoolz import partition_all


def db_in(set_id, in_file, batch_size=1000):
    DB = connect()
    in_file = Path(in_file)
    if not in_file.exists() or not in_file.is_file():
        prints("Not a valid input file.", in_file, exits=1, error=True)
    if set_id not in DB:
        prints("Creating input dataset.", set_id)
        DB.add_dataset(set_id)
    loader = get_loader(None, file_path=in_file)
    annotations = loader(in_file)

    def set_hashes_stream(input):
        for eg in input:
            yield set_hashes(eg)

    batches = partition_all(batch_size, set_hashes_stream(annotations))
    session_id = get_timestamp_session_id()

    for batch in batches:
        DB.add_dataset(session_id, session=True)
        DB.add_examples(batch, datasets=[set_id, session_id])
        prints(
            "Imported {} annotations for '{}' to database {}".format(
                len(batch), set_id, DB.db_name
            )
        )
    prints("Session ID: {}".format(session_id))


if __name__ == "__main__":
    if len(sys.argv) < 3:
        prints(
            "Script requires two arguments, use like:",
            " > python db_in.py my_dataset my_examples.jsonl",
            exits=1,
            error=True,
        )
    db_in(sys.argv[1], sys.argv[2])


You can set the batch_size variable to control how many items to add at a time. Save it and use it like:

python db_in.py cool_dataset cool_examples.jsonl
1 Like

We’ve shutdown our VM now, so afraid I won’t be able to run it. But we ran a slightly less elegant db_in command which, as far as I can see, behaves the same. After calling set_hash and input_hash our final dataset was closer to 100GB, but loaded successfully despite the size.

1 Like