Hi! Thanks for the detailed report
Since the OOM problem also occurs in db-merge
, which doesn't do anything sophisticated and only loads the datasets, I suspect the problem must happen when Prodigy loads the examples from the database and keeps them in memory. Here's a very simple standalone script you can try:
from prodigy.components.db import connect
db = connect()
all_examples = []
for dataset_name in ["your_dataset1", "your_dataset2"]: # etc.
all_examples.extend(db.get_dataset(dataset_name))
How large do you think the JSON data in your datasets would be? Even with 6000 text files, it's unlikely going to be gigabytes, right? If it does turn out that your datasets don't fit into memory, that's indeed a bit tricky
While the db-merge
recipe just loads all datasets into memory and concatenates them, the data-to-spacy
and train
recipes will group the annotations by hashes and then create one Doc
object per example with all available annotations (e.g. entities with different labels etc.). Prodigy then keeps a second copy of the data by hash, and uses that to merge the examples. So one option would be to just run the data-to-spacy
conversion in batches and batch up your examples by _input_hash
, so there are no examples with the same input hash across different batches. This should give you the same merged result and means you have to keep less data in memory.