Prodigy NER train recipe getting killed by OOM

Hi! Thanks for the detailed report :+1:

Since the OOM problem also occurs in db-merge, which doesn't do anything sophisticated and only loads the datasets, I suspect the problem must happen when Prodigy loads the examples from the database and keeps them in memory. Here's a very simple standalone script you can try:

from prodigy.components.db import connect

db = connect()
all_examples = []
for dataset_name in ["your_dataset1", "your_dataset2"]:  # etc.
    all_examples.extend(db.get_dataset(dataset_name))

How large do you think the JSON data in your datasets would be? Even with 6000 text files, it's unlikely going to be gigabytes, right? If it does turn out that your datasets don't fit into memory, that's indeed a bit tricky :thinking:

While the db-merge recipe just loads all datasets into memory and concatenates them, the data-to-spacy and train recipes will group the annotations by hashes and then create one Doc object per example with all available annotations (e.g. entities with different labels etc.). Prodigy then keeps a second copy of the data by hash, and uses that to merge the examples. So one option would be to just run the data-to-spacy conversion in batches and batch up your examples by _input_hash, so there are no examples with the same input hash across different batches. This should give you the same merged result and means you have to keep less data in memory.