Prodigy NER train recipe getting killed by OOM

ines · September 29, 2020, 1:11pm

Hi! Thanks for the detailed report

Since the OOM problem also occurs in db-merge, which doesn't do anything sophisticated and only loads the datasets, I suspect the problem must happen when Prodigy loads the examples from the database and keeps them in memory. Here's a very simple standalone script you can try:

from prodigy.components.db import connect

db = connect()
all_examples = []
for dataset_name in ["your_dataset1", "your_dataset2"]:  # etc.
    all_examples.extend(db.get_dataset(dataset_name))

How large do you think the JSON data in your datasets would be? Even with 6000 text files, it's unlikely going to be gigabytes, right? If it does turn out that your datasets don't fit into memory, that's indeed a bit tricky

While the db-merge recipe just loads all datasets into memory and concatenates them, the data-to-spacy and train recipes will group the annotations by hashes and then create one Doc object per example with all available annotations (e.g. entities with different labels etc.). Prodigy then keeps a second copy of the data by hash, and uses that to merge the examples. So one option would be to just run the data-to-spacy conversion in batches and batch up your examples by _input_hash, so there are no examples with the same input hash across different batches. This should give you the same merged result and means you have to keep less data in memory.

Topic		Replies	Views
NER and POS Tagging Annotation using One Prodigy User Interface	2	17	January 31, 2025
Prodigy NER train recipe getting killed for no apparent reason	9	764	December 4, 2022
Training few new entities: Result very low usage , ner , spacy	3	17	January 29, 2025
Improve trained models with annotations usage , ner , training	3	518	September 20, 2021
Train NER model to improve existing entities spacy vs prodigy ner , spacy	1	953	December 9, 2019

Prodigy NER train recipe getting killed by OOM

Related topics