Prodigy NER train recipe getting killed by OOM

Hi,

We have a fairly decent NER corpus of 6000 text files, with about 20 labels. For historical reasons, the corpus was stored in brat format, e.g. 1 txt file, and 1 ann file regrouping all annotations of the 20 types. We were manually (using custom Python) converting them to Gold objects, and training directly against the spaCy API, achieving north of 85 F-Score, which is promising. We were using some variant of ner.correct to create the corpus.

Now, we are pretty sure one of our main issues is tagging consistency. With 20 labels to annotate on each sample, taggers (such as I myself) can do their best, but we are bound to inconsistently apply our tagging policy : to much cognitive load remembering 20 specific policies at once.

So we thought about going on a different path : we split our brat format files into prodigy ner_manual jsonl format, one JSONL per label. Giving a total of (about) 20 jsonl files with (about) 6000 lines/tasks each. For each file, we have pretty much the same text, only different labels are annotated (no overlap of course).

Using that, we can ner.teach and ner.correct and train label by label, which works great.

But ! When we try to train the 20 prodigy datasets in a single recipe, we consistently get killed by the Linux OOM Killer before the first training loop starts, at around 10GB used.

Trying to db-merge or data-to-spacy all sets into one fail the same way.

Now, of course, I could climb to 16, 32 or 64GB but it does not seem scalable, knowing also that this works today's corpus.

So I wanted some feedback on how to do things properly. I'm thiking if this does not scale, maybe I'm doing it wrong.

Should I do something to ease the load on prodigy when "merging" all the datasets ?
Maybe something with task_ids ?

What workflow could work for us to grow / refine our model, label by label, using the wealth of prodigy annotations and recipes (e.g. binary or ner manual), and still being able to train a single model ?

Thanks!

Hi! Thanks for the detailed report :+1:

Since the OOM problem also occurs in db-merge, which doesn't do anything sophisticated and only loads the datasets, I suspect the problem must happen when Prodigy loads the examples from the database and keeps them in memory. Here's a very simple standalone script you can try:

from prodigy.components.db import connect

db = connect()
all_examples = []
for dataset_name in ["your_dataset1", "your_dataset2"]:  # etc.
    all_examples.extend(db.get_dataset(dataset_name))

How large do you think the JSON data in your datasets would be? Even with 6000 text files, it's unlikely going to be gigabytes, right? If it does turn out that your datasets don't fit into memory, that's indeed a bit tricky :thinking:

While the db-merge recipe just loads all datasets into memory and concatenates them, the data-to-spacy and train recipes will group the annotations by hashes and then create one Doc object per example with all available annotations (e.g. entities with different labels etc.). Prodigy then keeps a second copy of the data by hash, and uses that to merge the examples. So one option would be to just run the data-to-spacy conversion in batches and batch up your examples by _input_hash, so there are no examples with the same input hash across different batches. This should give you the same merged result and means you have to keep less data in memory.

Thanks for the pointers.

I did a small session with your sample code, printing the python process' RSS size at each db.getDataset() (using psutil).

Dataset 1 loaded => 1.2 GB
Dataset 2 loaded => 2.3GB
Dataset 3 loaded => 3.4GB
Dataset 4 loaded => 4.5 GB
Dataset 5 loaded => 5.5GB

So we get 1GB of RSS growth per dataset, while the orignal JSON files amount to ~120MB per dataset. This seems a little unreasonnable (x8 growth from the text data), but nothing I can do about it, or so it seems.

So I need to switch strategies. It seems I can go in 2 directions...

  1. keep separated files for separate labels, and batch merge them by _input_hash
  2. go back to producing "heavy" files (all annotations for a single text in one place)

I have an issue with solution nb 1, because it is my understanding that the training process needs all texts annotated for all entity types to work. I can not have a set of texts annotated for type ORG, and another set of texts annotated for type PERSON, and have a single model trained reliably on both annotations at the same time (or can I) ?
In the long run, it strikes me as difficult to keep different files for different entities, and at the same time make sure that the set of annotated texts of all these files are identical.

So I'd go with solution number two : keep a "gold"(-ish) like corpus. Each text is present a single time, and all annotations for this text come along with it, which solves the out of memory issue.

Which would lead me to ask other questions, but that would be out of topic for this thread.

Please feel free to share your thoughts if my understanding is not up to scratch.
Thanks again.

Thanks for checking! x8 does seem pretty surprising... :thinking: The data size itself will definitely grow a bit, especially if you're annotating manually because Prodigy will store the tokenization with the examples (plus a bit of meta data). So you may end up with ~2.5 times more JSON overall.

Anyway, this at least gives us something to investigate and I'll report back!

For training, you definitely want to have all merged annotations available. But once they're merged, the final dataset should be much smaller, because you'll only have one copy of the text and tokens per example (i.e. per input hash). So if you can find a more efficient way to merge the data in batches (for instance, all examples of a certain range of input hashes first), you can still keep multiple copies of the same example in your database but won't have to load everything into memory at the same time.

In general, we do want to encourage workflows where you focus on a subset of labels at a time, because it makes it much easier to iterate. Using the input hashes, you'll always be able to identify what belongs together. (This is actually a big part of what train and data-to-spacy does: both commands merge and combine annotations on the same input data so you end up with one gold-standard example containing all annotations/labels.)

Btw, another Database method that could come in handy: db.get_examples lets you query by task or input hashes. You can also access the SQLite database directly if you want and fetch the JSONL from there.

Thanks.

Going directly to the DB seems pretty efficient for exporting and and I will look into merging by input_hash in ranges. It should work out fine.