Memory issue during training for text classification

================================ ✨  Datasets ================================

iso_eval, iso_train

================================ ✨  Sessions ================================

2022-01-03_15-55-00, 2022-01-03_15-57-01, 2022-01-03_15-57-16,
2022-01-05_09-18-46, 2022-01-05_09-35-16, 2022-01-05_09-46-06,
2022-01-05_13-07-38, 2022-01-06_17-17-19, 2022-01-06_17-18-04,
2022-01-06_18-59-43, 2022-01-06_19-44-48, 2022-01-06_20-24-34

============================== ✨  Dataset Stats ==============================

Dataset       iso_train
Created       2022-01-06 20:24:34
Description   None
Author        None
Annotations   8334
Accept        121
Reject        8213
Ignore        0

Hello Everyone,

I recently started getting around with prodigy and was not able to figure out three concepts. Looking forward to some help or pointers (if I missed some core concepts).

  1. As in the above snippets I have 12 sessions created during my learning. Are they in cache or should I close a session after usage? I am not sure if it utilizes any resource from the hardware as I never intend to reuse them.

  2. I intend to perform text classification from an annotated dataset, 'iso_train' which has 8213 'reject' answers. What should I understand with the stats shown in the snippet above as I find it confusing with the following explanation in the official documentation page

The REJECT button is less relevant here, because there’s nothing to say no to – however, you can use it to reject examples that have actual problems that need fixing, like noisy preprocessing artifacts, HTML markup that wasn’t cleaned properly, texts in other languages, and so on.

  1. The training utilizes all all memory and I do not have a GPU. Is it possible to perform text classification with 8K annotations on 16GB memory on CPU? If yes, how should I go about doing this task.

Thanks a lot for your effort and time!


Hi @hsekol-hub

There's no need to close a session. What you see are "session datasets," they are created in the background every time you start to annotate.

The quoted REJECT explanation only applies to multiple categories / multiple choice options. In your case, it seems that it was a binary classification task. In that case, it's totally fine. You can interpret REJECT as "No, this particular example doesn't belong to that particular class"

If you're experiencing full memory problems, it may be better to try running it on a cloud VM with more memory, something like 32 GB. It's possible to train without a GPU, but note that it may be CPU and memory-intensive.

Hi @ljvmiranda921,

Thanks a lot for the answers.
Regarding the last question, is it also possible to reduce the batch size or embedding dimensions so training could happen?

I have tried the following command (unfortunately did not work or is there a mistake?).

prodigy train ./new_models -tcm iso_train --training.batcher.size.start=8 --training.batcher.size.stop=16

Additional information: It works if I use the evaluation dataset 'iso_eval' for training with default batch size.
What I do not understand is which part/logic in background consumes the memory.

This sounds like your training dataset is just too large to fit into memory (and the smaller evaluation set isn't, so it works). So if you can use a machine with slightly more memory, this is probably the easiest solution.

You could also run some profiling in the background to see where the memory is consumed and to give you a better idea of how much you'll need.