ner.batch-train randomly terminating

So after building a collection of annotations in my database, I proceeded to run ner.batch-train in my PyCharm terminal, like so:

python -m prodigy ner.batch-train annotations en_core_web_lg --output addresses --n-iter 5 --label ADDR --batch-size 16 --eval-split 0.2

However, before the first or second iteration is complete, the process would suddenly stop terminating without an exception/error message being displayed in the terminal. The terminal would just be ready to accept new input, as if no process is running.

Following similar advice in older threads, I tried to run it with a batch-size of 1. However, this wasn’t a sustainable solution, as the training speed slowed down by a factor of 10, and it took more than an hour just to process an iteration. (For the record, I canceled it before completion as it was taking too long and it wasn’t prudent use of debugging time, although the process never terminated by the end of the first fold)

Seeing as there are similar unresolved threads to this issue, I’m wondering:

  1. Is there a consensus as to why this termination is happening? It seems work fine if I push the code to my Mac.
  2. What exactly is it that batch-size does, that leads you to believe it to be the solution to this problem.
  3. How does one trigger the debugging output on Windows in the PyCharm terminal? Putting export PRODIGY_LOGGING=basic in my venv’s activate file did not seem to do anything.
  4. Is there a fix that does not happen to induce unreasonable performance hits?

Any help is welcome.

My main suspicion is that this is an out-of-memory error that’s not being caught effectively, which is why setting the batch size to 1 is helpful. You can try a batch size of 4 or 5 to find a better compromise between memory usage and performance.

That said, the idea that it’s strictly a memory usage problem might not match up with all the reports. It’s possible there’s a different bug. Unfortunately it’s a C-level error, which doesn’t raise a Python exception. So we don’t get proper tracebacks to work with, making debugging difficult.

Funnily enough, when I fixed the bug in the ner.teach pipeline leading up to batch training (see issue here: https://support.prodi.gy/t/ner-teach-not-giving-relevant-entities-from-patterns-jsonl/838), ner.batch-train stopped randomly terminating too. They might be related issues.