We have a fairly decent NER corpus of 6000 text files, with about 20 labels. For historical reasons, the corpus was stored in brat format, e.g. 1
txt file, and 1
ann file regrouping all annotations of the 20 types. We were manually (using custom Python) converting them to Gold objects, and training directly against the spaCy API, achieving north of 85 F-Score, which is promising. We were using some variant of ner.correct to create the corpus.
Now, we are pretty sure one of our main issues is tagging consistency. With 20 labels to annotate on each sample, taggers (such as I myself) can do their best, but we are bound to inconsistently apply our tagging policy : to much cognitive load remembering 20 specific policies at once.
So we thought about going on a different path : we split our
brat format files into prodigy
ner_manual jsonl format, one JSONL per label. Giving a total of (about) 20 jsonl files with (about) 6000 lines/tasks each. For each file, we have pretty much the same
text, only different labels are annotated (no overlap of course).
Using that, we can
train label by label, which works great.
But ! When we try to
train the 20 prodigy datasets in a single recipe, we consistently get killed by the Linux OOM Killer before the first training loop starts, at around 10GB used.
data-to-spacy all sets into one fail the same way.
Now, of course, I could climb to 16, 32 or 64GB but it does not seem scalable, knowing also that this works today's corpus.
So I wanted some feedback on how to do things properly. I'm thiking if this does not scale, maybe I'm doing it wrong.
Should I do something to ease the load on prodigy when "merging" all the datasets ?
Maybe something with task_ids ?
What workflow could work for us to grow / refine our model, label by label, using the wealth of prodigy annotations and recipes (e.g. binary or ner manual), and still being able to train a single model ?