Combining two separate datasets into a single trained model

Hi, I've been able to create two datasets, let's call these a (entity label: A) and b (entity label: B). Tried but I only get one label annotated:

python -m prodigy train \
--ner a,b \
./output/c \
--label-stats

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1738 | Evaluation: 438 (20% split)
Training: 907 | Evaluation: 226
Labels: ner (2)
ℹ Pipeline: ['tok2vec', 'ner']
...
✔ Saved pipeline to output directory
output/c/model-last

=============================== NER (per type) ===============================

           P       R       F
A         94.56   95.36   94.96
B         0.0     0.0     0.0

I tried combining the two datasets first via python -m prodigy db-merge a,b c then training via

python -m prodigy train \
--ner c \
./output/c \
--label-stats

But still only getting a single label. Inspecting the .jsonl file of the merged dataset c reveals annotations of both labels a and b.

Appreciate pointers towards the right direction. Thanks!

Hi there!

I believe this is the same issue as described here:

The thread also has a solution. Let me know if that doesn't apply to your situation though!

1 Like

Thanks for the quick revert. Yes, the (interim?) solution applies!

1 Like