Combining two separate datasets into a single trained model

mv3 · December 6, 2023, 6:39am

Hi, I've been able to create two datasets, let's call these a (entity label: A) and b (entity label: B). Tried but I only get one label annotated:

python -m prodigy train \
--ner a,b \
./output/c \
--label-stats

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 1738 | Evaluation: 438 (20% split)
Training: 907 | Evaluation: 226
Labels: ner (2)
ℹ Pipeline: ['tok2vec', 'ner']
...
✔ Saved pipeline to output directory
output/c/model-last

=============================== NER (per type) ===============================

           P       R       F
A         94.56   95.36   94.96
B         0.0     0.0     0.0

I tried combining the two datasets first via python -m prodigy db-merge a,b c then training via

python -m prodigy train \
--ner c \
./output/c \
--label-stats

But still only getting a single label. Inspecting the .jsonl file of the merged dataset c reveals annotations of both labels a and b.

Appreciate pointers towards the right direction. Thanks!

koaning · December 6, 2023, 9:39am

Hi there!

I believe this is the same issue as described here:

The thread also has a solution. Let me know if that doesn't apply to your situation though!

mv3 · December 6, 2023, 5:24pm

Thanks for the quick revert. Yes, the (interim?) solution applies!

Topic		Replies	Views
combining multiple models and exporting training data to spacy ner , spacy	3	2886	November 13, 2018
Best practice for merging multiple NER datasets into one . usage , ner	1	784	November 30, 2021
Training Multiple entities at the Same time? ner , spacy , solved	11	3178	November 27, 2018
Merging single label-based models into one multiple label-model usage , ner , solved	3	1080	June 10, 2020
Data annotation : Error in merge datasets ner	5	23	January 10, 2025

Combining two separate datasets into a single trained model

Related topics