hi @sigitpurnomo!
Sorry for the delayed response. We're trying to close out old issues.
Regarding comparing prodigy train
and spacy train
, I recommend anyone interested to check out the Prodigy sample project:
If you clone this repo, you can run two examples to compare spacy train
and prodigy train
.
Using sample fashion data, you can run spacy train
by running
python -m spacy project run all
This will load the data (db-in
), export the data and config file (data-to-spacy
), and spacy train
(see the project.yml
).
python -m spacy project run all
โน Running workflow 'all'
=================================== db-in ===================================
Running command: /opt/homebrew/opt/python@3.10/bin/python3.10 -m prodigy db-in fashion_brands_training assets/fashion_brands_training.jsonl
โ Created dataset 'fashion_brands_training' in database SQLite
โ Imported 1235 annotations to 'fashion_brands_training' (session
2023-02-03_15-10-32) in database SQLite
Found and keeping existing "answer" in 1235 examples
Running command: /opt/homebrew/opt/python@3.10/bin/python3.10 -m prodigy db-in fashion_brands_eval assets/fashion_brands_eval.jsonl
โ Created dataset 'fashion_brands_eval' in database SQLite
โ Imported 500 annotations to 'fashion_brands_eval' (session
2023-02-03_15-10-33) in database SQLite
Found and keeping existing "answer" in 500 examples
=============================== data-to-spacy ===============================
Running command: /opt/homebrew/opt/python@3.10/bin/python3.10 -m prodigy data-to-spacy corpus/ --ner fashion_brands_training,eval:fashion_brands_eval
โน Using language 'en'
============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 1235 | Evaluation: 500 (from datasets)
Training: 1235 | Evaluation: 500
Labels: ner (1)
โ Saved 1235 training examples
corpus/train.spacy
โ Saved 500 evaluation examples
corpus/dev.spacy
============================= Generating config =============================
โน Auto-generating config with spaCy
โ Generated training config
======================== Generating cached label data ========================
โ Saving label data for component 'ner'
corpus/labels/ner.json
============================= Finalizing export =============================
โ Saved training config
corpus/config.cfg
To use this data for training with spaCy, you can run:
python -m spacy train corpus/config.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy
================================ train_spacy ================================
Running command: /opt/homebrew/opt/python@3.10/bin/python3.10 -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --gpu-id -1
โน Saving to output directory: training
โน Using CPU
โน To switch to GPU 0, use the option: --gpu-id 0
=========================== Initializing pipeline ===========================
[2023-02-03 15:10:39,792] [INFO] Set up nlp object from config
[2023-02-03 15:10:39,799] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-02-03 15:10:39,801] [INFO] Created vocabulary
[2023-02-03 15:10:39,802] [INFO] Finished initializing nlp object
[2023-02-03 15:10:41,529] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
โ Initialized pipeline
============================= Training pipeline =============================
โน Pipeline: ['tok2vec', 'ner']
โน Initial learn rate: 0.0
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 46.17 0.00 0.00 0.00 0.00
0 200 10.44 14143.08 0.00 0.00 0.00 0.00
0 400 17.36 921.48 0.00 0.00 0.00 0.00
1 600 18.70 517.74 0.00 0.00 0.00 0.00
1 800 22.26 619.64 0.83 50.00 0.42 0.01
2 1000 26.61 656.45 4.84 60.00 2.52 0.05
3 1200 29.64 745.70 9.67 41.94 5.46 0.10
4 1400 37.73 754.50 20.98 47.76 13.45 0.21
6 1600 82.65 884.78 30.59 46.96 22.69 0.31
7 1800 391.86 984.87 36.60 49.64 28.99 0.37
9 2000 354.41 1072.19 39.60 48.19 33.61 0.40
12 2200 107.65 988.55 41.21 51.25 34.45 0.41
15 2400 138.04 1029.82 47.12 55.06 41.18 0.47
19 2600 149.17 955.62 50.24 57.61 44.54 0.50
22 2800 124.06 703.44 50.84 59.22 44.54 0.51
25 3000 121.32 583.64 53.72 62.57 47.06 0.54
29 3200 112.32 431.85 54.55 63.33 47.90 0.55
32 3400 115.82 384.64 55.77 65.17 48.74 0.56
35 3600 122.27 307.42 55.50 64.44 48.74 0.56
38 3800 124.70 295.25 57.84 69.41 49.58 0.58
42 4000 153.26 254.92 57.56 68.60 49.58 0.58
45 4200 183.82 225.83 57.63 68.00 50.00 0.58
48 4400 191.45 206.76 57.62 66.48 50.84 0.58
52 4600 183.82 170.08 57.42 66.67 50.42 0.57
55 4800 104.11 106.09 57.76 66.85 50.84 0.58
58 5000 132.83 96.88 57.97 68.18 50.42 0.58
62 5200 104.80 78.27 59.51 70.93 51.26 0.60
65 5400 94.62 77.89 59.66 71.35 51.26 0.60
68 5600 88.30 58.62 59.51 70.93 51.26 0.60
72 5800 91.84 43.24 60.00 71.51 51.68 0.60
75 6000 132.88 50.87 59.86 68.85 52.94 0.60
78 6200 77.27 42.82 60.59 73.21 51.68 0.61
82 6400 73.68 33.23 60.78 72.94 52.10 0.61
85 6600 79.77 29.21 61.65 72.99 53.36 0.62
88 6800 125.10 44.11 61.69 72.32 53.78 0.62
91 7000 62.31 29.18 61.95 73.84 53.36 0.62
95 7200 44.03 19.51 61.99 73.14 53.78 0.62
98 7400 46.05 15.76 60.98 72.67 52.52 0.61
101 7600 43.38 10.81 62.20 72.22 54.62 0.62
105 7800 25.63 10.48 58.65 72.67 49.16 0.59
108 8000 92.39 25.84 62.35 72.63 54.62 0.62
111 8200 27.62 9.18 62.65 73.45 54.62 0.63
115 8400 40.35 11.85 62.14 73.56 53.78 0.62
118 8600 24.75 8.94 62.05 71.82 54.62 0.62
121 8800 32.70 10.96 61.72 71.67 54.20 0.62
125 9000 23.91 7.12 61.24 71.11 53.78 0.61
128 9200 31.73 10.01 61.24 71.11 53.78 0.61
131 9400 65.21 20.19 61.72 71.67 54.20 0.62
134 9600 11.40 3.41 61.54 71.91 53.78 0.62
138 9800 21.41 6.48 61.69 72.32 53.78 0.62
Epoch 139: 0%| | 0/200 [00:00<?, ?it/s]โ Saved pipeline to output directory
training/model-last
Alternatively, you can run prodigy train
on the same data by running all_prodigy
:
$ python3 -m spacy project run all_prodigy
โน Running workflow 'all_prodigy'
=================================== db-in ===================================
Running command: /opt/homebrew/opt/python@3.10/bin/python3.10 -m prodigy db-in fashion_brands_training assets/fashion_brands_training.jsonl
โ Imported 1235 annotations to 'fashion_brands_training' (session
2023-02-03_15-19-02) in database SQLite
Found and keeping existing "answer" in 1235 examples
Running command: /opt/homebrew/opt/python@3.10/bin/python3.10 -m prodigy db-in fashion_brands_eval assets/fashion_brands_eval.jsonl
โ Imported 500 annotations to 'fashion_brands_eval' (session
2023-02-03_15-19-04) in database SQLite
Found and keeping existing "answer" in 500 examples
=============================== train_prodigy ===============================
Running command: /opt/homebrew/opt/python@3.10/bin/python3.10 -m prodigy train training/ --ner fashion_brands_training,eval:fashion_brands_eval --config configs/config.cfg --gpu-id -1
โน Using CPU
โน To switch to GPU 0, use the option: --gpu-id 0
========================= Generating Prodigy config =========================
โ Generated training config
=========================== Initializing pipeline ===========================
[2023-02-03 15:19:05,519] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 2470 | Evaluation: 1000 (from datasets)
Training: 1235 | Evaluation: 500
Labels: ner (1)
[2023-02-03 15:19:05,818] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-02-03 15:19:05,820] [INFO] Created vocabulary
[2023-02-03 15:19:05,821] [INFO] Finished initializing nlp object
[2023-02-03 15:19:06,960] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
โ Initialized pipeline
============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 2470 | Evaluation: 1000 (from datasets)
Training: 1235 | Evaluation: 500
Labels: ner (1)
โน Pipeline: ['tok2vec', 'ner']
โน Initial learn rate: 0.0
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 46.17 0.00 0.00 0.00 0.00
0 200 10.44 14143.08 0.00 0.00 0.00 0.00
0 400 17.36 921.48 0.00 0.00 0.00 0.00
1 600 18.70 517.74 0.00 0.00 0.00 0.00
1 800 22.26 619.64 0.83 50.00 0.42 0.01
2 1000 26.61 656.45 4.84 60.00 2.52 0.05
3 1200 29.64 745.70 9.67 41.94 5.46 0.10
4 1400 37.73 754.50 20.98 47.76 13.45 0.21
6 1600 82.65 884.78 30.59 46.96 22.69 0.31
7 1800 391.86 984.87 36.60 49.64 28.99 0.37
9 2000 354.41 1072.19 39.60 48.19 33.61 0.40
12 2200 107.65 988.55 41.21 51.25 34.45 0.41
15 2400 138.04 1029.82 47.12 55.06 41.18 0.47
19 2600 149.17 955.62 50.24 57.61 44.54 0.50
22 2800 124.06 703.44 50.84 59.22 44.54 0.51
25 3000 121.32 583.64 53.72 62.57 47.06 0.54
29 3200 112.32 431.85 54.55 63.33 47.90 0.55
32 3400 115.82 384.64 55.77 65.17 48.74 0.56
35 3600 122.27 307.42 55.50 64.44 48.74 0.56
38 3800 124.70 295.25 57.84 69.41 49.58 0.58
42 4000 153.26 254.92 57.56 68.60 49.58 0.58
45 4200 183.82 225.83 57.63 68.00 50.00 0.58
48 4400 191.45 206.76 57.62 66.48 50.84 0.58
52 4600 183.82 170.08 57.42 66.67 50.42 0.57
55 4800 104.11 106.09 57.76 66.85 50.84 0.58
58 5000 132.83 96.88 57.97 68.18 50.42 0.58
62 5200 104.80 78.27 59.51 70.93 51.26 0.60
65 5400 94.62 77.89 59.66 71.35 51.26 0.60
68 5600 88.30 58.62 59.51 70.93 51.26 0.60
72 5800 91.84 43.24 60.00 71.51 51.68 0.60
75 6000 132.88 50.87 59.86 68.85 52.94 0.60
78 6200 77.27 42.82 60.59 73.21 51.68 0.61
82 6400 73.68 33.23 60.78 72.94 52.10 0.61
85 6600 79.77 29.21 61.65 72.99 53.36 0.62
88 6800 125.10 44.11 61.69 72.32 53.78 0.62
91 7000 62.31 29.18 61.95 73.84 53.36 0.62
95 7200 44.03 19.51 61.99 73.14 53.78 0.62
98 7400 46.05 15.76 60.98 72.67 52.52 0.61
101 7600 43.38 10.81 62.20 72.22 54.62 0.62
105 7800 25.63 10.48 58.65 72.67 49.16 0.59
108 8000 92.39 25.84 62.35 72.63 54.62 0.62
111 8200 27.62 9.18 62.65 73.45 54.62 0.63
115 8400 40.35 11.85 62.14 73.56 53.78 0.62
118 8600 24.75 8.94 62.05 71.82 54.62 0.62
121 8800 32.70 10.96 61.72 71.67 54.20 0.62
125 9000 23.91 7.12 61.24 71.11 53.78 0.61
128 9200 31.73 10.01 61.24 71.11 53.78 0.61
131 9400 65.21 20.19 61.72 71.67 54.20 0.62
134 9600 11.40 3.41 61.54 71.91 53.78 0.62
138 9800 21.41 6.48 61.69 72.32 53.78 0.62
โ Saved pipeline to output directory
training/model-last
From these two examples, you should get the same results!
Here's the versions:
$ python -m prodigy stats
============================== โจ Prodigy Stats ==============================
Version 1.11.10
Location /opt/homebrew/lib/python3.10/site-packages/prodigy
Platform macOS-13.0.1-arm64-arm-64bit
Python Version 3.10.8
$ python -m spacy info
============================== Info about spaCy ==============================
spaCy version 3.5.0
Location /opt/homebrew/lib/python3.10/site-packages/spacy
Platform macOS-13.0.1-arm64-arm-64bit
Python version 3.10.8