I just started using Prodigy for my research and I love it. I am updating the NER for some labels (mainly products) to better fit my dataset.
For that I have extracted a few random sentences from my text and annotated them - just to try whether my flow will ultimately work.
============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 12 | Evaluation: 3 (20% split) Training: 12 | Evaluation: 3
Labels: ner (3)
ℹ Pipeline: ['transformer', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Frozen components: ['tagger', 'parser', 'attribute_ruler',
'lemmatizer']
ℹ Initial learn rate: 0.0
E # LOSS TRANS... LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------- -------- ------ ------ ------ ------
0 0 85.51 43.87 33.33 33.33 33.33 0.33
Afterwards nothing happens, it doesn't stop but also there is no progress but my CPU is pretty much maxed out - even though I only have a few additional sentences in here. I'm running it on an M1 MacBook Pro, so it should work?
This might actually be your CPU running out of memory. For a transformer model like en_core_web_trf, I suggest using a GPU—it should alleviate that process. You might also want to play around your training.batch_size configuration just in case.
Hi, thank you for your reply! I believe GPU does not work with an M1 Mac yet (since Spacy uses Cuda, which is focused on Nvidia?)...
I tried modifying my code to add the batch size (I set it to 128, I am not sure which values are reasonable to try out here?) but got an error message:
✘ Config validation error
training -> batch_size extra fields not permitted
Any idea how I could proceed?
Since this is part of my PhD research, I would like to have the most accurate model possible - hence I use en_core_web_trf as a basis.
If I use en_core_web_lg the training seems to make progress but I would really like to get en_core_web_trf to work.
Hmm, yeah, this means more like a separate GPU machine. Transformers are usually large and may not be a good fit for a laptop (e.g., I even tried a gaming laptop). If you can configure it using a cloud service (like just through a Virtual Machine in GCP/AWS/Azure with GPU), you may be able to train one.
I believe you should be able to do this via the Prodigy CLI itself. You can overwrite the values for training.batcher.size.start and training.batch.er.size.stop. Something like this: