Hi Michael,
I'll try to reproduce the steps here. Here's an examples.jsonl
file.
{"text": "this text is very positive"}
{"text": "this text is very negative"}
{"text": "boo and hiss!"}
{"text": "yay and hurrah!"}
{"text": "i am neutral"}
These are annotated via:
python -m prodigy textcat.manual issue-5962 examples.jsonl --label pos,neg,neu
Next, when I run prodigy train
on this data:
python -m prodigy train --textcat-multilabel issue-5962 --base-model en_core_web_md
Then I see this intermediate output appear:
ℹ Using CPU
========================= Generating Prodigy config =========================
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
ℹ Auto-generating config with spaCy
config={'paths': {'train': None, 'dev': None, 'vectors': None, 'init_tok2vec': None}, 'system': {'gpu_allocator': None, 'seed': 0}, 'nlp': {'lang': 'en', 'pipeline': ['textcat_multilabel'], 'batch_size': 1000, 'disabled': [], 'before_creation': None, 'after_creation': None, 'after_pipeline_creation': None, 'tokenizer': {'@tokenizers': 'spacy.Tokenizer.v1'}}, 'components': {'textcat_multilabel': {'factory': 'textcat_multilabel', 'model': {'@architectures': 'spacy.TextCatBOW.v2', 'exclusive_classes': False, 'ngram_size': 1, 'no_output_layer': False, 'nO': None}, 'scorer': {'@scorers': 'spacy.textcat_multilabel_scorer.v1'}, 'threshold': 0.5}}, 'corpora': {'dev': {'@readers': 'spacy.Corpus.v1', 'path': '${paths.dev}', 'max_length': 0, 'gold_preproc': False, 'limit': 0, 'augmenter': None}, 'train': {'@readers': 'spacy.Corpus.v1', 'path': '${paths.train}', 'max_length': 0, 'gold_preproc': False, 'limit': 0, 'augmenter': None}}, 'training': {'dev_corpus': 'corpora.dev', 'train_corpus': 'corpora.train', 'batcher': {'@batchers': 'spacy.batch_by_words.v1', 'discard_oversize': False, 'tolerance': 0.2, 'size': {'@schedules': 'compounding.v1', 'start': 100, 'stop': 1000, 'compound': 1.001, 't': 0.0}, 'get_length': None}, 'optimizer': {'@optimizers': 'Adam.v1', 'beta1': 0.9, 'beta2': 0.999, 'L2_is_weight_decay': True, 'L2': 0.01, 'grad_clip': 1.0, 'use_averages': False, 'eps': 1e-08, 'learn_rate': 0.001}, 'seed': '${system.seed}', 'gpu_allocator': '${system.gpu_allocator}', 'dropout': 0.1, 'accumulate_gradient': 1, 'patience': 1600, 'max_epochs': 0, 'max_steps': 20000, 'eval_frequency': 200, 'score_weights': {'cats_score': 1.0, 'cats_score_desc': None, 'cats_micro_p': None, 'cats_micro_r': None, 'cats_micro_f': None, 'cats_macro_p': None, 'cats_macro_r': None, 'cats_macro_f': None, 'cats_macro_auc': None, 'cats_f_per_type': None, 'cats_macro_auc_per_type': None}, 'frozen_components': [], 'annotating_components': [], 'before_to_disk': None, 'logger': {'@loggers': 'spacy.ConsoleLogger.v1', 'progress_bar': False}}, 'pretraining': {}, 'initialize': {'vectors': '${paths.vectors}', 'init_tok2vec': '${paths.init_tok2vec}', 'vocab_data': None, 'lookups': None, 'tokenizer': {}, 'components': {}, 'before_init': None, 'after_init': None}}
ℹ Using config from base model
✔ Generated training config
=========================== Initializing pipeline ===========================
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
[2022-09-21 14:28:21,704] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
- [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (3)
[2022-09-21 14:28:21,717] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'textcat_multilabel']
[2022-09-21 14:28:21,717] [INFO] Resuming training for: ['tok2vec']
[2022-09-21 14:28:21,721] [INFO] Created vocabulary
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_web_md' (3.3.0) was trained with spaCy v3.3 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
[2022-09-21 14:28:23,042] [INFO] Added vectors: en_core_web_md
[2022-09-21 14:28:23,137] [INFO] Finished initializing nlp object
[2022-09-21 14:28:23,140] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline
============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
- [textcat_multilabel] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: textcat_multilabel (3)
ℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner', 'textcat_multilabel']
ℹ Frozen components: ['tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS TEXTC... CATS_SCORE SPEED SCORE
--- ------ ------------ ------------- ---------- ------ ------
0 0 0.00 0.25 0.00 385.30 0.00
1000 1000 0.00 58.68 0.00 545.31 0.00
2000 2000 0.00 8.41 0.00 578.05 0.00
3000 3000 0.00 3.04 0.00 560.96 0.00
4000 4000 0.00 1.43 0.00 571.13 0.00
5000 5000 0.00 0.76 0.00 561.90 0.00
You'll notice that I get some warnings about old md
versions of the model, but besides that it seems to train fine. That means that I can't reproduce the error locally, but you'll also notice that it generates and prints a JSON blob with config settings. Could you share yours? I'm guessing something is going wrong there on your end.
Also, could you share your Prodigy and spaCy versions? Preferably via:
python -m pip freeze | grep spacy
python -m pip freeze | grep prodigy
You could also varify if the output remains the same when you run with python -m prodigy
instead of prodigy
? I'd just want to make sure it's not a virtualenv thing.