Errors with pos.teach and pos.batch-train.

Hi,
I am getting some errors when I try and run pos.teach and pos.batch-train. I am not quite sure if I am doing something wrong, because ner.teach, ner.manual and ner.batch-train all worked out pretty well and I could use the trained model.

1. pos.teach

Not able to execute it at all. Command being tried:

prodigy pos.teach network_entities en_core_web_sm ~/Projects/Datasets/Logs/CSCF-Logs/training-logs.jsonl --label "VERB, PROPN"

Error being displayed:

[Abhishek:/tmp/model] [ARMn] 6s $ prodigy pos.teach network_entities en_core_web_sm    
/Users/Abhishek/Projects/Datasets/Logs/CSCF-Logs/training_logs.jsonl --label "VERB, PROPN"
Using 2 labels: VERB, PROPN
Traceback (most recent call last):
File "cython_src/prodigy/core.pyx", line 55, in prodigy.core.Controller.__init__
File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-
packages/toolz/itertoolz.py", line 368, in first
return next(iter(seq))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File  
"/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", 
line 193, in _run_module_as_main "__main__", mod_spec)
File 
"/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", 
line 85, in _run_code_exec(code, run_globals)
File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-
packages/prodigy/__main__.py", line 259, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "cython_src/prodigy/core.pyx", line 60, in prodigy.core.Controller.__init__
ValueError: Error while validating stream: no first batch. This likely means that your stream is empty.

I have checked and re-checked the input file. Tried a .txt file and also a .jsonl file. The same file works well if I am using it with ner.teach or ner.manual, just not pos.teach.

  1. pos.batch-train

Somehow got the web-app to work with pos.make-gold and annotated a bunch of POS tags. Tried running batch-train to get this error.

prodigy pos.batch_train network_entities en_core_web_sm --output /tmp/model --n-iter 20 --eval-split 0.2 --dropout 0.2 --label "PROPN, VERB"


Using 2 labels: PROPN, VERB

Loaded model en_core_web_sm
Traceback (most recent call last): 
File    
"/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py",   
line 193, in _run_module_as_main "__main__", mod_spec)
File   
"/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py",   
line 85, in _run_code exec(code, run_globals)
File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-
packages/prodigy/__main__.py", line 259, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy

File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-
packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-
packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-
packages/prodigy/recipes/pos.py", line 181, in batch_train
tagger.add_label(span['label'])
File "pipeline.pyx", line 550, in spacy.pipeline.Tagger.add_label
ValueError: [T003] Resizing pre-trained Tagger models is not currently supported.

Any pointers are much appreciated.

Thanks for the report! This is very strange – the pos recipes are very similar to the ner recipes, so I don’t immediately see what might be going wrong here, especially not when it comes to the stream :thinking:

If you run pos.teach with the environment variable PRODIGY_LOGGING=basic set, do you see anything in the log that might be suspicious?

If you’re only using labels that are already present in the model (VERB, PROPN), one workaround you could try for the training is to remove the following lines from the pos.batch-train recipe:

for eg in examples:
    for span in eg.get('spans', []):
        if 'label' in span:
            tagger.add_label(span['label'])
for l in label:
    tagger.add_label(l)

Will definitely investigate this further, though!

Thank you for the input Ines. I did the same and nothing untoward in the logs (at least seems like that to me). Here is the output. The errors posted above follow immediately after the end of the log messages.

BASIC LOGGING:

Using 2 labels: VERB, PROPN
10:21:53 - RECIPE: Starting recipe pos.teach
10:21:53 - LOADER: Using file extension 'jsonl' to find loader
10:21:53 - LOADER: Loading stream from jsonl
10:21:53 - LOADER: Rehashing stream
10:21:53 - RECIPE: Creating Tagger using model en_core_web_sm
10:21:54 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
10:21:54 - SORTER: Randomly select questions using their score as the selection probability
10:21:54 - CONTROLLER: Initialising from recipe
10:21:54 - VALIDATE: Creating validator for view ID 'pos'
10:21:54 - DB: Initialising database SQLite
10:21:54 - DB: Connecting to database SQLite
10:21:54 - DB: Loading dataset 'network_entities' (5977 examples)
10:21:54 - DB: Creating dataset '2018-06-11_10-21-54'
10:21:54 - CONTROLLER: Validating the first batch
10:21:54 - CONTROLLER: Iterating over stream
10:21:54 - PREPROCESS: Splitting sentences
10:21:54 - FILTER: Filtering duplicates from stream
10:21:54 - FILTER: Filtering out empty examples for key 'text'

I just checked verbose logging and nothing there either. Seems weird to me.

Using 2 labels: VERB, PROPN
10:25:42 - RECIPE: Starting recipe pos.teach
{'unsegmented': False, 'exclude': None, 'label': ['VERB', 'PROPN'], 'loader': None, 'api': None, 'source': '/Users/Abhishek/Projects/Datasets/Logs/CSCF-Logs/training-logs.jsonl', 'spacy_model': 'en_core_web_sm', 'dataset': 'network_entities'}

10:25:42 - LOADER: Using file extension 'jsonl' to find loader
/Users/Abhishek/Projects/Datasets/Logs/CSCF-Logs/training-logs.jsonl

10:25:42 - LOADER: Loading stream from jsonl
10:25:42 - LOADER: Rehashing stream
10:25:42 - RECIPE: Creating Tagger using model en_core_web_sm
10:25:43 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
10:25:43 - SORTER: Randomly select questions using their score as the selection probability
10:25:43 - CONTROLLER: Initialising from recipe
{'config': {'lang': 'en', 'label': 'VERB, PROPN', 'dataset': 'network_entities'}, 'dataset': 'network_entities', 'db': True, 'exclude': None, 'get_session_id': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x106b3e710>, 'self': <prodigy.core.Controller object at 0x106b457f0>, 'stream': <prodigy.components.sorters.Probability object at 0x106b45550>, 'update': <bound method Tagger.update of <prodigy.models.pos.Tagger object at 0x106b454e0>>, 'view_id': 'pos'}

10:25:43 - VALIDATE: Creating validator for view ID 'pos'
10:25:43 - DB: Initialising database SQLite
10:25:43 - DB: Connecting to database SQLite
10:25:43 - DB: Loading dataset 'network_entities' (5977 examples)
10:25:43 - DB: Creating dataset '2018-06-11_10-25-43'
{'description': 'Entities in Network Logs', 'author': 'Abhishek Dwaraki', 'created': datetime.datetime(2018, 6, 8, 10, 37, 54)}

10:25:43 - CONTROLLER: Validating the first batch
10:25:43 - CONTROLLER: Iterating over stream
10:25:43 - PREPROCESS: Splitting sentences
{'batch_size': 32, 'min_length': None, 'nlp': <spacy.lang.en.English object at 0x106b459e8>, 'stream': <generator object at 0x106a63678>, 'text_key': 'text'}

10:25:43 - FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x106a63558>}

10:25:43 - FILTER: Filtering out empty examples for key 'text'

I will try the workaround that you posted for now and see what happens. Report back in a bit. Thanks a ton once again for the help.

Thanks for sharing the logs! This is weird indeed – I’ll try to reproduce it and see if I can get to the bottom of it.

Btw, another workaround: The “no first batch” error you’re seeing occurs in the stream validation (where Prodigy checks if the tasks have the right format etc). In your prodigy.json, you can set "validate": false to disabled that. Does this help? And if not, does it maybe raise a different error (which might be the true cause of the problem)?

Removed those lines from pos.batch-train liked you suggested. It did not fail immediately, but it did fail eventually. :slight_smile:

Using 2 labels: PROPN, VERB

Loaded model en_core_web_sm
Using 20% of accept/reject examples (920) for evaluation
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/prodigy/recipes/pos.py", line 210, in batch_train
    baseline = model.evaluate(evals)
  File "cython_src/prodigy/models/pos.pyx", line 122, in prodigy.models.pos.Tagger.evaluate
KeyError: 'token_start'

EDITED POST:

Setting "validate": false rectifies that error, in the sense it no longer crashes with that particular error and it lets the web app go through, but that is about it. The web app opens and says no tasks available.

:cry:

Okay, so it does seem like the validation was “right” in the sense that there are no tasks to display. After some internal testing, I think I also found the problem:

The built-in tagger model currently doesn’t seem to handle the fine-grained vs. coarse-grained tags correctly. So when you filter by label VERB, it doesn’t actually find any matching tokens, because it compares the fine-grained Token.tag_, not the Token.pos_ value. If you use the fine-grained tags, it should work as expected:

--label VB,VBD,VBG,VBN,VBP,VBZ

You can find an overview of the fine-grained tags and their coarse-grained counterparts (i.e. all tags that are VERBs) in the annotation specs.

We’ll also work on a fix for this. While the more specific tag also allow more specific annotation, I definitely understand that it’s annoying, because you have to keep all those fine-grained tags in mind and it makes annotation much more difficult than “verb vs. not verb”. Sorry about that – we’ll definitely come up with a better solution for the next release!

Does your dataset contain any annotations collected with different recipes? For some reason, one or more of the POS tags in your training or evaluation examples don’t have a token_start specified (which marks the token the POS tag is referring to). This should be added either in pos.teach (via the model) or pos.make-gold (via the add_tokens pre-processor). You can always use the db-out recipe to inspect the dataset and see how the data looks and if there’s an example somewhere in there that has the wrong format.

Man, that was quick. You are awesome. :slight_smile:

It’s okay if it is a bit of a pain, it does not block me on anything. I can continue to use it for now. And yes, you are right, the dataset may have annotations from the ner recipes. I will try to do this with a new dataset just for pos annotations. That was silly of me. Should have tried that earlier. Thanks Ines.

1 Like

pos.batch-train is failing with this new error. This is after commenting out the few lines that you mentioned earlier.

14:04:31 - RECIPE: Calling recipe 'pos.batch-train'
Using 6 labels: VB, VBD, VBG, VBN, VBP, VBZ
14:04:31 - RECIPE: Starting recipe pos.batch-train
{'silent': False, 'unsegmented': False, 'eval_split': 0.2, 'eval_id': None, 'batch_size': 32, 'n_iter': 10, 'dropout': 0.2, 'factor': 1, 'label': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'], 'output_model': PosixPath('/tmp/model'), 'input_model': 'en_core_web_sm', 'dataset': 'network_pos_annotations'}

14:04:31 - DB: Initialising database SQLite
14:04:31 - DB: Connecting to database SQLite

Loaded model en_core_web_sm
14:04:31 - RECIPE: Added sentence boundary detector to model pipeline
['sbd', 'tagger', 'parser', 'ner']

14:04:31 - DB: Loading dataset 'network_pos_annotations' (43 examples)
14:04:31 - MODEL: Merging tag spans of 43 examples
14:04:31 - MODEL: Using 33 examples (without 'ignore')
Using 20% of accept/reject examples (5) for evaluation
14:04:31 - RECIPE: Temporarily disabled other pipes: ['parser', 'ner']
14:04:31 - RECIPE: Initialised Tagger with model en_core_web_sm
{'lang': 'en', 'pipeline': ['sbd', 'tagger'], 'accuracy': {'token_acc': 99.8698372794, 'ents_p': 84.9664503965, 'ents_r': 85.6312524451, 'uas': 91.7237657538, 'tags_acc': 97.0403350292, 'ents_f': 85.2975560875, 'las': 89.800872413}, 'name': 'core_web_sm', 'license': 'CC BY-SA 3.0', 'author': 'Explosion AI', 'url': 'https://explosion.ai', 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}, 'sources': ['OntoNotes 5', 'Common Crawl'], 'version': '2.0.0', 'spacy_version': '>=2.0.0a18', 'parent_package': 'spacy', 'speed': {'gpu': None, 'nwords': 291344, 'cpu': 5122.3040471407}, 'email': 'contact@explosion.ai', 'description': 'English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.'}

14:04:31 - PREPROCESS: Splitting sentences
{'batch_size': 32, 'min_length': None, 'nlp': <spacy.lang.en.English object at 0x112c58f60>, 'stream': [{'text': 'add client 8c:85:90:3d:96:79, client count 9.', 'spans': [{'start': 0, 'end': 3, 'token_start': 0, 'token_end': 1, 'label': 'VBP', 'text': 'add', 'score': 0.055333979400000004, 'answer': 'accept'}], 'meta': {'score': 0.055333979400000004}, '_input_hash': 11590144, '_task_hash': -897644249, 'answer': 'accept'}, {'text': 'Found valid key for user admin', 'spans': [{'start': 0, 'end': 5, 'token_start': 0, 'token_end': 1, 'label': 'VBD', 'text': 'Found', 'score': 0.0978863165, 'answer': 'accept'}, {'start': 0, 'end': 5, 'token_start': 0, 'token_end': 1, 'label': 'VBN', 'text': 'Found', 'score': 0.8839325309, 'answer': 'accept'}], 'meta': {'score': 0.0978863165}, '_input_hash': 880729490, '_task_hash': -249076197, 'answer': 'accept'}, {'text': 'Port 19 link down', 'spans': [{'start': 8, 'end': 12, 'token_start': 2, 'token_end': 3, 'label': 'VBP', 'text': 'link', 'score': 0.6044232845, 'answer': 'reject'}], 'meta': {'score': 0.6044232845}, '_input_hash': -925490151, '_task_hash': 288532677, 'answer': 'reject'}, {'text': 'account admin logout from xml 128.119.247.5', 'spans': [{'start': 0, 'end': 7, 'token_start': 0, 'token_end': 1, 'label': 'VBP', 'text': 'account', 'score': 0.2405723035, 'answer': 'reject'}], 'meta': {'score': 0.2405723035}, '_input_hash': 1420608995, '_task_hash': -989918693, 'answer': 'reject'}, {'text': 'E9:21:FD:F0 port 16 VLANs SmallGroups, authentication Radius', 'spans': [{'start': 9, 'end': 11, 'token_start': 2, 'token_end': 3, 'label': 'VBG', 'text': 'F0', 'score': 0.0048528342, 'answer': 'reject'}], 'meta': {'score': 0.0048528342}, '_input_hash': -362070793, '_task_hash': 616828274, 'answer': 'reject'}, {'text': 'receive station msg, mac-8c:85:90:3d:96:79 bssid-f0:5c:19:25:d1:21 essid-CICS.', 'spans': [{'start': 0, 'end': 7, 'token_start': 0, 'token_end': 1, 'label': 'VB', 'text': 'receive', 'score': 0.5782603025, 'answer': 'accept'}, {'start': 49, 'end': 66, 'token_start': 7, 'token_end': 8, 'label': 'VBG', 'text': 'f0:5c:19:25:d1:21', 'score': 0.1396117955, 'answer': 'reject'}], 'meta': {'score': 0.5782603025}, '_input_hash': 1948839068, '_task_hash': 1981648084, 'answer': 'accept'}, {'text': 'Failed to login to ALE server.', 'spans': [{'start': 0, 'end': 6, 'token_start': 0, 'token_end': 1, 'label': 'VBN', 'text': 'Failed', 'score': 0.8626689911000001, 'answer': 'accept'}], 'meta': {'score': 0.8626689911000001}, '_input_hash': -386275139, '_task_hash': -872306004, 'answer': 'accept'}, {'text': 'Login passed for user admin through xml 128.119.247.5', 'spans': [{'start': 6, 'end': 12, 'token_start': 1, 'token_end': 2, 'label': 'VBD', 'text': 'passed', 'score': 0.39312195780000003, 'answer': 'accept'}], 'meta': {'score': 0.39312195780000003}, '_input_hash': -323994160, '_task_hash': -733438919, 'answer': 'accept'}, {'text': 'AM f0:5c:19:25:9c:00: ARM Channel Interference Trigger', 'spans': [{'start': 0, 'end': 2, 'token_start': 0, 'token_end': 1, 'label': 'VB', 'text': 'AM', 'score': 0.4830096364, 'answer': 'reject'}], 'meta': {'score': 0.4830096364}, '_input_hash': 1835173649, '_task_hash': 1783173019, 'answer': 'reject'}, {'text': 'Port 16 link UP at speed 1 Gbps and full-duplex', 'spans': [{'start': 8, 'end': 12, 'token_start': 2, 'token_end': 3, 'label': 'VBP', 'text': 'link', 'score': 0.0774894804, 'answer': 'reject'}], 'meta': {'score': 0.0774894804}, '_input_hash': 797919552, '_task_hash': -928820376, 'answer': 'reject'}, {'text': 'Login passed for user admin through ssh 128.119.240.39', 'spans': [{'start': 6, 'end': 12, 'token_start': 1, 'token_end': 2, 'label': 'VBD', 'text': 'passed', 'score': 0.39312195780000003, 'answer': 'accept'}, {'start': 6, 'end': 12, 'token_start': 1, 'token_end': 2, 'label': 'VBN', 'text': 'passed', 'score': 0.6063562036, 'answer': 'accept'}], 'meta': {'score': 0.39312195780000003}, '_input_hash': -72680087, '_task_hash': 207730494, 'answer': 'accept'}, {'text': 'Network Login MAC user 68B599A71D20 logged in MAC 68:B5:99:A7:1D:20 port 20 VLANs EDLAB, authentication Radius', 'spans': [{'start': 36, 'end': 42, 'token_start': 5, 'token_end': 6, 'label': 'VBN', 'text': 'logged', 'score': 0.0770713314, 'answer': 'accept'}, {'start': 89, 'end': 103, 'token_start': 14, 'token_end': 15, 'label': 'VB', 'text': 'authentication', 'score': 0.1477508098, 'answer': 'reject'}], 'meta': {'score': 0.0770713314}, '_input_hash': 513841368, '_task_hash': 692986997, 'answer': 'accept'}, {'text': 'Login passed for user admin through ssh 128.119.240.169', 'spans': [{'start': 6, 'end': 12, 'token_start': 1, 'token_end': 2, 'label': 'VBD', 'text': 'passed', 'score': 0.39312195780000003, 'answer': 'accept'}, {'start': 6, 'end': 12, 'token_start': 1, 'token_end': 2, 'label': 'VBN', 'text': 'passed', 'score': 0.6063562036, 'answer': 'accept'}], 'meta': {'score': 0.39312195780000003}, '_input_hash': 2062879761, '_task_hash': 1162997097, 'answer': 'accept'}, {'text': 'Protocol resolve:ucast-v4 is violated at fpc 0 for 26 times, started at 2017-09-16 11:54:08 EDT', 'spans': [{'start': 61, 'end': 68, 'token_start': 15, 'token_end': 16, 'label': 'VBN', 'text': 'started', 'score': 0.0699738115, 'answer': 'accept'}], 'meta': {'score': 0.0699738115}, '_input_hash': -2118964065, '_task_hash': -1473159162, 'answer': 'accept'}, {'text': 'Network Login MAC user 38C9863BB891 logged in MAC 38:C9:86:3B:B8:91 port 13 VLANs EDLAB, authentication Radius', 'spans': [{'start': 68, 'end': 72, 'token_start': 11, 'token_end': 12, 'label': 'VBP', 'text': 'port', 'score': 0.0382472016, 'answer': 'reject'}, {'start': 89, 'end': 103, 'token_start': 16, 'token_end': 17, 'label': 'VB', 'text': 'authentication', 'score': 0.1400180161, 'answer': 'reject'}], 'meta': {'score': 0.0382472016}, '_input_hash': 1009617874, '_task_hash': 687254671, 'answer': 'reject'}, {'text': 'Network Login MAC user 38C9862826E6 logged in MAC 38:C9:86:28:26:E6 port 14 VLANs EDLAB, authentication Radius', 'spans': [{'start': 89, 'end': 103, 'token_start': 14, 'token_end': 15, 'label': 'VB', 'text': 'authentication', 'score': 0.1453359574, 'answer': 'reject'}], 'meta': {'score': 0.1453359574}, '_input_hash': -57640706, '_task_hash': 2129843431, 'answer': 'reject'}, {'text': 'Login passed for user admin through ssh 128.119.240.50', 'spans': [{'start': 6, 'end': 12, 'token_start': 1, 'token_end': 2, 'label': 'VBD', 'text': 'passed', 'score': 0.39312195780000003, 'answer': 'accept'}, {'start': 6, 'end': 12, 'token_start': 1, 'token_end': 2, 'label': 'VBN', 'text': 'passed', 'score': 0.6063562036, 'answer': 'accept'}], 'meta': {'score': 0.39312195780000003}, '_input_hash': 1755864279, '_task_hash': 953636728, 'answer': 'accept'}, {'text': 'receive station msg, mac-04:54:53:3f:89:3a bssid-f0:5c:19:25:e6:12 essid-CSPublic.', 'spans': [{'start': 0, 'end': 7, 'token_start': 0, 'token_end': 1, 'label': 'VB', 'text': 'receive', 'score': 0.5784712434, 'answer': 'accept'}, {'start': 21, 'end': 42, 'token_start': 4, 'token_end': 5, 'label': 'VB', 'text': 'mac-04:54:53:3f:89:3a', 'score': 0.0654469058, 'answer': 'reject'}], 'meta': {'score': 0.5784712434}, '_input_hash': 677432146, '_task_hash': 57882591, 'answer': 'accept'}, {'text': 'Port 1 link down', 'spans': [{'start': 7, 'end': 11, 'token_start': 2, 'token_end': 3, 'label': 'VBP', 'text': 'link', 'score': 0.6796995401, 'answer': 'reject'}], 'meta': {'score': 0.6796995401}, '_input_hash': 799623361, '_task_hash': 655144301, 'answer': 'reject'}, {'text': 'Port 1 link UP at speed 100 Mbps and full-duplex', 'spans': [{'start': 7, 'end': 11, 'token_start': 2, 'token_end': 3, 'label': 'VBP', 'text': 'link', 'score': 0.3509369791, 'answer': 'reject'}], 'meta': {'score': 0.3509369791}, '_input_hash': 1451080809, '_task_hash': -64224250, 'answer': 'reject'}], 'text_key': 'text'}

14:04:31 - PREPROCESS: Splitting sentences
{'batch_size': 32, 'min_length': None, 'nlp': <spacy.lang.en.English object at 0x112c58f60>, 'stream': [{'text': 'Port 13 link UP at speed 1 Gbps and full-duplex', 'spans': [{'start': 8, 'end': 12, 'token_start': 2, 'token_end': 3, 'label': 'VBP', 'text': 'link', 'score': 0.5909635425, 'answer': 'reject'}], 'meta': {'score': 0.5909635425}, '_input_hash': -1771984083, '_task_hash': -835484383, 'answer': 'reject'}, {'text': 'Cannot connect to aruba.brightcloud.com: Connection timed out', 'spans': [{'start': 0, 'end': 3, 'token_start': 0, 'token_end': 1, 'label': 'VB', 'text': 'Can', 'score': 1.4818000000000002e-06, 'answer': 'accept'}], 'meta': {'score': 1.4818000000000002e-06}, '_input_hash': -143317322, '_task_hash': 1511107371, 'answer': 'accept'}, {'text': 'Protocol resolve:ucast-v4 is violated at fpc 0 for 25 times, started at 2017-09-15 15:52:54 EDT', 'spans': [{'start': 61, 'end': 68, 'token_start': 15, 'token_end': 16, 'label': 'VBN', 'text': 'started', 'score': 0.0902839601, 'answer': 'accept'}], 'meta': {'score': 0.0902839601}, '_input_hash': 1191548301, '_task_hash': 1688884184, 'answer': 'accept'}, {'text': 'Port 20 link UP at speed 1 Gbps and full-duplex', 'spans': [{'start': 8, 'end': 12, 'token_start': 2, 'token_end': 3, 'label': 'VB', 'text': 'link', 'score': 0.033128142400000005, 'answer': 'reject'}], 'meta': {'score': 0.033128142400000005}, '_input_hash': -1982547134, '_task_hash': -237652226, 'answer': 'reject'}, {'text': 'AM f0:5c:19:21:ef:80: ARM - increasing power cov-index 7/0 tx-power 3 new_rra 1/4', 'spans': [{'start': 3, 'end': 20, 'token_start': 1, 'token_end': 2, 'label': 'VB', 'text': 'f0:5c:19:21:ef:80', 'score': 0.0240463354, 'answer': 'reject'}], 'meta': {'score': 0.0240463354}, '_input_hash': 1414870338, '_task_hash': -1772092209, 'answer': 'reject'}], 'text_key': 'text'}

14:04:31 - MODEL: Merging tag spans of 5 examples
14:04:31 - MODEL: Using 5 examples (without 'ignore')
14:04:31 - MODEL: Evaluated 5 examples
{'right': 0.0, 'wrong': 3.0, 'unk': 71.0, 'acc': 0.0}

14:04:31 - RECIPE: Calculated baseline from evaluation examples (accuracy 0.00)
Using 100% of remaining examples (20) for training
Dropout: 0.2  Batch size: 32  Iterations: 10


BEFORE     0.000
Correct    0
Incorrect  3
Unknown    71


#          LOSS       RIGHT      WRONG      ACCURACY
14:04:32 - MODEL: Merging tag spans of 5 examples
14:04:32 - MODEL: Using 5 examples (without 'ignore')
14:04:32 - MODEL: Evaluated 5 examples
{'right': 0.0, 'wrong': 2.0, 'unk': 72.0, 'acc': 0.0}

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/prodigy/recipes/pos.py", line 229, in batch_train
    model_to_bytes = model.to_bytes()
AttributeError: 'Tagger' object has no attribute 'to_bytes'

Just released v1.5.1, which should fix the underlying problems. Sorry again! The pos.teach and pos.batch-train recipes should now also be able to tell whether you want to use fine-grained or coarse-grained labels (by comparing the labels, so you don’t have to tell it explicitly). For more specific use cases, you can now even pass in a JSON tag map that maps coarse-grained to fine-grained tags.

@ines Works like a charm now, at least without the errors. I am currently training a model and will post back on how the pos recipes work. I am certain they are going to be fine. Thank you so much for the lightning quick responses once again. :slight_smile:

Hi,
I faced exactly the same error but with ner.teach. I am trying to repeat the TRAINING A NEW ENTITY TYPE demo on youtube with my own terms. However, I cannot proceed because of this error! Do you have any suggestion for resolving the issue?

@saeedranjbar12 Could you share more details on the error? Which one do you see – the “no first batch” error? If so, what does your data look like?

The error is :
“ValueError: Error while validating stream: no first batch. This likely means that your stream is empty.”
I was trying to replace DRUG in demo with Educational Degrees. When I convert the initial dataset to pattern file, it is name of degrees like “PhD”, “BSc” with EDU label.
Do you want me to copy and paste the trace back?

Ah okay – this is more of a general error related to the data you’re loading in. Essentially, it means that Prodigy couldn’t load any texts. This can happen if the format is corrupted, if none of the entries in a JSON file have a "text" etc. What data are you loading in and how does it look?

I am exactly repeating what Matthew did in the demo. I first start from few terms. Then add more terms to dataset using Prodigy. Later, I convert the prepared dataset to pattern in .jsonl format and use the following command as it is shown in the tutorial.
python3 -m prodigy ner.teach skills_ner en_core_web_lg train --loader reddit --label EDU --patterns skill_patterns.jsonl
My jsonl file looks as following and it includes my chosen terms. Am I missing a part?
first line : {“label”:“EDU”,“pattern”:[{“lower”:“phd”}]}

The patterns stuff all looks correct. I think the problem is in the data you're annotating, i.e. the Reddit corpus. In the video, we're loading in the pre-extracted data from a directory called train – sorry if this was slightly confusing. See this thread for an explanation:

The fourth argument of your command is the data you're loading int for annotation. So what currently says train needs to be a valid data file. In this case, a Reddit comments archive, because you're using --loader reddit:

python3 -m prodigy ner.teach skills_ner en_core_web_lg /path/to/data.bz2 --loader reddit --label EDU --patterns skill_patterns.jsonl

If you haven't done so already, you can download data from the Reddit comments corpus from this page.

1 Like

OOOooo! so this is the problem. I thought it is only a directory to save results/models to . Thanks INES!
Let me try it one more time with the given instruction. I will share the results.

1 Like

Great, thanks – glad it’s working! (Sorry again about the confusion. Next time, we’ll try to write things like that as ./some-directory in the video tutorials. For more details on the recipes and their arguments, you can check out this page or the PRODIGY_README.html that you can download with Prodigy.)

Btw, when working with Reddit, keep in mind that the datasets are huge and contain lots of texts from all kinds of different subreddits. So depending on what you’re looking for, you might want to filter the data first and only select certain subreddits.

1 Like