Prodigy does not load the server, it seems to hang on FILTER. This happens on every dataset I’ve tried, including Reddit example. I’ve run it on two separate computers, with and without prodigy.json config file. I’ve also uninstalled and reinstalled prodigy and spacy and run in fresh virtualenv’s. On each computer Prodigy worked one time and then started having this problem. See log below:
$ prodigy dataset reddit_example
$ prodigy db-in reddit_example reddit_product.jsonl
$ prodigy ner.teach reddit_example en_core_web_sm --label PRODUCT
06:46:51 - RECIPE: Calling recipe ‘ner.teach’
Using 1 labels: PRODUCT
06:46:51 - RECIPE: Starting recipe ner.teach
{‘unsegmented’: False, ‘exclude’: None, ‘patterns’: None, ‘label’: [‘PRODUCT’], ‘loader’: None, ‘api’: None, ‘source’: None, ‘spacy_model’: ‘en_core_web_sm’, ‘dataset’: ‘reddit_example’}
06:46:51 - LOADER: Loading stream from jsonl
06:46:51 - LOADER: Reading stream from sys.stdin
06:46:51 - LOADER: Rehashing stream
06:46:51 - RECIPE: Creating EntityRecognizer using model en_core_web_sm
06:46:52 - MODEL: Added sentence boundary detector to model pipeline
[‘sbd’, ‘tagger’, ‘parser’, ‘ner’]
06:46:52 - RECIPE: Making sure all labels are in the model
[‘PRODUCT’]
06:46:52 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
06:46:52 - CONTROLLER: Initialising from recipe
{‘config’: {‘lang’: ‘en’, ‘label’: ‘PRODUCT’, ‘dataset’: ‘reddit_example’}, ‘dataset’: ‘reddit_example’, ‘db’: True, ‘exclude’: None, ‘get_session_id’: None, ‘on_exit’: None, ‘on_load’: None, ‘progress’: <prodigy.components.progress.ProgressEstimator object at 0x7f2a723be400>, ‘self’: <prodigy.core.Controller object at 0x7f2a71145780>, ‘stream’: <prodigy.components.sorters.ExpMovingAverage object at 0x7f2a71145128>, ‘update’: <bound method EntityRecognizer.update of <prodigy.models.ner.EntityRecognizer object at 0x7f2a723ca278>>, ‘view_id’: ‘ner’}
06:46:52 - VALIDATE: Creating validator for view ID ‘ner’
06:46:52 - DB: Initialising database SQLite
06:46:52 - DB: Connecting to database SQLite
06:46:52 - DB: Loading dataset ‘reddit_example’ (1800 examples)
06:46:52 - DB: Creating dataset ‘2018-06-10_06-46-52’
{‘description’: None, ‘author’: None, ‘created’: datetime.datetime(2018, 6, 10, 6, 44, 46)}
06:46:52 - CONTROLLER: Validating the first batch
06:46:52 - CONTROLLER: Iterating over stream
06:46:52 - PREPROCESS: Splitting sentences
{‘batch_size’: 32, ‘min_length’: None, ‘nlp’: <spacy.lang.en.English object at 0x7f2a723ca160>, ‘stream’: <generator object at 0x7f2a7245dee8>, ‘text_key’: ‘text’}
06:46:52 - FILTER: Filtering duplicates from stream
{‘by_input’: True, ‘by_task’: True, ‘stream’: <generator object at 0x7f2a7245da68>}
06:46:52 - FILTER: Filtering out empty examples for key ‘text’