Prodigy hangs on "FILTER: Filtering out empty examples for key 'text'"

Prodigy does not load the server, it seems to hang on FILTER. This happens on every dataset I’ve tried, including Reddit example. I’ve run it on two separate computers, with and without prodigy.json config file. I’ve also uninstalled and reinstalled prodigy and spacy and run in fresh virtualenv’s. On each computer Prodigy worked one time and then started having this problem. See log below:

$ prodigy dataset reddit_example
$ prodigy db-in reddit_example reddit_product.jsonl

$ prodigy ner.teach reddit_example en_core_web_sm --label PRODUCT
06:46:51 - RECIPE: Calling recipe ‘ner.teach’
Using 1 labels: PRODUCT
06:46:51 - RECIPE: Starting recipe ner.teach
{‘unsegmented’: False, ‘exclude’: None, ‘patterns’: None, ‘label’: [‘PRODUCT’], ‘loader’: None, ‘api’: None, ‘source’: None, ‘spacy_model’: ‘en_core_web_sm’, ‘dataset’: ‘reddit_example’}

06:46:51 - LOADER: Loading stream from jsonl
06:46:51 - LOADER: Reading stream from sys.stdin
06:46:51 - LOADER: Rehashing stream
06:46:51 - RECIPE: Creating EntityRecognizer using model en_core_web_sm
06:46:52 - MODEL: Added sentence boundary detector to model pipeline
[‘sbd’, ‘tagger’, ‘parser’, ‘ner’]

06:46:52 - RECIPE: Making sure all labels are in the model
[‘PRODUCT’]

06:46:52 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
06:46:52 - CONTROLLER: Initialising from recipe
{‘config’: {‘lang’: ‘en’, ‘label’: ‘PRODUCT’, ‘dataset’: ‘reddit_example’}, ‘dataset’: ‘reddit_example’, ‘db’: True, ‘exclude’: None, ‘get_session_id’: None, ‘on_exit’: None, ‘on_load’: None, ‘progress’: <prodigy.components.progress.ProgressEstimator object at 0x7f2a723be400>, ‘self’: <prodigy.core.Controller object at 0x7f2a71145780>, ‘stream’: <prodigy.components.sorters.ExpMovingAverage object at 0x7f2a71145128>, ‘update’: <bound method EntityRecognizer.update of <prodigy.models.ner.EntityRecognizer object at 0x7f2a723ca278>>, ‘view_id’: ‘ner’}

06:46:52 - VALIDATE: Creating validator for view ID ‘ner’
06:46:52 - DB: Initialising database SQLite
06:46:52 - DB: Connecting to database SQLite
06:46:52 - DB: Loading dataset ‘reddit_example’ (1800 examples)
06:46:52 - DB: Creating dataset ‘2018-06-10_06-46-52’
{‘description’: None, ‘author’: None, ‘created’: datetime.datetime(2018, 6, 10, 6, 44, 46)}

06:46:52 - CONTROLLER: Validating the first batch
06:46:52 - CONTROLLER: Iterating over stream
06:46:52 - PREPROCESS: Splitting sentences
{‘batch_size’: 32, ‘min_length’: None, ‘nlp’: <spacy.lang.en.English object at 0x7f2a723ca160>, ‘stream’: <generator object at 0x7f2a7245dee8>, ‘text_key’: ‘text’}

06:46:52 - FILTER: Filtering duplicates from stream
{‘by_input’: True, ‘by_task’: True, ‘stream’: <generator object at 0x7f2a7245da68>}

06:46:52 - FILTER: Filtering out empty examples for key ‘text’

Hi! I think I know what the problem is:

In Prodigy, the dataset in the database is the location the annnotated examples will be saved to. So you don’t need to use db-in – to load in the data stream (e.g. the Reddit data), you can simply provide the file path on the command line as the third argument, right after the model. For example:

prodigy ner.teach reddit_example en_core_web_sm reddit_product.jsonl --label PRODUCT

If you leave the source argument empty (like in your example), Prodigy will default to reading from stdin. This mechanism can be useful because it lets you pipe data forward from a previous process, like a custom loader script. For example: python loader.py | prodigy ner.teach .... You can see that this is happening in the following log:

06:46:51 - LOADER: Reading stream from sys.stdin

So basically, Prodigy was waiting to read the stream from standard input – but nothing came, because there was nothing there. If you just provide the file as the input source like in my example above, everything should work as expected :blush:

(I’ll also think about a possible way to make Prodigy give more helpful feedback here. It’s difficult because leaving out the source argument is valid and Prodigy should also be able to wait until a stream comes in via sys.stdin, even if it takes a while. One idea we’ve had was to require the source argument, but expect the value - to read from standard input. This is quite common across other command line interfaces – but it’d be a breaking change.)

It works now, thank you!