I have built my own custom Elasticsearch loader and custom recipe wrapper on top of ner.make_gold. Both of them return a stream as a generator. However when I start the annotation I see that Prodigy iterate over stream generator until it gets exhausted before presenting any samples for annotation.
Here is my wrapper code:
@prodigy.recipe('elastic.ner.make-gold',
dataset=recipe_args['dataset'],
spacy_model=recipe_args['spacy_model'],
source=recipe_args['source'],
label=recipe_args['label_set'],
exclude=recipe_args['exclude'],
unsegmented=recipe_args['unsegmented'],
anonymize=('anonymize content of the samples', 'flag', 'a'))
def elastic_make_gold(dataset, spacy_model, source=None, api=None, loader=None,
label=None, exclude=None, unsegmented=False, anonymize=False):
stream = prodigy.get_stream(source, loader='elastic_loader')
stream = transform_stream(stream, spacy_model, anonymize)
components = make_gold(dataset=dataset, spacy_model=spacy_model,
source=stream, label=label, exclude=exclude, unsegmented=unsegmented)
print('Components:', components)
return components
Here is the full log:
prodigy elastic.ner.make-gold ner-test en_core_web_sm --label ORG,PRODUCT --exclude ner-test --unsegmented
10:06:41 - No API key or APP key was provided for Datadog
10:06:41 - CLI: Added 2 recipe(s) via entry points
10:06:41 - RECIPE: Calling recipe 'elastic.ner.make-gold'
Using 2 labels: ORG, PRODUCT
10:06:41 - LOADER: Added 1 file loader(s) via entry points
10:06:41 - LOADER: Loading stream from elastic_loader
10:06:41 - LOADER: Reading stream from sys.stdin
10:06:41 - RECIPE: Starting recipe ner.make-gold
10:06:41 - RECIPE: Loaded model en_core_web_sm
10:06:41 - RECIPE: Annotating with 2 labels
10:06:41 - LOADER: Using supplied iterable source as stream
10:06:41 - LOADER: Rehashing stream
Components: {'view_id': 'ner_manual', 'dataset': 'ner-test', 'stream': <generator object make_gold.<locals>.make_tasks at 0x7f96dc3cd5c8>, 'exclude': ['ner-test'], 'update': None, 'config': {'lang': 'en', 'labels': ['ORG', 'PRODUCT']}}
10:06:41 - CONTROLLER: Initialising from recipe
10:06:41 - VALIDATE: Creating validator for view ID 'ner_manual'
10:06:41 - DB: Initialising database PostgreSQL
10:06:44 - DB: Connecting to database PostgreSQL
10:06:51 - DB: Loading dataset 'ner-test' (102 examples)
10:06:51 - DB: Creating dataset '2019-03-26_10-06-41'
10:06:52 - DatasetFilter: Getting hashes for excluded examples
10:06:52 - DatasetFilter: Excluding 98 tasks from datasets: ner-test
10:06:52 - CONTROLLER: Initialising from recipe
✨ Starting the web server at http://0.0.0.0:8080 ...
Open the app in your browser and start annotating!
10:06:57 - GET: /project
10:06:58 - GET: /get_questions
10:06:58 - Task queue depth is 1
10:06:58 - Task queue depth is 2
10:06:58 - FEED: Finding next batch of questions in stream
10:06:58 - CONTROLLER: Validating the first batch for session: None
10:06:58 - PREPROCESS: Tokenizing examples
10:06:58 - FILTER: Filtering duplicates from stream
10:06:58 - FILTER: Filtering out empty examples for key 'text'
10:06:59 - GET https://xxxx.es.amazonaws.com:443/emails/_count [status:200 request:0.791s]
10:07:00 - GET https://xxxx.es.amazonaws.com:443/emails/_search?scroll=480m&size=3 [status:200 request:0.485s]
10:07:00 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.222s]
10:07:00 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.177s]
10:07:01 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.225s]
10:07:01 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.158s]
10:07:01 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.192s]
10:07:02 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.359s]
10:07:02 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.282s]
10:07:03 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.369s]
10:07:03 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.567s]
10:07:04 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.290s]
10:07:04 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.350s]
10:07:05 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.455s]
10:07:05 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.202s]
10:07:05 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.326s]
10:07:06 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.282s]
10:07:06 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.155s]
10:07:06 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.492s]
10:07:07 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.184s]
10:07:07 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.303s]
10:07:07 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.201s]
10:07:08 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.232s]
10:07:08 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.322s]
10:07:08 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.159s]
10:07:09 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.179s]
10:07:09 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.171s]
10:07:09 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.184s]
10:07:09 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.152s]
10:07:10 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.170s]
10:07:10 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.156s]
10:07:10 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.164s]
10:07:10 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.287s]
10:07:11 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.184s]
10:07:11 - GET https://xxxx.es.amazonaws.com:443/_search/scroll?scroll=480m [status:200 request:0.153s]
10:07:11 - DELETE https://xxxx.es.amazonaws.com:443/_search/scroll [status:200 request:0.205s]
10:07:17 - RESPONSE: /get_questions (3 examples)
I would like to be able to annotate potentially infinite stream of documents. Is there any workaround? How to fix this behavior?