hi @rosamond!
Can you try logging? I'm wondering that your feed_overlap
may not be set to true
like you think.
Try adding PRODIGY_LOGGING=verbose
:
PRODIGY_LOGGING=verbose python -m prodigy textcat_sent_sequence sent_dataset ...
Then look for CONFIG
and FEED
:
$ PRODIGY_LOGGING=verbose python -m prodigy ner.manual ner_ex1 blank:en nyt_text_dedup.jsonl --label ORG
...
20:54:00: CONFIG: Using config from global prodigy.json
/Users/ryan/.prodigy/prodigy.json
20:54:00: DB: Initializing database SQLite
20:54:00: DB: Connecting to database SQLite
20:54:00: DB: Creating dataset '2023-02-01_20-54-00'
{'created': datetime.datetime(2023, 2, 1, 20, 50, 36)}
20:54:00: FEED: Initializing from controller
{'auto_count_stream': True, 'batch_size': 10, 'dataset': 'ner_ex1', 'db': <prodigy.components.db.Database object at 0x11c3070a0>, 'exclude': ['ner_ex1'], 'exclude_by': 'input', 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.Feed object at 0x11c307c40>, 'stream': <generator object at 0x11c1b8540>, 'target_total_annotated': None, 'timeout_seconds': 3600, 'total_annotated': 0, 'total_annotated_by_session': Counter(), 'validator': <prodigy.components.validate.Validator object at 0x11c306a70>, 'view_id': 'ner_manual'}
...
Two things to notice. First, in the CONFIG
, you can see that Prodigy is using the global prodigy.json
. This is a good check on whether your local project prodigy.json
is being read in our if the global one (which is checked first).
Second, look at the FEED
and verify what feed_overlap
is. By default, it's set to False
.
I'm wondering if your global prodigy.json
is overwriting your local project's prodigy.json
.
One way to check this is using overrides (see this post for example):
PRODIGY_LOGGING=verbose PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' python -m prodigy ner.manual ner_ex1 blank:en nyt_text_dedup.jsonl --label ORG
This would make sense if you forgot in your rosamond
to save your last annotation (make sure before closing out of your browser, you hit save to save to DB any remaining in your batch (client) that haven't been saved yet).
By default, Prodigy will dedup by task_hash
. If you're only doing one task, this likely means you have some duplicates in your data.
Check those items and let us know if you're still having issues.