For our use case, we want to start from scratch with streaming a raw dataset from a kafka queue. We’re set up to label custom entities in text without a model in the loop.
We have a single, common postgres for the db.
We’d like to be able to go back into a dataset and add labels on it; for example, one labeler might be annotating phone numbers, and another labeler might come along 6 months later and want to add an address label to the same data.
Is there currently support for updating an existing dataset that’s built off a streaming queue?
Prodigy’s stream and JSON output formats are pretty much identical, so you can always go back and load in an existing dataset. For example:
prodigy db-out some_dataset > some_dataset.jsonl
prodigy ner.manual other_dataset en_core_web_sm some_dataset.jsonl --label SOME_LABEL
ner.manual recipe respects pre-defined entities, so the annotator will see everything that was labelled before, can correct the spans and also add new entities.
While you can technically add the new annotations to the old dataset, I’d still recommend creating a new one when you re-annotate an existing set later. This gives you a cleaner separation and if something goes wrong, you’ll always have a record of the previous set.
Hope this helps!