No, but that's a nice idea! You can easily write your own little converter script for this, though:
from prodigy.components.db import connect
db = connect() # connect to the database
examples = db.get_dataset('choice_dataset') # get the dataset
textcat_examples = [] # collect reformatted examples here
for eg in examples:
accepted = eg.get('accept', []) # get the list of accepted IDs, e.g. ['FINANCIAL']
for accepted_id in accepted:
textcat_examples.append({'text': eg['text'], 'label': accepted_id})
You can then save out the textcat_examples
to a JSONL file and add it to a dataset using db-in
, or add it to your database straight away by creating a new dataset and adding the list of examples to it. You should then be able to use that dataset to train with textcat.batch-train
.
If you want to do this even more elegantly, you could also add an on_exit
hook for your recipe that is run when you exit the Prodigy server, and automatically adds the reformatted tasks to a new dataset. The on_exit
function takes the controller as its argument, which gives you access to the database and the already annotated examples of the current session. You can find an example of this in the custom recipes workflow.
def on_exit(ctrl):
# get annotations of current session
examples = ctrl.db.get_dataset(ctrl.session_id)
textcat_examples = convert_examples(examples) # convert the examples
# add them to your other dataset (needs to exist in the database)
ctrl.db.add_examples(textcat_examples, datasets=('textcat_examples'))
This depends on what exactly you're trying to do – do you want to recreate the seed selection functionality of the textcat
recipes in your custom choice recipe? You can see how the stream with seeds is composed in prodigy/recipes/textcat.py
, or use the PatternMatcher
from the NER recipes to find terms in your incoming stream. A stream of annotation examples is just a simple generator btw – so you can also implement your own, custom matching logic.
By default, Prodigy tries to make as little assumptions about your streams as possible. Within the same session, duplicate tasks will be filtered out – but when you start a new session, Prodigy will not assume any state. However, once this bug is resolved in the upcoming release, you'll be able to specify the --exclude
argument or return a list of dataset IDs as the 'exclude'
setting returned by your recipe. This will tell Prodigy to not ask you questions that were already annotated in that dataset. For example, you can set it to the current dataset name, or use the ID of your evaluation set to make sure that examples don't appear in both your training and evaluation set.