Prodigy 1.12.0 is out! 🎉

We are thrilled to announce that Prodigy v1.12.0 is out! This latest version marks a significant milestone for us, as it has been quite some time since our last big release. Thanks to everyone who helped us with testing the alpha versions!

For v1.12.0 we have completely refactored Prodigy internals to make the annotation flow more tractable and more customizable. We have re-built the Controller and added new abstractions for a better representation of the task stream and input source. This allowed us to deliver a number of really exciting features (and there's more to come!). We invite you to check out our documentation for the full changelog and extensive user guides on LLM integrations, task routing and deployment. We have also prepared a video tour of the highlights:

Here's a lineup of our favourite features with links to the brand new user guides:


Prodigy v1.12.0 introduces built-in recipes for jump-starting the annotation for Named Entity Recognition (NER) and Text Categorization (Textcat) using the OpenAI LLM service. These workflows enable annotators to efficiently curate label suggestions from a large language model which can significantly speed up the annotation process.

In addition to NER and Textcat, Prodigy v1.12.0 offers recipes for generating domain-specific terminologies. After the generated terms are curated, they can be utilized with the PatternMatcher, enabling another form of annotation bootstrapping.

The provided recipes support both zero-shot and few-shot prompts. This means you can supply a few examples of the expected output to guide model's prediction in the desired direction. While Prodigy includes a default prompt template, it also allows for custom templates tailored to your specific requirements.

To assist you in selecting the most effective prompt for your particular use case, Prodigy v1.12.0 includes prompt engineering recipes, aiding in the optimization of prompts for improved annotation outcomes.


You have the option to select A/B style evaluation for two or more prompts. We particularly recommend the ab.openai.tournament recipe that utilizes an algorithm influenced by the Glicko ranking system to arrange the duels and monitor the performance of the various prompts. For further information, please refer to our LLM guide.


We've exposed two brand new recipe components: the task router and the session factory . These components let you control how tasks are assigned to annotators and what should happen when a new annotator joins the server. In addition, we have expanded the annotation overlap settings. Apart from full and zero overlap, you can now set partial overlap via new annotations_per_task config setting.

More importantly, though you can implement a fully custom task router. For example, you could distribute tasks based on the model score or annotator's expertise. Please check out our guide to task routing for more details and ideas.

In Prodigy v1.12.0 we have re-implemented the internal representations of the task stream and the input source. The stream is now aware of the underlying source and how much of it has been consumed by the annotators.

This allows us to offer more reliable progress tracking for the workflows where the target number of annotations is not known upfront. In the UX, you'll notice 3 different types of progress bar: target progress (based on the set target), source progress (reflects the progress through the source object) and progress (for custom progress estimators). Since the semantics of these new progress bars is different, we recommend reading our docs on progress which explain that in detail.

We have also improved the loaders and provided a refactored get_stream utility that resolves the source type and initializes the Stream accordingly. We also added support for Parquet input files.

These are the highlights,v1.12.0 also comes with a number of smaller features, bug fixes and DX improvements. And it supports Python 3.11. Please check out the full v1.12.0 changelog for details.

As always, we are looking forward to any feedback you might have! This forum is a great place to share it :slight_smile:

To install:

pip install --upgrade prodigy -f https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy

I just purchased a few hours ago and I am on 1.11.14. I ran this pip install prodigy -f https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy with my purchase code but version is still the same

Hi @jay007,

If there already is a Prodigy in the current environment, the --upgrade flag will be necessary.

Could you try:

pip install --upgrade prodigy -f https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy

In fact, I'll update the instruction to use it by default.

I tried that and also created a new environment and installed again but when running prodigy stats it still says Version 1.11.14?

Just tried and didn't have issues. Are you 100% sure you're looking at the right Python environment? For example, if you run which python or which python3, is it pointing to your venv?

If so, two other options:

  1. Run pip uninstall prodigy then try pip install prodigy -f .... You may want to prefix it with python3 -m pip ... or python -m pip ... to be 100% sure you're pointing to the right environment.
  2. Go to a browser, type in https://xxxx-xxxx-xxxx-xxxx@download.prodi.gy/index/prodigy and select Prodigy v1.12.0, download the wheel file (prodigy-1.12.0-...) that aligns to your Python version and OS. Alternatively, you can download from SendOwl. Do pip uninstall prodigy again (or start new venv) and then run pip install -f /path/to/wheels

Hi, I just updated to 1.12.1. The build-in audio.transcribe recipe is throwing an error.
I'm using audio.transcribe in the following way

prodigy audio.transcribe stt_ns_gb_review dataset:stt_ns_gb --fetch-media

This used to work before updating.
After updating to 1.12.1 the error message I'm getting is

error message: FileNotFoundError: [Errno 2] No such file or directory: 'dataset:stt_ns_gb'

but stt_ns_gb is a dataset and not a file. How do I pass in a dataset as an input argument to audio.transcribe ?

The dataset: syntax referenced here does not work for audio.transcribe

Thanks for the help.

hi @ngawangtrinley and @spsither,

Thanks for reporting the issue. We're looking into it and will get back to you.

2 Likes

Thanks @ngawangtrinley for the report, this is an error specifically with the Audio/Video workflows at the moment.

We have a fix in the works that should be out this week but if you're blocked now, I'd recommend you do a prodigy db-out stt_ns_gb ./path/to/output and use the exported JSONL file as the input to your recipe.

prodigy audio.transcribe stt_ns_gb_review ./path/to/output/stt_ns_gb.jsonl --loader jsonl

The audio.transcribe recipe defaults to using the audio loader which expects a directory of audio files so you'll need to specifically specify the --loader jsonl argument for this to work.

2 Likes

Thanks for the help @kab
The dataset stt_ns_gb is currently being added to.
I want annotations in stt_ns_gb to be reviewed once and added to stt_ns_gb_review.

If I use db-out and export a JSONL file, the issue I'm having is that the JSONL file contains all the annotations in stt_ns_gb and does not exclude the annotations we have already reviewed ( i.e. added to stt_ns_gb_review).

How do I tell db-out to export only the annotations not yet reviewed ( in stt_ns_gb but not in stt_ns_gb_review)?

The dataset: syntax was used to take care of that issue.

Well good news is that we pushed a fix to this issue in Prodigy v1.12.2! So you shouldn't have to worry about this. But if you're interested:

By default, Prodigy will exclude examples from the input source (a JSONL file in this case) based on the _task_hash attribute. For the audio.transcribe specifically, the _task_hash cannot always be consistent since the field added to the answered task is a dynamic property. We'll be looking to fix this in Prodigy v2 but for now, when using this recipe we recommend excluding by the _input_hash instead of the _task_hash with the exclude_by setting in your prodigy.json.

# prodigy.json
{
    ...
    "exclude_by": "input"
}
1 Like

Thank you for the quick fix @kab!

When I use "exclude_by": "input" in prodigy.json with a folder as the input source to audio.transcribe, does it exclude based on the file name in the input folder or the file content?

When the input source is a JSONL file, will "exclude_by": "input" exclude based on the audio file name and not on the dynamic property?

{
     "exclude_by": "input"
}

will exclude based on the file name for audio.transcribe when pointing at a directory. If using a JSONL file exported from the database via db-out it will also use the file name.

This is because the text attr is set to the file name and the built-in hashing for Prodigy treats the text key as input. It does not currently handle the base64 encoded audio data at all.

1 Like