Prodigy 1.12.0rc2 release candidate available for download!

In the last few weeks we've been collecting your feedback on the v1.12 alphas (thanks for giving it a spin!) and we've been ironing out all the outstanding issues. We are excited to announce that v1.12 release candidate (v1.12.0rc2) is now available for download!

Like the alphas (and the upcoming v1.12 stable) this download is available to all v1.11x licence holders.

As mentioned in our previous post, for v1.12 we have completely refactored Prodigy internals to make the annotation flow more tractable and more customizable. Adjusting the Controller and adding new components such as Stream and Source let us deliver a number of exciting features. Here are some of the highlights:

  • LLM-assisted workflows

Prodigy v1.12 offers built-in recipes for bootstrapping NER and Textcat annotations with OpenAI gpt-3.5 model. You can easily set up the workflow where the annotators curate label suggestions from the LLM, which is bound to significantly speed up the annotation process.

Apart from NER and Textcat, we also provide recipes for terminology generation which, after curation, can be used as patterns in the PatternMatcher for another type of annotation bootstrapping. See our docs for example code snippets!

The recipes support both zero-shot and few-shot prompts i.e. they allow you to provide some examples of expected output to steer the model in the right direction. Prodigy provides you with the default prompt template, but of course, custom templates are supported as well.

In order to help you make an informed the decision with respect to the optimal prompt for your purpose, Prodigy v1.12 comes with a couple of prompt engineering recipes.

You can choose between A/B style evaluation or set up a tournament between 3 or more prompts. The ab.openai.tournament uses an algorithm inspired by the Glicko ranking system to set up the duels and keep track of the best performing candidate.

Please check out our LLM guide for more details.

  • The full customization of the annotation flow.

We've exposed two brand new recipe components: the task router and the session factory that let you control how tasks are distributed across annotators and what should happen when a new annotators joins the server.

We have expanded the settings for annotation overlap. Apart from full and zero overlap, you can now set partial overlap via new annotations_per_task config setting.

More importantly though you can implement a fully custom task router. For example, you could distribute tasks based on the model score or annotator's expertise. Please check out our guide to task routing for more details and ideas.

  • Source-based progress estimation

In Prodigy v1.12 we have re-implemented the internal representations of the task stream and the input source. The stream is now aware of the underlying source and how much of it has been consumed by the annotators.

This allows us to offer more reliable progress tracking for the workflows where the target number of annotations is not known upfront. In the UX, you'll notice 3 different types of progress bar: target progress (based on the set target), source progress (reflects the progress through the source object) and progress (for custom progress estimators). Since the semantics of these new progress bars is different, we recommend reading our docs on progress which explain that in detail.

We have also improved the loaders and provided a refactored get_stream utility that resolves the source type and initializes the Stream accordingly. We also added support for Parquet input files as well as a new section on the docs about deploying Prodigy.

These are the highlights,v1.12also comes with a number of smaller features, bug fixes and DX improvements. And it supports python 3.11. Please check out the full v1.12rc2 changelog for details.

As always, we are looking forward to any feedback you might have!

To install:

pip install --pre prodigy -f

Hi @magdaaniol, thank you for sharing and the comments. I am wondering if there is a way to use other LLMs locally instead of the online OpenAI ones? Of course, at the expense of their accuracy.

The reason I am asking is that I am working with sensitive data that sits within a secure research environment without internet access. Using this fantastic approach as ner.openai.correct will significantly boos the annotation performance.

Perhaps some of the advanced Hugging Face models can be used instead to give a similar experience.

Thank you.

Hi Andrey,

You can definitely achieve similar results with a custom ner correct recipe and the new spacy-llm library. spacy-llm let's you integrate an LLM (hosted or local) as a spaCy component. It will take care of prompt generation and parsing and store the LLM annotation results on the Doc object just like any other spaCy pipeline.
To use a LLM with spaCy you’ll need to start by creating a configuration file that tells spacy-llm how to construct a prompt for your task. Please see spacy-llm docs for details, but for NER it could look like this:

lang = "en"
pipeline = ["llm"]


factory = "llm"

@llm_tasks = "spacy.NER.v2"

@llm_backends = "spacy.Dolly_HF.v1"
# For better performance, use databricks/dolly-v2-12b instead
model = "databricks/dolly-v2-3b"

Then, from your custom recipe you could assemble the nlp pipeline like so:

from spacy_llm.util import assemble

# Assemble a spaCy pipeline from the config
nlp = assemble("config.cfg")

# Use this pipeline as you would normally
doc = nlp("I know of a great pizza recipe with anchovis.")
print(doc.ents) # (pizza, anchovis)

Once you have processed your examples with the LLM loaded pipeline, you can use it as input to ner_manual interface for annotators to correct the LLM annotations just like it's done with openai.ner.correct.

In the very near future we are going to release Prodigy built-in spacy-llm recipes, but for now the same results can be achieved with a just a little bit custom scripting thanks to spacy-llm.
Let us know how it goes and if you need any assistance!

1 Like

Amazing! Thank you so much @magdaaniol for your quick response. I will double check the suggested spacy-llm library. We are training and developing custom Llama-based models and it will be great to see them integrated with Prodigy.

Thank you!