Yes, exactly! For example, let’s assume your input data looks like this:
texts = ["This is a text about cats", "This is about dogs"]
And let’s assume you also have a model that can predict labels and their scores based on input documents. You can then write a predict
function that yields (score, example)
tuples. Each example is a dictonary in Prodigy’s format, containing a text and a label, and optional meta information (which is displayed in the bottom right corner in the web app):
def predict(texts):
for text in texts:
# predict labels and scores for the text, for example:
# [('CAT', 0.91), ('DOG': 0.53)]
labels = YOUR_MODEL.predict(text)
for score, label in labels: # create one task for each label!
example = {'text': text, 'label': label, 'meta': {'score': score}}
yield (score, example) # tuples of score and annotation task
The (score, example)
tuples can be consumed by Prodigy’s sorters – for example, prefer_uncertain()
, which will resort the stream to prefer examples with a score closest to 0.5
. To make annotation more efficient, the web application focuses on one decision at a time – so you can simply create one annotation task for each label.
So in the example above, Prodigy might skip the very confident CAT
prediction and only ask you whether the label DOG
with a score of 0.53
applies to “This is a text about cats”. You can then click reject, and the annotation is saved to your dataset.
Your update()
function takes a list of annotated examples, i.e. the same examples as your input, just with an added "answer"
key that’s either "accept"
, "reject"
or "ignore"
.
def update(examples):
right = [eg for eg in examples if eg['answer'] == 'accept']
wrong = [eg for eg in examples if eg['answer'] == 'reject']
# update your model with the right and wrong examples
YOUR_MODEL.update(right, wrong)
All annotations will include the text, a label, the score and whether the label was correct or incorrect (according to the human annotator). So you can use this information to update your model.
Putting this all together, a recipe could then look like this:
import prodigy
from prodigy.components.sorters import prefer_uncertain
@prodigy.recipe('custom-textcat')
def custom_textcat(dataset):
texts = [...] # your data here
return {
'dataset': dataset,
'stream': prefer_uncertain(predict(texts)),
'update': update,
'view_id': 'classification' # use classification interface
}
This will stream in your examples, resort them to prioritise the ones with an uncertain score. In the web app, and show them in the web app, you will see a label and a text. Every time the app sends back a batch of annotated tasks to the Prodigy server, your update()
function will be called, and your model will be updated. All annotations you collect will be stored in the dataset, so you can export it and train from the annotations later on. You can also find more info on the exact data formats and expected API of the recipe components in your PRODIGY_README.html
.