Custom recipes tutorial not working

Hi,

I was going through the custom recipe tutorial on Custom Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP, but it is not working. Specifically, I'm trying to run the cat facts example (code copy/pasted from the webpage below). It seems like the cat facts API has changed, but even accounting for that I'm still getting the error:

...
stream.apply(add_tokens, nlp=nlp, stream=stream)  # tokenize the stream for ner_manual
AttributeError: 'generator' object has no attribute 'apply'

Code (from webpage):

import prodigy
from prodigy.components.preprocess import add_tokens
import requests
import spacy

@prodigy.recipe("cat-facts")
def cat_facts_ner(dataset, lang="en"):
    # We can use the blocks to override certain config and content, and set
    # "text": None for the choice interface so it doesn't also render the text
    blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "choice", "text": None},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]
    options = [
        {"id": 3, "text": "😺 Fully correct"},
        {"id": 2, "text": "😼 Partially correct"},
        {"id": 1, "text": "😾 Wrong"},
        {"id": 0, "text": "🙀 Don't know"}
    ]

    def get_stream():
        res = requests.get("https://cat-fact.herokuapp.com/facts").json()
        for fact in res["all"]:
            yield {"text": fact["text"], "options": options}

    nlp = spacy.blank(lang)           # blank spaCy pipeline for tokenization
    stream = get_stream()             # set up the stream
    stream.apply(add_tokens, nlp=nlp, stream=stream)  # tokenize the stream for ner_manual

    return {
        "dataset": dataset,          # the dataset to save annotations to
        "view_id": "blocks",         # set the view_id to "blocks"
        "stream": stream,            # the stream of incoming examples
        "config": {
            "labels": ["RELEVANT"],  # the labels for the manual NER interface
            "blocks": blocks         # add the blocks to the config
        }
    }

How can I fix these issues? Also, is there a more in-depth guide for custom recipes?

Thanks

Hi @ale,

Sorry about the outdated example! I've just updated the website so you should be able to recreate it without errors now.
The main things I changed was:

  1. processing the API response (it's now a list)
  2. updated how add_tokens is applied to stream. It looks like we've updated this example to use the newer API (concretely, the [apply](Components and Functions · Prodigy · An annotation tool for AI, Machine Learning & NLP) method of Stream component) but the source in this example is the old style generator function so we can't use the newer API here.

As for the in-depth guide for custom recipes, we have this section in the docs. I assume you've seen it already so let us know if there's any particular aspect you'd like some more info on.
If you'd like to see some end-to-end examples of custom recipes, I recommend checking this repository in particular tutorials and other folders.

Thanks @magdaaniol! It is working now.

I have a few questions about this custom recipe and my use case. First, a little background. My team is doing NER and RE at the same time with rel.manual. We want to categorize tricky cases we find in Prodigy and also allow for annotators to leave a comment regarding what they found difficult for a sentence.

  1. I see that the choice answers are stored in an "accept" attribute in the JSON format. Is it possible to customize the name of this attribute?
  2. Comments in the text input field are saved to an attribute called "user_input" in the JSON format. Can this one also be customized?
  3. We have identified categories of tricky cases. In our case, we would use the "choice" interface to select the type of tricky case when we encounter one. If we add a new category in the future by updating the recipe, would it be an issue if we continue to save to the same database even though the previous examples lack the new categories in the "options" field?
  4. Is it possible to review the annotations of this custom recipe with the review recipe or we need a custom review recipe too?
  5. Related to the one above, how could we review only the NER and RE annotations for the accepted examples (and exclude the choice and text input answers)?

Thanks!

Hi @magdaaniol,

I continued experimenting with custom recipes and have other questions. Here is the custom recipe I'm building:

import prodigy
from prodigy.core import Arg, recipe
from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens
import spacy

@prodigy.recipe(
    "test-recipe",
    dataset = Arg(help="Dataset to save answers to."),
    file_path=Arg(help="Path to texts")
)
def test_recipe(dataset: str, file_path):
    stream = get_stream(file_path) # load in the JSON file

    blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "choice", "text": None, "options": [{"id": "option_1", "text": "Option 1"}]},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Comments", "field_id": "comments"}
    ]

    return {
        "dataset": dataset,
        "view_id": "blocks",
        "stream": stream,
        "config": {
            "labels": ["LABEL"],
            "blocks": blocks,
            "choice_style": "multiple"
        }
    }

My questions are:

  1. When I run the recipe I get the following warning:
⚠ Prodigy automatically assigned an input/task hash because it was
missing. This automatic hashing will be deprecated as of Prodigy v2 because it
can lead to unwanted duplicates in custom recipes if the examples deviate from
the default assumptions. More information can found on the docs:
https://prodi.gy/docs/api-components#set_hashes

What should I do to have a hashing consistent with the default behaviour in Prodigy? I see the documentation suggests:

from prodigy import set_hashes

stream = (set_hashes(eg) for eg in stream)
stream = (set_hashes(eg, input_keys=("text", "custom_text")) for eg in stream)
  1. I have added the options for the choice component in the blocks view_id. It seems to be working fine there. Can I add the options to there instead of adding it to each example with a add_options function as shown in the documentation webpage?

  2. The text is not being displayed for the ner_manual task when I run this recipe. Do you know what is going wrong?

If there are any Prodigy best practices that I should incorporate in this recipe it would be very useful to know.

Thanks

Hi @ale,

Answering inline:

  1. I see that the choice answers are stored in an "accept" attribute in the JSON format. Is it possible to customize the name of this attribute?

It's not possible to customize it via recipe settings or arguments. You could modify it programatically by adding a before_db callback to your recipe which would essentially overwrite the task dictionary with the new key:

def before_db(examples):
    for eg in examples:
        accepted_options = eg.get("accept")
        if accepted_options:
            eg["my_custom_key"] = accepted_options
            del eg["accept"] # you could delete the original annotation but it's recommended to keep it as is
    return examples

This callback should be returned from the recipe under the before_db key:

 return {
        "view_id": "choice",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": before_db, # custom callback
        "config": {
            ...
        },
    }
  1. Comments in the text input field are saved to an attribute called "user_input" in the JSON format. Can this one also be customized?

Yes. You can customize the name of the attribute from the recipe level by specifying field_id in view_id definition. Please check here for an example of how field_id should be used.

  1. If we add a new category in the future by updating the recipe, would it be an issue if we continue to save to the same database even though the previous examples lack the new categories in the "options" field?

No, Prodigy follows "append only" policy with respect to storing annotation examples. So if you restart the server with a new label set, the examples that have more options will be just appended to the existing ones. You would need to consider how to use such hybrid dataset for training, though. If the old examples could potentially be labelled with the new categories (but they aren't bc the category didn't exist when the annotation was made) this can be really confusing to the model. This is why it is rarely a good idea to modify the label set during the annotation. If possible, it is recommended to do a pilot annotation on the representative sample of data to calibrate the label set. Once you're confident you have all categories you need, you would proceed to the main annotation step.

Another option if you do find out that you've missed on the category, would be to review the existing annotations with the new category as option included or, even better, in a binary yes/no workflow (which will require some post processing to compile the final annotation from the first multiple choice pass and the binary pass). Yet another option would be to correct model mistakes (e.g. with textcat.correct).
In any case, you need make sure the all final categories are well represented in your dev set so that you can see if the introduction of the category is causing troubles.

4.Is it possible to review the annotations of this custom recipe with the review recipe or we need a custom review recipe too?

Yes, you will need a custom review recipe. It's impossible to make assumptions about the components of custom recipes which why review supports only built-in UI. Also, you are able to review one view id at a time because otherwise the interface could become really illegible.

  1. Related to the one above, how could we review only the NER and RE annotations for the accepted examples (and exclude the choice and text input answers)?

In review you need to specify the view_id that the recipe is supposed to render. Please note that it will be impossible if you modified the names of the keys under which the NER and RE annotations are stored.
So in this case, you should be able to review both NER and REL by specifying relations as view_id on CLI and adding relations_span_labels with a list of all NER labels to prodigy.json as described here. If the only diff is wrt to a span it should also be rendered as differing versions in review

1 Like

Hi @ale:

If you don't need custom hashing function (because e.g. you have some custom fields that should be used to distinguish the examples) it is fine to just let Prodigy do it. The task and input hashes will be consistent. The warning there is just to inform that from Prodigy v2 the user will have to take care of it to make sure they are in full control.
What Prodigy does currently automatically is to call set_hashes under hood with the default task keys. You can consult what the default keys are in the set_hashes documentation.
Also, if your recipe is modifying the task with respect to the keys used in the hashing function e.g it adds annotations from patterns or a model (which is not the case here) it is recommended to call set_hashes after the modification to reflect these changes.

I have added the options for the choice component in the blocks view_id. It seems to be working fine there. Can I add the options to there instead of adding it to each example with a add_options function as shown in the documentation webpage?

Yes, you could but then you'd have no record of what the annotator had to choose from. It's always recommended to store all the information required to recreate the annotation task and that includes the available options. Also Prodigy train recipe uses this options field and and wouldn't be able to generate spaCy examples from the annotations if it is missing.

  1. The text is not being displayed for the ner_manual task when I run this recipe. Do you know what is going wrong?

As explained in this example the tokens are required for ner_manual view id. To add them you can use add_tokens helper (which you are already importing). You will also need a spaCy tokenizer. Here I'm using the basic spaCy tokenizer for English. Adding the following lines should make the recipe show all the blocks:

nlp = spacy.blank("en")
stream.apply(add_tokens, stream=stream, nlp=nlp)

Nothing crucial occurs to me on top of what I've said already. Storing options on the example is probably one of the more important "good practice" pointers.

1 Like

Thanks for all your help @magdaaniol.

I have some further questions regarding the hashes in custom recipes and Prodigy in general.

  1. In the documentation, where are task_keys extracted from? The default is ("spans", "label", "options"). Are these from the recipe dictionary or attributes of each example or somewhere else?

  2. Using the custom recipe cat-facts example from above, I ran a small test with two annotators: jane and joe. First, I annotated sentences with "labels": ["RELEVANT"] with jane. Then I changed the recipe's labels to "labels": ["CAT"] and annotated with joe. For both annotators, the same sentences have equal input hashes (expected) but also same task hashes. Shouldn't the task hashes be different because I'm using different labels?


  1. On a separate topic, is it possible to have more than 1 interface of the same interface type? For example, a custom recipe with two choice interfaces, each with different options.

  2. Is it possible to add a static question (or title) above the choice interface in the cat-facts recipe?

  3. Is it possible to add theme options to the recipe so that it is not necessary to specify them in prodigy.json? For example relationHeight and relationHeightWrap
    from the documentation.

Thanks

Hi @ale,

Some answers inline:

  1. In the documentation, where are task_keys extracted from? The default is ("spans", "label", "options"). Are these from the recipe dictionary or attributes of each example or somewhere else?

These are extracted from the attributes of each example, yes. The built-in recipes create certain task structures (dictionaries) specific to each recipe. Thus, if you want to add a custom task_key for the hashing function to use, it should be a first level key on the task dictionary.

  1. Using the custom recipe cat-facts example from above, I ran a small test with two annotators: jane and joe. First, I annotated sentences with "labels": ["RELEVANT"] with jane. Then I changed the recipe's labels to "labels": ["CAT"] and annotated with joe. For both annotators, the same sentences have equal input hashes (expected) but also same task hashes. Shouldn't the task hashes be different because I'm using different labels?

The default keys used for computing the task_hash are: spans, label, options, arcs. If you look closely there's no label attribute on the custom task here. The label attribute is stored for binary classification tasks. In this case the config attribute labels is used for determining the UI and the labels will be stored under spans if there are any. Thus, for NER, the task hash is affected by pre-existing spans, not by the set of labels available. The idea is to distinguish between the "kinds" of annotation or what is being annotated, not particular label sets.

On a separate topic, is it possible to have more than 1 interface of the same interface type? For example, a custom recipe with two choice interfaces, each with different options.

Technically, you could define multiple choice blocks. You would need to add the respective options as value of the "options" key in the block definition:

blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "choice", "text": None, "options": options},
        {"view_id": "choice", "text": None, "options": options2},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]

Please note that all answers will be written under the same accept key, so in order to be able to mark the options from both blocks, you would need to switch to "multiple" choice style. With the single style there will be only one answer permitted per both blocks. Also, by default, the keyboard shortcuts will be the same for both blocks so you might want to modify them or completely disable via custom javascript.

If you want more flexibility/control over the final UI you can always use custom HTML and JavaScript and build your own form with multiple checkboxes / radio button groups. window.prodigy.update callback lets you update the current task with any custom data, like information about the checkbox that was selected. Here's a straightforward example of a custom checkbox:

  1. Is it possible to add a static question (or title) above the choice interface in the cat-facts recipe?

Yes, you can achieve that by adding another html block on top of existing choice blocks:

 blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "html"},
        {"view_id": "choice", "text": None, "html":None, "options": options},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]

Note, that similarly to text, html has to be set to None in the choiceview_id definition to prevent the text from appearing twice.
The html view_id expects html field on the task so that will have to be added while you're creating the tasks:

 def get_stream():
        res = requests.get("https://cat-fact.herokuapp.com/facts").json()
        for fact in res:
            yield {"text": fact["text"], "options": options, "html":"<h2>This is my static question</h2>"}

You can also add extra styling, of course. Please check the custom interfaces section on html and css for examples.

  1. Is it possible to add theme options to the recipe so that it is not necessary to specify them in prodigy.json? For example relationHeight and relationHeightWrap
    from the documentation.

Yes, Prodigy merges the configuration from the global and the local prodigy.json, cli overrides and the config key returned from the recipe. So you can return custom_theme dictionary under the config key of the dictionary returned from the recipe:

  return {
        "dataset": dataset,          # the dataset to save annotations to
        "view_id": "blocks",         # set the view_id to "blocks"
        "stream": stream,            # the stream of incoming examples
        "config": {
            "labels": ["CAT"],  # the labels for the manual NER interface
            "blocks": blocks,  # add the blocks to the config
            "custom_theme": {"buttonSize": 500}  # set custom theme options    
        }
    }

Thanks @magdaaniol.

Some more questions:

  1. In question 4 above, regarding adding a custom section title via HTML on top of the choice interface: Is it ok to use the HTML rendered by the choice view_id instead of using a custom html view_id? Running a test, I kept the blocks variable as the original (code below), and added the "html" field to the task dictionary. I also added some inline CSS formatting and it looks as intended.
 blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "choice", "text": None},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Comments", "field_id": "comments"}
    ]
for task in stream:
        task["options"] = options
        task["html"] = '<h2 style="text-align: left; margin-bottom: 0;">Edge case category</h2>'
        yield task
  1. On a related note to the question above, why does the choice interface render the html field from examples by default but other interfaces like ner_manual don't render the html? What can I do to render different HTML titles for say, each section (ner_manual, choice, and text_input)?

  2. Is there a difference between adding task-formatting code (like adding options or html fields) inside of a custom get_stream() function vs a separate function like add_options()? For example:
    Everything under get_stream():

def get_stream():
    jsonl_stream = JSONL(file_path)

        options = [
            {"id": "option_1", "text": "Option 1"},
            {"id": "option_2", "text": "Option 2"}
        ]

        for task in jsonl_stream:
            task["options"] = options
            task["html"] = '<h2 style="text-align: left; margin-bottom: 0;">Select and option</h2>'
            yield task

Separated into functions:

stream = get_stream(file_path) # load in the JSON file
stream = add_options(stream) # add options to each task
  1. For the text_input interface, is there some setting I can adjust to defocus the text input field when pressing the ESC key after writing some text in the box?

Hi @ale!

Sure, if that's more convenient. I can't think of any side effects to doing it this way except for the recipe code being slightly less readable.

  1. On a related note to the question above, why does the choice interface render the html field from examples by default but other interfaces like ner_manual don't render the html? What can I do to render different HTML titles for say, each section (ner_manual, choice, and text_input)?

I suppose that the reason why the other interfaces do not render additional HTML on top of the task is that this is not a typical setup. Do define the NER task you shouldn't really need any extra text. Conversely, for the choice like task, you would normally accompany the options with the actual question.
Our recommended method for annotating data is to do one annotation task at a time and combine UIs only if it's strictly necessary (we talk about it in the docs here if you're interested in some more details on the topic). This a proven way to get more accurate annotations faster. Following this principle NER interface, in most cases, wouldn't require any extra headings.
But then there are, of course, different cases and the custom recipes are precisely there to accommodate special requirements. I understand that in your case, HTML blocks would serve to separate different UIs on the screen. If it's the same HTML for all tasks, you can achieve that by introducing multiple html blocks. The docs linked above contain an example.

3 Is there a difference between adding task-formatting code (like adding options or html fields) inside of a custom get_stream() function vs a separate function like add_options()?

Technically there's not. You can either have it separately or inside one function. That said, it's probably best to separate these two jobs (of reading the file and modifying the examples) into separate functions for easier testing, debugging, error catching etc.
Also, please note that that Prodigy has the get_stream helper which you can reuse (instead of overloading it and calling the loader yourself). This is another argument for having it separately. You can reuse Prodigy's get_stream and apply all the modifications in separate step(s). In fact, if you're on Prodigy > 1.12, the get_stream would return a Stream object which has the convenient apply method that let's you apply the stream modification functions.
The cat-fact example uses JSONL loader which is the older interface for loading files that returns a generator object. Using Stream and apply is the newer and recommended way of handling input files.

  1. For the text_input interface, is there some setting I can adjust to defocus the text input field when pressing the ESC key after writing some text in the box?

There's not a setting for this but you can add this key binding via custom js:

document.addEventListener('prodigyupdate', event => {
    // listen for the updates in the UI
    var input = document.querySelector('.prodigy-text-input');
        input.onkeydown = function(event) {
            // upon pressing ESC
            if (event.keyCode === 27) {
                // remove autofocus from input 
                input.blur();
            }
        }
});

Assuming this code is in the custom.js file, in the recipe:

# read the js code
 custom_js = Path("esc.js").read_text()
# pass it to the Prodigy Controller in the config
  return {
        "dataset": dataset,          # the dataset to save annotations to
        "view_id": "blocks",         # set the view_id to "blocks"
        "stream": stream,            # the stream of incoming examples
        "config": {
            "labels": ["CAT"],  # the labels for the manual NER interface
            "blocks": blocks, 
            "javascript": custom_js      # add custom javascript
        }
    }

Thanks again @magdaaniol!

Some more questions:

  1. Above you said that I can disable keyboard shortcuts for choice interfaces in Prodigy via custom Javascript to avoid conflict when having two choice interfaces. Currently, I have an interface with a relationships block that also supports NER and a choice block. I see that pressing number keys activates both NER labels and choice options. Can you share a snippet to remove these number key shortcuts?
  2. This question is related to our previous discussion here. You mentioned above that I can introduce the same HTML title in different parts of the interface using multiple html blocks. For a different application, how can I add a different HTML block? Specifically, I want to keep the HTML title above the choice block and add another HTML above the relations block with the same text that will be displayed in relations (but rendered in plain HTML). The reason for this is that some texts we annotate are paragraphs that cannot be split into sentences because we're annotating cross-sentence relationships. It may be useful for the annotators to be able to read the text in simple HTML and then annotate it in the relations block. That is because we find it difficult to read these multiline paragraphs in the relations interface.

Thanks!

Hi @ale,

No problem - always happy to help :wink:

Responding inline:

Can you share a snippet to remove these number key shortcuts?

I just double checked and removing the key shortcuts from the UI is just a part of the solution. You still have to remap the numerical key shortcuts because they will apply even though the options are not visible.
This post describes how to go about remapping the default key shortcuts:

Then, if do choose to remove the visualization of the shortcuts from the UI as well (shouldn't be necessary, though if you do the remapping), here's the custom css snippet:

div.prodigy-option > span {
    display: none;
}

You can pass it as a string directly to the global_css setting in prodigy.json or save it in a file (e.g. custom.css), read it from the recipe:

custom_css = Path("custom.css").read_text()

and pass it as global_css in the recipe return statement e.g.:

 return {
        "dataset": dataset,
        "view_id": "blocks",
        "stream": stream,
        "config": {
            "labels": ["LABEL"],
            "blocks": blocks,
            "choice_style": "single",
            "global_css": custom_css
        }
    }

Specifically, I want to keep the HTML title above the choice block and add another HTML above the relations block with the same text that will be displayed in relations (but rendered in plain HTML).

That should be possible by specifying two html blocks each with a different value. The title above the choice block could just contain the rawhtmlvalue (it's a static text which would be the same for all tasks) and the other html block would display the text of the relations task via the html_template. Templates have access to the task dictionary and you can use dot notation to access nested keys:

blocks = [
        {"view_id": "html", "html": "<p>This is a fixed html</p>"},
        {"view_id": "choice", "text": None},
        {"view_id": "html", "html_template":"{{text}}"},
        {"view_id": "relations"}
    ]
1 Like

Thanks @magdaaniol.

I have a some questions about hashing. Above, we discussed that Prodigy's default behaviour is to call set_hashes with the following defaults as per the documentation:

default_input_keys = ("text", "image", "html", "input")
default_task_keys = ("spans", "label", "options")
stream = (set_hashes(eg, input_keys = default_input_keys, task_keys = default_task_keys) for eg in stream)

Which as I understand would be the same as:

stream = (set_hashes(eg) for eg in stream)
  1. Let's say I run two different recipes (first ner.manual and then rel.manual) and save to the same database with the same input JSONL. I see that Prodigy doesn't show texts annotated with ner.manual when I later run rel.manual. I understand that this is because in both recipes the texts have the same input hashes. Is that correct?
  2. Later, I run my custom recipe with the same JSONL input and save to the same database as in 1. This time, I see all texts that were already annotated with ner.manual and rel.manual. However, I don't want to annotate those texts again, but rather pick up from where annotation with rel.manual finished. In my custom recipe I pre-process the input to add the fields options and html. However, since the options are just for collecting annotator feedback (they are not data annotations) and the html is just a static title for the choice interface, I create a custom hashing that takes only into account the text field:
# ...
stream.apply(create_hashes, stream)
#....

def create_hashes(stream):
    for eg in stream:
        eg = set_hashes(eg, input_keys=("text"), task_keys=("spans", "arcs"))
        yield eg

I thought I would get the same input hashes as ner.manual and rel.manual since I'm only hashing based on the text field, which is what ner.manual and rel.manual are doing (the html and options field are added only in my custom recipe as mentioned above, and therefore not present when I ran ner.manual or rel.manual). Why am I getting different input hashes than Prodigy's default recipes in this case?

To explain why I want to do this: We have annotated texts with rel.manual for a while. When the annotators found a tricky sentence, they would skip/ignore it and manually add it to an Excel table to add comments on why it is tricky to annotate and classify it into a category of tricky cases. This has become time-consuming, so I am creating a custom recipe that has the blocks rel.manual, choice, and text_input so that annotators can input the tricky case category and provide comments within Prodigy without jumping to Excel. I want to run this recipe while saving to the same database we have been using. The goal is for annotators to continue annotating where they left off with rel.manual, hence why I want my custom recipe to generate input hashes compatible with the ones they have already annotated. In addition, the text field continues to be the only relevant field in our annotation efforts, since whatever is annotated on the choice and text_input interface will not be used for training any model, it will be for us to come back to those sentences and annotated them properly based on the annotators' feedback.

Thanks!

Hi @ale,

In fact, the default filtering is by "task". Given that the input to both recipes is the same (just text no spans) the task hashes (as well as input hashes) will be the same so that's the reason it is being filtered out. To distinguish between them as you expect, you'd need to recompute the task hash on the NER examples after the annotation and before saving to the DB (e.g. using before_db callback). Alternatively, you can disable the current dataset from excluding examples by setting auto_exclude_current to false in your prodigy.json.
Finally, it is a recommended good practice to store annotations for different tasks in dedicated datasets as it is much cleaner and easier to maintain. When you've collected all the required annotations and you are redy to create the final dataset for training, Prodigy dataset utils such as db-merge and data-to-spacy make sure that the annotations for the same input are grouped together and there's no need to modify the default hashing mechanism.

Why am I getting different input hashes than Prodigy's default recipes in this case?

Examples are being excluded when Prodigy first reads in the source i.e. for example in the get_stream function. How is does your custom recipe read the input dataset?
Try setting "exclude_by": "input" in your prodigy.json to make sure the current dataset is being filtered by input which should be the same.

Any re-hashing you apply after reading the source will be stored in the DB and taken into account when you read this data in again afterwords.

To my previous point, the most typical workflow for this case i.e. when you start the new annotation task and want to exclude the examples already annotated in a different dataset would be to use the --exclude flag for naming the datasets to exclude (in the case of the custom recipe this would have to be re-created). That is assuming each annotation task is kept saved to a different dataset. In your case, where you store all tasks in the same dataset, the current dataset is being excluded by default so as long as the input hashes are the same and the exclude_by is set to "input", the task should not re-appear. So let's find out first if and why the input hashes are not the same.

Thanks @magdaaniol.

I'll share my custom recipe at the bottom. I have an input JSONL file that contains only texts. Here's an example:

{"text": "This is sentence number 1", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}
{"text": "This is sentence number 2", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}
{"text": "This is sentence number 3", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}

Initially, I run the following Prodigy command:

PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy rel.manual DATABASE blank:en input.jsonl --label LABEL --span-label SLABEL --wrap

I annotate the first sentence and save.

Then, I run my custom recipe with the command:

PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy ner-re-custom DATABASE input.jsonl -F custom_recipe.py

I have also ran the same command with PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true, "exclude_by": "input"}'. In both cases, I am asked to annotate Sentence 1 again. I want to continue where I left off with the previous recipe, so the interface would show me Sentence 2.

On the contrary, if I run ner.manual, I can continue were I left off with rel.manual, so Prodigy shows me Sentence 2:

PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy ner.manual DATABASE blank:en input.jsonl --label LABEL

Of course, running rel.manual first and then ner.manual may not make sense; I only do this to show that Prodigy doesn't show examples previously annotated with another recipe, but when using my custom recipe it does show those examples again.

I can also confirm that my custom recipe is generating different input and task hashes than Prodigy's ner.manual and rel.manual. What can I do in my recipe to match the input and task hashes to those from Prodigy? The interface elements that I'm adding in my custom recipe (choice, text input) are for annotator feedback only, but the actual input is the text, so I want Prodigy to know that these are the same tasks as when we were annotating with rel.manual. I also want to save to the same database we have been using, or perhaps to a new one and use the --exclude flag, but I still need the hashes to match for that to work.

Here is my custom recipe:

import prodigy
from prodigy.core import Arg, recipe
from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import JSONL
import spacy
from pathlib import Path
from prodigy import set_hashes

@prodigy.recipe(
    "ner-re-custom",
    dataset = Arg(help="Dataset to save answers to."),
    file_path=Arg(help="Path to texts")
)
def ner_re_custom(dataset: str, file_path):
    
    stream = get_stream(file_path) # load in the JSON file
    nlp = spacy.blank("en") # blank spaCy pipeline for tokenization

    stream.apply(create_hashes, stream)     # tokenize the stream for ner_manual
    stream.apply(add_tokens, nlp, stream)   # tokenize the stream for ner_manual
    stream.apply(add_options, stream)  # add options to each example
    stream.apply(add_html, stream)     # add html to each example
    

    blocks = [
        {"view_id": "relations"},
        {"view_id": "choice", "text": None},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Comments", "field_id": "comments"}
    ]

    # read the js code
    custom_js_path = Path(__file__).resolve().parent / "custom.js"
    custom_js = custom_js_path.read_text()

    return {
        "dataset": dataset,
        "view_id": "blocks",
        "stream": stream,
        "config": {
            "labels": ["LABEL"],
            "relations_span_labels": ["SLABEL"],
            "blocks": blocks,
            "choice_style": "multiple",
            "wrap_relations": True,
            "javascript": custom_js,
            "custom_theme": {
                "cardMaxWidth": 1500,
                "smallText": 16,
                "relationHeightWrap": 40
            }
        }
    }

def add_options(stream):
    # Helper function to add options to every task in a stream
    options = [
        {"id": "option_1", "text": "Option 1"},
        {"id": "option_2", "text": "Option 2"},
        {"id": "option_3", "text": "Option 3"},
    ]

    for eg in stream:
        eg["options"] = options
        yield eg

def add_html(stream):
    """Adds html field to the stream examples"""
    html_string = '<h3 style="text-align: left; margin-bottom: 0;">Edge case category</h3>'
    for eg in stream:
        eg["html"] = html_string
        yield eg

def create_hashes(stream):
    for eg in stream:
        eg = set_hashes(eg, input_keys=("text"), task_keys=("spans", "arcs"))
        yield eg

Thanks!

Hi @ale,

The reason why you end up with different input hashes is that the input_keys are not passed correctly to set_hashes function. input_keys must be a tuple. I realize it's a pesky detail but there's a comma missing in input_keys value. The correct version of create_hashes is:

def create_hashes(stream):
    for eg in stream:
        eg = set_hashes(eg, input_keys=("text",), task_keys=("spans", "arcs"), overwrite=True)
        yield eg

Without this coma the input_keys are being read as a string and when the set_hashes function iterates over it, it considers each particular character as a key. Since these "keys" are not present in the task dictionary, they do not contribute at all to the hash value.

Another thing is that you should be passing overwrite=True to overwrite the existing keys.
Also, even though it doesn't matter in this case because none of your stream modifying functions changes the original hashes, if you have a function that is supposed to compute the final hashes, it's best to call it as the last function so:

stream.apply(add_tokens, nlp, stream)   # tokenize the stream for ner_manual
stream.apply(add_options, stream)  # add options to each example
stream.apply(add_html, stream)     # add html to each example
stream.apply(create_hashes, stream)     # rehas -> the final modification

Finally, as mentioned before, you'll need {"exclude_by": "input"} in your config as the default is by task.

Thanks @magdaaniol! It works perfectly

1 Like

Hi @magdaaniol,

I 'm using the custom recipe (with your fixes from above) on a new dataset, and I'm using "exclude_by": "input" and "exclude": "prev_dataset" to exclude annotations made in our previous dataset with rel.manual (as I mentioned above). However, I noticed that when trying to exclude by task, which as you said, is Prodigy's default behaviour, the interface shows again the same examples annotated in our previous database. I understand that this is because task hashes are different between the two databases, but input hashes are the same. Why is this the case? In the database from rel.manual, hashes would have been calculated with the default fields mentioned in the documentation (("spans", "label", "options")). The input JSONL (same for both databases) only has "text" and "meta" fields, so none of the default task hash fields are present. Furthermore, my custom recipe calculates task hashes using the fields task_keys=("spans", "arcs"), removing the "options" field from the task hash since I add options to the examples in my custom recipe. As I mentioned before, the options are for annotator feedback and they should not be used to differentiate annotation tasks, because they are still the same annotation tasks in our case . Shouldn't task hashes be the same between rel.manual and my custom recipe given that none of the task fields from either recipe are present in the input data?

I want to follow Prodigy's best practices. If I understand correctly, the rationale for task hashes is to differentiate annotation questions. Given that in our case, the annotation question remains the same (annotate relations in a text), I think the task hashes should also be the same between these datasets (the previous one from rel.manual and the new one from my custom recipe). The information from the choice and text_input interfaces that I added to my custom recipe was collected via an external spreadsheet when when we where using rel.manual; this info is for improving our annotation guidelines, not data to train a model. To wrap up, my questions are:

  • How can I fix the task hashes being different between both recipes?
  • Given that we have already annotated some texts with the custom recipe, how can I update the task hashes of these annotations to make the fix retroactive?

Thanks!

Hi @ale,

Let me first explain why you're observing different task_hash values for rel.manual and your custom recipe:

The function that computes the hashes takes into account not only the keys but also their values (as well as input hash in the case of the task_hash).

Importantly, if the values for the keys are not present, the function just uses entire serialized task.
In the case of your custom recipe the computation of the input_hashand the task_hash would look as follows:

# INPUT_HASH
TEXT "This is sentence number 1"
INPUT HASH KEYS ('text',)
VALUES TO USE FOR HASHING:  text="This is sentence number 1"
INPUT HASH 877456689

# task hash
TEXT "This is sentence number 1"
KEYS ('spans', 'arcs')
VALUES None # there are no annotation on the input
DEFAULT VALUES TO USE FOR HASHING {"_input_hash":877456689,"html":"<h3 style=\"text-align: left; margin-bottom: 0;\">Edge case category</h3>","meta":{"sentence_uid":"ID_number"},"options":[{"id":"option_1","text":"Option 1"},{"id":"option_2","text":"Option 2"},{"id":"option_3","text":"Option 3"}],"text":"This is sentence number 1","tokens":[{"end":4,"id":0,"start":0,"text":"This","ws":true},{"end":7,"id":1,"start":5,"text":"is","ws":true},{"end":16,"id":2,"start":8,"text":"sentence","ws":true},{"end":23,"id":3,"start":17,"text":"number","ws":true},{"end":25,"id":4,"start":24,"text":"1","ws":false}]}
TASK HASH -549215909

As you can see the task_hashes will be different because the keys you have defined are not present (yet) and a different default value is being used. This is why you should be excluding by input if you don't want to see these examples again.
In fact, although the Prodigy excludes by default by task, manual and semi-manual workflows (including ner.manual and rel.manual) exclude by input (which I should have mentioned before).

I think the main point of confusion here is that you thought this custom hashing function is applied after the annotations were done (e.g before saving in the DB). Instead, the task hashes both for rel.manual and your custom recipe are being computed on the input. This results in different defaults, and thus different task hashes.
For this reason, excluding by input is best for manual or semi-manual workflows where you want to make sure you're not annotating the same text twice. Filtering by task is meant for binary annotation workflows to avoid asking about the same model suggestion multiple times or finding the annotations that received exactly same annotations.

If you effectively want to create the same kind of annotations (relations) without re-annoatting examples that were previously manually annotated with the same kind of task, it's best to exclude by input as this is what you really want to compare - the inputs and not the the existing annotations.

How can I fix the task hashes being different between both recipes?

Following Prodigy best practices these are not exactly the same tasks as you're effectively asking a slightly different question (the extra options etc.) but if you want rel.manual and your custom recipe compute the same input and task hashes do not call the create_hashes function. The only effect it has when processing the input stream is that it modifies the default values on which task_hash is being computed.

Given that we have already annotated some texts with the custom recipe, how can I update the task hashes of these annotations to make the fix retroactive?

You can modify the existing task hashes with the following script:

from prodigy.util import set_hashes
from prodigy.components.stream import get_stream
from prodigy.types import StreamType
import srsly
import copy

def rewrite_hashes(stream: StreamType) ->StreamType:
    for eg in stream:
        copied_eg = copy.deepcopy(eg)
        input_text = eg.get("text")
        meta = eg.get("meta")
        only_input = {"text": input_text, "meta": meta}
        only_input_rehashed = set_hashes(only_input, overwrite=True)
        
        copied_eg.update({
            "_input_hash": only_input_rehashed["_input_hash"],
            "_task_hash": only_input_rehashed["_task_hash"]
        })
        
        yield copied_eg

stream = get_stream("test.jsonl", rehash=False)
stream.apply(rewrite_hashes)
srsly.write_jsonl("rehashed_test.jsonl", stream)

On the final note, the task_hash will become more relevant if when you're comparing the annotations created for the same input. So rather than reassigning the task hashes as if there were no annotations (which is what the script above would do), it's probably best to continue the annotation excluding by input to avoid re annotating the same examples and then run a similar script recomputing the task hashes in the way that is most useful for your project. Please note that the default Prodigy hashing function takes into account the values as well (as stated above) so if you want to take into account keys only you'll need a custom hashing function.