Prodigy hashing behavior

Thanks for all your help @magdaaniol.

I have some further questions regarding the hashes in custom recipes and Prodigy in general.

  1. In the documentation, where are task_keys extracted from? The default is ("spans", "label", "options"). Are these from the recipe dictionary or attributes of each example or somewhere else?

  2. Using the custom recipe cat-facts example from above, I ran a small test with two annotators: jane and joe. First, I annotated sentences with "labels": ["RELEVANT"] with jane. Then I changed the recipe's labels to "labels": ["CAT"] and annotated with joe. For both annotators, the same sentences have equal input hashes (expected) but also same task hashes. Shouldn't the task hashes be different because I'm using different labels?


  1. On a separate topic, is it possible to have more than 1 interface of the same interface type? For example, a custom recipe with two choice interfaces, each with different options.

  2. Is it possible to add a static question (or title) above the choice interface in the cat-facts recipe?

  3. Is it possible to add theme options to the recipe so that it is not necessary to specify them in prodigy.json? For example relationHeight and relationHeightWrap
    from the documentation.

Thanks

Hi @ale,

Some answers inline:

  1. In the documentation, where are task_keys extracted from? The default is ("spans", "label", "options"). Are these from the recipe dictionary or attributes of each example or somewhere else?

These are extracted from the attributes of each example, yes. The built-in recipes create certain task structures (dictionaries) specific to each recipe. Thus, if you want to add a custom task_key for the hashing function to use, it should be a first level key on the task dictionary.

  1. Using the custom recipe cat-facts example from above, I ran a small test with two annotators: jane and joe. First, I annotated sentences with "labels": ["RELEVANT"] with jane. Then I changed the recipe's labels to "labels": ["CAT"] and annotated with joe. For both annotators, the same sentences have equal input hashes (expected) but also same task hashes. Shouldn't the task hashes be different because I'm using different labels?

The default keys used for computing the task_hash are: spans, label, options, arcs. If you look closely there's no label attribute on the custom task here. The label attribute is stored for binary classification tasks. In this case the config attribute labels is used for determining the UI and the labels will be stored under spans if there are any. Thus, for NER, the task hash is affected by pre-existing spans, not by the set of labels available. The idea is to distinguish between the "kinds" of annotation or what is being annotated, not particular label sets.

On a separate topic, is it possible to have more than 1 interface of the same interface type? For example, a custom recipe with two choice interfaces, each with different options.

Technically, you could define multiple choice blocks. You would need to add the respective options as value of the "options" key in the block definition:

blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "choice", "text": None, "options": options},
        {"view_id": "choice", "text": None, "options": options2},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]

Please note that all answers will be written under the same accept key, so in order to be able to mark the options from both blocks, you would need to switch to "multiple" choice style. With the single style there will be only one answer permitted per both blocks. Also, by default, the keyboard shortcuts will be the same for both blocks so you might want to modify them or completely disable via custom javascript.

If you want more flexibility/control over the final UI you can always use custom HTML and JavaScript and build your own form with multiple checkboxes / radio button groups. window.prodigy.update callback lets you update the current task with any custom data, like information about the checkbox that was selected. Here's a straightforward example of a custom checkbox:

  1. Is it possible to add a static question (or title) above the choice interface in the cat-facts recipe?

Yes, you can achieve that by adding another html block on top of existing choice blocks:

 blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "html"},
        {"view_id": "choice", "text": None, "html":None, "options": options},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]

Note, that similarly to text, html has to be set to None in the choiceview_id definition to prevent the text from appearing twice.
The html view_id expects html field on the task so that will have to be added while you're creating the tasks:

 def get_stream():
        res = requests.get("https://cat-fact.herokuapp.com/facts").json()
        for fact in res:
            yield {"text": fact["text"], "options": options, "html":"<h2>This is my static question</h2>"}

You can also add extra styling, of course. Please check the custom interfaces section on html and css for examples.

  1. Is it possible to add theme options to the recipe so that it is not necessary to specify them in prodigy.json? For example relationHeight and relationHeightWrap
    from the documentation.

Yes, Prodigy merges the configuration from the global and the local prodigy.json, cli overrides and the config key returned from the recipe. So you can return custom_theme dictionary under the config key of the dictionary returned from the recipe:

  return {
        "dataset": dataset,          # the dataset to save annotations to
        "view_id": "blocks",         # set the view_id to "blocks"
        "stream": stream,            # the stream of incoming examples
        "config": {
            "labels": ["CAT"],  # the labels for the manual NER interface
            "blocks": blocks,  # add the blocks to the config
            "custom_theme": {"buttonSize": 500}  # set custom theme options    
        }
    }

Thanks @magdaaniol.

Some more questions:

  1. In question 4 above, regarding adding a custom section title via HTML on top of the choice interface: Is it ok to use the HTML rendered by the choice view_id instead of using a custom html view_id? Running a test, I kept the blocks variable as the original (code below), and added the "html" field to the task dictionary. I also added some inline CSS formatting and it looks as intended.
 blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "choice", "text": None},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Comments", "field_id": "comments"}
    ]
for task in stream:
        task["options"] = options
        task["html"] = '<h2 style="text-align: left; margin-bottom: 0;">Edge case category</h2>'
        yield task
  1. On a related note to the question above, why does the choice interface render the html field from examples by default but other interfaces like ner_manual don't render the html? What can I do to render different HTML titles for say, each section (ner_manual, choice, and text_input)?

  2. Is there a difference between adding task-formatting code (like adding options or html fields) inside of a custom get_stream() function vs a separate function like add_options()? For example:
    Everything under get_stream():

def get_stream():
    jsonl_stream = JSONL(file_path)

        options = [
            {"id": "option_1", "text": "Option 1"},
            {"id": "option_2", "text": "Option 2"}
        ]

        for task in jsonl_stream:
            task["options"] = options
            task["html"] = '<h2 style="text-align: left; margin-bottom: 0;">Select and option</h2>'
            yield task

Separated into functions:

stream = get_stream(file_path) # load in the JSON file
stream = add_options(stream) # add options to each task
  1. For the text_input interface, is there some setting I can adjust to defocus the text input field when pressing the ESC key after writing some text in the box?

Hi @ale!

Sure, if that's more convenient. I can't think of any side effects to doing it this way except for the recipe code being slightly less readable.

  1. On a related note to the question above, why does the choice interface render the html field from examples by default but other interfaces like ner_manual don't render the html? What can I do to render different HTML titles for say, each section (ner_manual, choice, and text_input)?

I suppose that the reason why the other interfaces do not render additional HTML on top of the task is that this is not a typical setup. Do define the NER task you shouldn't really need any extra text. Conversely, for the choice like task, you would normally accompany the options with the actual question.
Our recommended method for annotating data is to do one annotation task at a time and combine UIs only if it's strictly necessary (we talk about it in the docs here if you're interested in some more details on the topic). This a proven way to get more accurate annotations faster. Following this principle NER interface, in most cases, wouldn't require any extra headings.
But then there are, of course, different cases and the custom recipes are precisely there to accommodate special requirements. I understand that in your case, HTML blocks would serve to separate different UIs on the screen. If it's the same HTML for all tasks, you can achieve that by introducing multiple html blocks. The docs linked above contain an example.

3 Is there a difference between adding task-formatting code (like adding options or html fields) inside of a custom get_stream() function vs a separate function like add_options()?

Technically there's not. You can either have it separately or inside one function. That said, it's probably best to separate these two jobs (of reading the file and modifying the examples) into separate functions for easier testing, debugging, error catching etc.
Also, please note that that Prodigy has the get_stream helper which you can reuse (instead of overloading it and calling the loader yourself). This is another argument for having it separately. You can reuse Prodigy's get_stream and apply all the modifications in separate step(s). In fact, if you're on Prodigy > 1.12, the get_stream would return a Stream object which has the convenient apply method that let's you apply the stream modification functions.
The cat-fact example uses JSONL loader which is the older interface for loading files that returns a generator object. Using Stream and apply is the newer and recommended way of handling input files.

  1. For the text_input interface, is there some setting I can adjust to defocus the text input field when pressing the ESC key after writing some text in the box?

There's not a setting for this but you can add this key binding via custom js:

document.addEventListener('prodigyupdate', event => {
    // listen for the updates in the UI
    var input = document.querySelector('.prodigy-text-input');
        input.onkeydown = function(event) {
            // upon pressing ESC
            if (event.keyCode === 27) {
                // remove autofocus from input 
                input.blur();
            }
        }
});

Assuming this code is in the custom.js file, in the recipe:

# read the js code
 custom_js = Path("esc.js").read_text()
# pass it to the Prodigy Controller in the config
  return {
        "dataset": dataset,          # the dataset to save annotations to
        "view_id": "blocks",         # set the view_id to "blocks"
        "stream": stream,            # the stream of incoming examples
        "config": {
            "labels": ["CAT"],  # the labels for the manual NER interface
            "blocks": blocks, 
            "javascript": custom_js      # add custom javascript
        }
    }

Thanks again @magdaaniol!

Some more questions:

  1. Above you said that I can disable keyboard shortcuts for choice interfaces in Prodigy via custom Javascript to avoid conflict when having two choice interfaces. Currently, I have an interface with a relationships block that also supports NER and a choice block. I see that pressing number keys activates both NER labels and choice options. Can you share a snippet to remove these number key shortcuts?
  2. This question is related to our previous discussion here. You mentioned above that I can introduce the same HTML title in different parts of the interface using multiple html blocks. For a different application, how can I add a different HTML block? Specifically, I want to keep the HTML title above the choice block and add another HTML above the relations block with the same text that will be displayed in relations (but rendered in plain HTML). The reason for this is that some texts we annotate are paragraphs that cannot be split into sentences because we're annotating cross-sentence relationships. It may be useful for the annotators to be able to read the text in simple HTML and then annotate it in the relations block. That is because we find it difficult to read these multiline paragraphs in the relations interface.

Thanks!

Hi @ale,

No problem - always happy to help :wink:

Responding inline:

Can you share a snippet to remove these number key shortcuts?

I just double checked and removing the key shortcuts from the UI is just a part of the solution. You still have to remap the numerical key shortcuts because they will apply even though the options are not visible.
This post describes how to go about remapping the default key shortcuts:

Then, if do choose to remove the visualization of the shortcuts from the UI as well (shouldn't be necessary, though if you do the remapping), here's the custom css snippet:

div.prodigy-option > span {
    display: none;
}

You can pass it as a string directly to the global_css setting in prodigy.json or save it in a file (e.g. custom.css), read it from the recipe:

custom_css = Path("custom.css").read_text()

and pass it as global_css in the recipe return statement e.g.:

 return {
        "dataset": dataset,
        "view_id": "blocks",
        "stream": stream,
        "config": {
            "labels": ["LABEL"],
            "blocks": blocks,
            "choice_style": "single",
            "global_css": custom_css
        }
    }

Specifically, I want to keep the HTML title above the choice block and add another HTML above the relations block with the same text that will be displayed in relations (but rendered in plain HTML).

That should be possible by specifying two html blocks each with a different value. The title above the choice block could just contain the rawhtmlvalue (it's a static text which would be the same for all tasks) and the other html block would display the text of the relations task via the html_template. Templates have access to the task dictionary and you can use dot notation to access nested keys:

blocks = [
        {"view_id": "html", "html": "<p>This is a fixed html</p>"},
        {"view_id": "choice", "text": None},
        {"view_id": "html", "html_template":"{{text}}"},
        {"view_id": "relations"}
    ]
1 Like

Thanks @magdaaniol.

I have a some questions about hashing. Above, we discussed that Prodigy's default behaviour is to call set_hashes with the following defaults as per the documentation:

default_input_keys = ("text", "image", "html", "input")
default_task_keys = ("spans", "label", "options")
stream = (set_hashes(eg, input_keys = default_input_keys, task_keys = default_task_keys) for eg in stream)

Which as I understand would be the same as:

stream = (set_hashes(eg) for eg in stream)
  1. Let's say I run two different recipes (first ner.manual and then rel.manual) and save to the same database with the same input JSONL. I see that Prodigy doesn't show texts annotated with ner.manual when I later run rel.manual. I understand that this is because in both recipes the texts have the same input hashes. Is that correct?
  2. Later, I run my custom recipe with the same JSONL input and save to the same database as in 1. This time, I see all texts that were already annotated with ner.manual and rel.manual. However, I don't want to annotate those texts again, but rather pick up from where annotation with rel.manual finished. In my custom recipe I pre-process the input to add the fields options and html. However, since the options are just for collecting annotator feedback (they are not data annotations) and the html is just a static title for the choice interface, I create a custom hashing that takes only into account the text field:
# ...
stream.apply(create_hashes, stream)
#....

def create_hashes(stream):
    for eg in stream:
        eg = set_hashes(eg, input_keys=("text"), task_keys=("spans", "arcs"))
        yield eg

I thought I would get the same input hashes as ner.manual and rel.manual since I'm only hashing based on the text field, which is what ner.manual and rel.manual are doing (the html and options field are added only in my custom recipe as mentioned above, and therefore not present when I ran ner.manual or rel.manual). Why am I getting different input hashes than Prodigy's default recipes in this case?

To explain why I want to do this: We have annotated texts with rel.manual for a while. When the annotators found a tricky sentence, they would skip/ignore it and manually add it to an Excel table to add comments on why it is tricky to annotate and classify it into a category of tricky cases. This has become time-consuming, so I am creating a custom recipe that has the blocks rel.manual, choice, and text_input so that annotators can input the tricky case category and provide comments within Prodigy without jumping to Excel. I want to run this recipe while saving to the same database we have been using. The goal is for annotators to continue annotating where they left off with rel.manual, hence why I want my custom recipe to generate input hashes compatible with the ones they have already annotated. In addition, the text field continues to be the only relevant field in our annotation efforts, since whatever is annotated on the choice and text_input interface will not be used for training any model, it will be for us to come back to those sentences and annotated them properly based on the annotators' feedback.

Thanks!

Hi @ale,

In fact, the default filtering is by "task". Given that the input to both recipes is the same (just text no spans) the task hashes (as well as input hashes) will be the same so that's the reason it is being filtered out. To distinguish between them as you expect, you'd need to recompute the task hash on the NER examples after the annotation and before saving to the DB (e.g. using before_db callback). Alternatively, you can disable the current dataset from excluding examples by setting auto_exclude_current to false in your prodigy.json.
Finally, it is a recommended good practice to store annotations for different tasks in dedicated datasets as it is much cleaner and easier to maintain. When you've collected all the required annotations and you are redy to create the final dataset for training, Prodigy dataset utils such as db-merge and data-to-spacy make sure that the annotations for the same input are grouped together and there's no need to modify the default hashing mechanism.

Why am I getting different input hashes than Prodigy's default recipes in this case?

Examples are being excluded when Prodigy first reads in the source i.e. for example in the get_stream function. How is does your custom recipe read the input dataset?
Try setting "exclude_by": "input" in your prodigy.json to make sure the current dataset is being filtered by input which should be the same.

Any re-hashing you apply after reading the source will be stored in the DB and taken into account when you read this data in again afterwords.

To my previous point, the most typical workflow for this case i.e. when you start the new annotation task and want to exclude the examples already annotated in a different dataset would be to use the --exclude flag for naming the datasets to exclude (in the case of the custom recipe this would have to be re-created). That is assuming each annotation task is kept saved to a different dataset. In your case, where you store all tasks in the same dataset, the current dataset is being excluded by default so as long as the input hashes are the same and the exclude_by is set to "input", the task should not re-appear. So let's find out first if and why the input hashes are not the same.

Thanks @magdaaniol.

I'll share my custom recipe at the bottom. I have an input JSONL file that contains only texts. Here's an example:

{"text": "This is sentence number 1", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}
{"text": "This is sentence number 2", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}
{"text": "This is sentence number 3", "meta": {"sentence_uid": "ID_number", "source": "SOURCE"}}

Initially, I run the following Prodigy command:

PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy rel.manual DATABASE blank:en input.jsonl --label LABEL --span-label SLABEL --wrap

I annotate the first sentence and save.

Then, I run my custom recipe with the command:

PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy ner-re-custom DATABASE input.jsonl -F custom_recipe.py

I have also ran the same command with PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true, "exclude_by": "input"}'. In both cases, I am asked to annotate Sentence 1 again. I want to continue where I left off with the previous recipe, so the interface would show me Sentence 2.

On the contrary, if I run ner.manual, I can continue were I left off with rel.manual, so Prodigy shows me Sentence 2:

PRODIGY_ALLOWED_SESSIONS=alex PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' pgy ner.manual DATABASE blank:en input.jsonl --label LABEL

Of course, running rel.manual first and then ner.manual may not make sense; I only do this to show that Prodigy doesn't show examples previously annotated with another recipe, but when using my custom recipe it does show those examples again.

I can also confirm that my custom recipe is generating different input and task hashes than Prodigy's ner.manual and rel.manual. What can I do in my recipe to match the input and task hashes to those from Prodigy? The interface elements that I'm adding in my custom recipe (choice, text input) are for annotator feedback only, but the actual input is the text, so I want Prodigy to know that these are the same tasks as when we were annotating with rel.manual. I also want to save to the same database we have been using, or perhaps to a new one and use the --exclude flag, but I still need the hashes to match for that to work.

Here is my custom recipe:

import prodigy
from prodigy.core import Arg, recipe
from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.components.loaders import JSONL
import spacy
from pathlib import Path
from prodigy import set_hashes

@prodigy.recipe(
    "ner-re-custom",
    dataset = Arg(help="Dataset to save answers to."),
    file_path=Arg(help="Path to texts")
)
def ner_re_custom(dataset: str, file_path):
    
    stream = get_stream(file_path) # load in the JSON file
    nlp = spacy.blank("en") # blank spaCy pipeline for tokenization

    stream.apply(create_hashes, stream)     # tokenize the stream for ner_manual
    stream.apply(add_tokens, nlp, stream)   # tokenize the stream for ner_manual
    stream.apply(add_options, stream)  # add options to each example
    stream.apply(add_html, stream)     # add html to each example
    

    blocks = [
        {"view_id": "relations"},
        {"view_id": "choice", "text": None},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Comments", "field_id": "comments"}
    ]

    # read the js code
    custom_js_path = Path(__file__).resolve().parent / "custom.js"
    custom_js = custom_js_path.read_text()

    return {
        "dataset": dataset,
        "view_id": "blocks",
        "stream": stream,
        "config": {
            "labels": ["LABEL"],
            "relations_span_labels": ["SLABEL"],
            "blocks": blocks,
            "choice_style": "multiple",
            "wrap_relations": True,
            "javascript": custom_js,
            "custom_theme": {
                "cardMaxWidth": 1500,
                "smallText": 16,
                "relationHeightWrap": 40
            }
        }
    }

def add_options(stream):
    # Helper function to add options to every task in a stream
    options = [
        {"id": "option_1", "text": "Option 1"},
        {"id": "option_2", "text": "Option 2"},
        {"id": "option_3", "text": "Option 3"},
    ]

    for eg in stream:
        eg["options"] = options
        yield eg

def add_html(stream):
    """Adds html field to the stream examples"""
    html_string = '<h3 style="text-align: left; margin-bottom: 0;">Edge case category</h3>'
    for eg in stream:
        eg["html"] = html_string
        yield eg

def create_hashes(stream):
    for eg in stream:
        eg = set_hashes(eg, input_keys=("text"), task_keys=("spans", "arcs"))
        yield eg

Thanks!

Hi @ale,

The reason why you end up with different input hashes is that the input_keys are not passed correctly to set_hashes function. input_keys must be a tuple. I realize it's a pesky detail but there's a comma missing in input_keys value. The correct version of create_hashes is:

def create_hashes(stream):
    for eg in stream:
        eg = set_hashes(eg, input_keys=("text",), task_keys=("spans", "arcs"), overwrite=True)
        yield eg

Without this coma the input_keys are being read as a string and when the set_hashes function iterates over it, it considers each particular character as a key. Since these "keys" are not present in the task dictionary, they do not contribute at all to the hash value.

Another thing is that you should be passing overwrite=True to overwrite the existing keys.
Also, even though it doesn't matter in this case because none of your stream modifying functions changes the original hashes, if you have a function that is supposed to compute the final hashes, it's best to call it as the last function so:

stream.apply(add_tokens, nlp, stream)   # tokenize the stream for ner_manual
stream.apply(add_options, stream)  # add options to each example
stream.apply(add_html, stream)     # add html to each example
stream.apply(create_hashes, stream)     # rehas -> the final modification

Finally, as mentioned before, you'll need {"exclude_by": "input"} in your config as the default is by task.

Thanks @magdaaniol! It works perfectly

1 Like

Hi @magdaaniol,

I 'm using the custom recipe (with your fixes from above) on a new dataset, and I'm using "exclude_by": "input" and "exclude": "prev_dataset" to exclude annotations made in our previous dataset with rel.manual (as I mentioned above). However, I noticed that when trying to exclude by task, which as you said, is Prodigy's default behaviour, the interface shows again the same examples annotated in our previous database. I understand that this is because task hashes are different between the two databases, but input hashes are the same. Why is this the case? In the database from rel.manual, hashes would have been calculated with the default fields mentioned in the documentation (("spans", "label", "options")). The input JSONL (same for both databases) only has "text" and "meta" fields, so none of the default task hash fields are present. Furthermore, my custom recipe calculates task hashes using the fields task_keys=("spans", "arcs"), removing the "options" field from the task hash since I add options to the examples in my custom recipe. As I mentioned before, the options are for annotator feedback and they should not be used to differentiate annotation tasks, because they are still the same annotation tasks in our case . Shouldn't task hashes be the same between rel.manual and my custom recipe given that none of the task fields from either recipe are present in the input data?

I want to follow Prodigy's best practices. If I understand correctly, the rationale for task hashes is to differentiate annotation questions. Given that in our case, the annotation question remains the same (annotate relations in a text), I think the task hashes should also be the same between these datasets (the previous one from rel.manual and the new one from my custom recipe). The information from the choice and text_input interfaces that I added to my custom recipe was collected via an external spreadsheet when when we where using rel.manual; this info is for improving our annotation guidelines, not data to train a model. To wrap up, my questions are:

  • How can I fix the task hashes being different between both recipes?
  • Given that we have already annotated some texts with the custom recipe, how can I update the task hashes of these annotations to make the fix retroactive?

Thanks!

Hi @ale,

Let me first explain why you're observing different task_hash values for rel.manual and your custom recipe:

The function that computes the hashes takes into account not only the keys but also their values (as well as input hash in the case of the task_hash).

Importantly, if the values for the keys are not present, the function just uses entire serialized task.
In the case of your custom recipe the computation of the input_hashand the task_hash would look as follows:

# INPUT_HASH
TEXT "This is sentence number 1"
INPUT HASH KEYS ('text',)
VALUES TO USE FOR HASHING:  text="This is sentence number 1"
INPUT HASH 877456689

# task hash
TEXT "This is sentence number 1"
KEYS ('spans', 'arcs')
VALUES None # there are no annotation on the input
DEFAULT VALUES TO USE FOR HASHING {"_input_hash":877456689,"html":"<h3 style=\"text-align: left; margin-bottom: 0;\">Edge case category</h3>","meta":{"sentence_uid":"ID_number"},"options":[{"id":"option_1","text":"Option 1"},{"id":"option_2","text":"Option 2"},{"id":"option_3","text":"Option 3"}],"text":"This is sentence number 1","tokens":[{"end":4,"id":0,"start":0,"text":"This","ws":true},{"end":7,"id":1,"start":5,"text":"is","ws":true},{"end":16,"id":2,"start":8,"text":"sentence","ws":true},{"end":23,"id":3,"start":17,"text":"number","ws":true},{"end":25,"id":4,"start":24,"text":"1","ws":false}]}
TASK HASH -549215909

As you can see the task_hashes will be different because the keys you have defined are not present (yet) and a different default value is being used. This is why you should be excluding by input if you don't want to see these examples again.
In fact, although the Prodigy excludes by default by task, manual and semi-manual workflows (including ner.manual and rel.manual) exclude by input (which I should have mentioned before).

I think the main point of confusion here is that you thought this custom hashing function is applied after the annotations were done (e.g before saving in the DB). Instead, the task hashes both for rel.manual and your custom recipe are being computed on the input. This results in different defaults, and thus different task hashes.
For this reason, excluding by input is best for manual or semi-manual workflows where you want to make sure you're not annotating the same text twice. Filtering by task is meant for binary annotation workflows to avoid asking about the same model suggestion multiple times or finding the annotations that received exactly same annotations.

If you effectively want to create the same kind of annotations (relations) without re-annoatting examples that were previously manually annotated with the same kind of task, it's best to exclude by input as this is what you really want to compare - the inputs and not the the existing annotations.

How can I fix the task hashes being different between both recipes?

Following Prodigy best practices these are not exactly the same tasks as you're effectively asking a slightly different question (the extra options etc.) but if you want rel.manual and your custom recipe compute the same input and task hashes do not call the create_hashes function. The only effect it has when processing the input stream is that it modifies the default values on which task_hash is being computed.

Given that we have already annotated some texts with the custom recipe, how can I update the task hashes of these annotations to make the fix retroactive?

You can modify the existing task hashes with the following script:

from prodigy.util import set_hashes
from prodigy.components.stream import get_stream
from prodigy.types import StreamType
import srsly
import copy

def rewrite_hashes(stream: StreamType) ->StreamType:
    for eg in stream:
        copied_eg = copy.deepcopy(eg)
        input_text = eg.get("text")
        meta = eg.get("meta")
        only_input = {"text": input_text, "meta": meta}
        only_input_rehashed = set_hashes(only_input, overwrite=True)
        
        copied_eg.update({
            "_input_hash": only_input_rehashed["_input_hash"],
            "_task_hash": only_input_rehashed["_task_hash"]
        })
        
        yield copied_eg

stream = get_stream("test.jsonl", rehash=False)
stream.apply(rewrite_hashes)
srsly.write_jsonl("rehashed_test.jsonl", stream)

On the final note, the task_hash will become more relevant when you're comparing the annotations created for the same input. So rather than reassigning the task hashes as if there were no annotations (which is what the script above would do), it's probably best to continue the annotation excluding by input to avoid re annotating the same examples and then run a similar script recomputing the task hashes in the way that is most useful for your project. Please note that the default Prodigy hashing function takes into account the values as well (as stated above) so if you want to take into account keys only you'll need a custom hashing function.

1 Like

3 posts were split to a new topic: Reviewing custom UI