Custom templates with custom DB and exclude logic

I have created a custom recipe using html. The point of the recipe is to check if two data entries are the same (a dictionary displayed as two tables).

I have managed to get prodigy to loop through the examples and store something in a postgresql database. However it seems like it does not store the answers. It also does not keep track of what objects have already been compared, so it starts from the beginning every time.

To be clear, I just want to create a annotated dataset. Not use any of the models provided by Prodigy. I find the documentation a bit lacking on custom recipes. :frowning:

Here is an image of the interface:

Here is my code:

@prodigy.recipe('match_recipe',
                dataset=prodigy.recipe_args['dataset'],
                file_path=("Path to texts", "positional", None, str))
def match_recipe(dataset, file_path):
    json_compare = JSON('/home/compare_test.json')
    stream = format_matches(json_compare)

    return {
        'dataset': dataset,
        'view_id': 'html',
        'stream': stream,
        'db': True,
    }

Thanks for sharing – the HTML template looks really nice! It also looks like you've been doing everything correctly so far.

Does it not store anything in the database, or just not the "answer" key? Can you export your annotations using the db-out command, and could you post an example of an annotation task in your stream? Also, is this specific to PostgreSQL, i.e. does it work as expected using the default SQLite storage?

Another thing that's always helpful for debugging is to run your commands with the environment variable PRODIGY_LOGGING=basic. This will log everything that's going on and makes it easier to spot problems – or track down an error.

By default, Prodigy makes as little assumptions about your stream as possible – but you can tell it to exclude examples that are already present in one or more datasets. Simply add the IDs to the 'exclude' setting returned by your recipe. In your case, this would just be the current set:

return {
    'dataset': dataset,
    'exclude': [dataset],  # exclude all annotations already in the dataset
    # etc.
}

Yes, the way you've solved this in your custom recipe is absolutely correct :blush: In terms of documentation, did you already see the custom recipes workflow and the custom recipes API docs in the PRODIGY_README.html? And if so, is there anything in particular that you feel like currently missing? We're always keen to add more examples of specific use cases!

Thank you! :slight_smile:

I did not know about the db-out command. Is this in any quick start guides? I also feel like the interface should link to a table showing the results. That could also simplify exporting answers. I see that there is a metadata column in the database containing byte strings. I assume this is where the answer data is stored?

Anyway i managed to export the results this way. But i see that the html formatting is included in the answers. Is there a way to only include other information and exclude the html?

Is this done through the update function?

def update(answers):
    for answer in answers:
        if answer['answer'] == 'accept'
            my_custom_model.update(answer['text'], answer['label'])

Also the statistics resets every time i restart the application. Meaning i cannot see my progress after stoping and starting the server. Is this normal?

Lastly the ´exclude` option does not seem to work. Do i need to add a custom id to the data? or are the objects compared using hashes?

I inspected the mark function from prodigy source code and found this:

    def fill_memory(ctrl):
        if memorize:
            examples = ctrl.db.get_dataset(dataset)
            log("RECIPE: Add {} examples from dataset '{}' to memory"
                .format(len(examples), dataset))
            for eg in examples:
                memory[eg[TASK_HASH_ATTR]] = eg['answer']

    def ask_questions(stream):
        for eg in stream:
            if TASK_HASH_ATTR in eg and eg[TASK_HASH_ATTR] in memory:
                answer = memory[eg[TASK_HASH_ATTR]]
                counts[answer] += 1
            else:
                if label:
                    eg['label'] = label
                yield eg

It seems like mark runs through each answer and adds them on load, it also iterates through the stream and adds back old answers. However I do not understand what TASK_HASH_ATTR is. I cannot find it in the source code

A good place to start is the recipes overview on the website, which includes all available recipes and commands with visual examples. The first steps guide also shows simple usage of working with the database and datasets. You might also want to check out our video tutorials – even though they're showing different usage examples, they might be helpful to get a better feeling for an end-to-end workflow using Prodigy.

Prodigy uses a simple JSON/dictionary format for the annotation tasks. This keeps things simple, and makes sure you can easily reuse the data in other processes. For example, if your input looks like this:

{"text": "Hello world"}

The annotated task will look like this:

{"text": "Hello world", "answer": "accept"}

You can find more details and examples of this in the "Annotation task format" section of your PRODIGY_README.html.

This indicates that there's likely something going wrong with the database connection – i.e. the connection doesn't work, the tables are not created correctly or the data is not saved. (I'm surprised it didn't throw an error, though!)

Could you try running the same commands with the default SQLite database and see if this works as expected? And do you see any suspicious output on the command line or in the log?

Yes, the database issue above likely also explains why the exclude option is not working – the dataset is empty, so there are no examples to exclude. Under the hood, Prodigy assigns hashes to the incoming task – one for the input data (e.g. "text", "html" or the JSON-dumped task). It also assigns a hash for the input data plus features you're annotating (e.g. labels or spans – this is less relevant in your case, though, since you're only annotating the incoming HTML data.

Yes, the easiest way to do this would be to create a custom HTML template. For example, lets' say your data looks like this:

{"customer": {"first_name": "John", "last_name": "Doe"}, "master": {"first_name": "John", "last_name": "Doe"}}

You can then access the data as Mustache variables in your HTML template, including nested values. For example, something like this:

<table>
    <tr>
       <td>{{customer.first_name}}</td>
       <td>{{customer.last_name}}</td>
    </tr>
    <tr>
       <td>{{master.first_name}}</td>
       <td>{{master.last_name}}</td>
    </tr>
</table>

You can add the HTML template to the 'config' returned by your recipe, e.g.:

return {
    'config': {'html_template': HTML_TEMPLATE},
    # etc.
}

Edit:

Yes – the same functionality should also be taken care of by the exclude logic. TASK_HASH_ATTR is simply a constant for "_task_hash_attr", which you might have seen in your annotated examples. The main reason we're using the variable here is that it's a little cleaner than hard-coding the string. But as I said, you shouldn't have to worry about this – instead, you can use one of the stream filter functions or the set_hashes() helper if you need to do hashing and filtering yourself (see the docs for the detailed API).

Ok, i tried one of the default examples provided for classification. I see now that the interface is not created to show overall statistics about accepter/rejected answers.

I also tested the database connection and previous answers are available and loaded. However the exclude function is not excluding old answers. How is this supposed to work? For me the Exclude option is a black box i don’t know what else to do that to provide the dataset. How does the exclude function check for similarity between answers in the dataset and examples in the whole set of problems?

Also, you seem to have misunderstood my question regarding the excluding the html data from the answers data.

I assume you mean that by using the built in html_template functionality Prodigy will exclude the template from the answer and only include the data passed inn pluss the answer? However this will not work as mustach is a bit limited in its functionality. Pluss it seems like the html template is not parsed, instead of showing the content the link to the template is show as text.

This is how i create my html template using Jinja2:

def format_matches(stream):
    template = env.get_template('compare_template.html')
    for item in stream:
        equal_dict = compared(item['customer'], item['master'])
        yield {
            'html': template.render(keys=item['customer'].keys(),
                                    customer=item['customer'],
                                    master=item['master'],
                                    equal=equal_dict),
            'text': item['customer']['first_name']
        }

And there is my html template:

<style>
  .left_side {
    width: 50%;
    float: left;
  }
  .right_side {
    width: 50%;
    float: left;
  }
  .label {
    color: gray;
  }
  .table_header {
    color: gray;
  }
  .equal {
    background: rgba(139,235,28,0.48);
  }
</style>
<table>
  <tr class="left_side_table">
    <th colspan="2" class="table_header">Customer</th>
    <th colspan="2" class="table_header">Master</th>
  </tr>
  {% for key, value in customer.items() %}
  <tr class="{{ equal.first_name }}">
    <td class="label">{{ key }}</td>
    <td>{{ customer[key] }}</td>
    <td class="label">{{ key }}</td>
    <td>{{ master[key] }}</td>
  </tr>
  {% endfor %}
</table>

Is there any way i can customise what data is stored as answers? for example:

{"customer": {"first_name": "John", "last_name": "Doe"}, "master": {"first_name": "John", "last_name": "Doe"}, "answer": "accept", "_input_hash": "asdfadsf", "_task_hash": "sdfasdfds"}

If you want to show the statistics of a dataset (total answers, accept/reject/ignore counts, metadata etc.), you can use the prodigy stats command:

prodigy stats your_dataset

You can also turn on the "show_stats": true option in your prodigy.json to make the UI show an overview of the accept/reject/ignore distribution of the current annotation session (see here). This will only refer to the current session, though, not the whole dataset:

web_app_progress

Let's say you have a dataset called "my_dataset" and you don't want to see examples that you've already answered and stored in this set. You can then set 'exclude': ['my_dataset'] (or --exclude my_dataset on the command line for the built-in recipes). Prodigy will check the _task_hash that is assigned to the task when it comes in and if it's the same as one of the already annotated examples, it will skip the example and not show it to you.

If you feel like there might be a bug or some other problem, you can verify that two indentical tasks receive the same hash by using the set_hashes() helper:

from prodigy import set_hashes

task = {'customer': 'John Doe'}
task = set_hashes(task)
print(task['_input_hash'], task['_task_hash'])

If examples receive the same hash, are present in the dataset that's excluded and are still shown in the web app, it'd be great if you could post a reproducable example so I can test it out and help debug it.

Yes, that's exactly what I was trying to get across in my answer above, sorry. The Mustache syntax is very similar to Jinja2, so you should be able to convert your template pretty easily. The annotation task is what your format_matches function yields – so instead of making it yield {'html': '...', 'text': '...'}, just make it yield the data directly:

yield {
     'customer': {
        'first_name': item['customer']['first_name'],
        'last_name': item['customer']['last_name']
     }
     # etc.
}

In your HTML template, you can then refer to {{customer.first_name}}. The front-end will look at the incoming annotation task data, and fill in the fields of your template.

I hope you don't feel discouraged, because aside from a few small confusions, you really got everything right, even the more advanced stuff!

Mustache is severely limited when it comes to creating templates. Especially when i cannot add custom functions to mustache. I have to do some serious data manipulation just to cram this data into how mustache is able to use it. Can you please in the future allow users to just adjust the saved data? then we could freely use the html view with whatever template engine we want. Or just add a proper one, like Jinja2.

I bought Prodigy to speed up data annotation, at this point i’m frustrated and disappointed. Please think more about allowing users to extend/modify functions and classes. And give some transparency, maybe i’m just not familiar enough with cpython but i can’t inspect this code to understand it. And there is not enough documentation.

Ok, i finnaly worked out a solution to desiding what to store in the database. The documentation is a bit unclear, but the update option takes a function that modifies annotations, what was not clear is that this also modifies what is saved to the database. Not just what is sent to a model.

@prodigy.recipe('match_recipe',
                dataset=prodigy.recipe_args['dataset'],
                file_path=("Path to texts", "positional", None, str))
def match_recipe(dataset, file_path):
    json_compare = JSON('compare_test.json')
    stream = format_matches(json_compare)
    # stream = format_matches_mustache(json_compare)
    components = mark(dataset=dataset, source=stream, memorize=True, exclude=(dataset,))

    def update(answers):
        for answer in answers:
            del answer['html']

    components['update'] = update

    return components

With a custom update function i can delete html from the answer that is stored. However how this happens is a mystery to me as the update function does not return anything. Is the answers object global? In that case why do i pass in it here?

I I also need to add data to the object in the stream:

def format_matches(stream):
    template = env.get_template('compare_template.html')
    for item in stream:
        equal_dict = compared(item['customer'], item['master'])
        yield {
            'html': template.render(customer=item['customer'],
                                    master=item['master'],
                                    equal=equal_dict),
            'text': item['customer']['first_name'],
            'data': item
        }

This solves what i explained above. Now What remains is debugging the exclude option. I’ll try to build a reproducible example

Really sorry this has been frustrating for you – I'll do my best to help!

Just to be sure: Did you see the PRODIGY_README.html file available for download with the Prodigy installers (via your download link)? All relevant API should be fully documented, including the individual classes and methods, the built-in models, the recipe functions, the components, database etc. and the REST API. It also includes the expected input and output formats.

If there's anything you needed that wasn't included, let me know and we're happy to fix that! You should never have to dig through the compiled code.

I'm not sure I understand – Prodigy allows you to stream in and save any JSON data you like, and you can structure your annotation tasks however you want. The only thing you need is to tell Prodigy how to render it – either via a "html" key or a HTML template.

The Mustache solution is mostly a quick and simple way to handle the templating on the front-end, especially for users who are not familiar with populating templates themselves in Python. In your case, the solution you came up with using Jinja2 is totally fine and exactly what we've had in mind for HTML tasks. To allow as much flexibility as possible, Prodigy lets you create your stream freely and programmatically in Python – for example, to produce a result like this:

{"html": "<strong>John Doe</strong>", "customer": {"first_name": "John", "last_name": "Doe"}

Okay, I think I understand what you're trying to do here. The examples stored in the database are, by design, intended to be an exact reflection of the annotation tasks. You can think of the Prodigy database as an exact record of what the annotator worked on. So modifying the examples is usually not recommended, because it can easily have unintended side-effects – especially when using Prodigy for both annotation and training. Keeping the exact tasks on record also makes debugging easier, because it allows you to"replay" an annotation session later on.

So if you want to "clean up" the annotations afterwards to store them in a different format, a nicer solution would be to add a hook to your recipe that saves them out to any other format or database of your choice when you exit the server.

For instance, you can do this via an on_exit function returned by your recipe. You can find an the API and an example of the on_exit and update functions in the "Custom recipes" section of the PRODIGY_README.html.

def on_exit(ctrl):  # the controller gives you access to the DB, session etc.
     # get all examples of the current session – this will be a simple 
     # list of dictionaries of the annotated tasks
     examples = ctrl.db.get_dataset(ctrl.session_id)
     for eg in examples:
         del eg['html']  # reformat the example and do whatever you like
     # do something with the annotations – save them out to a file, DB, etc.
     save_examples_to_custom_storage_solution(examples)

This will ensure you have a record of the annotation session in your Prodigy database, as well as the collected annotations in a custom format of your choice, which you can then reuse for other processes.

Yes, i have downloaded this and used it. However writing a bit about each function without example of what it returns and only simplified isolated examples of use will not be enough to understand it.

I have tried to debug why previously answered tasks are not excluded after i restart the server, but i cannot figure it out. both filter_tasks and filter_inputs returns the stream unchanged.

Bellow if the recipe code at this time, i have commented out the update function to leave the saved data unchanged. As i agree with your philosophy on this:

DB = connect()

# pylint: disable=E1101
@prodigy.recipe('match_recipe',
                dataset=prodigy.recipe_args['dataset'],
                # file_path=("Path to texts", "positional", None, str),
                exclude=prodigy.recipe_args['exclude'])
def match_recipe(dataset, file_path, exclude):
    stream = prodigy.get_stream('/Users/ohenrik/Sites/bisnode_rec_match/src/verification/compare_test.json', loader='json')
    # stream = JSON('/Users/ohenrik/Sites/bisnode_rec_match/src/verification/compare_test.json')

    def compared(customer, master):
        equal = {}
        for key, value in customer.items():
            if master[key] == value:
                equal[key] = 'equal'
            else:
                equal[key] = 'not_equal'
        return equal

    def format_matches(stream):
        template = env.get_template('compare_template.html')
        # already_marked = DB.get_task_hashes(dataset)
        for item in stream:
            equal_dict = compared(item['customer'], item['master'])
            result = {
                'html': template.render(customer=item['customer'],
                                               master=item['master'],
                                               equal=equal_dict),
                'text': item['ID'],
                'data': item
            }
            yield result

    # def update(answers):
    #     for answer in answers:
    #         del answer['html']


    stream = format_matches(stream)
    stream = filter_tasks(stream, DB.get_task_hashes(dataset))
    stream = filter_inputs(stream, DB.get_input_hashes(dataset))
    components = mark(dataset=dataset, source=stream, memorize=True, exclude=[dataset])

    # components['update'] = update

    return components

Might it be that the exclude function cannot deal with my custom formater (format_matches)?

I haven’t read everything here yet, so I could be wrong — but at first glance it looks problematic that your custom records are missing the input and task hashes. If you add the line result = prodigy.set_hashes(result) before yield result, you’ll update the record with these content IDs. The hashes are necessary to allow Prodigy to compare records by input and annotation content — without the hashes, there’s no way to guess what keys in your record are relevant, so there’s no way for filter_tasks and filter_inputs to work.

The hashes are normally set automatically in the controller, but since you’re doing everything differently here, it looks like that’s the extra step that’s going missing.

Also is it possible to maybe arrange a screensharing session (google hangout for example) to clarify a few things. Prodigy seems to be built to use directly with models, so when i use it just to annotate (mark) data i might have som basic misconceptions about how Prodigy is supposed to work, that might contribute to me not getting it right.

I actually tried this earlier. it caused all the tasks to be filtered out, even when i deleted all previous progress and started from scratch. So i though that by adding the hashes to result Prodigy though they where already answered.

Edit: I added this back in now and it suddenly worked... Thank you though. Are there any examples of this in the documentation?

It’s probably worth noting that you’re well off the “happy path”, which is why you’ve had to customise so much. I’m actually quite pleased with how flexible the software has been!

We designed Prodigy for a use-case that we felt didn’t really have a “good enough” solution previously: quickly prototyping models for new machine learning-powered features. For situations like training a new named entity recognition model from scratch, or evaluating an image captioning system, I think Prodigy really offers something new — and I’m pleased to say the response has been really great on these things, and other complicated tasks.

I think software is always going to feel a little awkward and uncomfortable if you’re at the edges of the tasks it was designed for, and you’re only using a subset of the functionality. For instance, Microsoft Word isn’t a great product for jotting down a quick note-to-self.

I’m glad to hear you got your example working, and I hope we can put out a video tutorial covering custom recipes and no-model use-cases soon, after we’ve finished the tutorials for the manual NER and computer vision recipes.

I guess so, i just thought creating custom recipes and using prodigy for annotating data was one of the main use cases.

It seems to be working now except for one last thing. Now all previous answers are excluded even if i start a new dataset. I have to delete the sqllite data base to be able to start again. I’m assuming this is not normal? I have set the exclude option to exclude data from the current dataset by default.

components = mark(dataset=dataset, source=stream, memorize=True, exclude=[dataset])

So you’re passing the name of the new, empty dataset in the exclude list, and that’s preventing the tasks from coming in? It’s hard to make sense of that because the exclude mechanism just gets the task hashes from the dataset and runs filter_tasks(), just as you’re doing explicitly above. If the dataset you’re excluding is empty, this should return an empty list of tasks.

I got it working now, the complete code now looks like this:

"""Custom prodigy recipie for marking matches"""
import os
import prodigy
from prodigy.recipes.generic import mark # pylint: disable=E0611,E0401
from jinja2 import Environment, PackageLoader, select_autoescape

env = Environment(
    loader=PackageLoader('src', 'verification'),
    autoescape=select_autoescape(['html', 'xml'])
)

# pylint: disable=E1101
@prodigy.recipe('match_recipe',
                dataset=prodigy.recipe_args['dataset'],
                exclude=prodigy.recipe_args.get('exclude', None))
def match_recipe(dataset, exclude):
    test_filepath = os.path.join(os.environ.get('PROJECT_DIR'), 'tests/fixtures/compare_test.json')
    stream = prodigy.get_stream(test_filepath, loader='json')

    def compared(customer, master):
        equal = {}
        for key, value in customer.items():
            if master[key] == value:
                equal[key] = 'equal'
            else:
                equal[key] = 'not_equal'
        return equal

    def format_matches(stream):
        template = env.get_template('compare_template.html')
        for item in stream:
            equal_dict = compared(item['customer'], item['master'])
            result = {
                'html': template.render(customer=item['customer'],
                                        master=item['master'],
                                        equal=equal_dict),
                'text': item['customer']['first_name'],
                'data': item,
                'id': item['ID']
            }
            result = prodigy.set_hashes(result)
            yield result

    stream = format_matches(stream)
    components = mark(dataset=dataset, source=stream, memorize=True, exclude=exclude)

    return components
1 Like

Awesome! Looks very elegant. Thanks for updating with your solution.