Creating a tabular visualisation for a comparison task

Hello!

We would like to annotate a data set.

The annotation task is to compare (similar/no-similar) pairs of records from a source dataset. Each record contains N fields, and for 2 records to match, some fields must “match” based on business rules (so it is not exact matching always). To facilitate this, for every pair of records we need to side-by-side view and compare their field.

Questions:

  1. We want to configure a tabular visualisation layout like the following:
FIELD CANDIDATE1 CANDIDATE2
field1 value1 value4
field2 value2 value5
field3 value3 value6

where each CANDIDATE is a record in source dataset, and below that table there will be the annotation buttons (match, no match, don’t know, enter)

How can we configure such a presentation layout?

  1. We would like to simulate the --diff argument.
    It would be useful if in the above table we could highlight (some of) CANDIDATE1 and CANDIDATE2 values when they match (string equality), in order to help annotators. Is this possible?

  2. Can we add to the above model some checkboxs (e.g. a list with rules that applied in the annotators decision) like the example of custom recipes with choice?

  3. Should our input data have a special format (jsonl)?

Thank you,
Gerasimos

Thanks for your questions – I always like creative use cases like this, so here are some ideas:

Prodigy supports streaming in custom HTML that you can generate however you like. Since your requirements are a little more specific (different colours depending on the data etc.), you might want to generate the HTML in Python straight away. Prodigy recipes are simple Python functions, so how you construct the stream is up to you.

Let's assume your data is a list of dictionaries that look like this:

{
    "fields": ["field1", "field2", "field3"],
    "candidate1": ["value1", "value2", "value3"],
    "candidate2": ["value4", "value5", "value6"]
}

To illustrate the idea of creating the HTML template programmatically, here's an example of doing it in "vanilla" Python. Of course, you might want to look into using a templating library like Jinja2 to make this easier. You can structure the logic however you like, compare the values, add style attributes to the table cells, or even include other custom CSS for styling.

rows = ''
for field, c1, c2 in zip(fields, candidate1, candidate2):
    # use green background if candidate values are identical
    bgcolor = 'green' if c1 == c2 else 'transparent'
    # create HTML markup for a table row
    rows += """
        <tr>
            <td>{field}</td>
            <td style="background: {bgcolor}">{c1}</td>
            <td style="background: {bgcolor}">{c2}</td>
        </tr>
    """.format(field=field, c1=c1, c2=c2, bgcolor=bgcolor)

# put everything together wrapped in a table
html = """
    <table>
        <tr><th>Field</th><th>Candidate 1</th><th>Candidate 2</th></tr>
        {rows}
    </table>
""".format(rows=rows) 

Your annotation tasks in JSONL format could then look like this:

{"html": "<table>...</table>", "data": { ... }}

You can store any other arbirary data with the annotation tasks to relate the annotations back to the original input data (candidate 1 and candidate 2 etc.). This is probably quite important in your case – you don't just want to keep the generated HTML markup, but also the original values.

When the user annotates the task, it will be stored in the database with an added "answer" key of the value "accept", "reject" or "ignore".

Annotation tasks using the "choice" interface can also contain HTML data. So your task could look like this:

{
    "html": "<table>...</table>", 
    "data": {...}, 
    "options": [
         {"id": 1, "text": "Option 1"},
         {"id": 2, "text": "Option 2"}
    ]
}

When the user annotates the task, it will be stored with the "answer", as well as an "accept" key, a list of all selected option IDs. For example, "accept": [1] if the annotator has selected the option with the "id": 1 (IDs don't have to be integers btw, you can also use strings). To allow multiple-choice selection, you can set "choice_style": "multiple" in your config.

You can find more details on the format in the "Annotation task format" section of your PRODIGY_README.html. In theory, the "options" could also include a "html" key instead of text. But I'd recommend not going too overboard with this and try to keep the task as simple and straightforward as possible. Prodigy's interface is most powerful if the annotator is able to make the decision within a few seconds.

You can find an example of a custom recipe using the "choice" interface in the custom recipes workflow. At a minimum, your recipe could then look something like this:

@prodigy.recipe('custom-recipe')
def custom_recipe(dataset):
    # add your logic that reads in your data and creates a generator
    # of the annotation task dictionaries (see above)
    stream = load_your_data_here()

    return {
        'dataset': dataset,  # save annotations to this dataset
        'stream': stream,    # iterator of examples
        'view_id': 'choice'  # use the choice interface
    }

To call the recipe from the command line, you can run:

prodigy custom-recipe your_dataset -F recipe.py

If you need to convert your data anyways, JSONL is usually the format we'd recommend. It's flexible and can be read in line by line, which can speed up the process for large datasets (as Prodigy won't have to wait for the whole file to be loaded and parsed).

If you're using custom recipes, you have even more flexibility, because you can write your ETL logic in Python. All you need to do is load your data (from a file, a database, a REST API etc.) and return a generator of dictionaries in Prodigy's annotation task format.

Great! Thank you Ines.

1 Like

I would like to ask you one more question. Is it possible to display the options of choice to two or more columns?