Extending UI to display additional fields for textcat.teach

My goal is to use the textcat.teach recipe to annotate sentences which I have stored in a jsonl file. Further, each sentence is associated with a file, and I want to display the file path alongside the text in the annotation interface.

My approach is to wrap textcat.teach in a custom recipe and use the html interface to display additional info.

So far, I have essentially reproduced the textcat.teach recipe with the html interface and introduced a placeholder for the extra field to display, but I’m not sure of the best way to actually include the extra field.

Custom Recipe

import prodigy
from prodigy.recipes.textcat import teach

@prodigy.recipe('custom.textcat.teach',
    dataset=prodigy.recipe_args['dataset'],
    spacy_model=prodigy.recipe_args['spacy_model'],
    source=prodigy.recipe_args['source'],
    label=prodigy.recipe_args['label_set'])
def custom_textcat_teach(dataset, spacy_model, source, label=None):
    components = teach(dataset=dataset, spacy_model=spacy_model,
                       source=source, label=label)
    
    with open('extension/template.html', 'r') as f:
        template = f.read()
    components['config']['html_template'] = template
    components['view_id'] = 'html'
    return components

template.html

<strong>{{text}}</strong>
<span style="background: #ffe184">File path will go here.</span>

Input Data File (jsonl)

{"text": "sentence number one"}
{"text": "sentence number two"}

My naive approach was to modify the input data file to something like

{"text": "sentence number one", "file_path": "path/to/example_one"}
{"text": "sentence number two", "file_path": "path/to/example_two"}

and then reference {{file_path}} in the html template, but this throws an error: ValueError: Failed to load task (invalid JSON)..

So my questions are:

I assume the error is because the input jsonl to textcat is not expecting the ‘file_path’ key - correct?

Are there other valid fields that I can include in the jsonl input and then reference in the html template?

Is there another recommended approach to do this? I could create a generator for the modified jsonl format that will extract just the ‘text’ part and pass it along to textcat.teach, but I’m unsure of how I make the ‘file_path’ values referable in the html template.

Yes, your approach sounds good :+1: It's really exaxctly what I would have recommended: using the HTML view with a custom template and additional properties in the task.

ValueError: Failed to load task (invalid JSON)..

This error usually really only occurs if a line can't be loaded by json.loads. The example you pasted looks fine, but maybe you could double-check that there's nothing weird in the file you're loading? An accidental unescaped quotation mark in one of the strings? A trailing comma? You could also write a script that opens the file and calls json.loads on each line to see where it fails.

(The incoming data will be validated against a JSON schema, too, to make sure it has everything it needs – but I just had a look at the schema again and it allows additional properties. So this shouldn't be an issue. Btw, if you're into JSON schemas, you can check it out via prodigy.get_schema('classification').)

Btw, one quick note, also in case others come across this thread later: When you train the model (assuming you're training with textcat.batch-train and spaCy), it will only get to see the "text". So if you do want the model to take the file path into account, you could generate data that looks like this:

{
    "orig_text": "Some text",
    "file_path": "some/path",
    "text": "Some text some/path"
}

Your template would only use the orig_text and file_path, but the model would see the text. Of course, when using this approach, it's important to make sure that what the model sees really matches what the annotator saw – otherwise, you can end up with weird results.

So I did get an error when looping through the input and calling json.loads() on each line. I’m not able to pinpoint exactly which character was giving issues, but it was fixed by using json.dumps() to create the json instead of piecing it together manually. Seems to be working fine now.

Thanks for your expertise!

1 Like