My goal is to use the textcat.teach recipe to annotate sentences which I have stored in a jsonl file. Further, each sentence is associated with a file, and I want to display the file path alongside the text in the annotation interface.
My approach is to wrap textcat.teach in a custom recipe and use the html interface to display additional info.
So far, I have essentially reproduced the textcat.teach recipe with the html interface and introduced a placeholder for the extra field to display, but I’m not sure of the best way to actually include the extra field.
Custom Recipe
import prodigy
from prodigy.recipes.textcat import teach
@prodigy.recipe('custom.textcat.teach',
dataset=prodigy.recipe_args['dataset'],
spacy_model=prodigy.recipe_args['spacy_model'],
source=prodigy.recipe_args['source'],
label=prodigy.recipe_args['label_set'])
def custom_textcat_teach(dataset, spacy_model, source, label=None):
components = teach(dataset=dataset, spacy_model=spacy_model,
source=source, label=label)
with open('extension/template.html', 'r') as f:
template = f.read()
components['config']['html_template'] = template
components['view_id'] = 'html'
return components
template.html
<strong>{{text}}</strong>
<span style="background: #ffe184">File path will go here.</span>
Input Data File (jsonl)
{"text": "sentence number one"}
{"text": "sentence number two"}
My naive approach was to modify the input data file to something like
{"text": "sentence number one", "file_path": "path/to/example_one"}
{"text": "sentence number two", "file_path": "path/to/example_two"}
and then reference {{file_path}}
in the html template, but this throws an error: ValueError: Failed to load task (invalid JSON).
.
So my questions are:
I assume the error is because the input jsonl to textcat is not expecting the ‘file_path’ key - correct?
Are there other valid fields that I can include in the jsonl input and then reference in the html template?
Is there another recommended approach to do this? I could create a generator for the modified jsonl format that will extract just the ‘text’ part and pass it along to textcat.teach, but I’m unsure of how I make the ‘file_path’ values referable in the html template.