Everything works perfectly and I am really satisfied with the sleek interface. However, I would like to know if it is possible to customise the JSONL output?
I am working with a large CSV file containing multiple columns. The column âtextâ is automatically used by prodigy (as it should be). Is it possible to parse the âidâ column from my original CSV document and add this value to the JSONL output? I mean, is it possible to extract any information from my CSV file and pass it to prodigyâs output?
Hi and thanks! I think you might be at a point where want to write your own little loader script that does exactly what you need. Loaders are simple functions that yield annotation tasks, i.e. dictionaries containing the task keys like "text" etc. So using the built-in csv module, you could do something like this:
import csv
def custom_csv_loader(file_path):
with open(file_path) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
text = row.get('custom_text_field') # etc.
id = row.get('id')
yield {'text': text, 'meta': {'id': id}}
stream = custom_csv_loader('/path/to/your_file.csv')
Within the loader, you can also perform any other data transformations. The "meta" field is usually the best place to store custom data if you want it to be displayed on the front-end (in the bottom right corner of the annotation card). Alternatively, Prodigy should also respect any other custom properties like {'text': 'Some text', 'user_id': 123}, pass them through as you annotate the tasks and store them in the database with your annotations. This lets you attach any arbitrary âhiddenâ meta data (as long as itâs JSON-serializable).
Sorry if this was a little confusing in my answer â the library itself doesnât have a custom_csv_loader, but since recipe functions are just Python scripts, you can always write your own logic instead. So in your recipe code, you can replace stream = CSV(file_path) with my code snippet above.
I actually tried the same code snippet but I am not able to get my "ResponseID" as output in JSONL. It says "null" in the id column. Can you help me sort this issue?
import csv
import prodigy
@prodigy.recipe('fb_test',
dataset=prodigy.recipe_args['dataset'],
file_path=("C:/Users/...../Downloads/workspace/test.csv", "positional", None, str))
def fb_test(dataset, file_path):
"""Annotate the feedbacks using different labels."""
stream = custom_csv_loader("C:/Users/....../Downloads/workspace/test.csv")
stream = add_options(stream) # add options to each task
return {
'dataset': dataset, # save annotations in this dataset
'view_id': 'choice', # use the choice interface
'config': {'choice_style': 'multiple'},
'stream':stream,
'on_exit': on_exit
}
def custom_csv_loader(file_path):
with open(file_path, encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
ids = row.get('ResponseId')
text = row.get('Feedback_Explanation')
yield {'meta': {'id':ids},'text': text} # load in the CSV file
def on_exit(controller):
# Get all annotations in the dataset, filter out the accepted tasks,
# count them by the selected options and print the counts.
examples = controller.db.get_dataset(controller.dataset)
examples = [eg for eg in examples if eg["answer"] == "accept"]
for option in ("Billing & Payment", "Registration & Sign-In", "Website Issues"):
count = len([eg for eg in examples if option in eg["accept"]])
print(f"Annotated {count} {option} examples")
def add_options(stream):
# Helper function to add options to every task in a stream
options = [
{"id": "Billing & Payment", "text": "Billing & Payment"},
{"id": "Registration & Sign-In", "text": "Registration & Sign-In"},
{"id": "Website Issues", "text": "Website Issues"},
]
for task in stream:
task["options"] = options
yield task