Custom JSONL output

Hi, I am currently working with the following recipe:

import prodigy
from prodigy.components.loaders import CSV

@prodigy.recipe('my_recipe',
            dataset=prodigy.recipe_args['dataset'],
            file_path=("The path to the CSV file", "positional", None, str))

def my_recipe(dataset, file_path):

    stream = CSV(file_path)
    stream = add_options(stream)

    return {
        'dataset': dataset,
        'exclude': [dataset],
        'view_id': 'choice',
        'stream': stream,
    }

def add_options(stream):
    options = [{'id': 'class1', 'text': 'Class 1'},
               {'id': 'class2', 'text': 'Class 2'},
               {'id': 'class3', 'text': 'Class 3'}]

    for task in stream:
        task['options'] = options
        yield task

Everything works perfectly and I am really satisfied with the sleek interface. However, I would like to know if it is possible to customise the JSONL output?

I am working with a large CSV file containing multiple columns. The column ‘text’ is automatically used by prodigy (as it should be). Is it possible to parse the ‘id’ column from my original CSV document and add this value to the JSONL output? I mean, is it possible to extract any information from my CSV file and pass it to prodigy’s output?

Thanks for the great support!

Hi and thanks! I think you might be at a point where want to write your own little loader script that does exactly what you need. Loaders are simple functions that yield annotation tasks, i.e. dictionaries containing the task keys like "text" etc. So using the built-in csv module, you could do something like this:

import csv

def custom_csv_loader(file_path):
    with open(file_path) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
           text = row.get('custom_text_field')  # etc.
           id = row.get('id')
           yield {'text': text, 'meta': {'id': id}}

stream = custom_csv_loader('/path/to/your_file.csv')

Within the loader, you can also perform any other data transformations. The "meta" field is usually the best place to store custom data if you want it to be displayed on the front-end (in the bottom right corner of the annotation card). Alternatively, Prodigy should also respect any other custom properties like {'text': 'Some text', 'user_id': 123}, pass them through as you annotate the tasks and store them in the database with your annotations. This lets you attach any arbitrary “hidden” meta data (as long as it’s JSON-serializable).

2 Likes

Thanks for the speedy reply @ines! I didn’t know the library had a custom_csv_loader. I’ll give it a try!

Sorry if this was a little confusing in my answer – the library itself doesn’t have a custom_csv_loader, but since recipe functions are just Python scripts, you can always write your own logic instead. So in your recipe code, you can replace stream = CSV(file_path) with my code snippet above.

Thanks! It was actually fairly easy and worked straight away! Here is my updated recipe:

import prodigy
import csv

@prodigy.recipe('my_recipe',

            dataset=prodigy.recipe_args['dataset'],
            file_path=("Path to CSV", "positional", None, str))

def my_recipe(dataset, file_path):

    stream = custom_csv_loader(file_path, source)
    stream = add_options(stream)

    return {
        'dataset': dataset,
        'exclude': [dataset],
        'view_id': 'choice',
        'stream': stream,
    }

def custom_csv_loader(file_path, source):
    with open(file_path) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            id = row.get('id')
            text = row.get('text')
            category = row.get('category')
            info = row.get('info')
            yield {'text': text, 'meta': {'id':id,'category':category,'info':info}}

def add_options(stream):
    options = [{'id': 'class1', 'text': 'Class 1'},
               {'id': 'class2', 'text': 'Class 2'},
               {'id': 'class3', 'text': 'Class 3'}]

    for task in stream:
        task['options'] = options
        yield task
1 Like

I actually tried the same code snippet but I am not able to get my "ResponseID" as output in JSONL. It says "null" in the id column. Can you help me sort this issue?

import csv
import prodigy

@prodigy.recipe('fb_test',
            dataset=prodigy.recipe_args['dataset'],
            file_path=("C:/Users/...../Downloads/workspace/test.csv", "positional", None, str))

def fb_test(dataset, file_path):
    """Annotate the feedbacks using different labels."""               
    stream = custom_csv_loader("C:/Users/....../Downloads/workspace/test.csv")            
    stream = add_options(stream)  # add options to each task

    return {
          'dataset': dataset,   # save annotations in this dataset
          'view_id': 'choice',  # use the choice interface
          'config': {'choice_style': 'multiple'},
          'stream':stream,
          'on_exit': on_exit
      }
def custom_csv_loader(file_path):
        with open(file_path, encoding="utf-8") as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:    
                ids = row.get('ResponseId')
                text = row.get('Feedback_Explanation')   
                yield {'meta': {'id':ids},'text': text}     # load in the CSV file

def on_exit(controller):
    # Get all annotations in the dataset, filter out the accepted tasks,
    # count them by the selected options and print the counts.
    examples = controller.db.get_dataset(controller.dataset)
    examples = [eg for eg in examples if eg["answer"] == "accept"]
    for option in ("Billing & Payment", "Registration & Sign-In", "Website Issues"):
        count = len([eg for eg in examples if option in eg["accept"]])
        print(f"Annotated {count} {option} examples")
    
def add_options(stream):
    # Helper function to add options to every task in a stream
    options = [
        {"id": "Billing & Payment", "text": "Billing & Payment"},
        {"id": "Registration & Sign-In", "text": "Registration & Sign-In"},
        {"id": "Website Issues", "text": "Website Issues"},
    ]
    for task in stream:
        task["options"] = options
        yield task

@dinnuv See your other thread: ID is null although I give an ID value from .CSV file Cross-posting the same question to multiple threads isn't helpful and actually makes it much harder for us to help people.