jsonl format

Hi all,

I have problems with the .jsonl format (python 3.6.4). After some serious hacking I managed to write a list of dictionaries in jsonl format (converted each list entry to a string)

data_list = [{dict1_data}, {dict1_data}, ...]
with open(target_file, "w") as fp:
    for data in data_list:
        fp.write(str(data)+ "\n")
   fp. close() 

I try to read the file such that I get again the original list

with open(source_file, "r") as fp:
    for data in data_list:
        data_list  fp.read())

but no luck so far. Suggestions (code?)

best, Andreas

1 Like

Yes, your general approach is correct. JSONL is newline-delimited JSON, so to export any data in that format, you can call json.dumps() and add a \n to each line. To load it back in, you can read in each line and then call json.loads() on each line to transform them back to a dictionary.

Alternatively, you can also use Prodigy’s internal helper functions util.read_jsonl (returns a generator) and util.write_jsonl. The code looks as follows:

import ujson
from pathlib import Path

def read_jsonl(file_path):
    """Read a .jsonl file and yield its contents line by line.
    file_path (unicode / Path): The file path.
    YIELDS: The loaded JSON contents of each line.
    """
    with Path(file_path).open('r', encoding='utf8') as f:
        for line in f:
            try:  # hack to handle broken jsonl
                yield ujson.loads(line.strip())
            except ValueError:
                continue


def write_jsonl(file_path, lines):
    """Create a .jsonl file and dump contents.
    file_path (unicode / Path): The path to the output file.
    lines (list): The JSON-serializable contents of each line.
    """
    data = [ujson.dumps(line, escape_forward_slashes=False) for line in lines]
    Path(file_path).open('w', encoding='utf-8').write('\n'.join(data))

Finally, there are also libraries that handle JSONL for you and give you more options. This one for example:

https://jsonlines.readthedocs.io/en/latest/

4 Likes

Hello
Iam new in prodigy... I have a file in txt format (spanish), and I need jsonl...
Someone has any code already written in python for change the format..?
Or maybe some software?
Gus

Hi @gus!

If you don't have any metadata in your .txt file and each item you want to label is separated by a new line then Prodigy can handle that without needing to convert to .jsonl. Here's some documentation that provides an example:

Alternatively, you could load the data using pandas and its to_json() function.

For example:

# test.txt
This is a sentence.
This is another sentence.
import pandas as pd

# `header` is None b/c we don't have a header in text.txt, but name the column as "text"
df = pd.read_csv("test.txt", header=None, column=["text"])

# Convert to JSONL and export to myfile.jsonl
df.to_json("myfile.jsonl", orient="records", lines=True)

Let me know if this solves your problem. Thank you!

Thank you very much, I try, read and any problem consult
Gus

Hi Ryan
Is ok. I red and use the "Loaders and Input Data · Prodigy · An annotation tool for AI, Machine..." and I could upload my files in format txt,,, and I can run my server.
Thanks very much
I will continue reading the documentation...
My corpus will be about historical documents
Gus