jsonl format

Hi all,

I have problems with the .jsonl format (python 3.6.4). After some serious hacking I managed to write a list of dictionaries in jsonl format (converted each list entry to a string)

data_list = [{dict1_data}, {dict1_data}, ...]
with open(target_file, "w") as fp:
    for data in data_list:
        fp.write(str(data)+ "\n")
   fp. close() 

I try to read the file such that I get again the original list

with open(source_file, "r") as fp:
    for data in data_list:
        data_list  fp.read())

but no luck so far. Suggestions (code?)

best, Andreas

1 Like

Yes, your general approach is correct. JSONL is newline-delimited JSON, so to export any data in that format, you can call json.dumps() and add a \n to each line. To load it back in, you can read in each line and then call json.loads() on each line to transform them back to a dictionary.

Alternatively, you can also use Prodigy’s internal helper functions util.read_jsonl (returns a generator) and util.write_jsonl. The code looks as follows:

import ujson
from pathlib import Path

def read_jsonl(file_path):
    """Read a .jsonl file and yield its contents line by line.
    file_path (unicode / Path): The file path.
    YIELDS: The loaded JSON contents of each line.
    """
    with Path(file_path).open('r', encoding='utf8') as f:
        for line in f:
            try:  # hack to handle broken jsonl
                yield ujson.loads(line.strip())
            except ValueError:
                continue


def write_jsonl(file_path, lines):
    """Create a .jsonl file and dump contents.
    file_path (unicode / Path): The path to the output file.
    lines (list): The JSON-serializable contents of each line.
    """
    data = [ujson.dumps(line, escape_forward_slashes=False) for line in lines]
    Path(file_path).open('w', encoding='utf-8').write('\n'.join(data))

Finally, there are also libraries that handle JSONL for you and give you more options. This one for example:

https://jsonlines.readthedocs.io/en/latest/

4 Likes