Yes, your general approach is correct. JSONL is newline-delimited JSON, so to export any data in that format, you can call json.dumps()
and add a \n
to each line. To load it back in, you can read in each line and then call json.loads()
on each line to transform them back to a dictionary.
Alternatively, you can also use Prodigy’s internal helper functions util.read_jsonl
(returns a generator) and util.write_jsonl
. The code looks as follows:
import ujson
from pathlib import Path
def read_jsonl(file_path):
"""Read a .jsonl file and yield its contents line by line.
file_path (unicode / Path): The file path.
YIELDS: The loaded JSON contents of each line.
"""
with Path(file_path).open('r', encoding='utf8') as f:
for line in f:
try: # hack to handle broken jsonl
yield ujson.loads(line.strip())
except ValueError:
continue
def write_jsonl(file_path, lines):
"""Create a .jsonl file and dump contents.
file_path (unicode / Path): The path to the output file.
lines (list): The JSON-serializable contents of each line.
"""
data = [ujson.dumps(line, escape_forward_slashes=False) for line in lines]
Path(file_path).open('w', encoding='utf-8').write('\n'.join(data))
Finally, there are also libraries that handle JSONL for you and give you more options. This one for example: