I have problems with the .jsonl format (python 3.6.4). After some serious hacking I managed to write a list of dictionaries in jsonl format (converted each list entry to a string)
data_list = [{dict1_data}, {dict1_data}, ...]
with open(target_file, "w") as fp:
for data in data_list:
fp.write(str(data)+ "\n")
fp. close()
I try to read the file such that I get again the original list
with open(source_file, "r") as fp:
for data in data_list:
data_list fp.read())
Yes, your general approach is correct. JSONL is newline-delimited JSON, so to export any data in that format, you can call json.dumps() and add a \n to each line. To load it back in, you can read in each line and then call json.loads() on each line to transform them back to a dictionary.
Alternatively, you can also use Prodigy’s internal helper functions util.read_jsonl (returns a generator) and util.write_jsonl. The code looks as follows:
import ujson
from pathlib import Path
def read_jsonl(file_path):
"""Read a .jsonl file and yield its contents line by line.
file_path (unicode / Path): The file path.
YIELDS: The loaded JSON contents of each line.
"""
with Path(file_path).open('r', encoding='utf8') as f:
for line in f:
try: # hack to handle broken jsonl
yield ujson.loads(line.strip())
except ValueError:
continue
def write_jsonl(file_path, lines):
"""Create a .jsonl file and dump contents.
file_path (unicode / Path): The path to the output file.
lines (list): The JSON-serializable contents of each line.
"""
data = [ujson.dumps(line, escape_forward_slashes=False) for line in lines]
Path(file_path).open('w', encoding='utf-8').write('\n'.join(data))
Finally, there are also libraries that handle JSONL for you and give you more options. This one for example:
Hello
Iam new in prodigy... I have a file in txt format (spanish), and I need jsonl...
Someone has any code already written in python for change the format..?
Or maybe some software?
Gus
If you don't have any metadata in your .txt file and each item you want to label is separated by a new line then Prodigy can handle that without needing to convert to .jsonl. Here's some documentation that provides an example:
Alternatively, you could load the data using pandas and its to_json() function.
For example:
# test.txt
This is a sentence.
This is another sentence.
import pandas as pd
# `header` is None b/c we don't have a header in text.txt, but name the column as "text"
df = pd.read_csv("test.txt", header=None, column=["text"])
# Convert to JSONL and export to myfile.jsonl
df.to_json("myfile.jsonl", orient="records", lines=True)
Let me know if this solves your problem. Thank you!
Hi Ryan
Is ok. I red and use the "Loaders and Input Data · Prodigy · An annotation tool for AI, Machine..." and I could upload my files in format txt,,, and I can run my server.
Thanks very much
I will continue reading the documentation...
My corpus will be about historical documents
Gus