Yes, your general approach is correct. JSONL is newline-delimited JSON, so to export any data in that format, you can call json.dumps() and add a \n to each line. To load it back in, you can read in each line and then call json.loads() on each line to transform them back to a dictionary.
Alternatively, you can also use Prodigy’s internal helper functions util.read_jsonl (returns a generator) and util.write_jsonl. The code looks as follows:
from pathlib import Path
"""Read a .jsonl file and yield its contents line by line.
file_path (unicode / Path): The file path.
YIELDS: The loaded JSON contents of each line.
with Path(file_path).open('r', encoding='utf8') as f:
for line in f:
try: # hack to handle broken jsonl
def write_jsonl(file_path, lines):
"""Create a .jsonl file and dump contents.
file_path (unicode / Path): The path to the output file.
lines (list): The JSON-serializable contents of each line.
data = [ujson.dumps(line, escape_forward_slashes=False) for line in lines]
Finally, there are also libraries that handle JSONL for you and give you more options. This one for example:
If you don't have any metadata in your .txt file and each item you want to label is separated by a new line then Prodigy can handle that without needing to convert to .jsonl. Here's some documentation that provides an example:
Alternatively, you could load the data using pandas and its to_json() function.
This is a sentence.
This is another sentence.
import pandas as pd
# `header` is None b/c we don't have a header in text.txt, but name the column as "text"
df = pd.read_csv("test.txt", header=None, column=["text"])
# Convert to JSONL and export to myfile.jsonl
df.to_json("myfile.jsonl", orient="records", lines=True)
Let me know if this solves your problem. Thank you!
Is ok. I red and use the "Loaders and Input Data · Prodigy · An annotation tool for AI, Machine..." and I could upload my files in format txt,,, and I can run my server.
Thanks very much
I will continue reading the documentation...
My corpus will be about historical documents