Convert CSV to JSONL

Zim1-finest · November 16, 2021, 5:09pm

I want to use Prodigy to annotate my data.

The issue is that my data is in CSV file and I am failing to convert the CSV file to the acceptable jsonl file that is required by Prodigy.

Can you please point me to the simplest way that I can do this conversion.
Thank you...

ljvmiranda921 · November 17, 2021, 12:26am

Hi @Zim1-finest !

You can try using the pandas library as it has a to_json method that will write it out into JSONL format (just ensure that you're passing True to the lines parameter). You can check this StackOverflow answer for more information.

Zim1-finest · November 18, 2021, 12:30pm

Hi @ljvmiranda921

I tried following the answer you pointed out.
My json file results into this:

When I try to pull it in prodigy so as to start annotating I get this:
grafik

I think I need to convert my file to follow this example file:

I should be missing a very small step. Unfortunately for me, my knowledge of python is rather limited. Any assistance is appreciated.
Thank you

ines · November 18, 2021, 12:34pm

JSON and JSONL (newline-delimited JSON) are both fine input formats and I think there's only one tiny problem in your input file, otherwise it looks good. If you look closely at the structure, it currently looks like this:

{"text": [{"text": "blah"}, ...]}

But you want it to be just this:

[{"text": "blah"}, ...]

So if you're saving it out in Python, doing something like data["text"] should give you just the list. And then you can save that to a .json file.

Zim1-finest · November 18, 2021, 12:58pm

Hi Ines,

so I tried two things: first using your advive I got this json file

but still the same error,

I then used a different approach to generate a json file which resulted in this format:

and yet again it says [x] Invalid JSON file: expected list, got <class 'dict'>

ines · November 18, 2021, 4:37pm

Can you share the raw JSON it generated (instead of the visual preview)? Maybe pandas ended up actually exporting it as a dict with keys 0 instead of a list, or something like that?

The second version is definitely not correct, because you want a list of dictionaries with the key "text", e.g. [{"text": "..."}].

Zim1-finest · November 18, 2021, 5:14pm

This is the file:
trial.jsonl (1.6 KB)

This is the code I have used for clarity:

import pandas as pd

df = pd.read_csv('usa5.csv', encoding = 'ISO-8859-1')
df["text"] = df.text.apply(lambda x: {"text":x})

df = df['text']
print(df.to_json(orient = 'records', lines = True))
df.to_json(r'C:\Users\b1075161\Documents\Prodigy\prodogy_files\trial.json')

For prodigy i used:

!python -m prodigy ner.teach test en_core_web_trf trial.json

which still results in:

Invalid JSON file: expected list, got <class 'dict'>

ljvmiranda921 · November 19, 2021, 12:54am

Hi @Zim1-finest !

For the code, you only need to have a text column in your pandas DataFrame that contains exactly your text. Assuming you have a CSV file where each line is a text:

# test.csv
Welcome to Prodigy!
I love playing baseball
My brother went to the library

You can then convert them using this script:

import pandas as pd

# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)

# Convert to JSONL
jsonl_str = df.to_json(orient="records", lines=True)

# Inspect JSONL string
print(jsonl_str)

Zim1-finest · November 19, 2021, 11:30am

Thank you both for your assistance,

How can I save the printed jsonl_str file.

I noticed that using jsonl_str.to_json(r'') returns:

AttributeError: 'str' object has no attribute 'to_json'

ljvmiranda921 · November 19, 2021, 11:50am

Hi @Zim1-finest !

To save it as a file, you can pass a path to the to_json function. Something like this:

df.to_json(
    r'C:\Users\b1075161\Documents\Prodigy\prodogy_files\trial.jsonl', 
    orient="records", 
    lines=True
)

Zim1-finest · November 19, 2021, 11:54am

I made some progress:

I saved my file through:

# Create a file path to the intended json file
jsonFilePath = 'trial3.json'

# create new json file and write data on it
with open(jsonFilePath, 'w') as jsonFile:
    # make it more readable and pretty
    jsonFile.write(jsonlst)

Now when i open the json file through the web browser i get the error:

SyntaxError: JSON.parse: unexpected non-whitespace character after JSON data at line 2 column 1 of the JSON data

and when I load to prodigy i now get

ValueError: Trailing data

after a long list of other errors

ljvmiranda921 · November 19, 2021, 11:57am

Oh, have you tried the pandas dataframe suggestion above? We might have replied at the same time Saving with the json module can sometimes be tricky. Atleast with the dataframe approach, it should be handled already.

Zim1-finest · November 19, 2021, 11:59am

@ljvmiranda921 this worked when I had the df as a dictinary. When i tried it again after following your suggestion I get the: str object has no attribute 'to_json'

ljvmiranda921 · November 19, 2021, 12:00pm

Try this instead:

In this approach, we already save the file right away into JSONL. We don't need to go through its "string" representation. The output should already be a file in your Windows folder

The reason why we had the error str object has no attribute to_json is because we don't have a direct way of converting strings into a JSONL file, we need to jump a few hoops (like importing the json package, etc. etc.).

However, if we try to use the df.to_json function and supply the filepath in one of the parameters, we skipped the hassle and have our file right away

Zim1-finest · November 19, 2021, 12:01pm

Im beginning to think that the problem might also be with my csv.

Maybe let me try and clean my csv and see if I can solve this problem

ljvmiranda921 · November 19, 2021, 12:05pm

Ok sure! Let us know how it went After cleaning your CSV, you can try this again:

(Assuming your CSV looks like this)

# test.csv
Welcome to Prodigy!
I love playing baseball
My brother went to the library

You can try this

import pandas as pd

# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)
df.to_json(
    r'C:\Users\b1075161\Documents\Prodigy\prodogy_files\trial.jsonl',
    orient="records", 
    lines=True
)

Zim1-finest · November 19, 2021, 12:50pm

@ljvmiranda921 i think the problem was that I kept saving as json instead of jsonl . One of the problems so to say.

Now i have another issue

When i load the jsonl file to prodigy it says:

Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.

I am getting this error after running this:

!python -m prodigy ner.teach test en_core_web_trf trial8.jsonl

My csv is similar to the one you gave. Could it be maybe a problem of using a pretrained spaCy model?
My main goal is to train new named entities on an empty model.

oh and my jsonl file looks like this

ljvmiranda921 · November 19, 2021, 1:39pm

Hi @Zim1-finest ! It's not about the pretrained spaCy models.
The problem is about the column names. Try this step.
Notice the rename step there. We're renaming column 0 with text, so that once it's saved into JSONL, the text column shows up:

import pandas as pd

# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)
df = df.rename(columns={0:"text"})
df.to_json(
    r'C:\Users\b1075161\Documents\Prodigy\prodogy_files\trial.jsonl',
    orient="records", 
    lines=True
)

Zim1-finest · November 19, 2021, 3:17pm

That was indeed the problem...

Thank you so much. You can mark this as solved

kushal_pythonist · May 28, 2022, 1:07pm

HI!, I HAVE MY JOB.JSONL FILE AS THE BELOW ONE AND I THINK IT IS IS NOT AS THE CORRECT FORM OF JOB.JSONL

{"0":"Designation","1":"CompanyName","2":"CompanyLocation","3":"JobSummary","4":"PostedDate","5":"Salary"}
{"0":"newSr. Backend Developer","1":"RED TECHNOLOGIES","2":"Charlotte, NC 28203 (Dilworth area)","3":"The Sr. Backend Developer is responsible for work in all stages of the development life cycle - reviewing business requirements, design, construction, testing,\u00e2\u0080\u00a6","4":"PostedToday","5":"None"}

HOW TO CHANGE IT TO WORKING JSONL ?
ANY IDEA?

Topic		Replies	Views
Python script to Convert CSV to JSONL (with metadata support) solved	0	639	January 13, 2024
Convert pandas dataframe to suitable jsonl file usage , solved , streams	7	2310	August 5, 2020
jsonl format usage , solved	5	7942	May 20, 2022
Setting Up Custom CSV Format usage , solved , streams	5	692	July 25, 2021
Custom JSONL output usage , solved	6	1266	March 13, 2020

Convert CSV to JSONL

Related topics