Convert CSV to JSONL

I want to use Prodigy to annotate my data.

The issue is that my data is in CSV file and I am failing to convert the CSV file to the acceptable jsonl file that is required by Prodigy.

Can you please point me to the simplest way that I can do this conversion.
Thank you...

Hi @Zim1-finest !

You can try using the pandas library as it has a to_json method that will write it out into JSONL format (just ensure that you're passing True to the lines parameter). You can check this StackOverflow answer for more information.

Hi @ljvmiranda921

I tried following the answer you pointed out.
My json file results into this:

When I try to pull it in prodigy so as to start annotating I get this:

I think I need to convert my file to follow this example file:

I should be missing a very small step. Unfortunately for me, my knowledge of python is rather limited. Any assistance is appreciated.
Thank you

JSON and JSONL (newline-delimited JSON) are both fine input formats and I think there's only one tiny problem in your input file, otherwise it looks good. If you look closely at the structure, it currently looks like this:

{"text": [{"text": "blah"}, ...]}

But you want it to be just this:

[{"text": "blah"}, ...]

So if you're saving it out in Python, doing something like data["text"] should give you just the list. And then you can save that to a .json file.

Hi Ines,

so I tried two things: first using your advive I got this json file

but still the same error,

I then used a different approach to generate a json file which resulted in this format:

and yet again it says [x] Invalid JSON file: expected list, got <class 'dict'>

Can you share the raw JSON it generated (instead of the visual preview)? Maybe pandas ended up actually exporting it as a dict with keys 0 instead of a list, or something like that?

The second version is definitely not correct, because you want a list of dictionaries with the key "text", e.g. [{"text": "..."}].

This is the file:
trial.jsonl (1.6 KB)

This is the code I have used for clarity:

import pandas as pd

df = pd.read_csv('usa5.csv', encoding = 'ISO-8859-1')
df["text"] = df.text.apply(lambda x: {"text":x})

df = df['text']
print(df.to_json(orient = 'records', lines = True))

For prodigy i used:

!python -m prodigy ner.teach test en_core_web_trf trial.json

which still results in:

[x] Invalid JSON file: expected list, got <class 'dict'>

Hi @Zim1-finest !

For the code, you only need to have a text column in your pandas DataFrame that contains exactly your text. Assuming you have a CSV file where each line is a text:

# test.csv
Welcome to Prodigy!
I love playing baseball
My brother went to the library

You can then convert them using this script:

import pandas as pd

# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)

# Convert to JSONL
jsonl_str = df.to_json(orient="records", lines=True)

# Inspect JSONL string

Thank you both for your assistance,

How can I save the printed jsonl_str file.

I noticed that using jsonl_str.to_json(r'') returns:

AttributeError: 'str' object has no attribute 'to_json'

Hi @Zim1-finest !

To save it as a file, you can pass a path to the to_json function. Something like this:


I made some progress:

I saved my file through:

# Create a file path to the intended json file
jsonFilePath = 'trial3.json'

# create new json file and write data on it
with open(jsonFilePath, 'w') as jsonFile:
    # make it more readable and pretty

Now when i open the json file through the web browser i get the error:

SyntaxError: JSON.parse: unexpected non-whitespace character after JSON data at line 2 column 1 of the JSON data

and when I load to prodigy i now get

ValueError: Trailing data

after a long list of other errors

Oh, have you tried the pandas dataframe suggestion above? We might have replied at the same time :sweat_smile: Saving with the json module can sometimes be tricky. Atleast with the dataframe approach, it should be handled already.

1 Like

@ljvmiranda921 this worked when I had the df as a dictinary. When i tried it again after following your suggestion I get the: str object has no attribute 'to_json'

Try this instead:

In this approach, we already save the file right away into JSONL. We don't need to go through its "string" representation. The output should already be a file in your Windows folder :slight_smile:

The reason why we had the error str object has no attribute to_json is because we don't have a direct way of converting strings into a JSONL file, we need to jump a few hoops (like importing the json package, etc. etc.).

However, if we try to use the df.to_json function and supply the filepath in one of the parameters, we skipped the hassle and have our file right away :smiley:

Im beginning to think that the problem might also be with my csv.

Maybe let me try and clean my csv and see if I can solve this problem :pleading_face:

Ok sure! Let us know how it went :slight_smile: After cleaning your CSV, you can try this again:

(Assuming your CSV looks like this)

# test.csv
Welcome to Prodigy!
I love playing baseball
My brother went to the library

You can try this

import pandas as pd

# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)

@ljvmiranda921 i think the problem was that I kept saving as json instead of jsonl :grinning_face_with_smiling_eyes:. One of the problems so to say.

Now i have another issue :woman_facepalming:

When i load the jsonl file to prodigy it says:

[x] Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe

I am getting this error after running this:

!python -m prodigy ner.teach test en_core_web_trf trial8.jsonl

My csv is similar to the one you gave. Could it be maybe a problem of using a pretrained spaCy model?
My main goal is to train new named entities on an empty model.

oh and my jsonl file looks like this

Hi @Zim1-finest ! :slight_smile: It's not about the pretrained spaCy models.
The problem is about the column names. Try this step.
Notice the rename step there. We're renaming column 0 with text, so that once it's saved into JSONL, the text column shows up:

import pandas as pd

# `header` is None because we don't have a CSV header in the example
df = pd.read_csv("test.csv", encoding="ISO-8859-1", header=None)
df = df.rename(columns={0:"text"})

That was indeed the problem...

Thank you so much. You can mark this as solved :grinning: :grinning: