Convert pandas dataframe to suitable jsonl file

mitro · August 4, 2020, 9:39pm

Hi all,
I have a dataframe with a text colum, which I‘d like to annotate and other columns with metadata.
As the strings in my text column have to start and to end with \n (I would like to annotate it literally). I have trouble to convert the dataframe to a suitable jsonl-file.

Is there any code how to convert a dataframe to jsonl which should serve as an input to prodigy?

I appreciate any help. All the best

umer · August 4, 2020, 11:05pm

Hi Mitro
dataframe can be easily converted into json file.
For example
If you saved a data frame in the name data, all you need to do is the following steps

import json
file = data.to_json()
json_file = json.loads(file)

If you want to download the transformed file use json.dumps(json_file, indent =4).

Hope the response is useful

mitro · August 5, 2020, 7:33am

Thanks for your response. Actually, this is what I was trying out. The input in my dataframe‘s text column is: „\nsome text\n“

I would like to annotate the „\n“ literally in prodigy.

mitro · August 5, 2020, 10:30am

Have anyone faced a similar problem and could help me?

I have many „\n“ in my short texts and would like to load the „\n“ literally in prodigy.

ines · August 5, 2020, 10:37am

By "literally", do you mean that you want to have the line break included in the data, or do you want to be able to literally read the text "\n" on the screen?

In the manual annotation interfaces, Prodigy will render newline characters with a little ↵ icon and a line break to make sure they're visible and you know they're in the data, but don't annotate them by accident (see here for an example). In the non-manual interfaces, they're rendered as regular line breaks by default.

If you actually want to see \n on the screen, you'd have to escape the \ in your JSON, e.g. \\n. This way, it's not interpreted as a newline.

mitro · August 5, 2020, 10:40am

Hi Ines,

thanks for your response. Yes, the arrow is what I actually see, so if I will annotate my data with the arrow (line break). The NER Model will make correct predictions for entities with \n? The \n is a valuable feature what my model needs to learn.

ines · August 5, 2020, 12:09pm

Yes, exactly, that's the idea of the arrow symbol! Without it, the newlines can easily be overlooked (even though they can potentially matter a lot to the model).

By default, Prodigy automatically disables newline and whitespace tokens and makes them unselectable. For most use cases, this is reasonable, because you typically don't want newlines or tabs inside your entity spans. But you can change the behaviour by setting "allow_newline_highlight": true in your prodigy.json.

umer · August 5, 2020, 3:58pm

Oh, I understood your question now. Anyhow, you got a response from an expert

Topic		Replies	Views
Convert CSV to JSONL usage , solved , streams	25	4810	June 5, 2022
Python script to Convert CSV to JSONL (with metadata support) solved	0	634	January 13, 2024
Need to create a jsonl file on python according to certain format usage , third-party	1	810	October 2, 2019
Is it possible to make Prodigy export a Tokenized JSONL file by inputting a JSON file with no annotations done on the dataset? ner , solved	1	505	October 10, 2022
jsonl format usage , solved	5	7927	May 20, 2022

Convert pandas dataframe to suitable jsonl file

Related topics