Convert pandas dataframe to suitable jsonl file

Hi all,
I have a dataframe with a text colum, which I‘d like to annotate and other columns with metadata.
As the strings in my text column have to start and to end with \n (I would like to annotate it literally). I have trouble to convert the dataframe to a suitable jsonl-file.

Is there any code how to convert a dataframe to jsonl which should serve as an input to prodigy?

I appreciate any help. All the best :slight_smile:

Hi Mitro
dataframe can be easily converted into json file.
For example
If you saved a data frame in the name data, all you need to do is the following steps

import json
file = data.to_json()
json_file = json.loads(file)

If you want to download the transformed file use json.dumps(json_file, indent =4).

Hope the response is useful

1 Like

Thanks for your response. Actually, this is what I was trying out. The input in my dataframe‘s text column is: „\nsome text\n“

I would like to annotate the „\n“ literally in prodigy.

Have anyone faced a similar problem and could help me?

I have many „\n“ in my short texts and would like to load the „\n“ literally in prodigy.

By "literally", do you mean that you want to have the line break included in the data, or do you want to be able to literally read the text "\n" on the screen?

In the manual annotation interfaces, Prodigy will render newline characters with a little icon and a line break to make sure they're visible and you know they're in the data, but don't annotate them by accident (see here for an example). In the non-manual interfaces, they're rendered as regular line breaks by default.

If you actually want to see \n on the screen, you'd have to escape the \ in your JSON, e.g. \\n. This way, it's not interpreted as a newline.

2 Likes

Hi Ines,

thanks for your response. Yes, the arrow is what I actually see, so if I will annotate my data with the arrow (line break). The NER Model will make correct predictions for entities with \n? The \n is a valuable feature what my model needs to learn.

Yes, exactly, that's the idea of the arrow symbol! Without it, the newlines can easily be overlooked (even though they can potentially matter a lot to the model).

By default, Prodigy automatically disables newline and whitespace tokens and makes them unselectable. For most use cases, this is reasonable, because you typically don't want newlines or tabs inside your entity spans. But you can change the behaviour by setting "allow_newline_highlight": true in your prodigy.json.

2 Likes

Oh, I understood your question now. Anyhow, you got a response from an expert :slight_smile: