Hi all,
I have a dataframe with a text colum, which I‘d like to annotate and other columns with metadata.
As the strings in my text column have to start and to end with \n (I would like to annotate it literally). I have trouble to convert the dataframe to a suitable jsonl-file.
Is there any code how to convert a dataframe to jsonl which should serve as an input to prodigy?
Hi Mitro
dataframe can be easily converted into json file.
For example
If you saved a data frame in the name data, all you need to do is the following steps
By "literally", do you mean that you want to have the line break included in the data, or do you want to be able to literally read the text "\n" on the screen?
In the manual annotation interfaces, Prodigy will render newline characters with a little ↵ icon and a line break to make sure they're visible and you know they're in the data, but don't annotate them by accident (see here for an example). In the non-manual interfaces, they're rendered as regular line breaks by default.
If you actually want to see \n on the screen, you'd have to escape the \ in your JSON, e.g. \\n. This way, it's not interpreted as a newline.
thanks for your response. Yes, the arrow is what I actually see, so if I will annotate my data with the arrow (line break). The NER Model will make correct predictions for entities with \n? The \n is a valuable feature what my model needs to learn.
Yes, exactly, that's the idea of the arrow symbol! Without it, the newlines can easily be overlooked (even though they can potentially matter a lot to the model).
By default, Prodigy automatically disables newline and whitespace tokens and makes them unselectable. For most use cases, this is reasonable, because you typically don't want newlines or tabs inside your entity spans. But you can change the behaviour by setting "allow_newline_highlight": true in your prodigy.json.