Jsonl large text decode error

ezeguins · January 26, 2022, 8:30pm

Hi!
I´m trying to train a large scanned legal documents set, all of them contained in a JSONL file like this:
{"text": ....."}
{"text": ....."}
...
The fact is that the scanned documents have certainly json control charaters and more, and Prodigy is returning this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 106: invalid continuation byte
There is a way, like a json "strict=False" parameter in order to get rid of the error?

I´m using the ner.manual recipe.

ljvmiranda921 · January 27, 2022, 12:55am

Hi @ezeguins, if your documents have JSON control characters, you might want to escape them properly so that Prodigy can read it. Another option you might want to try is passing a text file (assuming that you only have texts in your dataset).

ines · January 27, 2022, 10:00am

Can you double-check that the encoding is set correctly on the file? You could also just load it in with Python directly and save it back out with the right encoding. If you actually have unicode control characters in the text content, it might also be useful to do some pre-processing on the data to remove them, assuming you can apply the same preprocessing at runtime. Otherwise, you'll be annotating and training with invisible unicode characters, which is unideal.

ezeguins · January 27, 2022, 1:46pm

Thank you, that´s right, I would escape the control characters...
I can pass a text file, but I want to avoid passing each text by itself because there are many of them.

ezeguins · January 27, 2022, 1:51pm

Hi Ines,
The encoding is UTF-8, but I´ve lot of missing characters due to OBS errors (original documents are very old and generaterd with typewriter.
I consider loading the text files with the sys.stind method but need to figure out first how to do it.
Thank you

Topic		Replies	Views
JSONL files are not opening citing a charmap codec can't decode byte 0x9d	1	606	September 24, 2023
jsonl loading question usage , ner , solved	6	2501	January 7, 2021
prodigy unable to read a greek character with an accent above it.	9	33	August 6, 2025
JSON file not working properly usage , streams	4	1020	March 27, 2020
srsly cant read exported jsonl from Prodigy usage , solved	2	444	January 31, 2022

Jsonl large text decode error

Related topics