I´m trying to train a large scanned legal documents set, all of them contained in a JSONL file like this:
The fact is that the scanned documents have certainly json control charaters and more, and Prodigy is returning this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 106: invalid continuation byte
There is a way, like a json "strict=False" parameter in order to get rid of the error?
I´m using the ner.manual recipe.
Hi @ezeguins, if your documents have JSON control characters, you might want to escape them properly so that Prodigy can read it. Another option you might want to try is passing a text file (assuming that you only have texts in your dataset).
Can you double-check that the encoding is set correctly on the file? You could also just load it in with Python directly and save it back out with the right encoding. If you actually have unicode control characters in the text content, it might also be useful to do some pre-processing on the data to remove them, assuming you can apply the same preprocessing at runtime. Otherwise, you'll be annotating and training with invisible unicode characters, which is unideal.
Thank you, that´s right, I would escape the control characters...
I can pass a text file, but I want to avoid passing each text by itself because there are many of them.
The encoding is UTF-8, but I´ve lot of missing characters due to OBS errors (original documents are very old and generaterd with typewriter.
I consider loading the text files with the sys.stind method but need to figure out first how to do it.