greektest2.jsonl is my jsonl file. I receive the following error message:
✘ 'charmap' codec can't decode byte 0x90 in position 12: character maps to
It turns out wherever the following character, “ἐ” appears. the dodec is unalbe to process it.
If anyone coud help out itwould be greatly appreciated. The letter is an epsilon with a breathing accent above it. It is a character from biblical greek.
The error you're seeing likely occurs because your system's default text encoding isn't set up to handle Unicode characters like "ἐ" (epsilon with psili). This character requires UTF-8 encoding to display properly, but it seems that your system is defaulting to a different encoding (likely Windows-1252) when reading files.
Could you try setting encoding before running the Prodigy command:
If that doesn't work, please also verify that your `greektest2.jsonl` file is saved with UTF-8 encoding. You can check this in most text editors (like Sublime, VS Code, etc.) - it should show "UTF-8" in the status bar or encoding menu.
Since you're working with Biblical Greek, UTF-8 is really the only encoding that will properly handle all the breathing marks, accents, and other diacritics you'll encounter in your texts. This is just a system configuration issue that affects many applications when working with Unicode text.
Let me know if this resolves the issue or if you need any clarification!
I tried set PYTHONIOENCODING=utf-8 python -m prodigy ner.manual my_dataset blank:xx greektest2.jsonl --label LABEL
but get an error message, which complains that the -m argument is not correct. the error message is below:
Set-Variable : A parameter cannot be found that matches parameter name 'm'.
At line:1 char:35
set PYTHONIOENCODING=utf-8 python -m prodigy ner.manual my_dataset bl ...
CategoryInfo : InvalidArgument : (:) \[Set-Variable\], ParameterBindingException
FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.SetVariableCommand
Some other context:
It turns out there are other characters where I get the same error, e.g ρ is not recognized by the system. I saved the original text in notepad with UTF-8 encoding, then used a python script to convert to JSONL. The python script explicitly had UTF-8 encoding in opening the text file and writing to the JSON file.
To debug, I used one word such as δῶρον or Ἐὰν which caused the error to occur. I actually copied and pasted from a notepad document. Saved as UTF8 encoding. Still have the error messages. any help would be greatly appreciated.
I also neglected to mention that I’m using pycharm and i’ve set the encoding in environment variables. As well as explicitly set encoding to UTF-8 in reading and writing the files.
It looks then that the JSONL file is indeed in utf-8. That leaves us with the system encoding being the problem.
Let's try to make this command work. It looks like you're using PowerShell (not the regular Windows terminal).
First let's check what's the current PYTHONENCODING.
Could you try in your PowerShell:
$env:PYTHONIOENCODING
If it returns anything other than utf-8, then let's try setting it:
$env:PYTHONIOENCODING="utf-8"
If after this the encoding is correct, and your input file is in utf-8, you should be able to run your command in the same shell session:
Thanks but I’d already tried that. The pythonencoding variable is already UTF-8 in powershell. So far I’ve found that the following 4 characters are undecodable:
ὐ ρ ύ ὁ
as a simple test I put the following into a JSONL file.
{"text": "ἱερεὺς"}
Position 16 (which is the rho character) causes the error.
Let's double check that everything is set to use utf-8, as sometimes at appears to be the the case but there might be some artifacts in files due to copying/pasting and also there might be some overriding locale system settings in place.
So assuming yout test.jsonl contains only
{"text": "ἱερεὺς"}
could you run the following script and share the output?
# check_encoding.py
with open('greektest2.jsonl', 'rb') as f:
content = f.read()
print("Raw bytes:", content)
print("Hex representation:", content.hex())
"D:\Python Projects.venv\Scripts\python.exe" "D:\Python Projects\debuggrecy2.py"
System default locale: ('en_US', 'cp1252')
Current locale: ('English_United States', '1252')
Preferred encoding: cp1252
System encoding: utf-8
File system encoding: utf-8
D:\Python Projects\debuggrecy2.py:4: DeprecationWarning: 'locale.getdefaultlocale' is deprecated and slated for removal in Python 3.15. Use setlocale(), getencoding() and getlocale() instead.
print("System default locale:", locale.getdefaultlocale())
I tried python -X utf8 -m earlier, but it didn’t work. I kind of threw the kitchen sink at the problem. The combination of explicitly setting all my environment variables in pycharm, plus setting the locale using locale.setlocale(locale.LC_ALL, ('en_US', 'UTF-8')) and the command python -X utf8 -m, got the jsonl file to be read properly. Unfortunately, I do not which solution in particular caused the fix, or whether all of the above is required.
Thanks for the debugging output. This confirms that 1. the file is encoded correctly (no weird artifacts) and that 2. the system encoding is still Windows -1252.
Even though Python's internal encoding is UTF-8, when Python reads files without explicit encoding specified, it defaults to the system's preferred encoding, which, as we could see, is cp1252 (Windows-1252). Prodigy is inheriting Python's default file reading behavior, which uses cp1252 instead of UTF-8, causing the Greek characters to be misinterpreted.
So you need to force the system to use UTF-8 as the default encoding and that can be done with -X flag you used:
Now the reason it didn't work for you before could be that PyCharm's terminal might have its own environment settings that override or interfere with command-line flags.
In other words, (I think) -X utf8 can only work in PyCharm if PyCharm's Python interpreter is set to use UTF-8 encoding.
I think -X should work in a PowerShell on it's own. For PyCharm you probably need PYTHONENCODING and -X. The locale setting might be redundant. (I'm sorry I don't have a Windows machine available to test it right now).