prodigy unable to read a greek character with an accent above it.

Hi,

(Disclaimer: I completely new to using prodigy, so my issue may be due to my own “newbie-ness”).

I’m trying to read in a jsonl file from CLI. I use the following command:

python -m prodigy ner.manual my_dataset blank:xx greektest2.jsonl --label LABEL

greektest2.jsonl is my jsonl file. I receive the following error message:

✘ 'charmap' codec can't decode byte 0x90 in position 12: character maps to

It turns out wherever the following character, “ἐ” appears. the dodec is unalbe to process it.

If anyone coud help out itwould be greatly appreciated. The letter is an epsilon with a breathing accent above it. It is a character from biblical greek.

All other characters seem to be read in properly.

Blessings,

Howard

Welcome to the forum @phoenix! :waving_hand:

The error you're seeing likely occurs because your system's default text encoding isn't set up to handle Unicode characters like "ἐ" (epsilon with psili). This character requires UTF-8 encoding to display properly, but it seems that your system is defaulting to a different encoding (likely Windows-1252) when reading files.

Could you try setting encoding before running the Prodigy command:

On Windows:

set PYTHONIOENCODING=utf-8 python -m prodigy ner.manual my_dataset blank:xx greektest2.jsonl --label LABEL

On Mac/Linux:

export PYTHONIOENCODING=utf-8 python -m prodigy ner.manual my_dataset blank:xx greektest2.jsonl --label LABEL

If that doesn't work, please also verify that your `greektest2.jsonl` file is saved with UTF-8 encoding. You can check this in most text editors (like Sublime, VS Code, etc.) - it should show "UTF-8" in the status bar or encoding menu.

Since you're working with Biblical Greek, UTF-8 is really the only encoding that will properly handle all the breathing marks, accents, and other diacritics you'll encounter in your texts. This is just a system configuration issue that affects many applications when working with Unicode text.

Let me know if this resolves the issue or if you need any clarification!

Wow, that was a fast response. Thank you!

I tried set PYTHONIOENCODING=utf-8 python -m prodigy ner.manual my_dataset blank:xx greektest2.jsonl --label LABEL

but get an error message, which complains that the -m argument is not correct. the error message is below:

Set-Variable : A parameter cannot be found that matches parameter name 'm'.
At line:1 char:35
set PYTHONIOENCODING=utf-8 python -m prodigy ner.manual my_dataset bl ...
CategoryInfo          : InvalidArgument : (:) \[Set-Variable\], ParameterBindingException
FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.SetVariableCommand

Some other context:

It turns out there are other characters where I get the same error, e.g ρ is not recognized by the system. I saved the original text in notepad with UTF-8 encoding, then used a python script to convert to JSONL. The python script explicitly had UTF-8 encoding in opening the text file and writing to the JSON file.

To debug, I used one word such as δῶρον or Ἐὰν which caused the error to occur. I actually copied and pasted from a notepad document. Saved as UTF8 encoding. Still have the error messages. any help would be greatly appreciated.

Thank in advance for your help!

I also neglected to mention that I’m using pycharm and i’ve set the encoding in environment variables. As well as explicitly set encoding to UTF-8 in reading and writing the files.

Hi @phoenix,

It looks then that the JSONL file is indeed in utf-8. That leaves us with the system encoding being the problem.
Let's try to make this command work. It looks like you're using PowerShell (not the regular Windows terminal).
First let's check what's the current PYTHONENCODING.
Could you try in your PowerShell:

$env:PYTHONIOENCODING

If it returns anything other than utf-8, then let's try setting it:

$env:PYTHONIOENCODING="utf-8"

If after this the encoding is correct, and your input file is in utf-8, you should be able to run your command in the same shell session:

python -m prodigy ner.manual my_dataset blank:xx greektest2.jsonl --label LABEL

If that works for you and you want to make the change permanent, then you could type in PowerShell the following command:

[System.Environment]::SetEnvironmentVariable("PYTHONIOENCODING", "utf-8", "User")

Magda,

Thanks but I’d already tried that. The pythonencoding variable is already UTF-8 in powershell. So far I’ve found that the following 4 characters are undecodable:

ὐ ρ ύ ὁ

as a simple test I put the following into a JSONL file.

{"text": "ἱερεὺς"}

Position 16 (which is the rho character) causes the error.

Any other ideas?

Blessings,
Howard

Thanks, @phoenix.

Let's double check that everything is set to use utf-8, as sometimes at appears to be the the case but there might be some artifacts in files due to copying/pasting and also there might be some overriding locale system settings in place.

So assuming yout test.jsonl contains only

{"text": "ἱερεὺς"}

could you run the following script and share the output?

# check_encoding.py
with open('greektest2.jsonl', 'rb') as f:
    content = f.read()
    print("Raw bytes:", content)
    print("Hex representation:", content.hex())

And then for the system settings:

# locale_check.py
import locale
import sys

print("System default locale:", locale.getdefaultlocale())
print("Current locale:", locale.getlocale())
print("Preferred encoding:", locale.getpreferredencoding())
print("System encoding:", sys.getdefaultencoding())
print("File system encoding:", sys.getfilesystemencoding())

Thanks!

Hi Magda,

Here are the outputs for both:

"D:\Python Projects.venv\Scripts\python.exe" "D:\Python Projects\debuggrecy.py"
Raw bytes: b'{"text": "\xe1\xbc\xb1\xce\xb5\xcf\x81\xce\xb5\xe1\xbd\xba\xcf\x82"}\r\n'
Hex representation: 7b2274657874223a2022e1bcb1ceb5cf81ceb5e1bdbacf82227d0d0a

Process finished with exit code 0

"D:\Python Projects.venv\Scripts\python.exe" "D:\Python Projects\debuggrecy2.py"
System default locale: ('en_US', 'cp1252')
Current locale: ('English_United States', '1252')
Preferred encoding: cp1252
System encoding: utf-8
File system encoding: utf-8
D:\Python Projects\debuggrecy2.py:4: DeprecationWarning: 'locale.getdefaultlocale' is deprecated and slated for removal in Python 3.15. Use setlocale(), getencoding() and getlocale() instead.
print("System default locale:", locale.getdefaultlocale())

Process finished with exit code 0

Thanks in advance again!

Howard

@magdaaniol I tried the following and the prodigy now seems to be working:

python -X utf8 -m prodigy ner.manual my_dataset grc_proiel_sm greektest4.jsonl --label LABEL

I tried python -X utf8 -m earlier, but it didn’t work. I kind of threw the kitchen sink at the problem. The combination of explicitly setting all my environment variables in pycharm, plus setting the locale using locale.setlocale(locale.LC_ALL, ('en_US', 'UTF-8')) and the command python -X utf8 -m, got the jsonl file to be read properly. Unfortunately, I do not which solution in particular caused the fix, or whether all of the above is required.

Hi @phoenix,

Thanks for the debugging output. This confirms that 1. the file is encoded correctly (no weird artifacts) and that 2. the system encoding is still Windows -1252.

Even though Python's internal encoding is UTF-8, when Python reads files without explicit encoding specified, it defaults to the system's preferred encoding, which, as we could see, is cp1252 (Windows-1252). Prodigy is inheriting Python's default file reading behavior, which uses cp1252 instead of UTF-8, causing the Greek characters to be misinterpreted.

So you need to force the system to use UTF-8 as the default encoding and that can be done with -X flag you used:

python -X utf8 -m prodigy ner.manual my_dataset blank:xx greektest2.jsonl --label LABEL

Now the reason it didn't work for you before could be that PyCharm's terminal might have its own environment settings that override or interfere with command-line flags.
In other words, (I think) -X utf8 can only work in PyCharm if PyCharm's Python interpreter is set to use UTF-8 encoding.
I think -X should work in a PowerShell on it's own. For PyCharm you probably need PYTHONENCODING and -X. The locale setting might be redundant. (I'm sorry I don't have a Windows machine available to test it right now).