Encoding of Umlaute

I'm having some problems with the display of Umlaute in prodigy (see below). I am pretty sure everything is encoded in utf-8. I tried different browsers, but that did not help. Can I set the encoding manually somewhere?
Thank you in advance.

Hi! Could you share an example of the underlying data? It does look like there might be an encoding issue with the file, so maybe double-check that you've definitely set the encoding correctly?

Prodigy will load the input file as utf-8 and will then stream the result in as-is. You can double-check that the encoding is set correctly by inspecting the loaded file in Python. Prodigy does the equivalent of this:

import srsly

data = srsly.read_jsonl("/path/to/data.jsonl")  # or read_json for JSON
print(data[0])  # etc.
1 Like

You are of course correct, there were encoding issues with the underlying data. Still have not found the source of the issues, but I was able to fix them with ftfy. Now everything is displayed nicely.

Thank you for your help.

1 Like