ner.manual and terms.to_patterns save to utf-16

Hello!
I'm working with german language.
I'm using ner.manual to label text and then I do db.out and terms.to_patterns, and both files convert german umlauts to utf16 symbols. I've tried to change "save file to utf8" as csv and jsonl - doesn't work. Tried Locale (on mac) - doesn't work.
I work on windows. And both files are saved with encoding utf16. Why?
Is there any way or key word in commands to save those files as utf8?

Example:
{"text":"Fl\u00fcssen"}

This is UTF-8 escaped as ASCII, which is the default behavior from srsly.write_json. srsly is the library that prodigy and spacy use for a lot of serialization tasks, and there's not an option to disable the escaping within srsly.

If you read the data in in a python script with the json module, you can save it back again without escaping using ensure_ascii=False and then it's easier to read/edit by hand. The file should work fine for use in prodigy with or without escaping. You can test that the data is the same by loading it with srsly.read_json("file.json") to double-check.

1 Like

Thank you for the answer!

Also see this comment for more details:

1 Like

Thank you!