In my db-out jsonl file I’m getting the
unicode hex value utf-16 things. It’s kind of a pain cause I need the original unicode to look back up against other ids. I know this is kind of a dumb unicode question but I started using python 3 to not have to figure this out!
near Caf\u00e9 Brazil you can
Update: I tried a couple of things. One was opening the text file in python using
with open([file_name], encoding='utf-16') as in_file but that gave errors.
What did work, though it’s is more of a pragmatic than programmatic solution but will do the trick for now, is this website https://www.branah.com/unicode-converter which when pasting the above text into the UTF-16 box outputs the original id in the top unicode box.
I think that’s just the way the data is represented in JSON. If you do something like this:
for line in open(path, encoding="utf8"):
record = json.loads(line)
You should get the decoded, human-readable version of the text.
If you want to stay on the command line and browse the data, then you have a few options. Prodigy has three pretty-print utilities that are probably helpful:
textcat.print-stream. You can also use the
jq utility to get the text key of each record.
Finally, sometimes I find just making tiny Python scripts that manipulate inputs and print outputs is easiest. I know Python much better than utilities like jq, so sometimes it’s just the quickest way for me.
Brilliant thanks very much!
encoding='utf8' worked where my own attempts to make a python utility that opened with different encodings did not. Thanks for the help!