How to get db-out to use unicode symbols

database
solved

#1

Hey,

In my db-out jsonl file I’m getting the unicode hex value utf-16 things. It’s kind of a pain cause I need the original unicode to look back up against other ids. I know this is kind of a dumb unicode question but I started using python 3 to not have to figure this out!

"id":",priceRange[morethan\u00a330]
near Caf\u00e9 Brazil you can

Update: I tried a couple of things. One was opening the text file in python using with open([file_name], encoding='utf-16') as in_file but that gave errors.

What did work, though it’s is more of a pragmatic than programmatic solution but will do the trick for now, is this website https://www.branah.com/unicode-converter which when pasting the above text into the UTF-16 box outputs the original id in the top unicode box.

01


(Matthew Honnibal) #2

I think that’s just the way the data is represented in JSON. If you do something like this:


for line in open(path, encoding="utf8"):
    record = json.loads(line)
    print(record["text"])

You should get the decoded, human-readable version of the text.

If you want to stay on the command line and browse the data, then you have a few options. Prodigy has three pretty-print utilities that are probably helpful: ner.print-dataset, ner.print-stream and textcat.print-stream. You can also use the jq utility to get the text key of each record.

Finally, sometimes I find just making tiny Python scripts that manipulate inputs and print outputs is easiest. I know Python much better than utilities like jq, so sometimes it’s just the quickest way for me.


#3

Brilliant thanks very much!

Yes encoding='utf8' worked where my own attempts to make a python utility that opened with different encodings did not. Thanks for the help!