How to get db-out to use unicode symbols

henrye · March 11, 2019, 6:05pm

Hey,

In my db-out jsonl file I’m getting the ~~unicode hex value~~ utf-16 things. It’s kind of a pain cause I need the original unicode to look back up against other ids. I know this is kind of a dumb unicode question but I started using python 3 to not have to figure this out!

"id":",priceRange[morethan\u00a330]
near Caf\u00e9 Brazil you can

Update: I tried a couple of things. One was opening the text file in python using with open([file_name], encoding='utf-16') as in_file but that gave errors.

What did work, though it’s is more of a pragmatic than programmatic solution but will do the trick for now, is this website https://www.branah.com/unicode-converter which when pasting the above text into the UTF-16 box outputs the original id in the top unicode box.

honnibal · March 12, 2019, 10:39am

I think that’s just the way the data is represented in JSON. If you do something like this:


for line in open(path, encoding="utf8"):
    record = json.loads(line)
    print(record["text"])

You should get the decoded, human-readable version of the text.

If you want to stay on the command line and browse the data, then you have a few options. Prodigy has three pretty-print utilities that are probably helpful: ner.print-dataset, ner.print-stream and textcat.print-stream. You can also use the jq utility to get the text key of each record.

Finally, sometimes I find just making tiny Python scripts that manipulate inputs and print outputs is easiest. I know Python much better than utilities like jq, so sometimes it’s just the quickest way for me.

henrye · March 12, 2019, 12:35pm

Brilliant thanks very much!

Yes encoding='utf8' worked where my own attempts to make a python utility that opened with different encodings did not. Thanks for the help!

Topic		Replies	Views
db-out utf-8 character problem usage , database , solved	2	1142	July 14, 2020
db-out format usage , database , solved	1	829	March 3, 2022
ner.manual and terms.to_patterns save to utf-16 usage , solved	4	447	June 26, 2021
db-in error after db-out database , solved , windows	6	1191	February 10, 2022
Language with macrons (āēīōūĀĒĪŌŪ) in output usage , solved	3	664	June 26, 2020

How to get db-out to use unicode symbols

Related topics