In my db-out jsonl file I’m getting the unicode hex value utf-16 things. It’s kind of a pain cause I need the original unicode to look back up against other ids. I know this is kind of a dumb unicode question but I started using python 3 to not have to figure this out!
"id":",priceRange[morethan\u00a330] near Caf\u00e9 Brazil you can
Update: I tried a couple of things. One was opening the text file in python using with open([file_name], encoding='utf-16') as in_file but that gave errors.
What did work, though it’s is more of a pragmatic than programmatic solution but will do the trick for now, is this website https://www.branah.com/unicode-converter which when pasting the above text into the UTF-16 box outputs the original id in the top unicode box.
I think that’s just the way the data is represented in JSON. If you do something like this:
for line in open(path, encoding="utf8"):
record = json.loads(line)
print(record["text"])
You should get the decoded, human-readable version of the text.
If you want to stay on the command line and browse the data, then you have a few options. Prodigy has three pretty-print utilities that are probably helpful: ner.print-dataset, ner.print-stream and textcat.print-stream. You can also use the jq utility to get the text key of each record.
Finally, sometimes I find just making tiny Python scripts that manipulate inputs and print outputs is easiest. I know Python much better than utilities like jq, so sometimes it’s just the quickest way for me.