How to see Unicode Vietnamese in human-readable?

donhuvy · April 16, 2022, 1:36pm

This is my file Microsoft OneDrive - Access files anywhere. Create docs with free Office Online. , I used Notepad++ to see. I want to see Unicode in Vietnamese human-readable for editing few item. Please help me.

koaning · April 19, 2022, 7:29am

I could be mistaken, but I believe you're experiencing an issue related to Notepad or perhaps encoding standards in Windows. If you were to load these in Python you should be able to resemble them properly.

For example. If I take this string.

"bơ đậu phộng"

Then I can choose to encode it via;

"bơ đậu phộng".encode()
# b'b\xc6\xa1 \xc4\x91\xe1\xba\xadu ph\xe1\xbb\x99ng'

This is likely what you're seeing in Notepad, it's a representation of the text on disk. But you can turn this into an encoding that prints nicely using .decode();

"bơ đậu phộng".encode().decode('utf-8')
# "bơ đậu phộng"

If I were to take the text from your example then:

"C\u00f4ng".encode().decode('utf-8')
# 'Công'

koaning · April 19, 2022, 9:47am

You might also find this thread insightful. It discusses a similar phenomenon but for Arabic.

celkhaissi · April 23, 2022, 8:22am

Thanks Vincent. Where exactly doe you specify/execute the .encode()/.decode() argument - in Notepad you mean?

koaning · April 24, 2022, 5:45am

The .enocde()/.decode() is Python code, which is run outside of the Notepad. The idea is to load in the file with Python, make some minor edits with Python and save the new file on disk.

Topic		Replies	Views
db-out format usage , database , solved	1	828	March 3, 2022
Language with macrons (āēīōūĀĒĪŌŪ) in output usage , solved	3	664	June 26, 2020
How to get db-out to use unicode symbols database , solved	2	1013	March 12, 2019
Chinese pattern file for text classification	3	244	February 24, 2023
db-out utf-8 character problem usage , database , solved	2	1137	July 14, 2020

How to see Unicode Vietnamese in human-readable?

Related topics