Chinese pattern file for text classification

jsnleong · February 23, 2023, 8:00am

Hi guys!

Is it possible to create a pattern file in Chinese?

My approach is currently to create a pattern file with all related terms via "terms.teach", then exporting it via "terms.to-patterns". But when I do so for Chinese lang, it is not utf-8 encoded. I am instead getting eg. "\u9ad8\uXXXXX"

Any idea how to workaround this?

Thanks!

koaning · February 23, 2023, 9:06am

Hi Jason.

I'd like to dive into this one a bit deeper, but in order to do that it'd help if I had an example. Could you share one so that I may try it on my local machine?

jsnleong · February 24, 2023, 1:24am

I am trying to terms.teach say some big companies (eg. 臉書 (facebook), 阿里巴巴 (Alibaba), 抖音 (tiktok), etc)

Can you try? Thanks!

koaning · February 24, 2023, 10:45am

I just generated an examples.jsonl file with these contents.

{"text": "臉書"}
{"text": "阿里巴巴"}
{"text": "抖音"}

Next, I annotate them via textcat.manual via:

python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

When I now output these annotations via db-out then indeed the output does not seem utf-8 encoded.

python -m prodigy db-out issue-6383

This yields:

{"text":"\u81c9\u66f8","_input_hash":2129430638,"_task_hash":1813097253,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235133}
{"text":"\u963f\u91cc\u5df4\u5df4","_input_hash":786114873,"_task_hash":-1088016566,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235134}
{"text":"\u6296\u97f3","_input_hash":-1163267003,"_task_hash":165057773,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235134}

However, when I now save these annotations into a file and if I were to re-use these in another recipe.

python -m prodigy db-out issue-6383 > examples2.jsonl
python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

Then the interface is totally able to render the characters, meaning no information got lost.

This behavior is normal, and it is also explained in more detail here:

Topic		Replies	Views
Issues with text classification, Invalid Pattern of JSON files for terminology list usage , textcat , terms , solved	2	698	March 21, 2019
db-out utf-8 character problem usage , database , solved	2	1146	July 14, 2020
textcat_teach from a file fails to launch usage , textcat	5	430	February 2, 2021
Incorrect terms.to-patterns example in web documentation docs , usage , done	5	1077	December 28, 2018
terms.to-patterns with existing data terms , solved	10	2798	May 29, 2019

Chinese pattern file for text classification

Related topics