Chinese pattern file for text classification

Hi guys!

Is it possible to create a pattern file in Chinese?

My approach is currently to create a pattern file with all related terms via "terms.teach", then exporting it via "terms.to-patterns". But when I do so for Chinese lang, it is not utf-8 encoded. I am instead getting eg. "\u9ad8\uXXXXX"

Any idea how to workaround this?

Thanks!

Hi Jason.

I'd like to dive into this one a bit deeper, but in order to do that it'd help if I had an example. Could you share one so that I may try it on my local machine?

I am trying to terms.teach say some big companies (eg. 臉書 (facebook), 阿里巴巴 (Alibaba), 抖音 (tiktok), etc)

Can you try? Thanks!

I just generated an examples.jsonl file with these contents.

{"text": "臉書"}
{"text": "阿里巴巴"}
{"text": "抖音"}

Next, I annotate them via textcat.manual via:

python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

When I now output these annotations via db-out then indeed the output does not seem utf-8 encoded.

python -m prodigy db-out issue-6383 

This yields:

{"text":"\u81c9\u66f8","_input_hash":2129430638,"_task_hash":1813097253,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235133}
{"text":"\u963f\u91cc\u5df4\u5df4","_input_hash":786114873,"_task_hash":-1088016566,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235134}
{"text":"\u6296\u97f3","_input_hash":-1163267003,"_task_hash":165057773,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235134}

However, when I now save these annotations into a file and if I were to re-use these in another recipe.

python -m prodigy db-out issue-6383 > examples2.jsonl
python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

Then the interface is totally able to render the characters, meaning no information got lost.

This behavior is normal, and it is also explained in more detail here: