Chinese pattern file for text classification

koaning · February 24, 2023, 10:45am

I just generated an examples.jsonl file with these contents.

{"text": "臉書"}
{"text": "阿里巴巴"}
{"text": "抖音"}

Next, I annotate them via textcat.manual via:

python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

When I now output these annotations via db-out then indeed the output does not seem utf-8 encoded.

python -m prodigy db-out issue-6383

This yields:

{"text":"\u81c9\u66f8","_input_hash":2129430638,"_task_hash":1813097253,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235133}
{"text":"\u963f\u91cc\u5df4\u5df4","_input_hash":786114873,"_task_hash":-1088016566,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235134}
{"text":"\u6296\u97f3","_input_hash":-1163267003,"_task_hash":165057773,"label":"company","_view_id":"classification","answer":"accept","_timestamp":1677235134}

However, when I now save these annotations into a file and if I were to re-use these in another recipe.

python -m prodigy db-out issue-6383 > examples2.jsonl
python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

Then the interface is totally able to render the characters, meaning no information got lost.

This behavior is normal, and it is also explained in more detail here:

Topic		Replies	Views
Issues with text classification, Invalid Pattern of JSON files for terminology list usage , textcat , terms , solved	2	698	March 21, 2019
db-out utf-8 character problem usage , database , solved	2	1146	July 14, 2020
textcat_teach from a file fails to launch usage , textcat	5	430	February 2, 2021
Incorrect terms.to-patterns example in web documentation docs , usage , done	5	1077	December 28, 2018
terms.to-patterns with existing data terms , solved	10	2798	May 29, 2019

Chinese pattern file for text classification

Related topics