Chinese pattern file for text classification

I just generated an examples.jsonl file with these contents.

{"text": "臉書"}
{"text": "阿里巴巴"}
{"text": "抖音"}

Next, I annotate them via textcat.manual via:

python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

When I now output these annotations via db-out then indeed the output does not seem utf-8 encoded.

python -m prodigy db-out issue-6383 

This yields:


However, when I now save these annotations into a file and if I were to re-use these in another recipe.

python -m prodigy db-out issue-6383 > examples2.jsonl
python -m prodigy textcat.manual issue-6383-2 examples2.jsonl --label company

Then the interface is totally able to render the characters, meaning no information got lost.

This behavior is normal, and it is also explained in more detail here: