db-out format

Hi there,
My source JSON files render the Syriac text correctly, however post-annotation the db-out file renders everything in ASCII format (i think).

Input (excerpt)

{"text":"̈ܬܐ ܐܬܐܣܝܘ ܒ̈ܢܝܢܫܐ܂ ܕܘܝܕ ܠܐ ܐܬܩܛܠ܂ ܡܛܠ ܕܡܝܬ ܒܣܝܒ","meta":{"source":"The Digital Syriac Corpus (https://syriaca.org/work/8501)"}}

output (excerpt)

{"text":"\u0710 \u0717\u0718\u071d\u072c \u071d\u0715\u0725\u0702 \u0715\u0720\u0721\u0720\u072c\u0717 \u0715\u0710\u0720\u0717\u0710 \u0720\u0710 \u0710\u0722\u072b \u0721\u071b\u0710\u0702 \u0710\u0718 \u0721\u071b\u0710 \u0720\u0723\u071f\u0717","meta":{"source":"The Digital Syriac Corpus (https://syriaca.org/work/8501)"},"_input_hash":-455951441,"_task_hash":-1480668003,"_is_binary":false,"tokens":[{"text":"\u0710","start":0,"end":1,"id":0,"ws":true},

Is there a way I can preserve the input format when I db-out?

Hi! This is just the representation of non-ascii characters within JSON – if you load it back within Python (e.g. with json.loads), you'll get the Unicode characters back.

If you want, you can save the data back out from Pythin with json.dumps and ensure_ascii=False set – in that case, you just need to watch out for the encoding because otherwise, your output may become unusable later.

1 Like