Writing Direction for Multi-Language Texts

Dear Prodigy Team,

I am working on a NER task for a dataset that contains both Arabic and English sentences, and might even contain sentences having a mixture of Arabic and English (within the same sentence), for example:
image
This is causing problems in the display in Prodigy browser, where English requires ltr while Arabic requires rtl. Is there a way to automate this shift within the same project depending on the language in the sentence?

Looking forward to your reply!

Hi! :slightly_smiling_face: While Prodigy has a "writing_dir" setting, it's expected to be set on a per-session basis, so it wouldn't work if you have texts in both writing directions within the same session.

Are you using ner.manual for manual annotation? Of course, the easiest solution would be to write a custom stream that filters by language so you can start two sessions with different "writing_dir" settings, one for English and one for Arabic.

Alternatively, you could also add a key to each incoming task that contains the writing direction (or language code) and then use custom JavaScript to change the writing direction of the annotation card. For example, this assumes each of your tasks have a "direction" key that's either "rtl" or "ltr":

document.addEventListener('prodigyupdate', event => {
    const content = event.detail.content
    const container = document.querySelector('.prodigy-content')
    container.style = `direction: ${content.direction} !important`
})

There are a few details that are a bit more fiddly and not covered by just toggling the top-level writing direction: the spacing around highlighted spans, and the position of the label (which would typically change depending on the writing direction).

If you have mixed text and you want to set different directions for individual tokens, that's currently more difficult, because you'd have to keep a record of tokens and their directions somewhere, and then toggle the writing direction in JavaScript for each token <span>, based on its "id" (the token ID).

Some ideas for how we could better support this going forward:

  1. Allow individual tasks to override "config" properties. I've been thinking about adding this anyways for other specific use cases, since it can sometimes make sense to override config on a per-task basis.
  2. Allow individual tokens to specify a "writing_dir" for token-based interfaces like ner_manual. You'd still have to set this yourself on the "tokens" of each outgoing task, but for something like English vs. Arabic, this should be pretty easy to do programmatically. The direction of a highlighted span would then follow the top-level writing direction of the sentence. So if you have an Arabic sentence, a highlighted span would be formatted RTL, but tokens within it that are English words would still be formatted LTR. That makes sense, right?
1 Like

Thank you INES for your reply! I ended up adding the "direction" key to each task as you instructed, and the final javascript in prodigy.json that worked for me was the following:

"javascript": "window.prodigy.update(window.prodigy.content); document.addEventListener('prodigyupdate', event => {const container = document.querySelector('.prodigy-content'); container.style = `direction: ${window.prodigy.content.direction} !important`;})",

The ideas you mentioned which you might be implementing in the future are definitely interesting and very useful. Looking forward to the coming updates from Prodigy :slight_smile:

2 Likes

Thanks for updating and sharing your code – glad it worked in such a straightforward way :slightly_smiling_face:

My suggestion #1 from above is definitely coming to Prodigy v1.10, so you'll then be able to remove your JS and just add "config": {"writing_dir": "rtl"} etc. to the individual annotation tasks.

#2 I'd still have to experiment with. If it's possible, could you share one of your mixed-writing-dir texts and their "tokens" so I could use that in my tests?

1 Like

Hello Ines! Apologies for the late reply; I wanted to fix some issues before sending the examples. Kindly find attached below jsonl files containing 4 examples with their annotation results, and another file showing the results with arabic instead of unicode if it helps. I understood that you only needed a few examples; let me know if there's something else I can do!
arabishh.jsonl (395 Bytes)
test_arabish_ner.jsonl (4.0 KB)
arabish_ner_results_converted.jsonl (3.7 KB)
On a side note, I noticed that the direction gets messed up again when labeling since the labels are in English (an example is shown below). I thought it might be normal, but I decided to ask anyway in case it has a fix.
Before labeling:
before
After labeling:
after
I'm not sure if setting the direction for each token instead might be one fix, but I was trying to avoid doing so.
Thank you again for your fast replies and looking forward to the updates you mentioned :slight_smile:

1 Like

Just released Prodigy v1.10, which includes improvements to the display of RTL spans and the option to include a "config" key in the individual tasks to customise the UI, e.g. to set the "writing_dir" on a per-example basis.

I still want to look into the token-based settings, so thanks a lot for providing the data samples :pray:

2 Likes

thank you for this post. I had some problems regarding that using this post, I changed my prodigy config to "rtl" and now, it is working well, however, as you see for the text is mixed Persian and English. I have a problem

as you see in the sents with Persian and English words, the order is wrong, any idea how can I solve it?

Usually if you have mixed text with multiple writing directions, you need to decide on one general direction.
However, if you already know beforehand the exact tokens that should have a different direction, you can set a "style" on just those tokens. For example:

{"style": "direction": "ltr"}
1 Like