I’m running into an issue where Arabic text is not being correctly rendered in ner.manual
. Words are spread across a couple lines above and below, and then jump down into their correct place once highlighted (see example below). The word that was above the line is now incorporated into it.
Thanks a lot for the report!
Do you have one or two example texts that I can copy-paste to debug the interface? And could you check the "tokens"
property on the incoming tasks that are created and verify that it’s all split correctly? Just so we can make sure that spaCy’s tokenization is not the problem here.
(I was going to suggest modifying the writing direction to RTL, but I just realised that the manual UI doesn’t support the "card_css"
config setting at the moment, since this can easily lead to unexpected results in this particular interface. But if you want to hack at it in your browser’s dev tools, I’d be curious to see the effect of setting direction: rtl
on the parent container’s styles.)
Sorry, I was wrong: "card_css": "direction: rtl"
is definitely possible, even in manual NER mode. Not sure if this is the full solution, though.
Adding "card_css": "direction: rtl"
to .prodigy.json
fixed the weird display problem. Thank you!
There’s a separate tokenization problem, which is that spaCy is including periods in with the previous token in the xx
and en
models. But that’s an Arabic in spaCy issue, not a Prodigy issue.
If you still want a piece of example text, here’s one:
ﺕﻮﺟﺩ ﺏﺎﻠﺒﺣﺮﻴﻧ ﺄﻜﺑﺭ ﻢﻘﺑﺭﺓ ﺕﺍﺮﻴﺨﻳﺓ ﺏﺎﻠﻋﺎﻠﻣ ﻮﻬﻳ ﻊﻟﻯ ﺶﻜﻟ ﺭﻭﺎﺒﻳ (ﺕﻼﻟ) ﻮﺘﺴﻣﻯ ﻢﻗﺎﺑﺭ ﻉﺎﻠﻳ ﻢﺨﺘﻠﻓﺓ ﺎﻠﺤﺠﻣ ﻭﺎﻠﺸﻜﻟ. ﻮﺗﻮﺟﺩ ﺐﻤﻨﻄﻗﺓ ﻉﺎﻠﻳ ﺏﺎﻠﻤﺣﺎﻔﻇﺓ ﺎﻟﻮﺴﻃﻯ، ﻮﻠﻫﺬﻫ ﺎﻠﻤﻗﺎﺑﺭ ﺕﺍﺮﻴﺧ ﻉﺮﻴﻗ ﻕﺪﻴﻣ ﻢﻧﺫ ﻊﻫﺩ ﺢﺿﺍﺭﺓ ﺪﻠﻣﻮﻧ. ﺢﻴﺛ ﻙﺎﻧ ﺎﻟﺪﻠﻣﻮﻨﻳﻮﻧ ﻱﺪﻔﻧﻮﻧ ﻡﻮﺗﺎﻬﻣ ﻒﻳ ﺖﻠﻛ ﺎﻠﺗﻼﻟ ﻮﻬﻳ (ﻖﺑﻭﺭ) ﻮﻳﺪﻔﻧﻮﻧ ﻢﻌﻬﻣ ﺡﺎﺠﻳﺎﺘﻬﻣ ﺎﻠﻨﻔﻴﺳﺓ ﻭﺎﻠﺜﻤﻴﻧﺓ، ﻚﻣﺍ ﻱﺪﻔﻧﻮﻧ ﻢﻌﻬﻣ ﺞِـِﺭﺎﺑ ﺎﻠﻣﺍﺀ ﺎﻠﻔﺧﺍﺮﻳﺓ، ﻭﺎﻠﻤﻘﺘﻨﻳﺎﺗ ﺎﻠﺜﻤﻴﻧﺓ ﻆﻧًﺍ ﻢﻨﻬﻣ ﻭﺎﻌﺘﻗﺍﺩﺍ ﺄﻧ ﻩﺫﺍ ﺎﻠﻤﻴﺗ ﻕﺩ ﻲﻋﻭﺩ ﻞﻠﺤﻳﺍﺓ ﻒﻳ ﺄﻳ ﻞﺤﻇﺓ. [ﺐﺣﺎﺟﺓ ﻞﻤﺻﺩﺭ]
Thanks, and nice to hear that it's working!
I'm trying to think of a better way to handle this from the user's perspective, but I'm not sure what the best solution is. We could introduce a "writing_direction"
config setting, which might be more intuitive to write and less abstract than "card_css": "direction: rtl"
... But it'd still be adding more complexity that's maybe not necessary. But I guess there are also too many RTL languages to have the front-end or back-end detect this automatically based on nlp.lang
... and it'd also be making too many assumptions on the user's behalf. (And it still wouldn't solve the problem if you're using an xx
model.) Anyway, if you have any ideas or suggestions, let me know!
Yes, this is likely because the unicode character classes used in the punctuation rules don't include the unicode ranges for Arabic. We actually just got two pull requests (#1879, #1893) adding Persian to spaCy, and one of the PRs adds \p{Arabic}
to the list of uncased characters. So this should probably solve the underlying issue.
I think just having to specify "card_css": "direction: rtl"
is just fine. That’s a pretty easy note in the README. I think people working with RTL languages in NLP are used to the RTL terminology.
I just checkout the PRs and hopefully that’ll fix it!
Yeah, now that I think about it, maybe some type of auto-detection isn’t so bad after all. I mean, if the user isn’t aware of the config setting and just comes across this issue in ner.manual
, it’s pretty difficult to figure out what’s going on. And we also don’t want people to think we’ve only ever thought about LTR languages, and everything else requires a “hack”.
So one alternative idea would be this: If all relevant recipes expose 'lang': nlp.lang
in their config, the controller could perform a simple check for the most common languages, and set the writing direction accordingly.
if config.get('lang') in ('ar', 'fa', 'ur', 'he', 'yi'):
config.setdefault('writing_direction', 'rtl')
The direction would only be set if no "writing_direction"
setting is found in the existing user config. This would allow the user to manually set "writing_direction": "ltr"
if their model is set to "lang": "ar"
, but is actually an Arabic transliteration model.
I like that idea. There aren’t that many RTL languages in spaCy so it’s not bad to enumerate them. If it does set writing_direction
to rtl
based on the language, it might be nice to have a big message pop out in the log that tells users it’s auto switching and that they can override with "writing_direction": "ltr"
in .progidy.jsonl
.
The only recipe that wouldn’t really get covered with this is mark
, where people will have text but no model to detect the language from. Maybe a hack that looks for characters in those languages and prints/logs a suggestion to set RTL? But that’s pretty clunky.