character based annotation issues with Arabic

Ferial · April 22, 2022, 9:06am

Hello,

I am trying to use ner.manual with the character based annotation on Arabic.
The purpose is to tag specific characters within tokens like: entities without prepositions, or possessive pronouns, because Arabic tends to agglutinate these particles with words.

But I am facing the following issue: characters are being detached from each other, what makes the result illegible as you can see in this screenshot:

Do you have any idea?

Thank you

ines · April 22, 2022, 12:18pm

Hi! ner.manual has an option to set --highlight-chars, which lets you highlight characters instead. However, whether this makes sense to do kinda depends on the model you're looking to train later and whether you're training a character-based model or predicting token-based tags. If you're predicting token-based tags, annotating characters can be counterproductive because you'd be creating annotations that don't map to tokens your model produces, so you won't easily be able to learn from them.

If it's possible, a better approach could be to do the splitting beforehand and use custom tokenization rules or a custom tokenizer that splits subwords.

(It could also be useful to look into annotation guidelines for existing Arabic corpora to see how this is normally handled and whether they annotate the whole word or subwords for NER.)

Ferial · April 22, 2022, 12:33pm

Thank you for your reply. This result is actually produced by adding --highlight-chars .
I also tried the bert.ner.manual with a multilingual ner model. The result is better but still not enough:

I think that your suggestion of a custom tokenization rules may solve the problem.

Topic		Replies	Views
Fully manual NER annotations without tokeniser enhancement , ner , done	3	998	June 17, 2020
NER with commas in the word through ner.correct	1	381	September 12, 2022
Working at the character level usage , ner , custom	6	1297	June 26, 2019
Disable automatic selection of full word when using ner.correct recipe	3	140	February 23, 2024
Highlighting individual characters in Relations UI usage , ner , relations	2	681	March 4, 2021

character based annotation issues with Arabic

Related topics