Dari language support for Annotation and Classification


(Ghezal Ahmad Zia) #1

Does Prodigy support annotation of Dari language, which is rtl language. If it supports that I should go further for installation and usage. To avoid waste of time, I asked this question.

Thank syou

(Ines Montani) #2

Hi! Rendering and labelling RTL text should be no problem – we actually have several users annotating Arabic text with Prodigy. You can set "writing_dir": "rtl" in your config and the interface will be adjusted accordingly.

spaCy currently doesn’t have any pre-trained models for Dari, but we do have alpha tokenization support for Farsi, which you could use to bootstrap a model and as a tokenizer. Here’s how to export a blank model from spaCy:

import spacy
nlp = spacy.blank("fa")  # create blank language class

This will give you a model in the directory /path/to/blank-fa-model, which you can then load into Prodigy – for instance, to add a text classifier or to manually label text (which pre-tokenizes the text for fast highlighting). Here’s an example:

prodigy ner.manual your_dataset /path/to/blank-fa-model /path/to/data.jsonl --label PERSON