Dari language support for Annotation and Classification

usage

(Ghezal Ahmad Zia) #1

Hi,
Does Prodigy support annotation of Dari language, which is rtl language. If it supports that I should go further for installation and usage. To avoid waste of time, I asked this question.

Thank syou


(Ines Montani) #2

Hi! Rendering and labelling RTL text should be no problem – we actually have several users annotating Arabic text with Prodigy. You can set "writing_dir": "rtl" in your config and the interface will be adjusted accordingly.

spaCy currently doesn’t have any pre-trained models for Dari, but we do have alpha tokenization support for Farsi, which you could use to bootstrap a model and as a tokenizer. Here’s how to export a blank model from spaCy:

import spacy
nlp = spacy.blank("fa")  # create blank language class
nlp.to_disk("/path/to/blank-fa-model")

This will give you a model in the directory /path/to/blank-fa-model, which you can then load into Prodigy – for instance, to add a text classifier or to manually label text (which pre-tokenizes the text for fast highlighting). Here’s an example:

prodigy ner.manual your_dataset /path/to/blank-fa-model /path/to/data.jsonl --label PERSON