Persian(Farsi) text in ner.manual faced with problem

If you remember, I was using prodigy to post-editing translation of google translate from English to Persian. that parts have been done, thank you for your comments, now it is time to annotate edited corpus and

I am running into an issue where Farsi text is not being correctly rendered in ner.manual. as you see here

all the letters have some semi-space distance from each other...

I have used this

import spacy
nlp = spacy.blank("fa")

!python -m prodigy ner.manual PerMT_V02 ../data/blank-farsi-model "../data/PerMTV02.jsonl" --label "PREMISE","CLAIM" --highlight-chars    

followed by what I have done before ,I used this in NER file in prodigy

blocks = [{"view_id": "text"},{"view_id": "text_input", "field_autofocus": True}]
    return {
        "view_id": "ner_manual",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "before_db": remove_tokens if highlight_chars else None,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
            "field_autofocus": True,
            "blocks": blocks

and also "rtl" setting,

I do not know what is the problem

i would be happy to know your idea

Many thanks

I solved that by using "en model". It seems Persian (fa) tokenization does not work with my corpus!... any idea why "pr" and "ar" do not work?

using this command, I can start annotating

!python -m prodigy ner.manual PerMT_V02 blank:en "../data/PerMTV02.jsonl" --label "PREMISE","CLAIM"  

basically English tokenization!! but as you see there are extra spaces between words, any idea why?

however the challenge :slight_smile: will be started afterward, as "Persian" has no pipeline in prodigy, :frowning:

is there any way to use other models (e.g from hugging face) in prodigy or spacy ?

Hi! The spaces between the tokens indicate the token boundaries and/or whitespace. I can't read Farsiunfortunately so it's difficult for me to spot the differences. But you can check how the text is tokenized by just running spaCy over the text directly and looking at the tokens, e.g.:

nlp = spacy.blank("fa")
doc = nlp("Your text here")
print([token.text for token in doc])

We haven't had any reports about tokenization problems for Persian but if there's something you think is inconsistent, feel free to open an issue.

What exactly is your goal, do you want to train a new pipeline from scratch? You don't need a pretrained pipeline in order to do this. If you're annotating categories like PREMISE and CLAIM, you probably want to train a new pipeline, right? Using some arbitrary pretrained model might not make much sense, because it'll likely be trained on different text and with different labels?

You can train your own spaCy pipeline in Prodigy by starting off with just the blank tokenizer and setting --lang fa when you run prodigy train. You can definitely also export Prodigy annotations and train any other model, or use any other model to pre-annotate your data. See here for an example:

1 Like

thank you for your quick response,

If you see my first post, you see there is "semi-space between the letters" (not token) when I use "fa", "pr"

so, I could not solve that, it seems it is not the problem of tokenization but stream app with "rtf (or my corpus)". as soon as I used "en" then (these semi-space distance ) it disappeared. do you know why is so?

about training a NER model on my annotated data using a blank model, I will try it for sure. ( probably also with span cat) I will report the results here...thank you for that. I already have experience of custom NER which worked pretty well. but there I had more examples (sentences ) here less....however here is more or less span categorization task.,, since my corpus is limited, I thought the pre-train model will improve the results of custom ner also... is not it so?
many thanks

Ah, I think one thing that doesn't help here is that so many things are called a "model" these days, including trained pipelines, embeddings and so on :sweat_smile:

It can definitely be useful to initialise your pipeline with pretrained embeddings, e.g. a Farsi variant of BERT etc. This can give you a good boost in accuracy, especially if you're working with smaller datasets. You can easily use any transformer embeddings that are available via spaCy's config:

1 Like