If you remember, I was using prodigy to post-editing translation of google translate from English to Persian. that parts have been done, thank you for your comments, now it is time to annotate edited corpus and
I am running into an issue where Farsi text is not being correctly rendered in ner.manual. as you see here
Hi! The spaces between the tokens indicate the token boundaries and/or whitespace. I can't read Farsiunfortunately so it's difficult for me to spot the differences. But you can check how the text is tokenized by just running spaCy over the text directly and looking at the tokens, e.g.:
nlp = spacy.blank("fa")
doc = nlp("Your text here")
print([token.text for token in doc])
We haven't had any reports about tokenization problems for Persian but if there's something you think is inconsistent, feel free to open an issue.
What exactly is your goal, do you want to train a new pipeline from scratch? You don't need a pretrained pipeline in order to do this. If you're annotating categories like PREMISE and CLAIM, you probably want to train a new pipeline, right? Using some arbitrary pretrained model might not make much sense, because it'll likely be trained on different text and with different labels?
You can train your own spaCy pipeline in Prodigy by starting off with just the blank tokenizer and setting --lang fa when you run prodigy train. You can definitely also export Prodigy annotations and train any other model, or use any other model to pre-annotate your data. See here for an example: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP
If you see my first post, you see there is "semi-space between the letters" (not token) when I use "fa", "pr"
so, I could not solve that, it seems it is not the problem of tokenization but stream app with "rtf (or my corpus)". as soon as I used "en" then (these semi-space distance ) it disappeared. do you know why is so?
about training a NER model on my annotated data using a blank model, I will try it for sure. ( probably also with span cat) I will report the results here...thank you for that. I already have experience of custom NER which worked pretty well. but there I had more examples (sentences ) here less....however here is more or less span categorization task.,, since my corpus is limited, I thought the pre-train model will improve the results of custom ner also... is not it so?
many thanks
Ah, I think one thing that doesn't help here is that so many things are called a "model" these days, including trained pipelines, embeddings and so on
It can definitely be useful to initialise your pipeline with pretrained embeddings, e.g. a Farsi variant of BERT etc. This can give you a good boost in accuracy, especially if you're working with smaller datasets. You can easily use any transformer embeddings that are available via spaCy's config: Embeddings, Transformers and Transfer Learning · spaCy Usage Documentation