Annotate using ner.manual for a new language

Hi, i'm trying to annotate new data using the ner.manual recipe for Indonesian language in which doesn't have any existing models. Should i train using the en_core_web_sm or are there any other options for me to do this?

Thanks in advance :slight_smile:

Hi! The ner.manual recipe only uses the model for tokenization, so all you need is an Indonesian tokenizer (which spaCy already supports). Once you're ready to train, you can then start off with the a blank Indonesian model and train your entity recognizer from scratch.

To save out a blank model that only includes the tokenization rules and no pipeline components, you can run the following:

import spacy
nlp = spacy.blank("id")
nlp.to_disk("/path/to/model")

Or as a handy one-liner on the command line:

python -c "import spacy; spacy.blank('id').to_disk('/path/to/model')"

In Prodigy, you can then pass in /path/to/model as the base model during annotation and later for training. Also see this thread for more details and examples for working with languages that don't have pre-trained models: Working with languages not yet supported by Spacy

Perfect! Thank you so much