Unsupervised Training of SPACY language model using a specialized Corpus

My task involves Greek text used by a branch of the Greek State Administration. To accomplish my set objectives I will use the large Greek Spacy model, for a starter.

However, since the text is specialized, I would like to fine-tune the Spacy Greek model using the particular Corpus. I refer to unsupervised learning. My hope is that following the fine-tuning, the performance of the existing Greek SPACY model will improve when it comes to NER tasks with documents originating from the aforementioned Corpus.

Could you please advise how such fine-tuning can be accomplished? Please be generous in suggesting possible answers, tutorials and pertinent links if you happen to know of.

Hi @a.konstantinidis,

I think what you want is the spacy pretrain command, which you can find documented here: https://spacy.io/api/cli#pretrain

There's also been some discussion of pretraining on the forum before, for example you can look at these threads: https://support.prodi.gy/search?q=pretrain

In general we can only provide limited support for spaCy-only questions that don't involve Prodigy directly here, as we need to make sure the forum stays more or less on-topic. Fortunately spaCy has quite an active community, so you should be able to find a lot of information from other users, and if you need more direct help there are several consultants who know the software well.