We’ve actually had some pretty good contributions to the Norwegian language data, thanks to the community – so tokenization should work quite well out-of-the-box. Any model you save out from spaCy can be used directly with Prodigy – so you can easily get a blank Norwegian model, and then use Prodigy to add and train the
from spacy.lang.nb import Norwegian
nlp = Norwegian() # alternatively: nlp = spacy.blank('nb')
prodigy textcat.teach nb_dataset /path/to/nb-model my_data.jsonl --label SOME_LABEL
Vectors are always nice, though. You can either import the FastText ones, or train your own. If you have a corpus with lots of Norwegian text, you can use the
terms.train-vectors recipe (see here for details). This will let you bootstrap lists of seed terms from your vectors using
terms.teach, to speed up the annotation process and get over the “cold start” problem.
Integrating Polyglot into spaCy will take a little more work, because you’ll have to dig into the internals. (The new custom pipeline components and attributes might help a lot, though!) You could also use Polyglot to pre-process your text or extract examples for annotation. Here’s a simple dummy recipe that shows a function extracting sentiment scores and wrapping a stream of annotation tasks:
from prodigy.components.loaders import JSONL
for eg in stream:
sent_score = get_sent_score_from_polyglot(eg['text']) # extract a score for the text
# add it to the task – you'll likely want to do this more elegantly ;)
eg['label'] = 'POSITIVE' if sent_score > 0.5 else 'NEGATIVE'
def sentiment_analysis(dataset, source):
stream = JSONL(stream)
'dataset': dataset, # add annotations to this dataset
'stream': add_setiment(stream), # add labels to stream based on sentiment
'view_id': 'classification' # annotate in classification mode
The data produced by the recipe can then be used to train your spaCy model.