Prodigy created model does not work

I work with Nordic languages (Swedish,Danish,Finnish,Norwegian). By sure you know that there is not any Spacy language packages exist for Nordic languages (I mean a complete one like English or French), so I make an Spacy model for each of them based on the stuff exist on language folder of Spacy. We have in Spacy tokeniser and lemetizer for all of them and I create on top of that pos-tagger and VEC and a basic NER. So our Nordic models now has,
tokeniser, lemmetizer, Pos-tag, VEC and NER.

I got all the trained VEC from facebook project and create an Spacy model out of that.
Here is the link for VEC:


this is the result:

är är 1.0
är en 0.74227923
är bra 0.5291633
är nyheter 0.60735524
glad det 0.50049955
glad här 0.5660494
glad är 0.47149867
glad en 0.5388159
glad bra 0.5673869
glad nyheter 0.7072487

Keep in mind for simplicity I create a NER model with two entities which are “Person” and “Finance” I want to annotate with Prodigy and trained a NER. I start the prodigy with anotation,
I used:

Prodigy ner.teach

The first 20-40 suggestions are just punctuation, like “.”, “,” or something totally irrelevant. Then starts to have more meaningful but not completely still after few good one there are something obviously wrong like “en” or “ett” which are articles in Swedish. after 1500 annotations (approximately) with 1000 “YES” and 500 “NO”, I start the ner.batch-train.

I tested the model created above with in Spacy and it was totally a mess, more or less everything back as an entity, in my opinion is not the “Catastrophic forgetting the NLP” because in that case just forget the entity but here the model looks more corrupted than forgetting.
If you do me favour and tell me in first place my pipeline looks OK then I could give you more details with my model and so on. It also might be the quality of the word to VEC model, I have no idea that for example the following have sense or not. I do not have a parser, does it produce any problem?

glad bra 0.5673869
glad nyheter 0.7072487
är en 0.74227923 (Is this should be that high?)

Thanks for your time.

The way you set up your model looks okay to me, but I think the problem lies here:

If you want to train a completely new category from scratch, it's very important that the model sees enough positive examples so it can make meaningful suggestions during ner.teach. If you start with nothing, the model is naturally going to suggest completely random tokens and it'd take very long and lots of annotations for it to converge. 1500 annotations including a bunch of random suggestions is a very small dataset if you're training from scratch (for comparison, the English models are trained on a corpus of 2 million words). As a result, you get a model that isn't very useful.

To get over the cold-start problem, you can either start by annotating a subset of your corpus by hand using ner.manual. (You probably also want to hold back some fully labelled gold-standard data so you can perform a more reliable evaluation.) This will give you enough positive examples to pre-train the model, so you can use ner.teach to improve it.

If you have word lists and examples of your entity types, you could also use ner.teach with match patterns (see here for examples of pattern files). This would mean that the model in the loop also sees suggestions from your patterns if they occur in the text, which will increase the number of positive examples.

Finally, you definitely want to collect more examples – ideally a few thousand – before you train and try out the model. When you run ner.batch-train, check out the accuracy to make sure your model is improving. You can also run ner.train-curve, which will run a training experiment with different amounts of data (25%, 50%, 75%, 100%) to give you a rough idea of whether your model is improving with more data. As a rule of thumb, if you see an increase in accuracy in the last 25%, it's likely that the model will improve more with more data.

Btw, if you haven't seen it already, you might also find this video helpful, which shows a full end-to-end workflow of training a new entity type:

Thanks ines for your prompt reply, today I started with actually a trained NER with 6 entities, the model was rained on a huge data set like 9 million words and over 250000 entities. Just want to check how Prodigy could improve the model and actually how good to annotate. Strange error happen with just running ner.teach and I got the following error. keep in mind that I check the spacy model locally and it was fine. loaded in python 2.7.13 without any problem and find all the entities. I also check to upload a spacy model with empty NER and surprisingly it works, so by sure my model has a problem. But as it is locally loaded in spacy and works super fine I am confused. So please take a look and might see the issue. Here is the error:

Using 6 labels:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data#
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER data/text/ner-swedish-finance.txt --label EVENT
Using 1 labels: EVENT
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data# cd NER/
root@39396ecbf955:/prodigy/data/NER# ls
accuracy.json meta.json ner tokenizer vocab
root@39396ecbf955:/prodigy/data/NER# cd …
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER data/text/ner-swedish-finance.txt --patterns data/pattern-files/patterns.jsonl --label
Using 6 labels: EVENT, MORTGAGE_LOAN, CUSTOMER_SITUATION, PRODUCT, COMPANY, PERSON
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data# ls
NER pattern-files prodigy.db text
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER/ data/text/ner-swedish-finance.txt --patterns data/pattern-files/patterns.jsonl --label EVENT,MORTGAGE_LOAN,CUSTOMER_SITUATION,PRODUCT,COMPANY,PERSON
Using 6 labels: EVENT, MORTGAGE_LOAN, CUSTOMER_SITUATION, PRODUCT, COMPANY, PERSON
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed

thanks