Prodigy created model does not work

Kasra_Moh · November 8, 2018, 3:22pm

I work with Nordic languages (Swedish,Danish,Finnish,Norwegian). By sure you know that there is not any Spacy language packages exist for Nordic languages (I mean a complete one like English or French), so I make an Spacy model for each of them based on the stuff exist on language folder of Spacy. We have in Spacy tokeniser and lemetizer for all of them and I create on top of that pos-tagger and VEC and a basic NER. So our Nordic models now has,
tokeniser, lemmetizer, Pos-tag, VEC and NER.

I got all the trained VEC from facebook project and create an Spacy model out of that.
Here is the link for VEC:

github.com

facebookresearch/fastText/blob/master/pretrained-vectors.md

# Pre-trained word vectors

We are publishing pre-trained word vectors for 294 languages, trained on [*Wikipedia*](https://www.wikipedia.org) using fastText.
These vectors in dimension 300 were obtained using the skip-gram model described in [*Bojanowski et al. (2016)*](https://arxiv.org/abs/1607.04606) with default parameters.

## Format

The word vectors come in both the binary and text default formats of fastText.
In the text format, each line contain a word followed by its embedding. Each value is space separated.
Words are ordered by their frequency in a descending order.

## License

The pre-trained word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).

## References

If you use these word embeddings, please cite the following paper:

P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)

This file has been truncated. show original

this is the result:

är är 1.0
är en 0.74227923
är bra 0.5291633
är nyheter 0.60735524
glad det 0.50049955
glad här 0.5660494
glad är 0.47149867
glad en 0.5388159
glad bra 0.5673869
glad nyheter 0.7072487

Keep in mind for simplicity I create a NER model with two entities which are “Person” and “Finance” I want to annotate with Prodigy and trained a NER. I start the prodigy with anotation,
I used:

Prodigy ner.teach

The first 20-40 suggestions are just punctuation, like “.”, “,” or something totally irrelevant. Then starts to have more meaningful but not completely still after few good one there are something obviously wrong like “en” or “ett” which are articles in Swedish. after 1500 annotations (approximately) with 1000 “YES” and 500 “NO”, I start the ner.batch-train.

I tested the model created above with in Spacy and it was totally a mess, more or less everything back as an entity, in my opinion is not the “Catastrophic forgetting the NLP” because in that case just forget the entity but here the model looks more corrupted than forgetting.
If you do me favour and tell me in first place my pipeline looks OK then I could give you more details with my model and so on. It also might be the quality of the word to VEC model, I have no idea that for example the following have sense or not. I do not have a parser, does it produce any problem?

glad bra 0.5673869
glad nyheter 0.7072487
är en 0.74227923 (Is this should be that high?)

Thanks for your time.

ines · November 8, 2018, 3:35pm

The way you set up your model looks okay to me, but I think the problem lies here:

If you want to train a completely new category from scratch, it's very important that the model sees enough positive examples so it can make meaningful suggestions during ner.teach. If you start with nothing, the model is naturally going to suggest completely random tokens and it'd take very long and lots of annotations for it to converge. 1500 annotations including a bunch of random suggestions is a very small dataset if you're training from scratch (for comparison, the English models are trained on a corpus of 2 million words). As a result, you get a model that isn't very useful.

To get over the cold-start problem, you can either start by annotating a subset of your corpus by hand using ner.manual. (You probably also want to hold back some fully labelled gold-standard data so you can perform a more reliable evaluation.) This will give you enough positive examples to pre-train the model, so you can use ner.teach to improve it.

If you have word lists and examples of your entity types, you could also use ner.teach with match patterns (see here for examples of pattern files). This would mean that the model in the loop also sees suggestions from your patterns if they occur in the text, which will increase the number of positive examples.

Finally, you definitely want to collect more examples – ideally a few thousand – before you train and try out the model. When you run ner.batch-train, check out the accuracy to make sure your model is improving. You can also run ner.train-curve, which will run a training experiment with different amounts of data (25%, 50%, 75%, 100%) to give you a rough idea of whether your model is improving with more data. As a rule of thumb, if you see an increase in accuracy in the last 25%, it's likely that the model will improve more with more data.

Btw, if you haven't seen it already, you might also find this video helpful, which shows a full end-to-end workflow of training a new entity type:

Kasra_Moh · November 9, 2018, 4:12pm

Thanks ines for your prompt reply, today I started with actually a trained NER with 6 entities, the model was rained on a huge data set like 9 million words and over 250000 entities. Just want to check how Prodigy could improve the model and actually how good to annotate. Strange error happen with just running ner.teach and I got the following error. keep in mind that I check the spacy model locally and it was fine. loaded in python 2.7.13 without any problem and find all the entities. I also check to upload a spacy model with empty NER and surprisingly it works, so by sure my model has a problem. But as it is locally loaded in spacy and works super fine I am confused. So please take a look and might see the issue. Here is the error:

Using 6 labels:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data#
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER data/text/ner-swedish-finance.txt --label EVENT
Using 1 labels: EVENT
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data# cd NER/
root@39396ecbf955:/prodigy/data/NER# ls
accuracy.json meta.json ner tokenizer vocab
root@39396ecbf955:/prodigy/data/NER# cd …
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER data/text/ner-swedish-finance.txt --patterns data/pattern-files/patterns.jsonl --label
Using 6 labels: EVENT, MORTGAGE_LOAN, CUSTOMER_SITUATION, PRODUCT, COMPANY, PERSON
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data# ls
NER pattern-files prodigy.db text
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER/ data/text/ner-swedish-finance.txt --patterns data/pattern-files/patterns.jsonl --label EVENT,MORTGAGE_LOAN,CUSTOMER_SITUATION,PRODUCT,COMPANY,PERSON
Using 6 labels: EVENT, MORTGAGE_LOAN, CUSTOMER_SITUATION, PRODUCT, COMPANY, PERSON
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed

thanks

Topic		Replies	Views
French NER usage , ner , spacy	14	3340	December 27, 2018
Working with languages not yet supported by Spacy textcat , spacy , solved	18	7222	June 25, 2018
Loading fasttext vectors to spacy/prodigy ner , spacy , solved	9	1543	February 13, 2022
Help with training from scratch english NER model with pretrained Gensim vectors usage , ner , spacy	2	643	January 27, 2022
Train multiple NER from a blank FR model using fastext vectors usage , ner , spacy	12	855	March 24, 2020

Prodigy created model does not work

Related topics