Whatever model I use a very large part of the results are non sensical (assigning LOC tag to contracted article l’ or to verbal forms that could not quite be taken for a LOC). I have tried using prodigy to improve the model, but it looks like I’m very far of: after retraining i’m still at 52% accuracy (yet the baseline is at 34%).
I suspect there must be something wrong with the word vectors used. Any idea how to look for vectors associated to their token in a text file ? Or better any idea why the performance is so far off? (using news articles to train).
The French NER model we provide with spaCy was trained from Wikipedia text, using a semi-automatic process. Basically, sentences with links are used as the training data, with the link anchors being used to guess entity types for the link anchor texts. Next year we’ll be starting annotation projects to provide better free NER models for the languages we support, by doing annotations using Prodigy Scale.
For now, if the pretrained NER model isn’t helpful, you might want to try starting off with a blank model instead, with only vectors. You might want to try the French vectors from here: https://fasttext.cc . After downloading the zip, you should be able to create a spaCy model like this: python -m spacy init-model fr ./fr_vectors_web_lg --vectors ./cc.fr.300.vec.gz.
To answer your more specific question: you should be able to get the word vector for a word like this:
import spacy
nlp = spacy.load('fr_core_news_md')
lexeme = nlp.vocab[u'Paris']
print(lexeme.vector)
# Alternatively, you can get .vector from a token
doc = nlp(u'Paris')
print(doc[0].vector)
Another idea that came to mind when reading your post: As a quick sanity check, you could try running an experiment with a small set of manually labelled gold-standard examples. Basically, run ner.manual with the full label scheme and label every entity in the text for maybe a few hundred examples. Then run ner.batch-train with the --no-missing flag (to indicate that all unlabelled tokens should be treated as outside an entity and not just missing values).
If the results look reasonable, this could indicate that the problem is related to the example selection and that the existing French model you’re using just doesn’t produce good enough suggestions to converge on your data.
If the results look bad and if there’s barely a visible improvement, this could indicate that there’s a deeper problem – maybe in the data, the vectors or the tokenization.
I’ll try to start from a blank model after sanity checking my examples.
However, I’ve done some experiments on fasttext with TF and the word vectors are lowercased (at least on the wiki set - that’s not the case with glove vectors but they do not exist in french only). I had to construct a feature on chars for retaining this information. Could I do the same on spacy / how would I retain the fact that most ORG / PER are capitalized ?
Another spacy’s beginner’s question: is there a benefit of keeping the old model and training entirely new tags (if there is any potential interaction between pretrained POS tag and the NER training task)
I believe the common-crawl FastText vectors are case insensitive. I could be wrong, but give it a try --- I think they should be okay. I would guess FastText should have an edge for French because of the subword features.
I've been working on improving the add-a-label workflows for the NER, but at the moment it's hard to give good guarantees around it. The problem is that the new label may be very surprising given the policies learnt for the original label scheme, especially if the new label overlaps partly with one or more previous entities. I would suggest training a new entity recogniser, rather than trying to "resume" training over the previous NER model.
You don't need to worry about dependencies between the pipeline components, though. They don't currently share any weights, and the POS tags aren't used as features in the NER or parser. So it should be safe to mix and match the components, to only retrain the NER, etc.
ok using my labeled text and the --no-missing output gives a 74% accuracy.
Now how come the results are so different ? should I stick with the provided french pretrained model or start a new one ? (well what would be the factors going into making the right choice?).
Also tried the fast text vectors, but I get errors when training the model…
Incorrect 240
Baseline 0.003
Accuracy 0.712
Traceback (most recent call last):
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 455, in batch_train
model.from_bytes(best_model)
File "cython_src/prodigy/models/ner.pyx", line 429, in prodigy.models.ner.EntityRecognizer.from_bytes
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/site-packages/spacy/language.py", line 690, in from_bytes
msg = util.from_bytes(bytes_data, deserializers, {})
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/site-packages/spacy/util.py", line 490, in from_bytes
msg = msgpack.loads(bytes_data, raw=False)
File "/home/holl/dev/rsc/miniconda3/envs/dlnlp/lib/python3.6/site-packages/msgpack_numpy.py", line 184, in unpackb
return _unpackb(packed, **kwargs)
File "msgpack/_unpacker.pyx", line 200, in msgpack._unpacker.unpackb
ValueError: 2681950465 exceeds max_bin_len(2147483647)```
Damn, we’ll have that error fixed in the next release. In the meantime try python -m spacy init-model fr ./fr_vectors_web_lg --vectors ./cc.fr.300.vec.zip --prune-vectors 20000
The difference is that when you annotate whole texts and use the --no-missing flag, the learning problem is a lot simpler, since the model knows it’s being given the complete and correct answer. The model also doesn’t start off confused by the other categories — it gets to just focus on the one you care about.
If you’re starting from scratch, it’s often better to start with some manual annotations. These can become your evaluation set later on as well. Once you’ve already got a reasonably accurate model, you can try using the ner.teach recipe, to more quickly correct its errors. It’s just that at the start, before you have any data, it’s all errors: so just identifying the errors doesn’t really help the model improve quickly.
I tried your command – it’s the same one as higher up in the post – and that’s where I get the error.
Any work around ?
Looking again at what I got using fr_core_news_{sm,md} it really looks something is behaving weirdly as it keeps proposing tags to articles (Le, L’) at the begining of a sentence (ie capitalized)
[later]
Continuing experiments, I noticed that the number of entities in my training batch (as reported in ner.batch-train in the BEFORE section varies a lot (from 397 when using fr_core_news_sm to 406 with fr_core_news_md and 2116 with a blank model with cc.fr.300.vec). Would that be a sign that tokens are transformed into vectors with varying success depending on the model used?
It looks like you potentially ended up with the latest version of msgpack, which introduces this problem. Can you try downgrading and installing msgpack==0.5.6?
I think the explanation might be a bit simpler: If you don't supply a separate evaluation set, Prodigy will hold back a certain percentage (20% or 50%, depending on the dataset size) for evaluation, so it can give you results. This is done by shuffling and splitting the training data. The "Entities" reported are the spans in the data – so depending on how the data was split, you could end up with very different distributions here. (So for running more "serious" experiments, we usually recommend a dedicated evaluation set to avoid this.)
I’m already using msgpack==0.5.6. Right now I just truncated fasttext vector files to about 1500k vectors which doesn’t trigger the error. It’s not ideal but the last vectors being less common, It should still allow understanding while building the data set.
As for the count of entities, I think I see your point, but my intuition (and the fact that I ran several times the batch-train and always obtained similar results) is that there is something in that’s not quite right – unless pretrained models count “entities” differently.
@holl Sorry it looks like my command got truncated (or at least, to me the last args aren’t visible). I added the argument --prune-vectors 20000 to it, which truncates the vectors to the 20k most common words, and maps the other words in the vocab to their nearest neighbour within those twenty thousand.
Imanaged to keep 1500k vectors with the --prune-vectors option
I somehow always get results in the 71 - 74% accuracy with about 1000 sentences labelled (accuracy didn’t improve significantly between 500 and 1000 sentences)
My feeling is that I should stick with creating a lot more annotation to have a larger gold standard.
On the other hand I have a large set of news articles + DB of entities (orgs and persons) which should match reasonably well. Now would there be a strategy for leveraging this with the matcher ? Would that help ?
I think the matcher and perhaps the ner.teach command could both be good now, yes. Now that you have a good start, you can start using the model to bootstrap itself a bit. If you haven’t already, make sure you have a good evaluation set that you can keep consistent across your experiments, so that you know how much you’re improving.