NER and blank models

My goals are

  1. add 3 labels by annotating based on pattern file - done on my end
  2. Train a model based on my annotations/dataset with prodigy ner.batch-train mydataset_ner **en_core_web_lg** - done on my end.

But now I can see that some general labels as CARDINAL or ORG are starting to messup with my own Labels.

How can I use the en_core_web_lg without its default named entities so in the end only my labels are applied correctly ?

I tried to manually create a blank model based on en_core_web_lg by replacing its ner part with a blank one but I cannot ner.batch-train on it.

`
import spacy

from spacy.pipeline import EntityRecognizer
nlp = spacy.load(‘en_core_web_lg’)
ner = nlp.create_pipe(“ner”)
nlp.replace_pipe(“ner”, ner)

nlp.to_disk(‘blank_ner_en_core_web_lg’)
`

Help most appreciated guys! :slight_smile:

Your approach sounds correct and should work! Which versions of Prodigy and spaCy were you on, and what error did you see?

One problem that occurred in the past was that blank components needed to have their weights initialized first. For example:

other_pipes = [p for p in nlp.pipe_names if p != "ner"]
with nlp.disable_pipes(*other_pipes):
    nlp.begin_training()  # initializes the weights

If you’re using the latest version of Prodigy, you should also be able to pass in a base model that doesn’t have an ner component at all. Prodigy will then create it for you, add it to the pipeline, initialize it and train ith with your examples. So in your code, you’d only have to call nlp.remove_pipe("ner").

Hello ines, thanks for your quick response !
I believe that I use the latest version of prodigy 1.8.3.

$ prodigy stats

? Prodigy stats

Version 1.8.3
Location C:\WORKSPACE\ENV\lib\site-packages\prodigy
Prodigy Home C:\WORKSPACE.prodigy
Platform Windows-10-10.0.17134-SP0
Python Version 3.7.2
Database Name SQLite
Database Id sqlite
Total Datasets 1
Total Sessions 48

$ python make_blank_ner_model.py

    import spacy
    from spacy.pipeline import EntityRecognizer


    nlp = spacy.load('en_core_web_lg')
    ner = nlp.create_pipe("ner")
    nlp.replace_pipe("ner", ner)

    nlp.to_disk('blank_ner_en_core_web_lg')

$ prodigy ner.match 3_ner blank_ner_en_core_web_lg email_rows.jsonl -pt patterns.jsonl

Traceback (most recent call last):
File “C:\WORKSPACE\AppData\Local\Programs\Python\Python37\Lib\runpy.py”, line 193, in run_module_as_main
main”, mod_spec)
File “C:\WORKSPACE\AppData\Local\Programs\Python\Python37\Lib\runpy.py”, line 85, in run_code
exec(code, run_globals)
File "C:\WORKSPACE\ENV\lib\site-packages\prodigy_main
.py", line 380, in
controller = recipe(args, use_plac=True)
File “cython_src\prodigy\core.pyx”, line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “C:\WORKSPACE\ENV\lib\site-packages\plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “C:\WORKSPACE\ENV\lib\site-packages\plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “C:\WORKSPACE\ENV\lib\site-packages\prodigy\recipes\ner.py”, line 64, in match
model = PatternMatcher(spacy.load(spacy_model)).from_disk(patterns)
File "C:\WORKSPACE\ENV\lib\site-packages\spacy_init
.py", line 27, in load
return util.load_model(name, **overrides)
File “C:\WORKSPACE\ENV\lib\site-packages\spacy\util.py”, line 133, in load_model
return load_model_from_path(Path(name), **overrides)
File “C:\WORKSPACE\ENV\lib\site-packages\spacy\util.py”, line 173, in load_model_from_path
return nlp.from_disk(model_path)
File “C:\WORKSPACE\ENV\lib\site-packages\spacy\language.py”, line 791, in from_disk
util.from_disk(path, deserializers, exclude)
File “C:\WORKSPACE\ENV\lib\site-packages\spacy\util.py”, line 630, in from_disk
reader(path / key)
File “C:\WORKSPACE\ENV\lib\site-packages\spacy\language.py”, line 787, in
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, exclude=[“vocab”])
File “nn_parser.pyx”, line 629, in spacy.syntax.nn_parser.Parser.from_disk
File “nn_parser.pyx”, line 54, in spacy.syntax.nn_parser.Parser.Model
TypeError: Model() takes exactly 1 positional argument (0 given)


This works:
$ prodigy ner.match 3_ner en_core_web_lg email_rows.jsonl -pt patterns.jsonl

? Starting the web server at http://localhost:8080
Open the app in your browser and start annotating!

This also does not work, same error stack as above:

$ prodigy ner.teach 3_ner blank_ner_en_core_web_lg email_rows.jsonl --label MV,VESSEL,DWT -U

My guess is that my only ner empty model with replaced empty ner pipe is not created properly.


So in your code, you’d only have to call nlp.remove_pipe("ner")

Ines, do you mean that I should create a custom recipe for that ?

Ah, yes – that looks like what I was referring to. The error message here isn’t very nice, but I’m pretty sure it fails because the blank NER component isn’t initialized.

No, I just meant in the code you use to create blank_ner_en_core_web_lg :slightly_smiling_face: Instead of calling create_pipe and replace_pipe, you just call remove_pipe and get rid of the old entity recognizer:

nlp = spacy.load("en_core_web_lg")
nlp.remove_pipe("ner")
nlp.to_disk("en_core_web_lg_without_ner")

Prodigy should then add it and initialize it when you train. Alternatively, you can also use the code I posted above to initialize the blank NER component after you add it.

Thanks a lot ines!

You are right, I just used (below), and my blank model works now.

import spacy
from spacy.pipeline import EntityRecognizer


nlp = spacy.load('en_core_web_lg')
nlp.remove_pipe("ner")
# ner = nlp.create_pipe("ner")
# nlp.replace_pipe("ner", ner)

nlp.to_disk('blank_ner_en_core_web_lg')

Unlucky me though because I tried to do ner.make-gold and I ran to

    examples, batch_size=batch_size, drop=dropout, beam_width=beam_width
  File "cython_src\prodigy\models\ner.pyx", line 362, in prodigy.models.ner.EntityRecognizer.batch_train
  File "cython_src\prodigy\models\ner.pyx", line 453, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src\prodigy\models\ner.pyx", line 446, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src\prodigy\models\ner.pyx", line 447, in prodigy.models.ner.EntityRecognizer._update
  File "C:\Users\dimitar.danev\Desktop\PYTHON3.7_Playground\ENV\lib\site-packages\spacy\language.py", line 457, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
  File "transition_system.pyx", line 148, in spacy.syntax.transition_system.TransitionSystem.set_costs
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

Now I am reading https://github.com/explosion/spaCy/issues/3558
and trying to figure how to export first then clean my dataset/annotations :frowning:

See here for details – you might have to remove a few whitespace spans from your data:

Thank you ines, you are pure gold!
All clear and fixed.

1 Like

Ines, even though that it works as far as there are no errors and I get to do training and annotations, it behaves very strangely.

I start with a ner.match 23k rows patterns file with single label VESSEL.
Some example of how they looks like:

{"label":"VESSEL","pattern":[{"lower":"tanto"},{"lower":"rejeki"}]}
{"label":"VESSEL","pattern":[{"lower":"amrta"},{"lower":"jaya"},{"lower":"i"}]}
{"label":"VESSEL","pattern":[{"lower":"yong"},{"lower":"feng"}]}
{"label":"VESSEL","pattern":[{"lower":"pyramids"}]}
{"label":"VESSEL","pattern":[{"lower":"nile"}]}
{"label":"VESSEL","pattern":[{"lower":"caraka"},{"lower":"jaya"},{"lower":"niaga"},{"lower":"iii-31"}]}
{"label":"VESSEL","pattern":[{"lower":"caraka"},{"lower":"jaya"},{"lower":"niaga"},{"lower":"iii-32"}]}

Then after about 1000 annotations I do training with the blank_ner_en_core_web_lg model based on en_core_web_lg

prodigy ner.batch-train vessels_ner_1 blank_ner_en_core_web_lg --output vessels_model_1 --label VESSEL --eval-split 0.30 --n-iter 24 --batch-size 8 --drop 0.3

I get nice accuracy

Correct     72
Incorrect   4
Baseline    0.000
Accuracy    0.947

But then when I start teaching mode

prodigy ner.teach vessels_ner_1 vessels_model_1 email_10k_with_pos.jsonl --label VESSEL -U

It asks me most of the time if a punctuation character is a VESSEL,
for example characters as: [,@/:]
Sentences which are used as Horizontal line separators in text as ***********************************

Then I can do the most annotations by holding (x) button and this is in the range of 200-300 questions.

Looks very weird to have 94% accuracy but get asked for high score/rank punctuation characters.

It looks like it sees my VESSEL label almost everywhere in all tokens.

Example of how my displacy Ner looks like, even though the accuracy of the model is 0.944.

@telemmaite How did your solve your problem of labeling everything as VESSEL.

I know this is quite a late answer but I thought I'd chip in for the sake of others reading this...

It looks like you just provided a lot of positive examples of words that are "VESSEL". The way Prodigy ("silver") training data works is that you're just saying whether or not the single labelled span is or isn't "VESSEL", you're not saying anything about the other words in the example sentence. So based on what you've told your model, all spans could be classified as "VESSEL" and you'd get a high score. You need some negative examples for your model to be able to discriminate. You can do this either by labelling some "VESSEL" suggestions as "reject" (negative), or add some other entity labels to the data (ideally both).

Hope that makes sense (and is correct...)

1 Like