Labeling sequence labeling (e.g. NER) task from scratch

Hi, is there support for training sequence labeling tasks from scratch? I hacked around and I have the following workflow:

  1. Run this script after changing seed data so it has custom categories (e.g. FOOD, ANIMAL)
  2. Saved the model from the above script to SAVED_MODEL_DIR
  3. Changed meta.json in SAVED_MODEL_DIR so it has a name and added tagger and parser to the pipeline (needed for sentence segmentation)
  4. Copied over the missing parser and tagger folders from another directory created with nlp.to_disk to SAVED_MODEL_DIR

Then ran Prodigy:

prodigy ner.teach NAME_OF_DATASET SAVED_MODEL_DIR FILENAME.txt

Thanks

The NER recipe loads in a spaCy model, which it assumes has all of the entity types already listed. spaCy 1 supported adding an entity type to an existing NER model. I haven’t restored support for this in spaCy 2 yet (but it’ll definitely be added).

So, you need to create a spaCy model directory with an NER parser in the pipeline, that has the entities you want. The easiest way to do this is to set up an nlp instance with the pipeline you want, and then call nlp.to_disk():


import spacy
from spacy.pipeline import TokenVectorEncoder
from spacy.pipeline import NeuralEntityRecognizer

nlp = spacy.blank('en')
nlp.pipeline.append(TokenVectorEncoder(nlp.vocab))
nlp.pipeline.append(NeuralEntityRecognizer(nlp.vocab))
nlp.pipeline[-1].add_label('FOOD')

optimizer = nlp.begin_training(lambda: []) # The API here is admittedly a bit inconvenient
nlp.to_disk(output_path)

I’m just updating the spaCy train_new_entity_type example for spaCy 2 as well, to give the complete script.

1 Like

Thanks! Looks good.

Would it be possible to add multiple sequence models (not necessary NER)?

It would be awesome to have a .cats attribute in a Token similar to the inside a Doc.

Hi I tried creating a blank NER model (w/ the FOOD label).

from prodigy.util import export_model_data


def _get_blank_ner_model():
    from spacy.pipeline import TokenVectorEncoder
    from spacy.pipeline import NeuralEntityRecognizer
    from spacy.pipeline import SentenceSegmenter
    from spacy.pipeline import NeuralDependencyParser

    nlp = spacy.blank('en')

    nlp.pipeline.append(TokenVectorEncoder(nlp.vocab))
    nlp.pipeline.append(NeuralEntityRecognizer(nlp.vocab))
    nlp.pipeline[-1].add_label('FOOD')
    nlp.pipeline.append(SentenceSegmenter(nlp.vocab))
    optimizer = nlp.begin_training(lambda: []) # The API here is admittedly a bit inconvenient
    nlp.meta['name'] = 'some arbitrary name'

nlp = _get_blank_ner_model()

export_model_data('out', nlp, [], [])

Then I run the ner.batch-train command from the blank NER model.

prodigy ner.batch-train DATASET_NAME out -o model_out -l FOOD

I get an error from the sentence segmentation (same error happens when I replace SentenceSegmenter with NeuralDependencyParser).

Loaded model here
Using 20% of examples (355) for evaluation
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/prodigy/__main__.py", line 235, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 129, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 233, in batch_train
    examples = list(split_sentences(model.orig_nlp, examples))
  File "cython_src/prodigy/components/preprocess.pyx", line 7, in split_sentences
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/language.py", line 466, in pipe
    for doc, context in izip(docs, contexts):
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/language.py", line 479, in pipe
    for doc in docs:
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/language.py", line 588, in _pipe
    func(doc)
TypeError: 'str' object is not callable
Exception ignored in: <generator object at 0x1141527b8>
Traceback (most recent call last):
  File "cython_src/prodigy/components/preprocess.pyx", line 6, in genexpr
AttributeError: 'weakref' object has no attribute 'cline_in_traceback'

When I comment out the split_sentences in the recipe, the command seems to work.

    # examples = list(split_sentences(model.orig_nlp, examples))
    # evals = list(split_sentences(model.orig_nlp, evals))

Thanks

I’m also trying to create an empty model from scatch to use in Prodigy, but I’m getting the following error:
TypeError: ‘TokenVectorEncoder’ object is not iterable
spaCy version 2.0.0a17
Python version 3.5.2

prodigy-0.4.0-cp35.cp36-cp35m.cp36m-linux_x86_64.whl

This looks like the model you’re using may be out of date / incompatible with your spaCy version (the latest alpha models don’t have a Tensorizer pipeline component anymore). In the latest spaCy alpha version, you can now check this using this handy command:

spacy validate

If there’s an outdated package or link, it will highlight it and show you the download command.

Sorry about the frequent model updates and compatibility issues btw :sweat: The models have been improving very quickly over the past few weeks, and we wanted to share them asap. This is an unfortunate side-effect of spaCy 2.0 still being in alpha and will stabilise very soon.

Thanks, but spacy validate did not show any errors.
I managed to fix it by downgrading spacy to 2.0.0a16, this way I was able to create a new model, but when I tried to use it I got the error as @plusepsilon.

Hmm, interesting… Could you post more of the stack trace from the TypeError: ‘TokenVectorEncoder’ object is not iterable error and the command you used?

As annoying as the error is, it’s a pretty simple issue and should be easy to fix, as it’s mostly related to the new pipeline structure in spaCy.

Here you go:

Creating initial model en
Traceback (most recent call last):
  File "train_new_entity_type.py", line 125, in <module>
    plac.call(main)
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "train_new_entity_type.py", line 106, in main
    train_ner(nlp, train_data, output_directory)
  File "train_new_entity_type.py", line 53, in train_ner
    optimizer = nlp.begin_training(lambda: [])
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/spacy/language.py", line 410, in begin_training
    for name, proc in self.pipeline:
TypeError: 'TokenVectorEncoder' object is not iterable

Used https://github.com/explosion/spaCy/blob/v2.0.0a17/examples/training/train_new_entity_type.py without any changes. On spaCy version v2.0.0a17.

Ah, sorry – I didn't realise you were using the train_new_entity_type.py example. This issue actually just came up on the spaCy issue tracker, as the example needs a small fix to work with the latest alpha version. I haven't had time to test the fix yet, but simply removing one line from the example and not appending the TokenVectorEncoder should work. See here for details:

Yeah I found that issue too, but it doesn’t fix it for me.
Now I get the same error but for NeuralEntityRecognizer.
It rather seems that the code tries to loop through self.pipeline in a wrong way.

Ah, this makes a lot of sense actually – since the pipeline architecture has changed and nlp.pipeline entries are now (name, func) tuples, these lines also has to be adjusted and replaced by spaCy’s new add_pipe method:

ner = NeuralEntityRecognizer(nlp.vocab)
ner.add_label('ANIMAL')
nlp.add_pipe(ner)

Untested at the moment so I can’t guarantee that this was the only fix needed – will verify this asap and we’ll adjust the spaCy example accordingly (also just updated my comment on the spaCy issue thread).

I got the model now, now some new error when loading it in Prodigy:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/prodigy/__main__.py", line 235, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 143, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/util.pyx", line 173, in prodigy.util.suggest_view_id
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "cython_src/prodigy/components/sorters.pyx", line 127, in __iter__
  File "cython_src/prodigy/components/sorters.pyx", line 53, in genexpr
  File "cython_src/prodigy/models/ner.pyx", line 215, in __call__
  File "cython_src/prodigy/models/ner.pyx", line 185, in get_tasks
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__next__ (cytoolz/itertoolz.c:14538)
  File "cython_src/prodigy/models/ner.pyx", line 151, in predict_spans
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__next__ (cytoolz/itertoolz.c:14538)
  File "cython_src/prodigy/components/preprocess.pyx", line 8, in split_sentences
  File "doc.pyx", line 502, in __get__
ValueError: sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://alpha.spacy.io/usage/models

(using the model in spaCy works)

Ah, so in the pre-processing an split_sentences logic, Prodigy currently assumes that doc.sents are available – which is only the case if a dependency parse is set, e.g. if the model has a parser. This is probably bad and should be fixed.

In the meantime, the easiest way to work around this would be to start off with a model that already has a parser and replace the NER component. The train_new_entity_type.py example starts entirely from scratch to better illustrate how all things fit together. But it doesn’t mean you have to start completely from scratch and with a blank model – for example, you could start with spaCy’s en_core_web_sm model and just replace the NER component with your own, using the new replace_pipe() method:

nlp = spacy.load('en_core_web_sm') # start with small English model
blank_ner = NeuralEntityRecognizer(nlp.vocab)  # create a new entity recognizer
blank_ner.add_label('ANIMAL')  # add your label
nlp.replace_pipe('ner', blank_ner)  #  replace the model's entity recognizer with the new one

# ... train the model

Sadly that breaks the training :frowning_face:

Traceback (most recent call last):
  File "/home/jeroen/spaCy/examples/training/train_new_entity_type.py", line 123, in <module>
    plac.call(main)
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/jeroen/spaCy/examples/training/train_new_entity_type.py", line 105, in main
    train_ner(nlp, train_data, output_directory)
  File "/home/jeroen/spaCy/examples/training/train_new_entity_type.py", line 53, in train_ner
    optimizer = nlp.begin_training(lambda: [])
  File "/home/jeroen/prodigy/lib/python3.5/site-packages/spacy/language.py", line 413, in begin_training
    pipeline=self.pipeline)
  File "pipeline.pyx", line 426, in spacy.pipeline.NeuralTagger.begin_training
  File "morphology.pyx", line 63, in spacy.morphology.Morphology.__init__
  File "morphology.pyx", line 129, in spacy.morphology.Morphology.add_special_case
KeyError: 13656873538139661788

– Edit: also breaks when you load the original NER pipeline, add the entities, replace the pipeline and then try to train on that

Fair enough – didn’t consider that it’s now also trying to train the tagger and parser, since they’re also in the pipeline. We’ll definitely think about the “official” recommended solutions for these problems.

In the meantime, I just played around with the example and my solution isn’t perfect and kinda hacky – but it worked. The idea is: before training, you remove the tagger and parser and keep them, and after training and just before saving the model, you add them back.

nlp = spacy.load('en_core_web_sm')
ner = NeuralEntityRecognizer(nlp.vocab)
nlp.add_pipe(ner)

_, tagger = nlp.remove_pipe('tagger')  # remember that these return (name, func) tuples
_, parser = nlp.remove_pipe('parser')

# ... the training loop
# I changed it to return nlp instead of saving to disk within train_ner()

nlp.add_pipe(tagger, before='ner')
nlp.add_pipe(parser, before='ner')
print(nlp.pipe_names)  # check that pipeline components are all there
nlp.to_disk(output_dir)  # save model

nlp_test = spacy.load(output_dir)  # load model back for testing
doc = nlp_test('Do you like horses?')
assert doc.is_parsed  # check that document was parsed
print([(ent.label_, ent.text) for ent in doc.ents])  # make sure entities are recognized

Thanks! Getting somewhere now. Model is loaded in Prodigy and it’s showing the new entities :smile: