Prodigy annotations to SpaCy train

Hello Explosion,

after creating a training set (for a new NER entity) using prodigy, I was playing with learning hyperparameters (like dropout) to optimize my model performance. In the SpaCy documentation I found lots of useful advice for how to set those parameters. I also saw that the SpaCy CLI has a train command which expects input into a specific JSON format.

Do you somewhere have a script that creates this format from prodigy annotations?

I can probably easily write on in case it does not exist yet, just wanted to ask first if you have one, as Prodigy and SpaCy are both coming from you.

All the best,

Stephan

Here's @ines's answer to a similar question I had:

I haven't implemented her suggested code yet but I'm also interested in this.

Thanks @andy!

The example code I posted in the thread converts the dataset to spaCy’s “simple training style” format, but it should be easy to adjust to make it generate the JSON training data instead. For NER, the main difference here is that you’ll need the entity annotations in BILUO format. spaCy’s gold.biluo_tags_from_offsets function might be helpful here (you might not need to use it directly, but you can take inspiration from the source).

Definitely keep us updated on how you go and if you can make it work! A converter recipe like this might be a nice addition for Prodigy.

Hey @ines, thanks for your answer.

If I only want to train a NER model, do I still have to set all the other properties like

"dep": string,      # dependency label
"head": int,        # offset of token head relative to token index
"tag": string,      # part-of-speech tag
"orth": string,     # verbatim text of the token

@Stephan It’s probably best to fill those in, because I suspect some of the conversion functions might not handle missing values properly. Just set the head to 0 and the dep and tag to the empty string.

(Let me know if I should switch this discussion to Github issues.) I’m following your advice, @honnibal, to set the dependency and POS tags to “” or 0 so I can use the full spaCy JSON format to train an NER system. When I do spacy train with --no-tagger --no-parser, I get this error:

Traceback (most recent call last):
  File "/Users/ahalterman/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/ahalterman/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/spacy/__main__.py", line 31, in <module>
    plac.call(commands[command])
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/spacy/cli/train.py", line 130, in train
    scorer = nlp_loaded.evaluate(dev_docs)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/spacy/language.py", line 472, in evaluate
    scorer.score(doc, gold, verbose=verbose)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/spacy/scorer.py", line 91, in score
    for annot in gold.orig_annot]))
  File "gold.pyx", line 31, in spacy.gold.tags_to_entities
AssertionError: ['B-ORG', 'I-ORG', 'L-ORG', 'O', 'U-GPE', 'O', 'B-DATE', 'L-DATE', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL', 'O', 'O', 'O', 'U-ORDINAL', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'L-DATE', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'B-GPE', 'I-GPE', 'L-GPE', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-GPE', 'L-GPE', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-GPE', 'I-GPE', 'L-GPE', 'O', 'O', 'O', 'B-DATE', 'I-DATE', 'L-DATE', 'O', 'B-GPE', 'L-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-DATE', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'O', 'O', 'O', 'O', 'U-NORP', 'O', 'O', 'O', 'O', 'U-GPE', 'O', 'U-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-NORP', 'O', 'O', 'U-DATE', 'O', 'O', 'O', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'I-DATE', 'L-DATE', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'B-ORG', 'I-ORG', 'L-ORG']

Ah, damn. The formats on this always confound me. There are a number of functions taking and transforming the gold annotations, and I’m not 100% sure they’re all consistent in how missing values must be represented. Try None or "-"? Not ideal, I know…

Great! Will try. I just crossed the streams between here and Github. Sorry about that.

I tried a few things, no success:

  • None gives an AttributeError: 'NoneType' object has no attribute 'lower'
  • - gives the same error as before
  • using fake labels (nsubj and NN) also give the original error.

Hopefully the model isn’t even looking at POS and dependencies if it’s getting the flags. I’m wondering if it might be a problem with the format of the NER tags, in part because the error message seems to indicate. I get the error on different input texts, but it could be something systemic in the way I’m producing the data.

Looking at the AssertionError above, it seems like the main issue here is that the data contains an invalid sequence (see the last few labels). So maybe something went wrong during the conversion?

'B-ORG', 'I-ORG', 'I-ORG', 'B-ORG', 'I-ORG', 'L-ORG'
                              ^ start of new entity  within open entity

Btw, quick update since it’s related to the topic: The upcoming version will also include a recipe ner.gold-to-spacy that lets you convert datasets to spaCy training data, both in “simple training style” and BILUO format. (There’ll also be a pos.gold-to-spacy recipe to convert part-of-speech tags annotated with the new pos.make-gold recipe.)

Here’s one solution, working for my purposes.

import json

import spacy
from prodigy.components.db import connect
from prodigy.util import split_evals
from spacy.gold import GoldCorpus, minibatch, biluo_tags_from_offsets, tags_to_entities


def prodigy_to_spacy(nlp, dataset):
    """Create spaCy JSON training data from a Prodigy dataset.

    See https://spacy.io/api/annotation#json-input.
    """
    db = connect()
    examples = db.get_dataset(dataset)

    offsets = []
    for eg in examples:
        if eg['answer'] == 'accept':
            entities = [(span['start'], span['end'], span['label'])
                        for span in eg['spans']]
            offsets.append((eg['text'], {'entities': entities}))

    docs = docs_from_offsets(nlp, offsets)
    trees = docs_to_trees(docs)
    return trees


def docs_from_offsets(nlp, gold):
    """Create a sequence of Docs from a sequence of text, entity-offsets pairs."""
    docs = []
    for text, entities in gold:
        doc = nlp(text)
        entities = entities['entities']
        tags = biluo_tags_from_offsets(doc, entities)
        if entities:
            for start, end, label in entities:
                span = doc.char_span(start, end, label=label)
                if span:
                    doc.ents = list(doc.ents) + [span]
        if doc.ents:  # remove to return documents without entities too
            docs.append((doc, tags))
    return docs


def docs_to_trees(docs):
    """Create spaCy JSON training data from a sequence of Docs."""
    doc_trees = []
    for d, doc_tuple in enumerate(docs):
        doc, tags = doc_tuple
        try:
            tags_to_entities(tags)
        except AssertionError:
            print('Dropping {}'.format(d))
            continue
        if not tags:
            print('Dropping {}'.format(d))
            continue
        sentences = []
        for s in doc.sents:
            s_tokens = []
            for t in s:
                token_data = {
                    'id': t.i,
                    'orth': t.orth_,
                    'tag': t.tag_,
                    'head': t.head.i - t.i,
                    'dep': t.dep_,
                    'ner': tags[t.i],
                }
                s_tokens.append(token_data)
            sentences.append({'tokens': s_tokens})
        doc_trees.append({
            'id': d,
            'paragraphs': [
                {
                    'raw': doc.text,
                    'sentences': sentences,
                }
            ]
        })
    return doc_trees

nlp = spacy.load('en_core_web_sm')
doc_trees = prodigy_to_spacy(nlp, 'test-dataset')

train, dev, _ = split_evals(doc_trees, .2)
with open('train.json', 'wt') as f:
    json.dump(train, f)
with open('dev.json', 'wt') as f:
    json.dump(dev, f)

Confirm these are good inputs:

corpus = GoldCorpus('train.json', 'dev.json')
optimizer = nlp.begin_training(lambda: corpus.train_tuples)
train_docs = corpus.train_docs(nlp, projectivize=True)
train_docs = list(train_docs)
with nlp.disable_pipes('tagger', 'parser'):
    losses = {}
    for batch in minibatch(train_docs, size=16):
        docs, golds = zip(*batch)
        nlp.update(docs, golds, drop=.2, sgd=optimizer, losses=losses)

:stars:

2 Likes

Hey @jamesdunham,

thanks a lot for your code, that is really helpful to me. However I get an error when running the nlp.update. Did you ever experience the following error?

(py3) ~/projects/tripler/data-analysis (master): python stuff.py
Traceback (most recent call last):
  File "stuff.py", line 15, in <module>
    nlp.update(docs, golds, drop=.2, sgd=optimizer, losses=losses)
  File "/Users/dedan/.virtualenvs/py3/lib/python3.6/site-packages/spacy/language.py", line 407, in update
    proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
  File "nn_parser.pyx", line 587, in spacy.syntax.nn_parser.Parser.update
  File "/Users/dedan/.virtualenvs/py3/lib/python3.6/site-packages/thinc/api.py", line 67, in continue_update
    gradient = callback(gradient, sgd)
  File "/Users/dedan/.virtualenvs/py3/lib/python3.6/site-packages/thinc/neural/_classes/affine.py", line 58, in finish_update
    self.d_W += self.ops.batch_outer(grad__BO, input__BI)
ValueError: operands could not be broadcast together with shapes (73,200) (77,200) (73,200)

Great to hear @Stephan. I’m a rookie at reading spacy tracebacks, but are you trying to train the parser in addition to the NER component? I see a call to spacy.syntax.nn_parser.Parser.update. In my demo I disabled all pipes in en_core_web_sm but ner with the context manager disable_pipes(). That might be the reason for the different result.

The call to the parser update is a good point, thanks for pointing it out. However I’ve used the same code as you did and also disabled the tagger and parser pipes, so I’m surprised that there is a call to the parser.

I will further investigate, thanks for the hint.