Best format for a synthetic NER corpus

I am training an NER model to detect birthdays. Given the text

Napoleon was born on August 15, 1769.
He became emperor on May 18, 1804.
He died on May 5, 1821.

It should mark only the span “August 15, 1769” as BIRTHDAY.

I am generating a synthetic corpus, so I have many examples of text containing dates that both are and aren’t birthdays. I know the character spans of the dates for both the positive and negative examples. How should I train the model?

I’m not sure if I should use Prodigy or spaCy. (I don’t need to play with hyperparameters right away, so Prodigy is fine if that’s easier.) I figure I’ll use either the prodigy ner.batch-train or spacy train commands. I assume it would be good to incorporate both positive and negative examples. I also assume if I choose a format that requires that I put in a confidence score, I should say 1.0 because I’m entirely confident of my gold examples.

Can you give an example of what the JSON format for these three examples would be, and which tool I should use to train it?

Additional question: in order to avoid the catastrophic forgetting problem I plan on running my generated text through the standard spaCy English model and adding the named entity spans that it finds. It will label my birthday spans as DATE entities. Can I leave the DATE span annotations in for mentions that I also want to label BIRTHDAY or do I have to remove them? I don’t know what your span collision logic is.

If I use Prodigy to annotate the above text I get this when I do a db-out:

{"text":"Napoleon was born on August 15, 1769.","_input_hash":-1642365362,"_task_hash":1054542698,"spans":[{"text":"August 15, 1769","start":21,"end":36,"label":"BIRTHDAY","priority":0.7142857313,"score":0.7142857313,"pattern":0,"answer":"accept"}],"meta":{"pattern":0,"score":0.7142857313},"answer":"accept"}
{"text":"He became emperor on May 18, 1804.","_input_hash":601536104,"_task_hash":-730977706,"spans":[{"text":"May 18, 1804","start":21,"end":33,"label":"BIRTHDAY","priority":0.7142857313,"score":0.7142857313,"pattern":0,"answer":"reject"}],"meta":{"pattern":0,"score":0.7142857313},"answer":"reject"}
{"text":"He died on May 5, 1821.","_input_hash":-407401634,"_task_hash":-1646523735,"spans":[{"text":"May 5, 1821","start":11,"end":22,"label":"BIRTHDAY","priority":0.7142857313,"score":0.7142857313,"pattern":0,"answer":"reject"}],"meta":{"pattern":0,"score":0.7142857313},"answer":"reject"}

So I figure if my synthetic data set looks like this I can train from it the same as if I had manually annotated. However, a lot of the fields above have to do with the annotation process and not the span annotations themselves. When I strip this data down to the bare minimum I get this.

{"text":"Napoleon was born on August 15, 1769.","spans":[{"text":"August 15, 1769","start":21,"end":36,"label":"BIRTHDAY","answer":"accept"}]}
{"text":"He became emperor on May 18, 1804.", "spans":[{"text":"May 18, 1804","start":21,"end":33,"label":"BIRTHDAY","answer":"reject"}]}
{"text":"He died on May 5, 1821.", "spans":[{"text":"May 5, 1821","start":11,"end":22,"label":"BIRTHDAY","answer":"reject"}]}

If I do a db-in on that stripped-down data I can run ner.batch-train on it, so I suppose it’s in the right format. (Though I can’t be sure unless I use a bigger set from which the model could actually learn.)

  • Is this the right way to go about this?
  • Am I stripping out too much information? E.g. should I be including a score of 1.0? Do I need that outermost accept/reject in addition to the ones on the individual spans?
  • Are negative examples helpful for the NER model? Should I bother including the rejects?

I could also write my own training code following the example in “Training an additional entity type”, but I suspect that algorithm is already implemented somewhere in the spacy command line tool.

You might try first pre-processing the text so that all DATE entities are merged into a single token, with tag=DATE. I think this will probably make the task easier for the model, because the distance between the verb like born and the date itself will be shorter.

You might also try a rule-based approach, using the dependency parse. You might actually find that this works better than the classifier, at least at first.

I can see how to use Doc.merge to merge tokens, but I don’t know how to reflect that in the training data. Do I have a sample that annotates the same span as both DATE and BIRTHDAY like this?

{"text":"Napoleon was born on August 15, 1769.","spans":[{"text":"August 15, 1769","start":21,"end":36,"label":"DATE","answer":"accept"},{"text":"August 15, 1769","start":21,"end":36,"label":"BIRTHDAY","answer":"accept"}]}

I think that won’t work, because from my experimenting overlapping spans are not allowed. (Which makes sense.)

You can store information in the records in any way that’s convenient to you. For instance you might have a key tokens that gives you the start and end offsets of the tokens, so you can split the string up according to it. Alternatively you could have a key that just tells you the multi-word tokens, so you can merge them.

Then you would have a spaCy pipeline component that performs the necessary merges before the Doc is passed to the NER model.

Ok. I misunderstood. I was thinking that the token merging could be stored in the training records in a format that was already handled by a spaCy training component. But actually I have to write this preprocessing component to merge tokens myself. (Which shouldn't be hard.) For some reason your use of the term "pre-processing" didn't register with me. :slight_smile:

I understand now. This makes sense.