I am training an NER model to detect birthdays. Given the text
Napoleon was born on August 15, 1769.
He became emperor on May 18, 1804.
He died on May 5, 1821.
It should mark only the span “August 15, 1769” as BIRTHDAY
.
I am generating a synthetic corpus, so I have many examples of text containing dates that both are and aren’t birthdays. I know the character spans of the dates for both the positive and negative examples. How should I train the model?
I’m not sure if I should use Prodigy or spaCy. (I don’t need to play with hyperparameters right away, so Prodigy is fine if that’s easier.) I figure I’ll use either the prodigy ner.batch-train
or spacy train
commands. I assume it would be good to incorporate both positive and negative examples. I also assume if I choose a format that requires that I put in a confidence score, I should say 1.0
because I’m entirely confident of my gold examples.
Can you give an example of what the JSON format for these three examples would be, and which tool I should use to train it?
Additional question: in order to avoid the catastrophic forgetting problem I plan on running my generated text through the standard spaCy English model and adding the named entity spans that it finds. It will label my birthday spans as DATE
entities. Can I leave the DATE
span annotations in for mentions that I also want to label BIRTHDAY
or do I have to remove them? I don’t know what your span collision logic is.