I want to train a model to extract mentions of birthdates from text. For example, in the sentence “Roger Smith was born on September 5, 1956 and died March 10, 1990” I want to tag the span “September 5, 1956” as a BIRTHDAY.
I will generate text along with the offsets and labels I want to learn. I want to get accuracy numbers from cross-validation and generate a training curve. Because this is a synthetic data set, I don’t need to do any manual annotation.
The spaCy documentation has a Training spaCy’s Statistical Models section that it looks like I could copy code from. However, both spaCy and Prodigy have command line training interfaces. I’d like to use them instead of writing code, but I’m not sure what input formats they take.
There is a spacy train
command that takes paths to training and development data. What is the format of the files or directories on those paths? (I can see that they get passed as arguments to GoldCorpus
, but I’m not sure what format GoldCorpus
is expecting.)
There is also a prodigy ner.batch-train
that trains a model given data already in the dataset. I guess I could use the prodigy db-in
to populate the dataset, imitating the format created by a prodigy ner.teach
session, but I’m doing a bit of reverse engineering there.
What is the best way to train an NER model on a synthetic data set?