Training NER models with synthetic data sets

I want to train a model to extract mentions of birthdates from text. For example, in the sentence “Roger Smith was born on September 5, 1956 and died March 10, 1990” I want to tag the span “September 5, 1956” as a BIRTHDAY.

I will generate text along with the offsets and labels I want to learn. I want to get accuracy numbers from cross-validation and generate a training curve. Because this is a synthetic data set, I don’t need to do any manual annotation.

The spaCy documentation has a Training spaCy’s Statistical Models section that it looks like I could copy code from. However, both spaCy and Prodigy have command line training interfaces. I’d like to use them instead of writing code, but I’m not sure what input formats they take.

There is a spacy train command that takes paths to training and development data. What is the format of the files or directories on those paths? (I can see that they get passed as arguments to GoldCorpus, but I’m not sure what format GoldCorpus is expecting.)

There is also a prodigy ner.batch-train that trains a model given data already in the dataset. I guess I could use the prodigy db-in to populate the dataset, imitating the format created by a prodigy ner.teach session, but I’m doing a bit of reverse engineering there.

What is the best way to train an NER model on a synthetic data set?

1 Like

Thanks, this is a good question. I see your point about the convenience of the built-in commands, and we definitely put a lot of work into the user experience and making them as useful as possible :blush: Prodigy’s ner.batch-train and ner.train-curve recipes are primarily designed to train from Prodigy-style accept/reject annotations – but this shouldn’t be a problem.

For instance, you could create two examples for each template – a correct one and an incorrect one, i.e. “Roger Smith was born on {date}” and “Roger Smith died on {date}” etc. This will give you a nice and even 50/50 distribution of accept/reject examples. You can then import those to a new dataset using db-in and should be able to train your model pretty much out of the box.

This should do pretty well, especially for the experimentation phase. Pre-populating datasets like this is totally in line with what we’ve had in mind for possible Prodigy workflows. It also means that you can try out different templates quickly, add manual annotations to your set later on, or combine already existing annotations from different sources.

Once you’re getting more “serious” about training and want to use a larger corpus, you might still want to transition to training with spaCy directly. This will give you more flexibility, and full control of all hyperparameters and knobs to twiddle. But if you run your experiments with Prodigy first, you won’t have to worry about this until you’ve found the best approach.

You can find more details on spaCy’s training input format in the annotation specs documentation. Note that the entity here is supplied in the BILUO format, i.e. B-BIRTHDAY. The training examples are all designed with simple command-line interfaces already, so it’ll hopefully be easy to mix and match those to build your own. You can also take some inspiration from spaCy’s built-in train command. And if you haven’t seen it already, the “Optimization tips and advice” section and our blog post on working around the “catastrophic forgetting” problem should also be helpful.

So would I use prodigy db-in to import the following JSON data for this sample?

{"text":"Roger Smith was born on September 5, 1956 and died March 10, 1990",
    {"text":"September 5, 1956", "answer":"accept", "label":"BIRTHDAY", "start":24, "end":41}, 
    {"text":"March 10, 1990", "answer":"reject", "label":"BIRTHDAY", "start":51, "end":65}

i.e. I am accepting “September 5, 1956” as a BIRTHDAY and rejecting “March 10, 1990”. start and end are character offsets.

Yes, this should work!

Alternatively, you can also add one annotation task per span and a global "answer" key on the example:

{"text":"Roger Smith was born on September 5, 1956 and died March 10, 1990", "spans": [{"text":"September 5, 1956", "label":"BIRTHDAY", "start":24, "end":41}], "answer": "accept"}
{"text":"Roger Smith was born on September 5, 1956 and died March 10, 1990", "spans": [{"text":"March 10, 1990", "label":"BIRTHDAY", "start":51, "end":65}], "answer": "reject"}

That appears to be working. My model is getting around 94% accuracy.

Here’s a gist of the dataset generator in case it’s helpful to other people.

1 Like

This sounds promising! Definitely keep us updated on the progress. I’m very curious to hear how the approach works at a larger scale, and how the model performs on real-world data.

And thanks for sharing the script! Just played around with it and it works great. I felt inspired and made a little variation of your script that wraps it in a Prodigy recipe:

Could even take it one step further and automatically add the created examples to a dataset – might add this as well, just for fun :blush:

One thing I don’t understand. I used this script to generate 10,000 samples with a roughly equal number of accepted and rejected spans. However, the stats for the dataset are as follows:

$ pgy stats birthday

  ✨  Prodigy stats

  Version            1.1.0              
  Location           /anaconda3/lib/python3.6/site-packages/prodigy 
  Prodigy Home       /Users/wmcneill/.prodigy 
  Platform           Darwin-17.3.0-x86_64-i386-64bit 
  Python Version     3.6.3              
  Database Name      SQLite             
  Database Id        sqlite             
  Total Datasets     10                 
  Total Sessions     38                  

  ✨  Dataset 'birthday'

  Dataset            birthday           
  Created            2018-01-03 13:43:42 
  Description        Birthday detector  
  Author             None               
  Annotations        10000              
  Accept             10000              
  Reject             0                  
  Ignore             0                   

Why does it say 10000 accepts and no rejects?

I can see accept: false in the dataset.

$ pgy db-out birthday | head -3
{"text":"On 7/10/2001 Cinderella died.","spans":[{"text":"7/10/2001","start":3,"end":12,"label":"BIRTHDAY","accept":false}],"_input_hash":-905765345,"_task_hash":-146242737,"answer":"accept"}
{"text":"RIP Jeanelle: 9/29/1973-6/20/1978.","spans":[{"text":"9/29/1973","start":14,"end":23,"label":"BIRTHDAY","accept":true},{"text":"6/20/1978","start":24,"end":33,"label":"BIRTHDAY","accept":false}],"_input_hash":1343077284,"_task_hash":397021865,"answer":"accept"}
{"text":"June 9, 1987 was the day Carry died.","spans":[{"text":"June 9, 1987","start":0,"end":12,"label":"BIRTHDAY","accept":false}],"_input_hash":-1823874530,"_task_hash":795811923,"answer":"accept"}

Ah, the problem here is that you only have the "answer" on the spans, not the task itself. So when you add the examples to the database, they’re automatically added as accepted answers. This is fine for NER training, as Prodigy looks at the spans. But the count you’re seeing in prodigy stats is a bit misleading in this case, because it only looks at the tasks and their answers, not the individual components.

Edit: I was wrong. Prodigy does handle this correctly. See my comment below on what the problem is, sorry!

Ah, this is potentially problematic and a different issue. Sorry I didn’t spot this before in your script. It should be "answer": "accept" and "answer": "reject". I’m surprised that it even worked before, but I think what happened here is that you accidentally trained on all examples as "accept". ner.batch-train checks if the spans have an "answer" key present and if not, it uses the task’s "answer" for all spans…

Edit: I updated my gist to include a --dataset option and fixed the "answer" bug on the spans.

Hi @ines

A quick question on data formatting for Prodigy training. I’m using PhraseMatcher to create ‘free data’ annotations for the NER task. I can create a very large list of ‘accept’ text examples (extracted by matching), my question is how to create ‘reject’ cases?

My first (naive) solution was to pick those sentence, where PhraseMatcher found nothing and then randomly annotate a token with ‘reject’ label.

For example, I’m interested in identifying drug names.

Positive case (both ‘epinephrine’ and ‘bicarb’ are drugs ):

{"text":"he received epinephrine, and bicarb.", "spans": [{"text":"epinephrine", "label":"DRUG", "start":12, "end":24}], "answer": "accept"}
{"text":"he received epinephrine, and bicarb.", "spans": [{"text":"epinephrine", "label":"DRUG", "start":29, "end":35}], "answer": "accept"}

Now, to make a balanced accept/reject data set for training, I can pick random sentences without drugs at all and randomly pick a word:

{"text":"after approximately 15-20 minutes of resuscitation, he had rosc.", "spans": [{"text":"after ", "label":"DRUG", "start":0, "end":5}], "answer": "reject"}
{"text":"he had a cth which was unremarkable.", "spans": [{"text":"unremarkable", "label":"DRUG", "start":24, "end":37}], "answer": "reject"}

Does my approach make sense? Any other ideas how to create a good training set for Prodigy?

@Andrey This is definitely something you can try, yes! I’d recommend running a few small experiments to see how you go – create a few hundred examples, run a training experiment and compare the results. It’s possible that some of these negative examples are too random – so maybe you could try using more specific token-based patterns to only focus on, say, nouns or proper nouns and mark those that aren’t part of your terminology list as negative examples?

If you have a word list and raw text, couldn’t you also use the ner.teach recipe with a patterns file? This will take care of the positive and negative examples and might even give you more entities that weren’t covered by your word list when the model kicks in and starts making suggestions.

Are you training a new model from scratch, and do you have data for all the categories you need? Because if so, you could also generate a full gold-standard dataset, annotate all matches in a text and then run ner.batch-train with the --no-missing flag. This will essentially tell spaCy to treat all unlabelled tokens as “outside of an entity” and assume that the labelled entities are all entities that occur in the text. It also means that you won’t need any negative examples anymore.

hi @ines, great advice, many thanks!

ner.teach is of course a good start, but since I have all words (a huge list of word list), clicking on ‘accept’/‘reject’ will be very time consuming. I’ve written a simple script with PhraseMatcher to match and annotate all those entities and can easily generate more than 100k text spans ‘gold’-annotated (which even using Prodigy will be very time consuming to get these figures), thus your latter suggestion seems very promising and efficient.

Just two quick questions:

  • (or if you could point me to a link to): how should the training examples (format) look like for training with Prodigy? I tried to follow written in the prodigy cookbook, but still can’t make the code run (it throws an error).
  • How can I change the accuracy metric to f1 for example in ner.batch-train?

Many thanks in advance!

@ines: The format that @wpm suggests in post #3, where is it documented?

I've been searching for hours to find a way to import existing data into a dataset that contains accepts and rejects before I finally - more or less by accident - stumbled across this thread.

If you look at the data that prodigy's db-out command exports it is far more complex.
How would I know that this much more simple format does work?
Is there some documentation that I'm just too stupid too find?

@kamiwa Check out the "Annotation task formats" section in your PRODIGY_README.html. It should have examples of the different JSON formats for the different interfaces / tasks, and the minimum required properties.

Hi @ines,

thanks for the reply.

Check out the "Annotation task formats" section in your PRODIGY_README.html .

I had checked it. But due to missing KI (kamiwa intelligence) I didn't comprehend that the shown format examples can be combined. (Format for "Annotated task" plus format for "NER" in this case).

As Douglas Adams once said:

A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools.