Placing Data in One Dataset

Hey there,

So I wanted to know if I should place all of the data I want to use for training in one dataset. It was mentioned here that it is beneficial to do so to reduce the effects of the catastrophic forgetting problem:

Using the same dataset to store all annotations related to the same project is definitely a good strategy. It means that every time you train and evaluate your model, it will train and evaluate on the whole set, not just the annotations you collected in the last step. This also helps prevent the “catastrophic forgetting” problem .

However, it was also mentioned that if we’re getting annotations from both the ner.teach and ner.make-gold recpies, we should place them in SEPARATE datasets:

I would usually recommend working with separate datasets for different annotations. The annotations you collect with ner.teach are binary decisions on different spans, whereas in ner.make-gold, you’re labelling the whole sentence until it’s complete. It’s usually cleaner to keep those separate, and it also makes it easy to try out different things and run experiments with different configurations.

So I just wanted to clarify on if we should keep them all in one dataset or multiple datasets. Also does it make sense to train the model on annotations from BOTH ner.teach AND ner.make-gold?


In general, we recommend using separate datasets for every experiment or small annotation project and combining them later on when you’re ready to train a full model.

You usually don’t want to use a single dataset for all annotations – otherwise, it’s going to be super difficult to interpret the results and find out which types of annotatios made a difference (in a positive or negative way).

Yes, but you usually want to train with different configurations, especially if you know that the data you’ve created with ner.make-gold is gold-standard and “complete”, i.e. includes all entity types that occur in the text. In that case, you can set the --no-missing flag during ner.batch-train. Instead of treating all unlabelled tokens as unknown missing values, the model will then be updated in a way the explicitly treats all other tokens as “not part of an entity” / “outside an entity”.

Prodigy also supports a workflow for converting binary annotations to gold standard data. The ner.print-best recipe will take binary annotations and find the best-possible analysis of the examples given the constraints defined in the data (known correct entities, known incorrect entities etc). You can then export this data and correct it using the ner.manual recipe.

Here’s an example of a ner.silver-to-gold recipe using the same principle:

Hi Ines,

Thank you for your response! I have a few follow up questions. It would be awesome if I could get a response fairly soon! I have a project deadline coming up, and the answers to these questions would help me immensely.

(Sorry to keep editing this post haha. I keep finding more information that I figure would help further clarify my questions)

  1. Is it recommended to convert binary annotations to gold standard ones, if I am already using gold standard annotations to train? If so, does it make sense to run BOTH ner.teach and ner.make-gold on the SAME data (I only have one file of text data)? Converting the binary annotations from ner.teach to gold standard ones when using the same data should just give me the same exact annotations as the ones from make gold right? Thus, should I simply JUST use ner.make-gold and then use batch-train to train the NER model?

1a) I found this piece of info:

“Prodigy’s ner.batch-train workflow was also created under the assumption that annotations would be collected using ner.teach – e.g. a selection of examples biased by the score, and binary decisions only.”

Thus if I were to ONLY use make-gold, would I be better off converting the annotations to the BILUO format using the gold-to-spacy function and then training the model with the Spacy train function rather than the batch-train function from Prodigy?

1b) If it is recommended to use both make gold and ner.teach and to convert the binary annotations to gold standard ones using the strategy you mentioned above, does it make sense to start with ner.teach 1st, since the model does actively learn from it, and then move to ner.make-gold? Or does the order not matter?

  1. This might be a little bit of a tangent, but I ran into the catastrophic forgetting problem and would like to circumvent that. We’re annotating on 10 out of the existing 18 entities and are not introducing any new entity. Does the catastrophic forgetting problem primarily apply to situations where you’re creating a new entity or does it apply even if you’re just dealing with pre-existing entities? I would assume the latter.

2a) If it does apply to all cases, is the best way to add annotations of the text from the base model to use the mark recipe from Prodigy? Or if I solely use the make gold to train, with that be enough to combat the catastrophic forgetting issue?

2b) If I do need to do something in addition to running make gold, which I’d use for training, what’s a ballpark number of examples that we should include of the base model correctly annotating the text? Lastly, could I use the same data for this as I used to train my model or should I find another text dataset?

  1. I assume one way to improve a NER model would be to annotate different data and train the model off that. However, if I only have one dataset that I’m making annotations from, what are other ways in which I could improve the model? For example, fiddling with which hyper-parameters would make the most difference?

Thank so much for your help!


If you want to to train from gold standard data and take advantage of the --no-missing flag then yes, it makes sense to convert your binary annotations to gold-standard data.

Yes, that’s correct. And no, if you don’t mind labelling your entire corpus, you might as well do the whole thing with ner.make-gold The ner.teach recipe is mostly useful if you have a lot of data and want to run quick experiments, but don’t necessarily care about labelling every single example. If you have limited data and don’t mind annotating everything, you might as well do that straight away. The data you end up with at the end will be the same – the gold-standard annotations of your corpus.

If the data you create includes all entity types you want in your model, then yes. If you’re creating gold-standard data including a new label (e.g. WEBSITE) and you also want the model to keep predicting PERSON, your gold-standard data should include annotations for both.

If you want to add a new label to your data, you might as well add it to the previous training data and then re-train the model from scratch – for example, if you decide you now also want to label CLOTHING, or if you want to change the definition of PRODUCT. Otherwise, a few hundred examples is always a good starting point. If the results look good, you can increase the number of examples to a thousand, and then to few thousand.

Hyper parameters can make some difference – for example, you can try retraining with a different batch size or dropout rate and maybe you’ll see a small increase in accuracy on your evaluation set. However, this is only going to (potentially) improve how the model is learning from your data. If you dataset isn’t good enough and doesn’t have enough examples, the hyperparameters can’t make up for that.

A much better (and more predictable) solution would be to find more data from different sources. Even if the source isn’t perfectly related, you can still try and annotate a few examples, use Prodigy to run a few experiments, see if the model is improving, try again with different examples, and so on. And then do this until you’ve found a promising approach, annotate a bit more, train again and try to reason about why this approach is producing better results.

Also make sure you have a good evaluation set in place. By default, Prodigy will hold back some of your data randomly, but once you’re getting serious about training, you usually want to create a separate evaluation set and pass that in as the --eval-id setting when you train. This will give you much better results to work with, and a more reliable way to measure how your model is really doing.

Hi Ines,

Thank you for your response! That all makes sense. I just have one final follow up question. If I were to solely use make-gold to train the model, would I be better off training the model via Spacy, since I read that the ner.batch-train function is primarily suited for ner.teach?


Before we introduced the --no-missing flag, ner.batch-train was mostly optimised for binary annotations, yes. With the --no-missing flag, you can also train from gold-standard annotations, just like you do if you use spaCy directly.

However, once you’re serious about training your model, you probably want to be using spaCy directly, since it gives you more control over the process. ner.batch-train is best suited for running many quick experiments during development.

A post was split to a new topic: Train curve accuracy getting worse