Form and prevalence of negative examples in the Training Set when training a Custom NER SPACY model

To perform my task I must train a SPACY model with custom NER labels.

Now, I understand that the format of a "positive" training example --I use the term "positive" loosely to refer to a sentence in which the Named Entities of interest exist in the first place-- is the following:

(sentence, {'entities' : [(start1, end1, label1), (start2, end2, label2) ... ]})

I also understand that I should include "negative" examples in the training process -- i.e., sentences where the Named Entities of interest are not present.

I have then two questions:

  1. What should be the format of these "negative" examples? Possibly the following?

(sentence, {'entities' : []})

  1. What should be the proportion of "negative" examples? I.e., if I train the model with 10,000 examples how many of those should be "negative" and how many "positive" in the sense I have given to these words?

Hi Alex,

We've been working hard on reforming the data handling stuff in spaCy for v3, as some aspects of this were a bit hard to use in v2. You can find more details about it in the docs here: https://spacy.io/usage/training

We do have to stay focused on Prodigy-related usage. Some of your questions (e.g. how many sentences without entities) are also quite general issues about machine learning. The general tip there is that you usually want to approximate the distribution of data you'll find at runtime, so you want a number of entity-less sentences that approximates the proportion of such sentences in the text you expect to analyse. There can be situations where it's useful to over-sample the positive cases though, it's ultimately an empirical question.