Form and prevalence of negative examples in the Training Set when training a Custom NER SPACY model

To perform my task I must train a SPACY model with custom NER labels.

Now, I understand that the format of a "positive" training example --I use the term "positive" loosely to refer to a sentence in which the Named Entities of interest exist in the first place-- is the following:

(sentence, {'entities' : [(start1, end1, label1), (start2, end2, label2) ... ]})

I also understand that I should include "negative" examples in the training process -- i.e., sentences where the Named Entities of interest are not present.

I have then two questions:

  1. What should be the format of these "negative" examples? Possibly the following?

(sentence, {'entities' : []})

  1. What should be the proportion of "negative" examples? I.e., if I train the model with 10,000 examples how many of those should be "negative" and how many "positive" in the sense I have given to these words?

Hi Alex,

We've been working hard on reforming the data handling stuff in spaCy for v3, as some aspects of this were a bit hard to use in v2. You can find more details about it in the docs here: https://spacy.io/usage/training

We do have to stay focused on Prodigy-related usage. Some of your questions (e.g. how many sentences without entities) are also quite general issues about machine learning. The general tip there is that you usually want to approximate the distribution of data you'll find at runtime, so you want a number of entity-less sentences that approximates the proportion of such sentences in the text you expect to analyse. There can be situations where it's useful to over-sample the positive cases though, it's ultimately an empirical question.

Hi, Matthew,
Is there any new feature that would work? I need negative examples for recognizing "no entity" cases. I tried using doc.char_span(0,0, "NO_ENTITIY") as a place holder but it wouldn't work, it just return none

hi @Raymond1415926!

Thanks for your question and welcome to the Prodigy community :wave:

Since this original post, we've created the spaCy GitHub discussions forum, which is where the spaCy core team answers spaCy specific questions. This forum is generally more for Prodigy-related questions.

I found this post on there that mentioned that you can pass negative NER examples through a SpanGroup. Feel free to respond back to that post if you have follow up questions.

Hope this helps!