Form and prevalence of negative examples in the Training Set when training a Custom NER SPACY model

a.konstantinidis · June 26, 2020, 2:17pm

To perform my task I must train a SPACY model with custom NER labels.

Now, I understand that the format of a "positive" training example --I use the term "positive" loosely to refer to a sentence in which the Named Entities of interest exist in the first place-- is the following:

(sentence, {'entities' : [(start1, end1, label1), (start2, end2, label2) ... ]})

I also understand that I should include "negative" examples in the training process -- i.e., sentences where the Named Entities of interest are not present.

I have then two questions:

What should be the format of these "negative" examples? Possibly the following?

(sentence, {'entities' : []})

What should be the proportion of "negative" examples? I.e., if I train the model with 10,000 examples how many of those should be "negative" and how many "positive" in the sense I have given to these words?

honnibal · June 26, 2020, 3:50pm

Hi Alex,

We've been working hard on reforming the data handling stuff in spaCy for v3, as some aspects of this were a bit hard to use in v2. You can find more details about it in the docs here: https://spacy.io/usage/training

We do have to stay focused on Prodigy-related usage. Some of your questions (e.g. how many sentences without entities) are also quite general issues about machine learning. The general tip there is that you usually want to approximate the distribution of data you'll find at runtime, so you want a number of entity-less sentences that approximates the proportion of such sentences in the text you expect to analyse. There can be situations where it's useful to over-sample the positive cases though, it's ultimately an empirical question.

Raymond1415926 · December 26, 2022, 2:10pm

Hi, Matthew,
Is there any new feature that would work? I need negative examples for recognizing "no entity" cases. I tried using doc.char_span(0,0, "NO_ENTITIY") as a place holder but it wouldn't work, it just return none

ryanwesslen · December 28, 2022, 3:11pm

hi @Raymond1415926!

Thanks for your question and welcome to the Prodigy community

Since this original post, we've created the spaCy GitHub discussions forum, which is where the spaCy core team answers spaCy specific questions. This forum is generally more for Prodigy-related questions.

I found this post on there that mentioned that you can pass negative NER examples through a SpanGroup. Feel free to respond back to that post if you have follow up questions.

Hope this helps!

Topic		Replies	Views
Mixing Positive and Negative examples in Training Set for NER Modeling usage , ner , spacy	1	613	October 1, 2020
NER training data creation ner , spacy , training	2	309	August 4, 2022
Do I need to provide training data for "negative" label? textcat , spacy , solved	1	472	September 30, 2019
Train model for certain, repeating mislabelling usage , ner	1	479	May 28, 2019
show false negative/false positives in NER usage , ner , spacy , solved	7	2734	May 3, 2022

Form and prevalence of negative examples in the Training Set when training a Custom NER SPACY model

Related topics