To perform my task I must train a SPACY model with custom NER labels.
Now, I understand that the format of a "positive" training example --I use the term "positive" loosely to refer to a sentence in which the Named Entities of interest exist in the first place-- is the following:
(sentence, {'entities' : [(start1, end1, label1), (start2, end2, label2) ... ]})
I also understand that I should include "negative" examples in the training process -- i.e., sentences where the Named Entities of interest are not present.
I have then two questions:
- What should be the format of these "negative" examples? Possibly the following?
(sentence, {'entities' : []})
- What should be the proportion of "negative" examples? I.e., if I train the model with 10,000 examples how many of those should be "negative" and how many "positive" in the sense I have given to these words?