Gold/Silver Dataset Confusion


I'm still in a prodigy newbie status, so I'm sorry to ask this probably stupid questions in order to get a grip on the correct usage of the various recipes...

My workflow:
I'm trying to extract companynames of the first line of their imprint website. This has typically the form of "Impressum - <NAME>" or "<NAME>" or "This is the site of <NAME>".

  1. I start with blank model that I batch-trained with a small (<100) dataset, which was manually annotated using ner.manual. For this and the following training I use the -U flag to assert that my examples are not further split into sentences.

  2. Despite knowing that these few examples will give a poor model, I used it in a teach loop to generate a bigger training dataset with fast binary decisions.
    Because there will be well known set of wrong guesses ("Impressum", "-", "|" ...), I created a custom ner.teach wrapper recipe that will automatically reject those and update the model accordingly. This way I create a dataset containing approx. 4000 examples.
    (for whom is interested in my endeavours: NER not containing <word_list>)

  3. When batch-training with this dataset (and using 20% of it as evaluation set), I get a high accuracy of 90%. Loading and sampling the model shows that this stems from a very high false positive rate, i.e. in some cases nearly every token in the sentence is predicted as my entity.

  4. After considering "Limit number of predicted entities" which is still interesting, I found the --no-missing flag of the ner.batch-train recipe. This reduced the accuracy to 70%, but the model is predicting with much more caution now, because my binary annotations contain more "implicit rejects" now.

Would it make sense to further improve my dataset with ner.make-gold or ner.silver-to-gold (because I already have the 4000 examples)? I'm quite confused with the difference between my dataset and a "golden one".

In my particular case, one restriction is that each sentence will (at most) contain only ONE entity. So my "Silver"(?) dataset in which I only accepted the correct entity prediction in combination with the --no-missing flag will automatically lead to a correct golden handling of it, right?

Or are there any benefits from creating a gold standard? I mean, I could simply take all accept entries from my silver dataset. If I compare a gold standard generated from a few examples with ``ner.silver-to-gold", the only difference seems to be a "tokens" entry in which the tokenization is saved...

Let's assume I generate a large gold standard for batch training. It is advised to use a separate evaluation set. If I use the ner.eval recipe with my input jsonl, it will start with the same entries I have in my gold standard. I only see the option to manually split the gold dataset beforehand and save it in distinct datasets via a script...

If you have that constraint, they yeah the information in your "accept" data is complete. The distinction only arises in other situations, where you might have one entity annotated but not know about other entities.

There's also a distinction in your "reject" samples: you might have said no to some suggested span, but some other span might be correct. In that situation, you don't want the model to assume there are no entities in the example, even though the annotations don't specify any.

Ok thank you, I just wanted to be sure that I got it correctly :wink: