I'm still in a prodigy newbie status, so I'm sorry to ask this probably stupid questions in order to get a grip on the correct usage of the various recipes...
I'm trying to extract companynames of the first line of their imprint website. This has typically the form of "Impressum - <NAME>" or "<NAME>" or "This is the site of <NAME>".
I start with blank model that I batch-trained with a small (<100) dataset, which was manually annotated using
ner.manual. For this and the following training I use the
-Uflag to assert that my examples are not further split into sentences.
Despite knowing that these few examples will give a poor model, I used it in a teach loop to generate a bigger training dataset with fast binary decisions.
Because there will be well known set of wrong guesses ("Impressum", "-", "|" ...), I created a custom
ner.teachwrapper recipe that will automatically reject those and update the model accordingly. This way I create a dataset containing approx. 4000 examples.
(for whom is interested in my endeavours: NER not containing <word_list>)
When batch-training with this dataset (and using 20% of it as evaluation set), I get a high accuracy of 90%. Loading and sampling the model shows that this stems from a very high false positive rate, i.e. in some cases nearly every token in the sentence is predicted as my entity.
After considering "Limit number of predicted entities" which is still interesting, I found the
--no-missingflag of the
ner.batch-trainrecipe. This reduced the accuracy to 70%, but the model is predicting with much more caution now, because my binary annotations contain more "implicit rejects" now.
Would it make sense to further improve my dataset with
ner.silver-to-gold (because I already have the 4000 examples)? I'm quite confused with the difference between my dataset and a "golden one".
In my particular case, one restriction is that each sentence will (at most) contain only ONE entity. So my "Silver"(?) dataset in which I only accepted the correct entity prediction in combination with the
--no-missing flag will automatically lead to a correct golden handling of it, right?
Or are there any benefits from creating a gold standard? I mean, I could simply take all
accept entries from my silver dataset. If I compare a gold standard generated from a few examples with ``ner.silver-to-gold", the only difference seems to be a "tokens" entry in which the tokenization is saved...
Let's assume I generate a large gold standard for batch training. It is advised to use a separate evaluation set. If I use the
ner.eval recipe with my input jsonl, it will start with the same entries I have in my gold standard. I only see the option to manually split the gold dataset beforehand and save it in distinct datasets via a script...