Annotation strategy for gold-standard data

Hi Frederik,

The --label k1,k2 argument tells Prodigy to only suggest entities that have been assigned those labels. These labels are not in the de_core_news_sm pre-trained model you're using, and the make-gold recipe doesn't update the model. This means no entities will ever be suggested by the model.

We should add a warning (or possibly error) if you specify labels not in the model during make-gold. We have similar warnings for most other recipes, as it's an easy mistake to make, especially by mistyping the label name.

The simple answer for Case 3: Mark accept.

The workflow in ner.make-gold should be that you mark all and only the correct annotations, and then mark it as ACCEPT once it's correct. You can use REJECT to mark deeper problems for you to resolve later. For instance:

  • Sometimes the tokenization is incorrect, preventing you from marking the entity boundaries correctly;

  • Sometimes you don't have a correct category to put the entity in, so you'd like to revisit the example once you've updated your label scheme.

  • Sometimes the entity contains other entities within it, and you'd like to note that in your downstream evaluation.

If there are no problems like this, it's often the case that the correct analysis has no entities. These examples are important for the model to learn from, so you need them in your training data.

1 Like