make gold multilabel

Hi there,

I am trying to reduce an existing labelling scheme to 20 labels, and I suspect that this could give me a warm start but that there will also be a lot of errors. My planned workflow is:

  • try my best to distil the existing labelled categories into the 20 I've determined to be independent
  • train a model on this dataset to get an idea of initial performance; which labels perform worst, etc.
  • correct incorrect labels using the trained model (or a blank model?) and manual or make-gold.

I am a unsure of the best workflow generally, but also whether I can just use my custom manual recipe to correct/ extend the dataset, or whether I should try and customize the make gold recipe for a multilabel task?

Hi @dany,

I think I might be understanding your problem incorrectly. Do you already have data annotated with some higher number of labels --- say, 100 entity types --- and you want to reduce that to only 20 entity types?

Is there a many-to-one mapping of your fine-grained types to the course-grained ones? So for instance, if you have labels for CAR and LAPTOP, can you map both of those to PRODUCT? Or are there categories where the mapping is more complicated: for instance, maybe your fine-grained labels have a category MUSICIAN, some of which you'd sort into PERSON and some of which you'd sort into ORG?

If you have a many-to-one mapping, obviously that's pretty easy. But even for the many-to-many cases, I would suggest making a frequency list of your entities, and working down the types, rather than the tokens. For instance, you might have several mentions of an entity like Justin Bieber. You're going to re-type all of those instances to the same category, so there's no need to do them all individually --- doing them individually can only introduce errors, because it's hard to remember all the decisions you made.