make gold multilabel

Dany · October 26, 2019, 5:10pm

Hi there,

I am trying to reduce an existing labelling scheme to 20 labels, and I suspect that this could give me a warm start but that there will also be a lot of errors. My planned workflow is:

try my best to distil the existing labelled categories into the 20 I've determined to be independent
train a model on this dataset to get an idea of initial performance; which labels perform worst, etc.
correct incorrect labels using the trained model (or a blank model?) and manual or make-gold.

I am a unsure of the best workflow generally, but also whether I can just use my custom manual recipe to correct/ extend the dataset, or whether I should try and customize the make gold recipe for a multilabel task?

honnibal · November 1, 2019, 4:20pm

Hi @dany,

I think I might be understanding your problem incorrectly. Do you already have data annotated with some higher number of labels --- say, 100 entity types --- and you want to reduce that to only 20 entity types?

Is there a many-to-one mapping of your fine-grained types to the course-grained ones? So for instance, if you have labels for CAR and LAPTOP, can you map both of those to PRODUCT? Or are there categories where the mapping is more complicated: for instance, maybe your fine-grained labels have a category MUSICIAN, some of which you'd sort into PERSON and some of which you'd sort into ORG?

If you have a many-to-one mapping, obviously that's pretty easy. But even for the many-to-many cases, I would suggest making a frequency list of your entities, and working down the types, rather than the tokens. For instance, you might have several mentions of an entity like Justin Bieber. You're going to re-type all of those instances to the same category, so there's no need to do them all individually --- doing them individually can only introduce errors, because it's hard to remember all the decisions you made.

Topic		Replies	Views
Help with messy data usage , ner	8	666	January 20, 2019
annotate multi phrases using ner.make-gold usage , ner	1	707	February 19, 2019
Annotating for few labels+new label but training on all labels usage , ner	1	582	February 1, 2018
ner.make-gold to re-evaluate pre-annotated dataset ner , solved	2	666	July 25, 2018
Best practise for multi-label and textcat.teach usage , textcat	6	4838	May 2, 2019

make gold multilabel

Related topics