Hierarchical text classification - multiple passes on same dataset

Hi - I'm working with a hierarchical text classification problem. For the children of each hierarchy, it's possible that there could be multiple overlapping labels. If I break this into a set of one-vs-all problems, and annotate the dataset separately for each, I would have N datasets. In order to train a model for prediction, would I have to create a gold dataset that combines all of these sets? Is there a way I can do that with prodigy so that, if a text has multiple labels, they are combined? Or would I have to review the process manually?

Yes, Prodigy will do this automatically if you pass multiple datasets to the built-in train or data-to-spacy recipes :smiley:

Alternatively, it should also be pretty easy to combine your annotations yourself if you want to do it in a separate process: each annotation will have two hashes, an _input_hash and a _task_hash (you can read more about the hashing here). The input hash lets you identify all examples with the same input – so in this case, all annotations with the same text. So examples with the same input hashes will be annotations on the same text, but with different labels.

1 Like