Hi guys,
I followed this article: Merging annotations from different datasets to deal with the following scenario:
Trained Dataset A w/ Label A off training data TD-A
Trained Dataset B w/ Label B off training data TD-B (TD-A and TD-B don’t actually share any input_hash… they are targetted subsets of the overall data primed to make best use of labellers time for A and B respectively)
Dataset A is only labelled w/ Label A, and same for B…
I ran the snippet of code outlined in Support link above to generate Dataset C (A + B). I did this because I want to train a combined model for A + B (instead of having separate models. which I guess I could do… but would be more performant at prediction time to have one, right? expecially since I plan on training a dozen or so different labels)
what happens in the following:
if I Accepted the “absence” of label in Dataset A… this tells algo good info… specifically that the text block is devoid of the label… very important in nuanced/borderline cases for algo to learn…I’ll call this a “negative” example, for purposes of the rest of this question.
when Dataset C is trained on for both labels e, g: --label A,B. How will the algo (from Dataset C, synthesized from A,B, but no longer keeping any information which source of the data it originated from) know that, when training for the B labels , that a Dataset A “negative” example (e.g. Accepted devoid of label) … won’t also count against the B label as a negative example? How will it know that this “absence” information is meant to steer the A labelling, and not the B labelling? in other words… Won’t the algo interpret this absence/devoid of spans as a negative example for B as well? What if that text (originated from A ) really did have a B label present? Won’t it be learning incorrectly?
Any ideas how to remedy this? Thanks