Multiple models or one single model?

I just finished saving a model with one label (based on 600+ manual annotations of text) using the following command:

prodigy train ner
my_annotated_data_1
en_vectors_web_lg
--init-tok2vec ../tok2vec_cd8_model289.bin
--output ./my_model_1
--eval-split 0.2`

Having obtained an F-score of >95, I now would like to add a second label through the same steps of ner.manual and prodigy train.

I am not sure if should create a separate annotation dataset my_annotated_data_2 with the second label -

  • and then train and save a separate model my_model_2; or
  • but train and save on the same model my_model_1 by providing both my_annotated_data_1 and my_annotated_data_2 as comma separated datasets to the prodigy train recipe

Not sure which of these is a better practice and would help achieve the most accurate results. Is there a third alternative?

Hi! If your goal is to have one pipeline predicting both labels, this is definitely the approach I would recommend :point_up_2: The presence and absence of one label can always be relevant for all other labels as wel, since the entity recognizer predicts token-based tags, and named entities can't overlap.

It'll definitely be interesting to run different experiments here, though and compare the per-label evaluation scores of the joint model to models trained separately on only one label at a time. If there's a big difference here, this could point to potential problems and conflicts in the data.

A workflow you probably want to avoid is updating the same trained artifact multiple times with different datasets and different labels. This will make the process and the results much harder to reason about, and you're risking forgetting effects at every step.

1 Like

Thank you so much, @ines. I am intending to experiment with combination and separate models.

1 Like