How to compare performance of 2 textcat models

I have 2 textcat models. One was taught manually, one was taught by active learning. The active learnt model only used the subset of the total dataset. Now I want to compare the performance of the 2 models.

Prodigy Train won't do it. Because it only evaluates again its own data. We would just compare apple with orange in my case.

It will make more sense to evaluate 2 models with the same unseen dataset. I haven't seen how to do in Prodigy. Looks like I will have to export 2 models to Spacy and evaluate in Spacy with the unseen evaluation set. Am I right?

You don't have to let Prodigy split the examples – that's just the default if no evaluation set is provided, so there's something to evaluate on. The --eval-id argument lets you pass in the name of a dataset used for evaluation. So if you're serious about evaluating your models and comparing the performance, that's probably what you want to be using.

That said, you can also export your data and train with spaCy directly, which makes sense once you're done with annotation and just want to run experiments, or if you want more fine-grained control over the training. The prodigy train command was mostly designed as a quick way to run experiments from Prodigy datasets and see how you're doing, but it's not necessarily how you have to train your final models.