Combining NER with text classification

Yes, absolutely! Both ner.batch-train and textcat.batch-train export loadable spaCy models, so you could start off with a blank or default spaCy model, train a model on your NER annotations and then use it as the input model for textcat.teach. For example:

prodigy textcat.batch-train my_textcat_dataset /path/to/ner-model ...

Ideally, you would create two separate datasets – one for your NER annotations and one for your text classifier annotations. You can use the same input data for both sets.

To achieve better NER accuracy, you might also want to try training your model with ner.teach – especially if you're training new entity types from scratch. ner.manual is great to create gold-standard data and evaluation sets, but in order to properly train a new type, you need a lot of manual annotations – ideally thousands or more. Using ner.teach and a patterns file with examples of the entities you're looking for can speed up the process, because the model in the loop can help you collect more relevant annotations.

In case you haven't seen it yet, here's our video tutorial on training a new entity type. I also wrote more detailed comments about training NER from scratch here and here.

textcat.batch-train expects all annotations in the dataset to have a "label" field containing the category label. Maybe your set contains examples without a label set? As I mentioned above, annotations you collect for different tasks (NER, textcat) should ideally have their own datasets. So a possible explanation for the error could be that your set contains both text classification and NER annotations (which don't have a label set).

You can use the db-out command to preview or export your dataset and check:

prodigy db-out mytest_systems_2 | less  # preview dataset
prodigy db-out mytest_systems_2 /tmp    # export dataset to a file

If it turns out that your set contains examples you want to exclude, you can edit the JSONL file manually and use db-in to import it to a new dataset. Each annotation session is also available in the database as a session dataset (named after the timestamp) – so you can also view and export individual sessions. To see a list of all datasets and session sets, you can use the prodigy stats -ls command.