Combining NER with text classification

ines · February 11, 2018, 2:54pm

Yes, absolutely! Both ner.batch-train and textcat.batch-train export loadable spaCy models, so you could start off with a blank or default spaCy model, train a model on your NER annotations and then use it as the input model for textcat.teach. For example:

prodigy textcat.batch-train my_textcat_dataset /path/to/ner-model ...

Ideally, you would create two separate datasets – one for your NER annotations and one for your text classifier annotations. You can use the same input data for both sets.

To achieve better NER accuracy, you might also want to try training your model with ner.teach – especially if you're training new entity types from scratch. ner.manual is great to create gold-standard data and evaluation sets, but in order to properly train a new type, you need a lot of manual annotations – ideally thousands or more. Using ner.teach and a patterns file with examples of the entities you're looking for can speed up the process, because the model in the loop can help you collect more relevant annotations.

In case you haven't seen it yet, here's our video tutorial on training a new entity type. I also wrote more detailed comments about training NER from scratch here and here.

textcat.batch-train expects all annotations in the dataset to have a "label" field containing the category label. Maybe your set contains examples without a label set? As I mentioned above, annotations you collect for different tasks (NER, textcat) should ideally have their own datasets. So a possible explanation for the error could be that your set contains both text classification and NER annotations (which don't have a label set).

You can use the db-out command to preview or export your dataset and check:

prodigy db-out mytest_systems_2 | less  # preview dataset
prodigy db-out mytest_systems_2 /tmp    # export dataset to a file

If it turns out that your set contains examples you want to exclude, you can edit the JSONL file manually and use db-in to import it to a new dataset. Each annotation session is also available in the database as a session dataset (named after the timestamp) – so you can also view and export individual sessions. To see a list of all datasets and session sets, you can use the prodigy stats -ls command.

Topic		Replies	Views
Combining NER and Classification usage , ner , textcat , solved	7	616	August 5, 2022
combining multiple models and exporting training data to spacy ner , spacy	3	2745	November 13, 2018
Split a ner.manual dataset, into smaller texts usage , ner , spacy	3	983	June 24, 2022
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	306	July 7, 2023
Merging single label-based models into one multiple label-model usage , ner , solved	3	916	June 10, 2020

Combining NER with text classification

Related Topics