Combining NER with text classification


I’m quite new to NLP and I’m trying to do chatbot intent detection. So I’m able to do text classification and NER using Prodigy, but can I combine the two into a single model which I can load into spacy? Does that make sense?

This is what I’ve done:

  • Created a dataset
  • Used ner.manual to identify entities in a set of input data (which is basically a bunch of email transcripts)
  • Used textcat.teach to associate labels to text (such as SUPPORT_REQUEST, ACCESS_REQUEST etc)

I then tried to export a model for use in spacy, and I get the error below.

$ prodigy textcat.batch-train mytest_systems_2 --output mytest_model --eval-split 0.2

Loaded blank model
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/dev/prodigy_test/lib/python3.6/site-packages/prodigy/", line 253, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/dev/prodigy_test/lib/python3.6/site-packages/", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/dev/prodigy_test/lib/python3.6/site-packages/", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/dev/prodigy_test/lib/python3.6/site-packages/prodigy/recipes/", line 110, in batch_train
    labels = {eg['label'] for eg in examples}
  File "/Users/dev/prodigy_test/lib/python3.6/site-packages/prodigy/recipes/", line 110, in <setcomp>
    labels = {eg['label'] for eg in examples}
KeyError: 'label'

Yes, absolutely! Both ner.batch-train and textcat.batch-train export loadable spaCy models, so you could start off with a blank or default spaCy model, train a model on your NER annotations and then use it as the input model for textcat.teach. For example:

prodigy textcat.batch-train my_textcat_dataset /path/to/ner-model ...

Ideally, you would create two separate datasets – one for your NER annotations and one for your text classifier annotations. You can use the same input data for both sets.

To achieve better NER accuracy, you might also want to try training your model with ner.teach – especially if you're training new entity types from scratch. ner.manual is great to create gold-standard data and evaluation sets, but in order to properly train a new type, you need a lot of manual annotations – ideally thousands or more. Using ner.teach and a patterns file with examples of the entities you're looking for can speed up the process, because the model in the loop can help you collect more relevant annotations.

In case you haven't seen it yet, here's our video tutorial on training a new entity type. I also wrote more detailed comments about training NER from scratch here and here.

textcat.batch-train expects all annotations in the dataset to have a "label" field containing the category label. Maybe your set contains examples without a label set? As I mentioned above, annotations you collect for different tasks (NER, textcat) should ideally have their own datasets. So a possible explanation for the error could be that your set contains both text classification and NER annotations (which don't have a label set).

You can use the db-out command to preview or export your dataset and check:

prodigy db-out mytest_systems_2 | less  # preview dataset
prodigy db-out mytest_systems_2 /tmp    # export dataset to a file

If it turns out that your set contains examples you want to exclude, you can edit the JSONL file manually and use db-in to import it to a new dataset. Each annotation session is also available in the database as a session dataset (named after the timestamp) – so you can also view and export individual sessions. To see a list of all datasets and session sets, you can use the prodigy stats -ls command.

@ines thank you for the detailed reply.

This is the bit I don't get. Why should I create two separate datasets? And if I do create them as separate data sets, how do I combine them into a single model? Or do I generate a model from each dataset, and load both into spaCy? Its quite possible that I'm speaking complete nonsense.

No worries :slightly_smiling_face: We're definitely introducing a lot of new concepts in Prodigy, so it's totally fine if you have questions. Answers below!

The main reason is that the annotation data produced by the ner and textcat recipes is specific to the training task. Prodigy comes with separate training recipes: ner.batch-train updates or adds the 'ner' component to the model, and textcat.batch-train updates or adds the 'textcat' component. There are various subtle differences in how we've optimised the updating of the different components, and the recipes also output slightly different statistics.

So if you want to train both components, you need to update the same model twice with the respective annotations. For example:

prodigy ner.batch-train your_ner_set --output /path/to/ner-model
prodigy textcat.batch-train your_textcat_set /path/to/ner-model --output /path/to/ner-textcat-model

The model exported to /path/to/ner-textcat-model should then include weights for both the entity recognizer and text classifier. You can also verify this by looking at the directories within the model.

In theory, you could write a custom recipe that trains both at the same time. As far as spaCy is concerned, this is definitely possible. But this also means that if the results are not satisfying, or there is some problem with your data, it'll make it a lot harder to debug and find out what's wrong.

Another reason is that you'll likely end up wanting slightly different annotations for training the two components. A big advantage of Prodigy is that it lets you iterate quickly and try out new ideas to see if they can improve your model. So as you keep experimenting, you might also want to experiment with different datasets.

If you're using the active learning-powered recipes like ner.teach or textcat.teach, Prodigy will use the model's predictions to suggest what to annotate next. This means that the selection of examples will always be biased (in a good way, though!). But it also means that the examples the NER model selects to improve its predictions aren't necessarily the best examples to annotate for text classification, and vice versa.

I’m also interested in training NER and text classification on the same corpus so this thread is relevant to me. However, I may have a different use case than @cmtru because I want to do joint learning of these tasks.

As I’ve already described in other posts, I’m trying to do NER on documents, but the documents are tens of pages long. The context is likely too long for for CNN or LSTM type methods to be effective, so I need to find a way to segment the documents into smaller pieces.

Luckily the entities I’m trying to extract appear in contexts about a paragraph in length. So if I can find the right “paragraph of interest” I can do a good job of extracting entities from it. These paragraphs of interest are themselves variable in form, so it’s a machine learning task to distinguish them from the other paragraphs in the document.

I’ve been framing this as a two stage process. First a binary text categorization model identifies the likely paragraphs of interest, then an NER model extracts the entities from those paragraphs. Both the text categorization and NER models are trained using Prodigy’s standard active learning techniques. (You and @honnibal have been helping me find a way to seed this process with phrases instead of just words.) And if I want two separate models, the reasons you give earlier in this post for training them separately make sense.

However, it seems like joint learning might be more effective. Instead of using the text classifier to make a hard decision about whether to examine a particular paragraph, it should merely contribute a probability. Likewise, the presence of the named entities I’m looking for can be a clue that the paragraph that contains them is one I care about. Basically I have two separate but related kinds of signal, and I want to combine them, both at runtime and during Prodigy’s active learning loop.

I don’t think Prodigy/spaCy is set up to do this kind of joint learning out of the box. Even if you have both NER and text categorization pipelines in the same model, the NER model doesn’t incorporate the textcat labels (as far as I can tell from watching @honnibal’s video tutorial about the NER model), and the text categorization model doesn’t take labeled NER spans as features. Am I correct about this?

I think if I want to do this kind of joint learning I have to write the model myself. Maybe use spaCy to extract features and then write my own CNN or LSTM in Keras that does NER with an additional paragraph-of-interest feature. Or maybe find a way to reframe paragraph detection as an attention mechanism. This seems doable, and because spaCy/Prodigy is a pluggable architecture, I’d be able to incorporate it, but it still seems like a lot of work, so I’m wondering if there’s some easier way to accomplish the task already built into these tools. (Like if I just attached a contained-in-a-paragraph-of-interest probability to each token as a feature, would that fold the segmentation signal into an NER model? Or is this a job for a multitask objective?)

Do I have to roll my own joint learning system, or is this capability already built into spaCy/Prodigy in a way that I’m just overlooking?

1 Like

@wpm If you segment the document into paragraphs, you can run the NER over all the paragraphs though, right?

I think it makes good sense to use the text classifier during training to find paragraphs with a high enough density of entities to make your annotation effort productive. But at runtime, where you just want the tool to extract entities, you may as well run it over the whole text.

As far as doing joint learning goes: there are a few ways you can do this. One solution would be to share the CNN layer between the NER component and the text classifier. This may or may not help: it does help a little to share the weights between the POS tagger and parser, but the disadvantage is you have to train the two together, which is a pain.

Another way to do joint NER and textcat would be to condition the NER labels on the type label applied to the text. For instance, you might jointly learn role labels for movie reviews with a scheme like POSITIVE_ACTOR and NEGATIVE_ACTOR.

While it’s not a joint strategy, a cheap way of including text classification labels as features would be to add the label as a token in the sentence (likely the first token). I doubt this would be very effective, though.

1 Like

Good point. Maybe I should try just doing a standard NER training recipe except with a phrase matcher prepending likely paragraphs to the head of the stream, like in the current textcat recipe.

That however, means I'm back to being blocked on my other question about extracting a set of examples from a stream.

I want to run an NER training task over a stream of paragraphs. I want to move those paragraphs that likely to contain named entities to the head of the stream. I can recognize these paragraphs because they also contain particular phrases. So I want to write a stream filter that moves paragraphs containing those phrases to the front of the stream. I'm back to wanting a function like find_with_terms(stream, seeds, at_least=10, at_most=1000, give_up_after=10000) except it would be find_with_phrases. The problem is I'm still not sure how to write a find_with_phrases that doesn't exhaust the original stream.

In the other thread you gave me an example recipe that did a combine_models on a text categorization model and a phrase matcher. That got around the "exhaust the stream" problem by having the combined model rank a single stream.

I'm playing with cloning the generator stream right now, but any guidance you could give me would help here. Maybe just a thumbnail sketch of how find_with_terms works, so I could write my own modification of it.

I Figured It Out

I pass in a graf_patterns option to the ner.teach recipe and use it to make the following modifications to the task stream.

if graf_patterns:
    matcher = PhraseMatcher(nlp.vocab)
    with open(graf_patterns) as f:
        matcher.add("Paragraph", None, *nlp.pipe(line.strip() for line in f.readlines()))
    stream, stream_a, stream_b = tee(stream, 3)
    tasks = zip(nlp.pipe(task["text"] for task in stream_a), stream_b)
    likely_paragraphs = [task for document, task in tasks if matcher(document)]
    for task in likely_paragraphs:
        task["meta"]["source"] = "graf-match"
        log("GRAF MATCH: {}".format(task))
    stream = concat([likely_paragraphs, stream])
    stream = get_stream(stream, rehash=True, dedup=True)

This seems to do the trick. I'm still curious how you implement find_with_terms though.

Thats a great/helpful discussion. I know this is an old post, however I was wondering if there is any update to tackle this issue?
I am trying to do the same think, using ner and text categorization on the same model, so model can look at the ner component and then leverage identified ner for text categorization.
I have used this recipe in spacy before
python -m spacy init config configs/config_trf.cfg --lang en --pipeline ner,textcat
However after reading @wpm comment not sure if this pipeline actually leverage the ner for text categorization and if so do we have the same recipe for Prodigy or not.

Hi @shahinshirazi ,

python -m spacy init config configs/config_trf.cfg --lang en --pipeline ner,textcat

This config will, indeed, update the ner and textcat components in isolation i.e. you wouldn't be getting the effect of joint learning.

As of spaCy 3.1 it is possible to propagate predictions between components so that labels obtained in NER could be used as features in textcat. The default textcat architectures don't use NER as feature, though so you'd need to provide a custom model that can leverage the annotating_components feature for textcat.