Combining NER with text classification

wpm · February 11, 2018, 11:12pm

Good point. Maybe I should try just doing a standard NER training recipe except with a phrase matcher prepending likely paragraphs to the head of the stream, like in the current textcat recipe.

That however, means I'm back to being blocked on my other question about extracting a set of examples from a stream.

I want to run an NER training task over a stream of paragraphs. I want to move those paragraphs that likely to contain named entities to the head of the stream. I can recognize these paragraphs because they also contain particular phrases. So I want to write a stream filter that moves paragraphs containing those phrases to the front of the stream. I'm back to wanting a function like find_with_terms(stream, seeds, at_least=10, at_most=1000, give_up_after=10000) except it would be find_with_phrases. The problem is I'm still not sure how to write a find_with_phrases that doesn't exhaust the original stream.

In the other thread you gave me an example recipe that did a combine_models on a text categorization model and a phrase matcher. That got around the "exhaust the stream" problem by having the combined model rank a single stream.

I'm playing with cloning the generator stream right now, but any guidance you could give me would help here. Maybe just a thumbnail sketch of how find_with_terms works, so I could write my own modification of it.

I Figured It Out

I pass in a graf_patterns option to the ner.teach recipe and use it to make the following modifications to the task stream.

if graf_patterns:
    matcher = PhraseMatcher(nlp.vocab)
    with open(graf_patterns) as f:
        matcher.add("Paragraph", None, *nlp.pipe(line.strip() for line in f.readlines()))
    stream, stream_a, stream_b = tee(stream, 3)
    tasks = zip(nlp.pipe(task["text"] for task in stream_a), stream_b)
    likely_paragraphs = [task for document, task in tasks if matcher(document)]
    for task in likely_paragraphs:
        task["meta"]["source"] = "graf-match"
        log("GRAF MATCH: {}".format(task))
    stream = concat([likely_paragraphs, stream])
    stream = get_stream(stream, rehash=True, dedup=True)

This seems to do the trick. I'm still curious how you implement find_with_terms though.

Topic		Replies	Views
Is it possible to do NER and Textcat Annotation together? ner , textcat	4	35	October 28, 2024
Combining NER and Classification usage , ner , textcat , solved	7	715	August 5, 2022
combining multiple models and exporting training data to spacy ner , spacy	3	2879	November 13, 2018
Use textcat and textcat_multilabel in the same model textcat , spacy	1	347	May 19, 2022
Can't load model if trained for NER and TEXTCAT usage , ner , textcat , spacy	3	817	July 2, 2019

Combining NER with text classification

I Figured It Out

Related topics