Good point. Maybe I should try just doing a standard NER training recipe except with a phrase matcher prepending likely paragraphs to the head of the stream, like in the current textcat recipe.
That however, means I'm back to being blocked on my other question about extracting a set of examples from a stream.
I want to run an NER training task over a stream of paragraphs. I want to move those paragraphs that likely to contain named entities to the head of the stream. I can recognize these paragraphs because they also contain particular phrases. So I want to write a stream filter that moves paragraphs containing those phrases to the front of the stream. I'm back to wanting a function like find_with_terms(stream, seeds, at_least=10, at_most=1000, give_up_after=10000)
except it would be find_with_phrases
. The problem is I'm still not sure how to write a find_with_phrases
that doesn't exhaust the original stream.
In the other thread you gave me an example recipe that did a combine_models
on a text categorization model and a phrase matcher. That got around the "exhaust the stream" problem by having the combined model rank a single stream.
I'm playing with cloning the generator stream right now, but any guidance you could give me would help here. Maybe just a thumbnail sketch of how find_with_terms
works, so I could write my own modification of it.
I Figured It Out
I pass in a graf_patterns
option to the ner.teach
recipe and use it to make the following modifications to the task stream.
if graf_patterns:
matcher = PhraseMatcher(nlp.vocab)
with open(graf_patterns) as f:
matcher.add("Paragraph", None, *nlp.pipe(line.strip() for line in f.readlines()))
stream, stream_a, stream_b = tee(stream, 3)
tasks = zip(nlp.pipe(task["text"] for task in stream_a), stream_b)
likely_paragraphs = [task for document, task in tasks if matcher(document)]
for task in likely_paragraphs:
task["meta"]["source"] = "graf-match"
log("GRAF MATCH: {}".format(task))
stream = concat([likely_paragraphs, stream])
stream = get_stream(stream, rehash=True, dedup=True)
This seems to do the trick. I'm still curious how you implement find_with_terms
though.