Initial Question
I’m doing a text categorization task on paragraphs. The best way to recognize the paragraphs I’m interested in is not with individual words but rather short phrases.
I’d like to write a --seeds
file that takes two and three word phrases on each line instead of individual words, but that doesn’t appear to be how the textcat.teach
recipe works. (Because find_with_terms
uses proximity in embedding space? I can’t tell since I can’t see the source.)
- Is there currently a Prodigy command line configuration that allows me to seed text categorization with short phrases? (Or patterns?)
- If not, what is the easiest way to write one? I’m guessing I copy
textcat.teach
and replacefind_with_terms
with my own code. Can I just write this as a filter on a set of patterns to match?
My First Attempt at an Answer (With Further Questions)
I wrapped the textcat.teach
recipe with code that reads the seeds
file into a PhraseMatcher. In essence I replace that original recipe’s find_with_terms
with my own find_with_phrases
.
This recipe finds four examples that my patterns pick out, and then Prodigy says “No More Samples”. I haven’t been able to figure out how to make the recipe continue to stream examples, both the additional ones my patterns match and others that the model hypothesizes.
I suspect I’m mishandling the stream
object by exhausting the generator. However, I can’t figure out the right way to handle this. Do I have to make copies of the stream? Are streams set up to loop infinitely? I don’t see this in the documentation, and I can’t step into find_with_terms
to see how it is implemented.
Here’s my code.
import cytoolz
import spacy
from prodigy import recipe, recipe_args
from prodigy.recipes.textcat import teach
from prodigy.util import log, get_seeds
from spacy.matcher import PhraseMatcher
@recipe("textcat.teach",
dataset=recipe_args["dataset"],
spacy_model=recipe_args["spacy_model"],
source=recipe_args["source"],
label=recipe_args["label"],
api=recipe_args["api"],
loader=recipe_args["loader"],
seeds=recipe_args["seeds"],
long_text=("Long text", "flag", "L", bool),
exclude=recipe_args["exclude"])
def teach_with_phrases(dataset, spacy_model, source=None, label="", api=None,
loader=None, seeds=None, long_text=False, exclude=None):
"""
Collect the best possible training data for a text classification model
with the model in the loop. Based on your annotations, Prodigy will decide
which questions to ask next.
"""
log("RECIPE: Starting recipe textcat.teach with phrase seeds", locals())
components = teach(dataset, spacy_model, source=source, label=label, api=api,
loader=loader, seeds=None, long_text=long_text, exclude=exclude)
if seeds is not None:
stream = components["stream"]
nlp = spacy.load(spacy_model)
seeds = get_seeds(seeds)
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(seeds))
matcher.add("Filter Patterns", None, *patterns)
examples_with_seeds = list(find_with_phrases(nlp, stream, matcher,
at_least=1, at_most=1000, give_up_after=10000))
log("RECIPE: Prepending {} examples with seeds to the stream".format(len(examples_with_seeds)))
components["stream"] = cytoolz.concat((examples_with_seeds, stream))
return components
def find_with_phrases(nlp, stream, matcher, at_least, at_most, give_up_after):
found = 0
for i, eg in enumerate(stream):
document = nlp(eg["text"])
if matcher(document):
found += 1
yield eg
if found == at_most:
break
if i > give_up_after and not found:
raise Exception("Give up after {} examples not matching the patterns".format(i))
if found < at_least:
raise Exception("Only found {} examples".format(found))
A couple things I tried:
- I tried getting a new stream from the existing one in
find_with_phrases
.
def find_with_phrases(nlp, stream, matcher, at_least, at_most, give_up_after):
...
for i, eg in enumerate(get_stream(stream)):
...
- I also tried to use
itertools
to split off my own stream.
def find_with_phrases(nlp, stream, matcher, at_least, at_most, give_up_after):
stream, phrase_stream = itertools.tee(stream)
found = 0
for i, eg in enumerate(phrase_stream):
...
Both of these did the same thing.
@ines , am I overlooking something?