Seeding text categorization with phrases

Initial Question

I’m doing a text categorization task on paragraphs. The best way to recognize the paragraphs I’m interested in is not with individual words but rather short phrases.

I’d like to write a --seeds file that takes two and three word phrases on each line instead of individual words, but that doesn’t appear to be how the textcat.teach recipe works. (Because find_with_terms uses proximity in embedding space? I can’t tell since I can’t see the source.)

  • Is there currently a Prodigy command line configuration that allows me to seed text categorization with short phrases? (Or patterns?)
  • If not, what is the easiest way to write one? I’m guessing I copy textcat.teach and replace find_with_terms with my own code. Can I just write this as a filter on a set of patterns to match?

My First Attempt at an Answer (With Further Questions)

I wrapped the textcat.teach recipe with code that reads the seeds file into a PhraseMatcher. In essence I replace that original recipe’s find_with_terms with my own find_with_phrases.

This recipe finds four examples that my patterns pick out, and then Prodigy says “No More Samples”. I haven’t been able to figure out how to make the recipe continue to stream examples, both the additional ones my patterns match and others that the model hypothesizes.

I suspect I’m mishandling the stream object by exhausting the generator. However, I can’t figure out the right way to handle this. Do I have to make copies of the stream? Are streams set up to loop infinitely? I don’t see this in the documentation, and I can’t step into find_with_terms to see how it is implemented.

Here’s my code.

import cytoolz
import spacy
from prodigy import recipe, recipe_args
from prodigy.recipes.textcat import teach
from prodigy.util import log, get_seeds
from spacy.matcher import PhraseMatcher


@recipe("textcat.teach",
        dataset=recipe_args["dataset"],
        spacy_model=recipe_args["spacy_model"],
        source=recipe_args["source"],
        label=recipe_args["label"],
        api=recipe_args["api"],
        loader=recipe_args["loader"],
        seeds=recipe_args["seeds"],
        long_text=("Long text", "flag", "L", bool),
        exclude=recipe_args["exclude"])
def teach_with_phrases(dataset, spacy_model, source=None, label="", api=None,
                       loader=None, seeds=None, long_text=False, exclude=None):
    """
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    log("RECIPE: Starting recipe textcat.teach with phrase seeds", locals())
    components = teach(dataset, spacy_model, source=source, label=label, api=api,
                       loader=loader, seeds=None, long_text=long_text, exclude=exclude)
    if seeds is not None:
        stream = components["stream"]
        nlp = spacy.load(spacy_model)
        seeds = get_seeds(seeds)
        matcher = PhraseMatcher(nlp.vocab)
        patterns = list(nlp.pipe(seeds))
        matcher.add("Filter Patterns", None, *patterns)
        examples_with_seeds = list(find_with_phrases(nlp, stream, matcher,
                                                     at_least=1, at_most=1000, give_up_after=10000))
        log("RECIPE: Prepending {} examples with seeds to the stream".format(len(examples_with_seeds)))
        components["stream"] = cytoolz.concat((examples_with_seeds, stream))
    return components


def find_with_phrases(nlp, stream, matcher, at_least, at_most, give_up_after):
    found = 0
    for i, eg in enumerate(stream):
        document = nlp(eg["text"])
        if matcher(document):
            found += 1
            yield eg
        if found == at_most:
            break
        if i > give_up_after and not found:
            raise Exception("Give up after {} examples not matching the patterns".format(i))
    if found < at_least:
        raise Exception("Only found {} examples".format(found))

A couple things I tried:

  • I tried getting a new stream from the existing one in find_with_phrases.
def find_with_phrases(nlp, stream, matcher, at_least, at_most, give_up_after):
    ...
    for i, eg in enumerate(get_stream(stream)):
    ...
  • I also tried to use itertools to split off my own stream.
def find_with_phrases(nlp, stream, matcher, at_least, at_most, give_up_after):
    stream, phrase_stream = itertools.tee(stream)
    found = 0
    for i, eg in enumerate(phrase_stream):
    ...

Both of these did the same thing.

@ines , am I overlooking something?

Sorry for the delay getting back to you on this.

Your idea of using patterns is correct here. I’m planning to update the recipe to work more like ner.teach, which uses a patterns file.

Are you currently blocking on this? If so I’ll try to give you some tips to keep you working for now. If not, then I’d like to spend some time fixing the textcat.teach recipe to be more consistent with ner.teach, so that more powerful seed patterns can be used.

Finally, to answer your question about the find_with_terms: It doesn’t use proximity in embedding space. From reading your code, what your function does should be pretty much the same as the built-in function.

Yes, I’m currently blocked on this. For the past week or so I’ve been able to distract people with progress in other areas, but starting this Monday I’m back to needing to get this done. :slight_smile:

Tips to keep me working would be great. In particular, I think an understanding of how the generators that underlie the stream objects is what I’m missing.

Thanks for your patience. The approach used in ner.teach seems to work well for textcat as well. This uses the PatternMatcher class, so that you can write a patterns.jsonl file which matches phrases, annotations, etc.

The mechanism for bootstrapping from the match rules is also better. Previously, we just pre-pended the “seedy” examples at the beginning of the data. Now we use the combine_models() function to interleave the two models, the PatternMatcher and the TextCategorizer. Annotations are sent to the update() callback of both models, so when you accept/reject the rule-based patterns, the results are used to train the text classifier. The accept/reject results on the rule-based patterns are also used to attach a score to the individual patterns, so that less fewer questions can be asked of less productive patterns. You can set a prior on the patterns in order to adjust how quickly these scores change.

# coding: utf8
from __future__ import unicode_literals, print_function

import spacy
import random
import cytoolz
import tqdm

from prodigy.models.matcher import PatternMatcher
from prodigy.models.textcat import TextClassifier
from prodigy.components import printers
from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import split_sentences
from prodigy.components.db import connect
from prodigy.components.sorters import prefer_uncertain
from prodigy.core import recipe, recipe_args
from prodigy.util import export_model_data, split_evals, get_print
from prodigy.util import combine_models, log


@recipe('textcat.teach',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        label=recipe_args['label'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        long_text=("Long text", "flag", "L", bool),
        exclude=recipe_args['exclude'])
def teach(dataset, spacy_model, source=None, label='', api=None, patterns=None,
          loader=None, long_text=False, exclude=None):
    """
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    log('RECIPE: Starting recipe textcat.teach', locals())
    DB = connect()
    nlp = spacy.load(spacy_model)
    disabled = nlp.disable_pipes(nlp.pipe_names) # Note: if you want POS tag patterns, don't disable the tagger.
    log('RECIPE: Creating TextClassifier with model {}'
        .format(spacy_model))
    model = TextClassifier(nlp, label.split(','), long_text=long_text)
    stream = get_stream(source, api, loader, rehash=True, dedup=True,
                        input_key='text')
    if patterns is None:
        predict = model
        update = model.update
    else:
        matcher = PatternMatcher(model.nlp, prior_correct=5., prior_incorrect=5.)
        matcher = matcher.from_disk(patterns)
        log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
        # Combine the NER model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
    # Rank the stream. Note this is continuous, as model() is a generator.
    # As we call model.update(), the ranking of examples changes.
    stream = prefer_uncertain(predict(stream))
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'update': update,
        'on_exit': lambda ctrl: disabled.restore(), 
        'config': {'lang': nlp.lang, 'labels': model.labels}
    }
3 Likes

I tried this recipe but saw strange results.

First, when I run it as-is, I get an error message about missing pipeline components.

Traceback (most recent call last):
  File "/anaconda3/envs/semex/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/semex/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/envs/semex/lib/python3.5/site-packages/prodigy/__main__.py", line 242, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/anaconda3/envs/semex/lib/python3.5/site-packages/plac-0.9.6-py3.5.egg/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/anaconda3/envs/semex/lib/python3.5/site-packages/plac-0.9.6-py3.5.egg/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "textcat.py", line 41, in teach
    disabled = nlp.disable_pipes(nlp.pipe_names) # Note: if you want POS tag patterns, don't disable the tagger.
  File "/anaconda3/envs/semex/lib/python3.5/site-packages/spacy/language.py", line 354, in disable_pipes
    return DisabledPipes(self, *names)
  File "/anaconda3/envs/semex/lib/python3.5/site-packages/spacy/language.py", line 685, in __init__
    self.extend(nlp.remove_pipe(name) for name in names)
  File "/anaconda3/envs/semex/lib/python3.5/site-packages/spacy/language.py", line 685, in <genexpr>
    self.extend(nlp.remove_pipe(name) for name in names)
  File "/anaconda3/envs/semex/lib/python3.5/site-packages/spacy/language.py", line 312, in remove_pipe
    raise ValueError(msg.format(name, self.pipe_names))
ValueError: Can't find '['tagger', 'parser', 'ner']' in pipeline. Available names: ['tagger', 'parser', 'ner']

This goes away if I comment out the disabled = nlp.disable_pipes(nlp.pipe_names) line.

I have a corpus in which each document is a paragraph of legal text. I want to learn to classify those paragraphs that describe a legal jurisdiction versus those that describe something else. (So JURISDICTION is a text category label applied to the entire paragraph.) All legal jurisdiction paragraphs mention the names of U.S. states, so I’m using those as my seed patterns. (I can’t use them as seed terms, both because some U.S. state names are multi-world, and also because I want to use more general short phrases for other tasks.)

I have a patterns file that looks like this:

{"label": "JURISDICTION", "pattern": [{"ORTH": "Alabama"}]}
{"label": "JURISDICTION", "pattern": [{"ORTH": "Alaska"}]}
{"label": "JURISDICTION", "pattern": [{"ORTH": "Arizona"}]}

If run the following Prodigy command that does not specify the pattern file

prodigy textcat.teach -F textcat.py jurisdiction-paragraph.000 en corpus.jsonl --label JURISDICTION

I see the text classification task that I’d expect. A typical screen looks like this.

However, I am not using any seed phrases to pick out likely candidate paragraphs.

Now if I use the following command line to include the patterns file

prodigy textcat.teach -F textcat.py jurisdiction-paragraph.000 en corpus.jsonl --patterns us-states.patterns.en.jsonl --label JURISDICTION

If the candidate paragraph contains the name of a State I see the following.

This looks like Prodigy switched from doing a text classification task to doing an NER task. The VIEW still says classification, but a U.S. State pattern is highlighted with the the label JURISDICTION.

If the candidate paragraph does not contain a mention of a U.S. state, Prodigy presents me with a text classification task, as in the first picture.

This looks weird. Am I misunderstanding something? Like maybe I’m using the wrong format for the patterns file? (I’ve been experimenting with this, but haven’t found a format that works.)

To restate just to be clear: I am trying to do text classification. I’m not trying to do NER at all here. I just want to use multi-word phrases as the “seed patterns” for my text classification.

(This is Prodigy 1.3.0)

The result actually looks pretty good so far – the difference in rendering is probably related to different task formats produced by the two models. If a classification task also contains "spans", Prodigy will render those in "NER style". Essentially, the interface makes very few assumptions about what exactly you're labelling, and will simply render tasks how they come in.

So it looks like the only problem here is that the text classification model produces annotation tasks with a text and a "label", whereas the pattern matcher model produces tasks with "spans" and no label. The highlighted "entity" you see is the match from the patterns. This is likely because the pattern matcher was originally designed for NER.

You might be able to just add a simple function that reformats the stream to make sure all tasks are presented in the same style (have the label set and no label on the spans). For example:

def normalize_stream(stream, label):
    for eg in stream:
        eg['label'] = label  # make sure all tasks have the label set
        if 'spans' in eg:
            for span in eg['spans']:
                span.pop('label')  # remove label from spans
        yield eg

Alternatively, if you don't want the matches to be highlighted at all, you could also remove the "spans" altogether.

stream = prefer_uncertain(predict(stream))
stream = normalize_stream(stream, label)

That makes sense. I’m closer to a solution.

First, minor bug in the original recipe. You need to disable the pipes with the following call:

disabled = nlp.disable_pipes(*nlp.pipe_names)

Then before you reenable on exit, you have to disable the pipes created in this recipe. I did it like so:

def restore(_):
    nlp.disable_pipes(*nlp.pipe_names)
    disabled.restore()
...
    return {
    ...
    'on_exit': restore,
    ...
}

I don’t want the annotator to see the highlighted seed phrases at all, so I wrote a variation on @ines’ stream normalization from above.

def normalize_stream(stream, label):
    for eg in stream:
        eg['label'] = label  # make sure all tasks have the label set
        eg["spans"] = []
        yield eg

This works: I always see the text category label at the top of the annotation task, and I never see the highlighting.

However, there’s still one problem. If the candidate paragraph contains more than one seed phrase, multiple eg objects will be cued up for it in the stream, one for each seed phrase. Then the annotator is presented the same paragraph multiple times in a row.

I want the paragraph to only be presented for annotation once. How do I do this? I’m guessing that after removing the spans from the eg object I have to rehash it, because the removal may render some eg objects identical. I don’t know how to do this. I tried

stream = get_stream(stream, rehash=True, dedup=True)

but that didn’t do the trick.

(BTW eg stands for “example”, right?)

Just posted an update on this thread with an updated version of textcat.teach using the PatternMatcher. (It still includes the entity labels, but you can easily filter them out using a function like yours above).

You could write a little wrapper for your stream that checks the _input_hash, which will be identical for tasks with the same text, and either merges the spans, or removes the duplicates. (This depends on how you want the tasks to look – i.e. if you want all matches to be highlighted, or just the first one.)

Ah yes – sorry if this was confusing. We just ended up using eg because it's short.

Just released v1.4.0, which lets you bootstrap textcat.teach with a patterns file instead of only seed terms (just like ner.teach) :tada:

Hey,

Just ran into some of these issues myself.

Is there any circumstance in which you'd want to show the same text multiple times in a classification task? That is to say, would it not be better for this to be the default behavior? I was pretty confused by the fact that i'd see the same text multiple times in a classification task.

I'm also seeing errors when I try to save my classification annotations when using patterns. Specifically:

14:54:46 - Exception when serving /give_answers
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/waitress/channel.py", line 338, in service
    task.service()
  File "/usr/local/lib/python3.6/site-packages/waitress/task.py", line 169, in service
    self.execute()
  File "/usr/local/lib/python3.6/site-packages/waitress/task.py", line 399, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/usr/local/lib/python3.6/site-packages/hug/api.py", line 424, in api_auto_instantiate
    return module.__hug_wsgi__(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "/usr/local/lib/python3.6/site-packages/hug/interface.py", line 734, in __call__
    raise exception
  File "/usr/local/lib/python3.6/site-packages/hug/interface.py", line 709, in __call__
    self.render_content(self.call_function(input_parameters), request, response, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/hug/interface.py", line 649, in call_function
    return self.interface(**parameters)
  File "/usr/local/lib/python3.6/site-packages/hug/interface.py", line 100, in __call__
    return __hug_internal_self._function(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/prodigy/app.py", line 101, in give_answers
    controller.receive_answers(answers)
  File "cython_src/prodigy/core.pyx", line 98, in prodigy.core.Controller.receive_answers
  File "cython_src/prodigy/util.pyx", line 277, in prodigy.util.combine_models.update
  File "cython_src/prodigy/models/textcat.pyx", line 169, in prodigy.models.textcat.TextClassifier.update
KeyError: 'label'

That said though, prodigy is super awesome so far. Thanks for making it!

EDIT: Upon reading @wpm's answer more carefully, I saw that he was adding 'label' back into the stream. It seems the PatternMatcher is over-riding the label property somehow when you use seed patterns, which causes the saves to fail. Adding that back in fixes the saving issue, and I also fixed the duplication problem, for anyone who wants it, the code is:

seen = {}
def filter_duplicates(stream, label):
    for eg in stream:
        if eg['_input_hash'] not in seen:
            seen[eg['_input_hash']] = True
            eg['label'] = label
            eg['spans'] = []
            yield eg

And then in the 'teach' recipe code:

stream = filter_duplicates(stream, label[0])

Note that this will not work correctly for multi-class problems. I'm not sure what to do about that though.

1 Like