Finally able to resolve this. High level summary:
- It's in fact not the learning dynamics, fundamentally. The learning dynamics are indeed different, but they were masking a different bug.
- The bug is fundamentally in the way the pattern matcher and model predictions were combined. This code just didn't do what it was supposed to do, and the result was that the patterns had little impact on the enqueued data.
- A side issue we encountered is that depending on the pipeline components you have running, following something closer to the original logic can cause performance problems.
The old recipe
I'm mostly pasting this for illustration -- it has imports etc missing so it won't be easy to run verbatim; if you do want to run it I can definitely give you it in a runnable form. But see below for a different suggestion.
def find_with_terms(stream, terms, at_least=0, at_most=100,
give_up_after=100000):
"""
Yield examples that have at least one substring matching the given
terms. You can specify a maximum number of matches, and/or a maximum number
of examples to search through. You can also specify a minimum number of
examples. If fewer are found, an error is raised.
"""
log("SORTER: Looking for at least {} examples containing {} seed terms"
.format(at_least, len(terms)))
n_matches = 0
for i, example in enumerate(stream):
words = set(example['text'].split())
if any(term in words for term in terms):
n_matches += 1
yield example
if n_matches >= at_most:
log("SORTER: Stop after finding maximum of {} examples from seeds"
.format(at_most))
break
if i >= give_up_after:
log("SORTER: Give up finding seed terms after {} examples"
.format(give_up_after))
break
if n_matches < at_least:
msg = ("Tried to find at least %d examples containing the %d seed "
"terms provided, but only found %d matches. Gave up after "
"searching %d examples from the stream.")
raise ValueError(msg % (at_least, len(terms), n_matches,
give_up_after))
@recipe('textcat.teach',
dataset=recipe_args['dataset'],
spacy_model=recipe_args['spacy_model'],
source=recipe_args['source'],
label=recipe_args['label'],
api=recipe_args['api'],
loader=recipe_args['loader'],
seeds=recipe_args['seeds'],
long_text=("Long text", "flag", "L", bool),
exclude=recipe_args['exclude'])
def teach(dataset, spacy_model, source=None, label='', api=None,
loader=None, seeds=None, long_text=False, exclude=None):
"""
Collect the best possible training data for a text classification model
with the model in the loop. Based on your annotations, Prodigy will decide
which questions to ask next.
"""
log('RECIPE: Starting recipe textcat.teach', locals())
nlp = spacy.load(spacy_model)
log('RECIPE: Creating TextClassifier with model {}'
.format(spacy_model))
model = TextClassifier(nlp, label.split(','), long_text=long_text)
stream = get_stream(source, api, loader, rehash=True, dedup=True,
input_key='text')
if seeds is not None:
if isinstance(seeds, str) and seeds in DB:
seeds = get_seeds_from_set(seeds, DB.get_dataset(seeds))
else:
seeds = get_seeds(seeds)
# Find 'seedy' examples
examples_with_seeds = list(find_with_terms(stream, seeds,
at_least=10, at_most=1000,
give_up_after=10000))
for eg in examples_with_seeds:
eg.setdefault('meta', {})
eg['meta']['via_seed'] = True
print("Found {} examples with seeds".format(len(examples_with_seeds)))
examples_with_seeds = [task for _, task in model(examples_with_seeds)]
# Rank the stream. Note this is continuous, as model() is a generator.
# As we call model.update(), the ranking of examples changes.
stream = prefer_uncertain(model(stream))
# Prepend 'seedy' examples, if present
if seeds:
log("RECIPE: Prepending examples with seeds to the stream")
stream = cytoolz.concat((examples_with_seeds, stream))
return {
'view_id': 'classification',
'dataset': dataset,
'stream': stream,
'exclude': exclude,
'update': model.update,
'config': {'lang': nlp.lang, 'labels': model.labels}
}
Analysis
What we're doing here is searching for a set of examples that have a high density of our seed terms. We then prepend this set onto the stream:
# Find 'seedy' examples
examples_with_seeds = list(find_with_terms(stream, seeds,
at_least=10, at_most=1000,
give_up_after=10000))
...
stream = cytoolz.concat((examples_with_seeds, stream))
It very much bothers me that this recipe has had this bug for so long, so to me it's worth a broader look at what's gone wrong. I think fundamentally, there's no reason for all these steps to happen in one function. We should break this down into different steps: have a recipe to filter the data to find good candidates for annotation, and then enqueue that data in a separate recipe.
Naturally it's already possible to work this way with Prodigy, and I think that's what a lot of users have been doing (which might also be why this bug wasn't more prominent). So going forward, the multi-step approach is what we'll be recommending. We'll also be recording new tutorials with the v2 release we're working on.
Separate filtering recipe
@koaning has written a recipe to filter examples by pattern. You could also easily modify this to work with substring matches, or use some other heuristic. We expect to publish this with v1.12, but here it is (along with Vincent's description) so you can try it out already if you want.
Here's a demo recipe, written in a file called filter-recipe.py
that can filter candidates by re-using parts of the previous recipe:
from typing import Iterable, List, Optional, Union
import toolz
import srsly
from prodigy.components.loaders import get_stream
from prodigy.core import recipe
from prodigy.models.matcher import PatternMatcher
from prodigy.types import RecipeSettingsType, StreamType
from prodigy.util import get_labels, load_model, log, msg, split_string
from spacy.language import Language
def find_matches(
matcher: PatternMatcher,
stream: StreamType,
at_least: int = 10,
at_most: int = 1000,
give_up_after: int = 10000,
):
"""
Yield examples that have at least one matched pattern.
You can specify a maximum number of matches, and/or a maximum number
of examples to search through. You can also specify a minimum number of
examples. If fewer are found, an error is raised.
"""
log(f"SORTER: Looking for at least {at_least} examples containing patterns")
n_matches = 0
for _, example in matcher(toolz.take(give_up_after, stream)):
# PatternMatcher.__call__ only yields examples that matched the pattern
assert example["spans"]
n_matches += 1
yield example
if n_matches >= at_most:
log(
f"SORTER: Stopped after finding maximum of {at_most} examples from patterns"
)
break
else:
log(f"SORTER: Gave up finding pattern matches after {give_up_after} examples")
if n_matches < at_least:
warning_msg = (
f"This recipe is using patterns to find candidates to annotate first. It "
f"tried to find at least {at_least} examples containing one of the patterns "
f"provided, but only found {n_matches} matches. It gave up searching "
f"searching {give_up_after} examples from the stream and the recipe will now "
f"continue by considering candidates that did not get matched."
)
msg.warn(warning_msg)
@recipe(
"custom.filter",
# fmt: off
source=("Data to filter (file path or '-' to read from standard input)", "positional", None, str),
output=("Path to .jsonl file to write subset into", "positional", None, str),
spacy_model=("Loadable spaCy pipeline or blank:lang (e.g. blank:en)", "positional", None, str),
loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
patterns=("Path to match patterns file", "option", "pt", str),
patterns_at_least=("If patterns are provided, the minimum number of matches", "option", "al", int),
patterns_at_most=("If patterns are provided, the maximum number of matches", "option", "am", int),
patterns_give_up_after=("If patterns are provided, when to stop searching", "option", "gu", int)
# fmt: on
)
def filter_feed(
source: Union[str, Iterable[dict]],
output: str,
spacy_model: Union[str, Language],
label: Optional[List[str]] = None,
patterns: Optional[str] = None,
loader: Optional[str] = None,
patterns_at_least: int = 10,
patterns_at_most: int = 100,
patterns_give_up_after: int = 2000,
) -> RecipeSettingsType:
"""
Filter the dataset based on a patterns.jsonl file
"""
log("RECIPE: Starting recipe custom.filter", locals())
if label is None:
msg.fail("custom.filter requires at least one --label", exits=1)
if patterns is None:
msg.fail("custom.filter requires --patterns", exits=1)
nlp = load_model(spacy_model)
stream = get_stream(
source, loader=loader, rehash=True, dedup=True, input_key="text"
)
matcher = PatternMatcher(
nlp,
prior_correct=5.0,
prior_incorrect=5.0,
label_span=False,
label_task=True,
filter_labels=label,
combine_matches=True,
task_hash_keys=("label",),
)
matcher = matcher.from_disk(patterns)
log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
matched_stream = find_matches(
matcher,
stream,
at_least=patterns_at_least,
at_most=patterns_at_most,
give_up_after=patterns_give_up_after,
)
srsly.write_jsonl(output, matched_stream)
This recipe has the following --help.
usage: prodigy custom.filter [-h] [-l None] [-pt None] [-lo None] [-al 10] [-am 100] [-gu 2000] source output spacy_model
Filter the dataset based on a patterns.jsonl file
positional arguments:
source Data to filter (file path or '-' to read from standard input)
output Path to .jsonl file to write subset into
spacy_model Loadable spaCy pipeline or blank:lang (e.g. blank:en)
optional arguments:
-h, --help show this help message and exit
-l None, --label None
Comma-separated label(s) to annotate or text file with one label per line
-pt None, --patterns None
Path to match patterns file
-lo None, --loader None
Loader (guessed from file extension if not set)
-al 10, --patterns-at-least 10
If patterns are provided, the minimum number of matches
-am 100, --patterns-at-most 100
If patterns are provided, the maximum number of matches
-gu 2000, --patterns-give-up-after 2000
If patterns are provided, when to stop searching
And can be called via something like:
python -m prodigy custom.filter examples.jsonl interesting-subset.jsonl en_core_web_md --patterns patterns.jsonl --label insult -F filter-recipe.py
I personally prefer the two-step approach for a few reasons.
- It's more explicit. Instead of having one recipe that tries to apply two techniques we're able to split that up into two annotation tasks that can each contribute useful training data.
- This filtering technique is useful for multiple recipes, not just textcat. You could use interesting-subset.jsonl for whatever use-case.
- The "patterns tactic" for active learning seem very valuable early on in the active learning process, but in later phases it feels like the uncertain examples from the textcat model deserve more attention. These are the examples that the model finds confusing, so at some point the model might be able to learn most from these examples.
- This filtering approach might invite users to iterate on their patterns files, which seems worth encouraging
I hope that by sharing these recipes we help solve your current issues, but also that it helps motivate why the current recipe is the way that it is. I also hope that this reply might inspire some custom ideas around annotation techniques as well. The cool thing about Prodigy is that it is customisable from Python and you're free to implement any filtering/active learning mechanism that might work well for your problem.
If there are follow up questions/comments: do let us know