prodigy-recipes repo – feedback appreciated!

docs
project
news

(Ines Montani) #1

We’ve just published the prodigy-recipes repo, which includes a selection of annotated open-source recipes for Prodigy. We hope that this makes it easier for users to get started and adapt the templates to write their own custom recipes :sparkles:

https://github.com/explosion/prodigy-recipes

The repo also includes more straightforward, standalone versions of the built-in recipes with added comments that explain what’s going on. We’re also planning on adding more recipes in the future, as well as a quick option for other users to contribute and share custom recipes they’ve built with the community.

We’re still in the process of testing and developing, but I’d appreciate your feedback :smiley: Also, if you have any suggestions for recipes and recipe templates you’d like to see, let me know!


(Ines Montani) #3

4 posts were split to a new topic: Using patterns for multi-word expressions


(Claudio84destri) #4

Hi,

I am trying to make an active learning custom recepi based on textcat.tech with the following:

  1. adding the – memorize option which is present in mark recepi to avoid duplication during the annotation process. Basically having a cache for inside single batch texts to keep track of already asked text and remove it

  2. set a high probability score threshold to remove most of negative samples when using prefer_high_scores(algorithm = ‘probability’)

the reason is that when using active learning textcat.teach with a very imbalanced dataset, to reduce number of negative samples, I tend to use quite a lot of patterns which ends up in a lot of duplicated questions during active learning. Some pattern is even repeated multiple time during text.

moreover if I train well the CNN, I hope that using prefer_high_scores() will help but I would like to set different threshold score according to the state of accuracy of the CNN.

my questions for you is where could I find the --memorize function to include in a custom recepi?

thank very much in advance
kind regards

claudio nespoli


(Ines Montani) #5

The built in recipes are shipped with Prodigy as well, so you can check out the source of mark in recipes/generic.py to see how it’s implemented. It’s pretty straightforward – so maybe we can also add an example with this to the repo.

Btw, to find the location of your Prodigy installation, you can run the following:

python -c "import prodigy;print(prodigy.__file__)"

(Claudio84destri) #6

thank you very much, support for prodigy is very helpful


(Claudio84destri) #7

During active learning initiated using textcat.teach, I can see that when we have multiple patterns present in the text Prodigy asks to annotate the same text multiple times for each pattern.

having a look at the recepi for textcat.teach, I can see that a generator is defined as below using the PatternMatcher

    matcher = PatternMatcher(model.nlp, prior_correct=5.,
                             prior_incorrect=5., label_span=False,
                             label_task=True)
    matcher = matcher.from_disk(patterns)
    log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
    # Combine the textcat model with the PatternMatcher to annotate both
    # match results and predictions, and update both models.
    predict, update = combine_models(model, matcher)

Is it possible to modify arguments of PatternMatcher in a custom recepi to generate only one question of each piece of text and avoiding being asked multiple times about the same piece of text?

thank you in advance
kind regards


(Matthew Honnibal) #8

That’s a very reasonable thing to want, but currently there’s no argument on the PatternMatcher class itself for this. You could try a post-process filter like this:

from prodigy.util import INPUT_HASH_ATTR

def one_question_per_text(stream):
    '''Filter the stream so we only get one question on each text.'''
    last_hash = None
    for eg in stream:
        if eg[INPUT_HASH_ATTR] == last_hash:
            pass
        else:
            last_hash = eg[INPUT_HASH_ATTR]
            yield eg

If you apply that to the stream, it should make sure you don’t get asked consecutive questions about the same text.


(Claudio84destri) #9

thank you very much