prodigy-recipes repo – feedback appreciated!

ines · October 22, 2018, 9:24am

We’ve just published the prodigy-recipes repo, which includes a selection of annotated open-source recipes for Prodigy. We hope that this makes it easier for users to get started and adapt the templates to write their own custom recipes

https://github.com/explosion/prodigy-recipes

The repo also includes more straightforward, standalone versions of the built-in recipes with added comments that explain what’s going on. We’re also planning on adding more recipes in the future, as well as a quick option for other users to contribute and share custom recipes they’ve built with the community.

We’re still in the process of testing and developing, but I’d appreciate your feedback Also, if you have any suggestions for recipes and recipe templates you’d like to see, let me know!

ines · November 14, 2018, 12:11pm

4 posts were split to a new topic: Using patterns for multi-word expressions

claudio84destri · November 17, 2018, 9:00am

Hi,

I am trying to make an active learning custom recepi based on textcat.tech with the following:

adding the – memorize option which is present in mark recepi to avoid duplication during the annotation process. Basically having a cache for inside single batch texts to keep track of already asked text and remove it
set a high probability score threshold to remove most of negative samples when using prefer_high_scores(algorithm = ‘probability’)

the reason is that when using active learning textcat.teach with a very imbalanced dataset, to reduce number of negative samples, I tend to use quite a lot of patterns which ends up in a lot of duplicated questions during active learning. Some pattern is even repeated multiple time during text.

moreover if I train well the CNN, I hope that using prefer_high_scores() will help but I would like to set different threshold score according to the state of accuracy of the CNN.

my questions for you is where could I find the --memorize function to include in a custom recepi?

thank very much in advance
kind regards

claudio nespoli

ines · November 17, 2018, 12:29pm

The built in recipes are shipped with Prodigy as well, so you can check out the source of mark in recipes/generic.py to see how it's implemented. It's pretty straightforward – so maybe we can also add an example with this to the repo.

Btw, to find the location of your Prodigy installation, you can run the following:

python -c "import prodigy;print(prodigy.__file__)"

claudio84destri · November 17, 2018, 12:34pm

thank you very much, support for prodigy is very helpful

claudio84destri · November 19, 2018, 11:43am

During active learning initiated using textcat.teach, I can see that when we have multiple patterns present in the text Prodigy asks to annotate the same text multiple times for each pattern.

having a look at the recepi for textcat.teach, I can see that a generator is defined as below using the PatternMatcher

    matcher = PatternMatcher(model.nlp, prior_correct=5.,
                             prior_incorrect=5., label_span=False,
                             label_task=True)
    matcher = matcher.from_disk(patterns)
    log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
    # Combine the textcat model with the PatternMatcher to annotate both
    # match results and predictions, and update both models.
    predict, update = combine_models(model, matcher)

Is it possible to modify arguments of PatternMatcher in a custom recepi to generate only one question of each piece of text and avoiding being asked multiple times about the same piece of text?

thank you in advance
kind regards

honnibal · November 19, 2018, 11:55am

That's a very reasonable thing to want, but currently there's no argument on the PatternMatcher class itself for this. You could try a post-process filter like this:

from prodigy.util import INPUT_HASH_ATTR

def one_question_per_text(stream):
    '''Filter the stream so we only get one question on each text.'''
    last_hash = None
    for eg in stream:
        if eg[INPUT_HASH_ATTR] == last_hash:
            pass
        else:
            last_hash = eg[INPUT_HASH_ATTR]
            yield eg

If you apply that to the stream, it should make sure you don't get asked consecutive questions about the same text.

claudio84destri · November 19, 2018, 1:37pm

thank you very much

Topic		Replies	Views
Can we bring back --seeds for textcat.teach? textcat , solved	7	522	February 10, 2023
textcat.teach presents same annotation task if text snippet contains multiple patterns enhancement , usage , textcat , solved	11	1668	June 3, 2019
Custom recipes tutorial not working custom , solved	6	242	July 27, 2024
Custom model Requirements usage , custom	8	2919	March 25, 2019
Saved annotation not excluded in active learning recipe bug , textcat	3	422	February 13, 2022

prodigy-recipes repo – feedback appreciated!

Related topics