prodigy-recipes repo – feedback appreciated!

Hi,

I am trying to make an active learning custom recepi based on textcat.tech with the following:

  1. adding the – memorize option which is present in mark recepi to avoid duplication during the annotation process. Basically having a cache for inside single batch texts to keep track of already asked text and remove it

  2. set a high probability score threshold to remove most of negative samples when using prefer_high_scores(algorithm = ‘probability’)

the reason is that when using active learning textcat.teach with a very imbalanced dataset, to reduce number of negative samples, I tend to use quite a lot of patterns which ends up in a lot of duplicated questions during active learning. Some pattern is even repeated multiple time during text.

moreover if I train well the CNN, I hope that using prefer_high_scores() will help but I would like to set different threshold score according to the state of accuracy of the CNN.

my questions for you is where could I find the --memorize function to include in a custom recepi?

thank very much in advance
kind regards

claudio nespoli