Hi,
I am trying to make an active learning custom recepi based on textcat.tech with the following:
-
adding the – memorize option which is present in mark recepi to avoid duplication during the annotation process. Basically having a cache for inside single batch texts to keep track of already asked text and remove it
-
set a high probability score threshold to remove most of negative samples when using prefer_high_scores(algorithm = ‘probability’)
the reason is that when using active learning textcat.teach with a very imbalanced dataset, to reduce number of negative samples, I tend to use quite a lot of patterns which ends up in a lot of duplicated questions during active learning. Some pattern is even repeated multiple time during text.
moreover if I train well the CNN, I hope that using prefer_high_scores() will help but I would like to set different threshold score according to the state of accuracy of the CNN.
my questions for you is where could I find the --memorize function to include in a custom recepi?
thank very much in advance
kind regards
claudio nespoli