textcat.teach: how to exclude target dataset examples by hash, but auxiliary datasets by input?

hi @einarbmag!

Thanks for your question! It's an interesting use case. Please correct me if I misinterpreted anything. It's a tricky workflow so I may have missed something.

Thinking outside the box, could you use the "choice" interface for textcat.teach and do all labels at the same time (instead of one label at a time)?

I typically like binary labels (i.e., one label at a time) early on, but with active learning, it may make it easier to consider all labels at the same time (I suspect that would eliminate the problem).

Ines had recommended this through a custom recipe in this post (definitely read through this thread with lots of examples):

In theory, hopefully this would solve your problem since you can do all the labels at the same time. The one issue may be you're dealing with many (say 10+ labels). I'll wait to see if you think this is a viable alternative.

You can find the location of your Prodigy installation by running python -m prodigy stats and find the Location folder printed out. You likely would be interested in core.py (for controller) or components/feeds.py (get_batch).

By default, the target (aka current) dataset are always excluded (see Configuration). You can change this by changing within prodigy.json to "auto_exclude_current": false. However, if you modify this, not sure the answer but it would be interesting to find out.

If the alternative suggestion doesn't work (use custom textcat.teach recipe with all labels in a "choice" interface), then both of these make sense. Hopefully the tips above can help.

Also, just in case, are you using Prodigy's logging functionality? I find the PRODIGY_LOGGING=verbose especially helpful as it will show the result of the feed (get_batch).

Hope this is a start and let me know your comments/feedback. Happy to iterate on more ideas until you find a solution.