textcat.teach: how to exclude target dataset examples by hash, but auxiliary datasets by input?

here's what I'm trying to achieve:

I've created a simple UI on top of prodigy for use in my team, and the workflow is typically that a user defines a labelling project with a number of textcat labels (can be exclusive or not, but let's assume not for this discussion, so multilabel), then does some "gold" labelling with textcat.manual and textcat.correct. We save those annotations to a single dataset. Then, the user wants to improve the model more efficiently, so we allow them to use textcat.teach and select which label to work on. We save the textcat.teach labelled examples to a separate dataset (a single one, not one per label). We don't want to label examples already in the golden dataset, so we add --exclude for that dataset, and have to exclude by input hash.

If the user first does textcat.teach with --label="L1" for a while, and then does textcat.teach with --label="L2", I don't want to filter out examples previously labelled with "L1", but this does seem to be happening.

As we don't have visibility of the Controller code I'm not sure how filtering is handled w.r.t. examples already in the target dataset. Does the target dataset get lumped in with the "exclude" datasets and therefore the "exclude_by" in the config file applies equally to the "exclude" datasets and the target dataset?

It seems to me that I have to either save the textcat.teach labelled examples to separate datasets depending on the label focused on, or modify the recipe to use custom stream filtering logic? Or probably simpler, write a custom script to handle the stream creation and required exclusion, and pipe the output of that into textcat.teach, making sure there is no further exclusion happening in the Controller?

hi @einarbmag!

Thanks for your question! It's an interesting use case. Please correct me if I misinterpreted anything. It's a tricky workflow so I may have missed something.

Thinking outside the box, could you use the "choice" interface for textcat.teach and do all labels at the same time (instead of one label at a time)?

I typically like binary labels (i.e., one label at a time) early on, but with active learning, it may make it easier to consider all labels at the same time (I suspect that would eliminate the problem).

Ines had recommended this through a custom recipe in this post (definitely read through this thread with lots of examples):

In theory, hopefully this would solve your problem since you can do all the labels at the same time. The one issue may be you're dealing with many (say 10+ labels). I'll wait to see if you think this is a viable alternative.

You can find the location of your Prodigy installation by running python -m prodigy stats and find the Location folder printed out. You likely would be interested in core.py (for controller) or components/feeds.py (get_batch).

By default, the target (aka current) dataset are always excluded (see Configuration). You can change this by changing within prodigy.json to "auto_exclude_current": false. However, if you modify this, not sure the answer but it would be interesting to find out.

If the alternative suggestion doesn't work (use custom textcat.teach recipe with all labels in a "choice" interface), then both of these make sense. Hopefully the tips above can help.

Also, just in case, are you using Prodigy's logging functionality? I find the PRODIGY_LOGGING=verbose especially helpful as it will show the result of the feed (get_batch).

Hope this is a start and let me know your comments/feedback. Happy to iterate on more ideas until you find a solution.