Filter inputs in textcat.teach

Is it possible to use the built-in filter_inputs in the built-in textcat.teach recipe? Or do I need to create my own custom recipe that basically is a copy-paste of textcat.teach? First I figured I could just do it in my custom loader but the loader doesn't know about the existing task ids unfortunately (i.e. which dataset is being used).

Alternatively I could create a custom loader for each dataset or embed the dataset into the source argument. Just feels a little too hacky

What are you interested in filtering?

I guess the simplest way to go about this is to write a small Python script that would run filter_inputs on your original data (say original.jsonl) and then saves a subset of interest (say useful-subset.jsonl). You can then feed this subset to the Prodigy recipe.

The benefit of such a Python script is that you can really just do anything you like. This includes using some of the helpers in Prodigy, but also anything that Python can do. That means that you can always expand the script and get creative.

Does this help?

1 Like