Filter inputs in textcat.teach

nix411 · February 4, 2022, 7:40am

Is it possible to use the built-in filter_inputs in the built-in textcat.teach recipe? Or do I need to create my own custom recipe that basically is a copy-paste of textcat.teach? First I figured I could just do it in my custom loader but the loader doesn't know about the existing task ids unfortunately (i.e. which dataset is being used).

Alternatively I could create a custom loader for each dataset or embed the dataset into the source argument. Just feels a little too hacky

koaning · March 11, 2023, 9:26am

What are you interested in filtering?

I guess the simplest way to go about this is to write a small Python script that would run filter_inputs on your original data (say original.jsonl) and then saves a subset of interest (say useful-subset.jsonl). You can then feed this subset to the Prodigy recipe.

The benefit of such a Python script is that you can really just do anything you like. This includes using some of the helpers in Prodigy, but also anything that Python can do. That means that you can always expand the script and get creative.

Does this help?

Topic		Replies	Views
Filter already annotated text usage , solved , streams	2	610	December 27, 2021
textcat.teach - Patterns not filtering Label enhancement , textcat , done , solved	8	744	January 11, 2019
textcat.teach: how to exclude target dataset examples by hash, but auxiliary datasets by input? usage , textcat , best-practices	1	501	August 23, 2022
Textcat not excluding dataset. textcat , streams	7	744	June 30, 2020
1.10.4 prodigy.json exclude_by bug? textcat , solved	5	678	November 10, 2020

Filter inputs in textcat.teach

Related topics