Thanks for the report!
Just tested it with your example data and came across the same behaviour. There weren't any errors either, so I think what might be happening here is the following: textcat.teach
scores the stream and tries to only show you the most relevant tasks. Since there are only 100 examples, the "most relevant" selection seems to be only about 10-20%, which means the stream is exhausted after 10-20 examples.
This is probably unideal, and we'll think about the best way to handle this. JSONL is streamed in line-by-line, so Prodigy can't know upfront how many examples there are. But it could, for example, try to loop over the data again with a lower score threshold if the stream is exhausted and the number of examples collected so far is very low.
Here are some potential solutions for now:
Label all examples without the active-learning component
If you want to label all examples in your dataset, try using the mark
recipe instead, which disables the active learning component and sorting and simply lets you annotate all examples. You can save all annotations for the different labels to the same dataset, and when you're done, use textcat.batch-train
to train your model.
prodigy mark your_dataset /path/to/file.jsonl -l LABEL -v classification
Use more examples
If you have enough data for each class, you don't necessarily have to extract a certain number first – you can simply stream in everything you have and let Prodigy take care of making the selection. If you need specific processing logic to convert your data to Prodigy's JSONL format, you can simply do this in Python and forward the data to Prodigy.
If you don't set the source
argument on the command line, it defaults to stdin
, which means you can pipe through data from any other source or script on the command line. Assuming your have a processing script like this:
stream = load_my_texts_from_somewhere()
for text in stream:
task = {'text': text}
# print the JSON to make it available to recipe as stdtin
print(json.dumps(task))
You can then use it with the recipe like this:
python my_processing_script.py | prodigy textcat.teach my_dataset en_core_web_sm --label MY_LABEL
Alternatively, you can also achieve the same result with a custom recipe that delegates to textcat.teach
. All recipes do is return a simple dictionary of components (which is later interpreted by Prodigy when you run the recipe), so your custom recipe can simply call the textcat.teach
function and return its output:
import prodigy
from prodigy.recipes.textcat import teach
@prodigy.recipe('custom_recipe')
def custom_recipe(dataset, model, label):
stream = load_texts_and_process_them() # your custom logic here
return teach(dataset, model, stream, label=label)
You can then use the recipe like this:
prodigy custom_recipe my_dataset en_core_web_sm MY_LABEL -F recipe.py
There should be a more detailed example of this in the documentation as well
This is weird – I'll double-check that the file upload permissions are set correctly. I specifically remember adding .jsonl
to the list of allowed file types, because it seemed like something that would come up a lot.