First, thanks for the prodigy and from the recent feedback from my teammates they are happy with this new tool compared to the previous tagging tools that we have tried like
One issue that we meet now is that prodigy won’t iterate thoroughly all the data that we provided under
prodigy mark. For example, in our .jsonl file, there are 1000 records, but at the end, we can only tag about 800 and then the screen shows
No tasks available. I saw this post and the solution recomendeded by Ines is to use
mark. I’m wondering that is there any misunderstanding that I have of it?
Another question is where can I dow
Thanks for the report – and I’m glad you’re finding Prodigy useful so far
The issue you describe definitely shouldn’t happen.
What I meant in my reply was: The
teach recipes will select the most relevant examples and thus will skip examples with very high or low predictions, to help you focus on the most important ones. Ideally, this means you’ll have to annotate fewer examples in total, while still getting similar results after training.
mark recipe should go through your examples in order and just ask questions, without making any predictions, selecting examples or modifying the stream. So if you don’t want to use the “active learning component” and just want to annotate a fixed set of examples in order, it’s generally recommended to use
mark instead of
Two possible explanations and solutions I can think of:
- Can you check an make sure that your examples do not contain any duplicates? Prodigy will assign a unique input hash to each example that comes in, based on the properties (text, spans etc.) and filter out duplicates, to make sure you’re not annotating an example twice. So is it possible that your data contains 200 duplicate tasks?
- Are you setting the
--memorize flag when using the
mark recipe? Setting the flag will exclude all examples that were already annotated in the same dataset. For example, if you’ve already annotated 200 examples and stored them in a dataset, and then restart
prodigy mark using the same dataset ID, the tasks you’ve already annotated will be skipped.
Looks like your sentence was somehow cut off?