"No tasks available" even though there's plenty of samples left

Hi,

For my multi-class classification task, I’ve prepared 100 samples for each class then started tagging. But after tagging 10-20 samples the system says “No tasks available”. What’s happening? I attach one of my .jsonl file for the reference.

Sample file

Download

System message

p.s. Why can’t I attach .jsonl file here? The editor says the extension is authorized, but I still can’t attach my sample file (so I used the Dropbox link)

2 Likes

Thanks for the report!

Just tested it with your example data and came across the same behaviour. There weren't any errors either, so I think what might be happening here is the following: textcat.teach scores the stream and tries to only show you the most relevant tasks. Since there are only 100 examples, the "most relevant" selection seems to be only about 10-20%, which means the stream is exhausted after 10-20 examples.

This is probably unideal, and we'll think about the best way to handle this. JSONL is streamed in line-by-line, so Prodigy can't know upfront how many examples there are. But it could, for example, try to loop over the data again with a lower score threshold if the stream is exhausted and the number of examples collected so far is very low.

Here are some potential solutions for now:

Label all examples without the active-learning component

If you want to label all examples in your dataset, try using the mark recipe instead, which disables the active learning component and sorting and simply lets you annotate all examples. You can save all annotations for the different labels to the same dataset, and when you're done, use textcat.batch-train to train your model.

prodigy mark your_dataset /path/to/file.jsonl -l LABEL -v classification

Use more examples

If you have enough data for each class, you don't necessarily have to extract a certain number first – you can simply stream in everything you have and let Prodigy take care of making the selection. If you need specific processing logic to convert your data to Prodigy's JSONL format, you can simply do this in Python and forward the data to Prodigy.

If you don't set the source argument on the command line, it defaults to stdin, which means you can pipe through data from any other source or script on the command line. Assuming your have a processing script like this:

stream = load_my_texts_from_somewhere()
for text in stream:
    task = {'text': text}
    # print the JSON to make it available to recipe as stdtin
    print(json.dumps(task))   

You can then use it with the recipe like this:

python my_processing_script.py | prodigy textcat.teach my_dataset en_core_web_sm --label MY_LABEL

Alternatively, you can also achieve the same result with a custom recipe that delegates to textcat.teach. All recipes do is return a simple dictionary of components (which is later interpreted by Prodigy when you run the recipe), so your custom recipe can simply call the textcat.teach function and return its output:

import prodigy
from prodigy.recipes.textcat import teach

@prodigy.recipe('custom_recipe')
def custom_recipe(dataset, model, label):
    stream = load_texts_and_process_them() # your custom logic here
    return teach(dataset, model, stream, label=label)

You can then use the recipe like this:

prodigy custom_recipe my_dataset en_core_web_sm MY_LABEL -F recipe.py

There should be a more detailed example of this in the documentation as well :blush:

This is weird – I'll double-check that the file upload permissions are set correctly. I specifically remember adding .jsonl to the list of allowed file types, because it seemed like something that would come up a lot.

I see. I’ll try with more samples first as you suggested. Btw please check again the file attachment permissions in the editor :slight_smile:

Great! Definitely keep us updated.

Just disabled a setting that would prevent new users from uploading attachments – so this might have been the problem in your case. Could you try uploading a file again and let me know if it worked?

Yeah it works!

apply for government job in india.jsonl (41.0 KB)

A quick note: multi-label text classification will likely suffer from the problem fixed here: https://github.com/explosion/spaCy/pull/1391

This fix should go live on spaCy nightly pretty soon. Until then, you might want to focus on training one label at a time.

1 Like

It there some way to turn that “most relevant” threshold all the way down to zero? Something I could do in a custom recipe?

I want to use active learning, but I want it to impose an ordering over my entire training set, not just some subset of it. I feel like a single ner.teach session should eventually cycle through every example in the training set if I have the patience to sit through it.

Thinking about it some more…

Oh I think I understand. Are you trying to stream through the entire JSONL file with an O(1) memory requirement, so you set a threshold and the only emit samples for annotation for a given text that are above that threshold? (So that you only ever have to have one piece of text in memory at a time.)

I guess the way to implement cycling through every example in the training set would be to just keep looping over the data again with an always lower threshold. (Maybe cut it in half after every epoch.)

I feel like I only see “No Tasks Available” after I have literally given an annotation decision for every span in every text. (So basically never.)

@wpm Yes, we assume the stream is larger than memory. We don’t take just one text, but instead a batch. You can get all the questions by removing the prefer_uncertain sorter.

Finally, the other thing to do is just repeat the stream! You can do this by wrapping your stream in itertools.cycle. However, note that the task hashes will still prevent you from being asked the exact same question twice. If you want to remove that constraint as well, you could reset the task hash so that it becomes sensitive to the number of repetitions over the stream. Something like this:

from prodigy.util import set_hashes

def infinite_stream(stream):
    while True:
        epoch = 0
        for eg in stream:
            eg[epoch] = epoch
            eg = set_hashes(eg, input_keys=('text', 'epoch'), overwrite=True)
            yield eg

Note that you should do this after the model, not before it. The models reset the hashes, so the above code won’t work if you do it on the input stream. Possibly we should add an attribute with the hash keys to the models?

I have loaded a 15K text examples to use with near.teach after 85 annotations and around 44% (???) I get not tasks available. The first time I did, I went up to 41%.

One thing that's important to keep in mind is that the progress in ner.teach doesn't mean the percentage of examples you've annotated in relation to your corpus (Prodigy has no way of knowing the total, because streams are generators). The progress you're seeing is an estimation of when the loss will hit 0 and when you've annotated enough examples. (You can think of it as "When has my model learned everything it can learn from my data?").

When using an active learning-powered recipe, you're only seeing a selection of the examples (e.g. the ones that the model is most uncertain about). Depending on the model's predictions, a large number of examples may be skipped. So it's definitely possible that your stream is exhausted earlier.

85 out of 15k is definitely very low, though – assuming that your 15k texts include enough candidates. Did you use patterns with ner.teach and are you training a new category, or improving an existing one?

I am training a new category and I have loaded patterns ~ 60 as jsonl.

I am thinking that the problem could come with the model I used. I did not use the Pubmed one as it does some kind of buffer overflow ( I have written about it in an other thread) , I have tried with all the en, en_core_web_sm, en_core_web_lg models. The larger one gave me more training examples.

What I found weird is that I can go only to ~40% and then it says no training examples and does not let me go to 100% of the training.

If I try to export the training result, it performs really badly. ( I did a batch train )

The batch train is also weird, I use the saved training data and I have only 7 to 10 examples…

Anyway I ll try more tomorrow.

I managed to go through with this.

First I have split 5k text examples into individual sentences and prodigy let me trained more than 2000k examples

With the larger texts it actually worked a little better with the new prodigy version 1.5.1 what happens not is that I can train 10 examples, it says that there are no more examples, then I save and refresh and I get 10 more examples. Those 10 examples are new ones.

Basically it works, I guess it creates a new session every time I save/refresh.

Anyway thank you, prodigy is really useful

1 Like

Seems that putting itertools.cycle in the loop either before or after the model indeed repeats the stream forever, but it definitely also shows the same examples multiple times… In fact I end up with multiple rows with the same task_hash in the database - it doesn’t update the row, it inserts another row. How would I add a hash-based check in the recipe to avoid seeing the same examples?

I see “mark” recipe does that, I guess I would have to bring over the hash checking parts into a custom version of textcat.teach, correct?

Yes, I think the most efficient way would be to create a set of seen hashes, populate them once on startup via the db.get_task_hashes method, make sure your stream sets hashes (if they’re not yet available in the data) and then add them to the set of seen hashes as you iterate over the examples. (You definitely want to avoid making too many database calls within your stream.)

For example, something like this should work:

from prodigy.components.db import connect
from prodigy import set_hashes

db = connect()
existing_hashes = db.get_task_hashes(dataset)
seen = set(existing_hashes)

for eg in infinite_stream:
    eg = set_hashes(eg)
    if eg['_task_hash'] not in seen:
        seen.add(eg['_task_hash'])
        yield eg

While the filtering part seems to work fine, putting an itertools.cycle(stream) before the filtering seems to just hang on a small dataset after pulling ~20 examples…

You might find it easier to figure out what’s going wrong if you do the cycle manually, like this:

seen = set()
while True:
    for eg in data:
        if eg['_task_hash'] not in seen
            yield eg
            seen.add(eg['_task_hash'])

This should make it easier to add print statements to figure out why the generator stops yielding you examples.

I ended up building an internal cache that memorizes all examples it’s seen until “update” is called on them. If the input stream runs out of data, it cycles through the cached examples. It works until you get to the last 2 examples, because the autosave is 2 examples behind in batch_size=1 mode.

I also had to take the model out of the loop, because it seems if the model deems the whole current batch as “don’t show this” the app hangs.

Good afternoon, Ines!

Have the same issue, as [kyoungrok_jang]

The command was
prodigy textcat.teach dataset_name en_core_web_lg ./5461_dataset_prodigy.jsonl --label "L1","L2",..."L262"
I tried all the ways that you suggested, but none of it works. Especially, when I am trying to pipe data from stdin I am getting the following error
prodigy textcat.teach: error: the following arguments are required: source

Then I tried to reload page and continue working on it, but "No tasks available" appeared after each piece of text.
If necessary:

  • I have 262 labels ( --label "L1","L2",..."L262")
  • I have 5461 pieces of text in jsonl format (attached file below)

Another thing that surprised me was, that after each reload I get the same piece of text but with another suggested label. So one piece of text repeated many times and as result I had a lot of labels per one text. As result the model became imbalanced because of lots of rejected answers and a few accepted.

Thank you for any reply,
Yaroslav5461_dataset_prodigy.jsonl (490.1 KB)

@YarrDOpanas Which version of Prodigy are you using?

To read from stdin, you'll need to set the source argument to a -. For example:

cat ./5461_dataset_prodigy.jsonl  | prodigy textcat.teach dataset_name en_core_web_lg - --label L1,L2,...

To some extent, this is part of the concept of textcat.teach: the model will produce predictions for the given labels and the sorter function will select the most relevant examples to annotate. By default, those with the most uncertain scores where the decision makes the biggest difference. I've shared some more details on this in this thread: using sorters (prefer_uncertain or prefer_high_scores) result in prodigy showing me the same data samples with different predictions - #2 by ines

However, if you're starting from scratch with a model that knows nothing, it will take a while until it can make meaningful suggestions. 262 labels is an unusally large label scheme for text classification, at least a the top level. So the model will need to see enough examples to make meaningful suggestions for 262 labels, which is going to be very difficult from a cold start, imbalanced classes and only 5461 raw texts to choose from.

Are your labels hierarchical? If so, can you split your classifier into multiple steps and start by predicting the top level categories, train separate classifiers for the different subcategories (e.g. given the text is about sport, is is about football?). This is likely going to be much easier to learn.