"No tasks available" even though there's plenty of samples left

I see. I’ll try with more samples first as you suggested. Btw please check again the file attachment permissions in the editor :slight_smile:

Great! Definitely keep us updated.

Just disabled a setting that would prevent new users from uploading attachments – so this might have been the problem in your case. Could you try uploading a file again and let me know if it worked?

Yeah it works!

apply for government job in india.jsonl (41.0 KB)

A quick note: multi-label text classification will likely suffer from the problem fixed here: https://github.com/explosion/spaCy/pull/1391

This fix should go live on spaCy nightly pretty soon. Until then, you might want to focus on training one label at a time.

1 Like

It there some way to turn that “most relevant” threshold all the way down to zero? Something I could do in a custom recipe?

I want to use active learning, but I want it to impose an ordering over my entire training set, not just some subset of it. I feel like a single ner.teach session should eventually cycle through every example in the training set if I have the patience to sit through it.

Thinking about it some more…

Oh I think I understand. Are you trying to stream through the entire JSONL file with an O(1) memory requirement, so you set a threshold and the only emit samples for annotation for a given text that are above that threshold? (So that you only ever have to have one piece of text in memory at a time.)

I guess the way to implement cycling through every example in the training set would be to just keep looping over the data again with an always lower threshold. (Maybe cut it in half after every epoch.)

I feel like I only see “No Tasks Available” after I have literally given an annotation decision for every span in every text. (So basically never.)

@wpm Yes, we assume the stream is larger than memory. We don’t take just one text, but instead a batch. You can get all the questions by removing the prefer_uncertain sorter.

Finally, the other thing to do is just repeat the stream! You can do this by wrapping your stream in itertools.cycle. However, note that the task hashes will still prevent you from being asked the exact same question twice. If you want to remove that constraint as well, you could reset the task hash so that it becomes sensitive to the number of repetitions over the stream. Something like this:

from prodigy.util import set_hashes

def infinite_stream(stream):
    while True:
        epoch = 0
        for eg in stream:
            eg[epoch] = epoch
            eg = set_hashes(eg, input_keys=('text', 'epoch'), overwrite=True)
            yield eg

Note that you should do this after the model, not before it. The models reset the hashes, so the above code won’t work if you do it on the input stream. Possibly we should add an attribute with the hash keys to the models?

I have loaded a 15K text examples to use with near.teach after 85 annotations and around 44% (???) I get not tasks available. The first time I did, I went up to 41%.

One thing that's important to keep in mind is that the progress in ner.teach doesn't mean the percentage of examples you've annotated in relation to your corpus (Prodigy has no way of knowing the total, because streams are generators). The progress you're seeing is an estimation of when the loss will hit 0 and when you've annotated enough examples. (You can think of it as "When has my model learned everything it can learn from my data?").

When using an active learning-powered recipe, you're only seeing a selection of the examples (e.g. the ones that the model is most uncertain about). Depending on the model's predictions, a large number of examples may be skipped. So it's definitely possible that your stream is exhausted earlier.

85 out of 15k is definitely very low, though – assuming that your 15k texts include enough candidates. Did you use patterns with ner.teach and are you training a new category, or improving an existing one?

I am training a new category and I have loaded patterns ~ 60 as jsonl.

I am thinking that the problem could come with the model I used. I did not use the Pubmed one as it does some kind of buffer overflow ( I have written about it in an other thread) , I have tried with all the en, en_core_web_sm, en_core_web_lg models. The larger one gave me more training examples.

What I found weird is that I can go only to ~40% and then it says no training examples and does not let me go to 100% of the training.

If I try to export the training result, it performs really badly. ( I did a batch train )

The batch train is also weird, I use the saved training data and I have only 7 to 10 examples…

Anyway I ll try more tomorrow.

I managed to go through with this.

First I have split 5k text examples into individual sentences and prodigy let me trained more than 2000k examples

With the larger texts it actually worked a little better with the new prodigy version 1.5.1 what happens not is that I can train 10 examples, it says that there are no more examples, then I save and refresh and I get 10 more examples. Those 10 examples are new ones.

Basically it works, I guess it creates a new session every time I save/refresh.

Anyway thank you, prodigy is really useful

1 Like

Seems that putting itertools.cycle in the loop either before or after the model indeed repeats the stream forever, but it definitely also shows the same examples multiple times… In fact I end up with multiple rows with the same task_hash in the database - it doesn’t update the row, it inserts another row. How would I add a hash-based check in the recipe to avoid seeing the same examples?

I see “mark” recipe does that, I guess I would have to bring over the hash checking parts into a custom version of textcat.teach, correct?

Yes, I think the most efficient way would be to create a set of seen hashes, populate them once on startup via the db.get_task_hashes method, make sure your stream sets hashes (if they’re not yet available in the data) and then add them to the set of seen hashes as you iterate over the examples. (You definitely want to avoid making too many database calls within your stream.)

For example, something like this should work:

from prodigy.components.db import connect
from prodigy import set_hashes

db = connect()
existing_hashes = db.get_task_hashes(dataset)
seen = set(existing_hashes)

for eg in infinite_stream:
    eg = set_hashes(eg)
    if eg['_task_hash'] not in seen:
        seen.add(eg['_task_hash'])
        yield eg

While the filtering part seems to work fine, putting an itertools.cycle(stream) before the filtering seems to just hang on a small dataset after pulling ~20 examples…

You might find it easier to figure out what’s going wrong if you do the cycle manually, like this:

seen = set()
while True:
    for eg in data:
        if eg['_task_hash'] not in seen
            yield eg
            seen.add(eg['_task_hash'])

This should make it easier to add print statements to figure out why the generator stops yielding you examples.

I ended up building an internal cache that memorizes all examples it’s seen until “update” is called on them. If the input stream runs out of data, it cycles through the cached examples. It works until you get to the last 2 examples, because the autosave is 2 examples behind in batch_size=1 mode.

I also had to take the model out of the loop, because it seems if the model deems the whole current batch as “don’t show this” the app hangs.

Good afternoon, Ines!

Have the same issue, as [kyoungrok_jang]

The command was
prodigy textcat.teach dataset_name en_core_web_lg ./5461_dataset_prodigy.jsonl --label "L1","L2",..."L262"
I tried all the ways that you suggested, but none of it works. Especially, when I am trying to pipe data from stdin I am getting the following error
prodigy textcat.teach: error: the following arguments are required: source

Then I tried to reload page and continue working on it, but "No tasks available" appeared after each piece of text.
If necessary:

  • I have 262 labels ( --label "L1","L2",..."L262")
  • I have 5461 pieces of text in jsonl format (attached file below)

Another thing that surprised me was, that after each reload I get the same piece of text but with another suggested label. So one piece of text repeated many times and as result I had a lot of labels per one text. As result the model became imbalanced because of lots of rejected answers and a few accepted.

Thank you for any reply,
Yaroslav5461_dataset_prodigy.jsonl (490.1 KB)

@YarrDOpanas Which version of Prodigy are you using?

To read from stdin, you'll need to set the source argument to a -. For example:

cat ./5461_dataset_prodigy.jsonl  | prodigy textcat.teach dataset_name en_core_web_lg - --label L1,L2,...

To some extent, this is part of the concept of textcat.teach: the model will produce predictions for the given labels and the sorter function will select the most relevant examples to annotate. By default, those with the most uncertain scores where the decision makes the biggest difference. I've shared some more details on this in this thread: using sorters (prefer_uncertain or prefer_high_scores) result in prodigy showing me the same data samples with different predictions - #2 by ines

However, if you're starting from scratch with a model that knows nothing, it will take a while until it can make meaningful suggestions. 262 labels is an unusally large label scheme for text classification, at least a the top level. So the model will need to see enough examples to make meaningful suggestions for 262 labels, which is going to be very difficult from a cold start, imbalanced classes and only 5461 raw texts to choose from.

Are your labels hierarchical? If so, can you split your classifier into multiple steps and start by predicting the top level categories, train separate classifiers for the different subcategories (e.g. given the text is about sport, is is about football?). This is likely going to be much easier to learn.

Many thanks for soon reply and advices, Ines,

I have 1.10.7 version of Prodigy. Your command with - argument worked fine for me.
Yes, my labels are hierarchical, So, I'll act as you recommended.

1 Like

Hi,
I am using a custom recipe for multi-label text classification.
But I am getting similar issue as "No tasks available" after few annotations.

Could you please check my recipe?

import prodigy
from prodigy.components.loaders import JSONL

@prodigy.recipe(
"article_cat",
dataset=("The dataset to save to", "positional", None, str),
file_path=("Path to texts", "positional", None, str),
)
def article_cat(dataset, file_path):
"""Annotate the sentiment of texts using different mood options."""
stream = JSONL(file_path) # load in the JSONL file
stream = add_options(stream) # add options to each task
blocks = [
{"view_id": "html"},
{"view_id": "text"},
{"view_id": "choice", "text": None, "html": None}
]
return {
"dataset": dataset, # save annotations in this dataset
"view_id": "blocks", # set the view_id to "blocks"
"stream": list(stream),
"config": {
"blocks": blocks, # add the blocks to the config
# "html_template": html_temp
}
}

def add_options(stream):
# Helper function to add options to every task in a stream
options = [
{"id": "1", "text": "A"},
{"id": "2", "text": "B"},
{"id": "3", "text": "C"},
{"id": "4", "text": "D"}
]

for task in stream:
task["options"] = options
yield task