I have a CSV file with around 3600 lines of unlabelled data and I do ner.manual for labeling.After about 2800, prodigy tells me "No tasks available". Is there a way to debug if there are issues parsing the CSV? (pandas is able to parse it).
Would you have any idea why?
I don't have the official answer, but I have this problem whenever reading in a file to prodigy. What I have started doing is padding the file at the end with ~a few hundred "documents" that have all the required fields and just says something like "FILLER MESSAGE IGNORE" and whenever my annotators get there, they stop, knowing the actual dataset is finished.
It's a little hacky, but while we wait for a real answer, it works.
Hi! The CSV loader mostly just calls csv.DictReader and then iterates over the rows – so if there was a parsing issue, you should see an error.
The most common reasons why you wouldn't see all examples are:
duplicate records in the data or rows with an empty text/Text column
the dataset you're using already contains an annotation for an example with the same text
new batches of examples are requested and not submitted (e.g. if you refresh the browser or multiple people access the same session) – if this is a problem, try setting "force_stream_order": true in your prodigy.json, which will enforce the exact order of examples and re-send batches until they're answered
That's interesting So does it seem like the problem is that it's just cut off after X% of the stream, and adding the padding solves that because it ensures the cutoff doesn't happen before the actual examples are sent out?
It's pretty mysterious, because the stream is really just a Python generator. The "No tasks available" message is shown if the server returns an empty batch – so typically when there are no more examples left. So I don't understand how there would just randomly not be a next batch... Really want to get to the bottom of this.
What's your setup like, and are you using multi-user sessions?
Then I press 'a' to validate annotations (not necessary labelling anything), save every 10 inputs and make sure I don't go too fast, else the browser complains. After about 2900, I have "No tasks available".
I also tried to go through 1000 data, CTRL-C to save. It showed 1000 added. go through 1000 more, CRTL-C to save. It showed 2000 total. go through 1000 more, after about 900, "No tasks available"
My setup is a bit wacky (using prodigy 1.4.2, for starters). Based on what you said, I'd guess our problem is with different users accessing our application, refreshing the data, and so some batches go missing. I'll try out the force stream order though!
I will add though, that recently I was locally annotating 100 examples and prodigy would consistently be unable to save the 100th example. It was odd, not sure if it was a problem connecting to the database or what, but it was always on the very last one.
So what did you run and how did you find the 4083 number? Did you consume the whole stream generator, or did you go through all the examples? If so, what are the other 1000 records?
The CSV loader is really just a wrapper around the built-in csv.DictReader and doesn't do anything special. You could try running it over your file and check the number it loads:
from prodigy.components.loaders import CSV
stream = list(CSV("/path/to/your/file"))
If you're using a recipe that splits sentences and have sentence segmentation enabled, that would explain why the total number of examples presented can be higher than the number of texts in the input data. But that's not the case for ner.manual.
Ah, okay, so that'd mean you were presented with duplicates somewhere during the annotation process.
How did you annotate? Did you do it all yourself, or did you have other annotators? Did they use named sessions? Did you do the annotation in multiple sessions and restart the server? Did you have force_stream_order enabled?
Okay, I think I have a theory then. The reason I was asking about this was because forcing the stream order depends on sending the hashes of the current tasks back to the server, and checking if they have already been sent back. So if you just hold down a button and annotate super fast, you may end up with a race condition where the app requests batch 1, batch 1 goes out and the app already requests batch 2, before having received batch 1. So the server receives a request for batch 2 without info about batch 1. So it thinks batch 1 isn't there (which would happen if you refresh the app), so it sends it out again. This happens infrequently, which is why you only end up with some batches duplicated.
If this is the case, at least the good news is that this is unlikely to happen in a real-world setting and if you're annotating with a more "humanly" speed of ~1 second or more per annotation. But we'll still investigate this and see if we can find a good solution that prevents the problem entirely.