ner.manual not going through all annotations in a CSV file

srandoux · February 26, 2020, 9:16pm

Hi,

I have a CSV file with around 3600 lines of unlabelled data and I do ner.manual for labeling.After about 2800, prodigy tells me "No tasks available". Is there a way to debug if there are issues parsing the CSV? (pandas is able to parse it).
Would you have any idea why?

Thanks

araykhel · February 26, 2020, 9:58pm

I don't have the official answer, but I have this problem whenever reading in a file to prodigy. What I have started doing is padding the file at the end with ~a few hundred "documents" that have all the required fields and just says something like "FILLER MESSAGE IGNORE" and whenever my annotators get there, they stop, knowing the actual dataset is finished.

It's a little hacky, but while we wait for a real answer, it works.

ines · February 27, 2020, 2:42pm

Hi! The CSV loader mostly just calls csv.DictReader and then iterates over the rows – so if there was a parsing issue, you should see an error.

The most common reasons why you wouldn't see all examples are:

duplicate records in the data or rows with an empty text/Text column
the dataset you're using already contains an annotation for an example with the same text
new batches of examples are requested and not submitted (e.g. if you refresh the browser or multiple people access the same session) – if this is a problem, try setting "force_stream_order": true in your prodigy.json, which will enforce the exact order of examples and re-send batches until they're answered

That's interesting So does it seem like the problem is that it's just cut off after X% of the stream, and adding the padding solves that because it ensures the cutoff doesn't happen before the actual examples are sent out?

It's pretty mysterious, because the stream is really just a Python generator. The "No tasks available" message is shown if the server returns an empty batch – so typically when there are no more examples left. So I don't understand how there would just randomly not be a next batch... Really want to get to the bottom of this.

What's your setup like, and are you using multi-user sessions?

srandoux · February 27, 2020, 3:52pm

For me:

* No duplicate records
* No empty text column
* No already annotated annotations
* Browser is not refreshed

* And single user session.

To reproduce (I did not check the minimal amount, but after 2000, it was ok and after 3000 not):

prodigy ner.manual mydb blank:en my.csv --label mylabel

Then I press 'a' to validate annotations (not necessary labelling anything), save every 10 inputs and make sure I don't go too fast, else the browser complains. After about 2900, I have "No tasks available".

I also tried to
go through 1000 data, CTRL-C to save. It showed 1000 added.
go through 1000 more, CRTL-C to save. It showed 2000 total.
go through 1000 more, after about 900, "No tasks available"

araykhel · February 27, 2020, 4:51pm

My setup is a bit wacky (using prodigy 1.4.2, for starters). Based on what you said, I'd guess our problem is with different users accessing our application, refreshing the data, and so some batches go missing. I'll try out the force stream order though!

araykhel · February 27, 2020, 4:55pm

I will add though, that recently I was locally annotating 100 examples and prodigy would consistently be unable to save the 100th example. It was odd, not sure if it was a problem connecting to the database or what, but it was always on the very last one.

srandoux · March 17, 2020, 2:45pm

I tried with prodigy 1.9.8, and now it generates more from data from my csv input files... I have 3010 in the input files and it found 4083 in NER.MANUAL

ines · March 17, 2020, 3:00pm

So what did you run and how did you find the 4083 number? Did you consume the whole stream generator, or did you go through all the examples? If so, what are the other 1000 records?

The CSV loader is really just a wrapper around the built-in csv.DictReader and doesn't do anything special. You could try running it over your file and check the number it loads:

from prodigy.components.loaders import CSV

stream = list(CSV("/path/to/your/file"))
print(len(stream))

If you're using a recipe that splits sentences and have sentence segmentation enabled, that would explain why the total number of examples presented can be higher than the number of texts in the input data. But that's not the case for ner.manual.

srandoux · March 17, 2020, 3:12pm

Yes I went through all examples and it found more than in the actual CSV (if I export it with db-out, it has then duplicates where as my original has not).

If I use the code above, it returns 3010, the good value.

ines · March 17, 2020, 4:20pm

Ah, okay, so that'd mean you were presented with duplicates somewhere during the annotation process.

How did you annotate? Did you do it all yourself, or did you have other annotators? Did they use named sessions? Did you do the annotation in multiple sessions and restart the server? Did you have force_stream_order enabled?

srandoux · March 17, 2020, 4:24pm

Yes, single annotator. Single Session.

here is my prodigy.json file:

{
"feed_overlap": false,
"force_stream_order": true
}

srandoux · March 17, 2020, 5:05pm

If I put "force_stream_order": false, I "sees", the appropriate amount.

ines · March 18, 2020, 9:54am

Okay, so that sounds like the forced stream order lead to batches being sent out again, even though they were already annotated / being annotated.

So how did you annotate? Did you do it all in one go, did you ever refresh, did you use a named session via ?session? Did you just click through really fast for testing purposes?

srandoux · March 20, 2020, 2:36pm

Just pressed 'a' fast for testing purpose. (no refresh, no named session)

ines · March 21, 2020, 9:48am

Okay, I think I have a theory then. The reason I was asking about this was because forcing the stream order depends on sending the hashes of the current tasks back to the server, and checking if they have already been sent back. So if you just hold down a button and annotate super fast, you may end up with a race condition where the app requests batch 1, batch 1 goes out and the app already requests batch 2, before having received batch 1. So the server receives a request for batch 2 without info about batch 1. So it thinks batch 1 isn't there (which would happen if you refresh the app), so it sends it out again. This happens infrequently, which is why you only end up with some batches duplicated.

If this is the case, at least the good news is that this is unlikely to happen in a real-world setting and if you're annotating with a more "humanly" speed of ~1 second or more per annotation. But we'll still investigate this and see if we can find a good solution that prevents the problem entirely.

Topic		Replies	Views
Get 'no task' before all annotation finished usage , ner	3	1234	June 18, 2019
No tasks available for ner.correct? ner	2	548	October 3, 2020
ner.manual skips 10 lines in text file when browser is refreshed usage , front-end , solved	8	1310	September 28, 2018
Annotation tasks finish even when more samples are in the jsonl dataset usage , solved , streams	5	445	April 8, 2022
showing no task available even data not yet completely annotated usage	10	1477	October 20, 2021

ner.manual not going through all annotations in a CSV file

Related topics