End of task hit when many task left

rkeyvani · March 26, 2020, 12:32pm

we have exactly 548 source pages for annotation and correction
I am a bit confusing why we reach No tasks when we still have them.

One observation that number of available pages after restarting (last 2 iterations) suspiciously equal to batch_size parameter we set to have buffer before autosave.

Let me know if you need my file stream or parameters I am starting prodigy with, but were sort stuck right now.

rkeyvani · March 26, 2020, 12:40pm

yes | cp -rf custom.js /usr/local/lib/python3.6/site-packags/prodigy/static/ | prodigy ner.manual lineslip_ner_manual_crime_gfactor_GOLD_only en_core_web_sm lineslip_ner_manual_crime_gfactor_GOLD_only.jsonl --label QuoteFieldLabelList_CRM_header.txt

ines · March 26, 2020, 12:58pm

It sounds like you might have one or two batches that were sent out but haven't come back – for instance, if you refreshed the browser or if the same session was accessed by multiple users. In the default setting and expecting that potentially multiple people access the same session, Prodigy can only know whether a batch is coming back at the end of the session. When you restart the server and any examples were missing, they should be queued up again.

Alternatively, if you want to always ensure that batches are sent out in the same order and that examples are re-sent if they haven't been annotated yet, you can set "force_stream_order": True. (In that scenario, you just have to make sure you don't have two people on the same session, as this can lead to duplicates.)

rkeyvani · March 26, 2020, 4:21pm

we only have one person using it, but we will shortly have two more annotators. I can try setting force_stream, but I am not convinced that is it. I have attached the script we use to stream data perhaps its something we do in here?

rkeyvani · March 26, 2020, 4:22pm

from os import listdir
from os.path import join, isfile
import json
import csv


def load_text_files(data_dir):
    with open(f"{data_dir}/temp/metadata.csv", mode="r") as csv_file:
        csv_reader = csv.DictReader(csv_file)
        line_count = 0
        for row in csv_reader:
            text_file_path = f"{data_dir}/{row['file']}".replace(".pdf", ".txt")
            if line_count > 0 and isfile(text_file_path):
                with open(text_file_path, encoding="utf-16") as f:
                    file_text = f.read()
                    file_pages = file_text.split("\f")
                    page_count = len(file_pages)
                    page_num = 1
                    for file_page in file_pages:
                        print(json.dumps({"text": file_page, "meta": {"source": f"carrierGroupName: {row['carrierGroupName']} | quoteMainID: {row['quoteMainID']} | {row['originalFileName']} | page {page_num} of {page_count}", "via": row["url"]}}))
                        page_num += 1
            line_count += 1

if __name__ == "__main__":
    data_dir = "Crime with meta"
    load_text_files(data_dir)

ines · March 26, 2020, 8:33pm

I don't think the loader itself is the problem – unless that's not loading all examples or your data has duplicate rows.

Once your data is loaded, Prodigy will send it out in batches and it will filter out examples that are already in the dataset. A good way to check if batches were skipped is to start up the server again and see if you're presented with unannotated examples. If so, I do think the most likely explanation would be that maybe a batch was requested but not annotated and sent back (e.g. because you opened the app and closed it again or something like that). In that case, setting "force_stream_order": true would help, because on each batch, Prodigy will check if the batch was annotated and re-send it if it hasn't.

Another explanation could be that you forgot to hit "save" at the end of the session – but normally, your browser should alert you if you tried to close the window with unsaved work.

In general, having more annotators is no problem, even if you force the stream order. You just need to give them separate named sessions then or start up separate instances for them so Prodigy knows who is who.

Topic		Replies	Views
Incomplete annotations with textcat.manual textcat , streams	4	432	May 4, 2020
Losing samples on browser refresh usage , done , database , streams	11	1127	October 21, 2020
Number of tasks doesn't match number of items in input file solved , streams	8	1024	November 15, 2019
Resuming annotations within a session (after closing the browser) usage , streams	6	1411	October 24, 2019
Missing data when usage , server	8	1036	May 21, 2021

End of task hit when many task left

Related topics