End of task hit when many task left

we have exactly 548 source pages for annotation and correction
I am a bit confusing why we reach No tasks when we still have them.

One observation that number of available pages after restarting (last 2 iterations) suspiciously equal to batch_size parameter we set to have buffer before autosave.

Let me know if you need my file stream or parameters I am starting prodigy with, but were sort stuck right now.

yes | cp -rf custom.js /usr/local/lib/python3.6/site-packags/prodigy/static/ | prodigy ner.manual lineslip_ner_manual_crime_gfactor_GOLD_only en_core_web_sm lineslip_ner_manual_crime_gfactor_GOLD_only.jsonl --label QuoteFieldLabelList_CRM_header.txt

It sounds like you might have one or two batches that were sent out but haven't come back – for instance, if you refreshed the browser or if the same session was accessed by multiple users. In the default setting and expecting that potentially multiple people access the same session, Prodigy can only know whether a batch is coming back at the end of the session. When you restart the server and any examples were missing, they should be queued up again.

Alternatively, if you want to always ensure that batches are sent out in the same order and that examples are re-sent if they haven't been annotated yet, you can set "force_stream_order": True. (In that scenario, you just have to make sure you don't have two people on the same session, as this can lead to duplicates.)

we only have one person using it, but we will shortly have two more annotators. I can try setting force_stream, but I am not convinced that is it. I have attached the script we use to stream data perhaps its something we do in here?

from os import listdir
from os.path import join, isfile
import json
import csv
​

def load_text_files(data_dir):
    with open(f"{data_dir}/temp/metadata.csv", mode="r") as csv_file:
        csv_reader = csv.DictReader(csv_file)
        line_count = 0
        for row in csv_reader:
            text_file_path = f"{data_dir}/{row['file']}".replace(".pdf", ".txt")
            if line_count > 0 and isfile(text_file_path):
                with open(text_file_path, encoding="utf-16") as f:
                    file_text = f.read()
                    file_pages = file_text.split("\f")
                    page_count = len(file_pages)
                    page_num = 1
                    for file_page in file_pages:
                        print(json.dumps({"text": file_page, "meta": {"source": f"carrierGroupName: {row['carrierGroupName']} | quoteMainID: {row['quoteMainID']} | {row['originalFileName']} | page {page_num} of {page_count}", "via": row["url"]}}))
                        page_num += 1
            line_count += 1
​
if __name__ == "__main__":
    data_dir = "Crime with meta"
    load_text_files(data_dir)

I don't think the loader itself is the problem – unless that's not loading all examples or your data has duplicate rows.

Once your data is loaded, Prodigy will send it out in batches and it will filter out examples that are already in the dataset. A good way to check if batches were skipped is to start up the server again and see if you're presented with unannotated examples. If so, I do think the most likely explanation would be that maybe a batch was requested but not annotated and sent back (e.g. because you opened the app and closed it again or something like that). In that case, setting "force_stream_order": true would help, because on each batch, Prodigy will check if the batch was annotated and re-send it if it hasn't.

Another explanation could be that you forgot to hit "save" at the end of the session – but normally, your browser should alert you if you tried to close the window with unsaved work.

In general, having more annotators is no problem, even if you force the stream order. You just need to give them separate named sessions then or start up separate instances for them so Prodigy knows who is who.