Hi! Are you storing anything with your individual examples that indicates the document the text belongs to (or the line number, a running index, something like that)? Ultimately, it depends on how you define "progress" and the metrics you're looking for, but let's say your examples look like this and include the number of the document and a running ID (e.g. nth sentence in the document):
{"text": "...", "document_no": 5, "id": 1234}
You could then, for instance, do something like this to find the highest ID available in your dataset for that document – this tells you how far your annotators have come already.
from prodigy.components.db import connect
def get_progress_for_document(dataset, document_no):
db = connect()
examples = db.get_dataset(dataset)
ids = []
for eg in examples:
if eg["document_no"] == document_no:
ids.append(eg["id"])
print(f"Progress for {dataset}", max(ids))
If you're using named multi-user sessions for your annotators, you could also include the "_session_id"
here, which will tell you the session (and user) that annotated the given example. There might also be other meta data that you can include or analyse here, depending on what's in your data. (Pro tip: You could even put together a mini Streamlit app and visualise these stats, so you can have a custom dashboard super specific to your dataset )