Identify Unlabelled text

We are manually tagging entities in the paragraphs extracted from several documents.
The json generated by extracting all the paragraphs has around 10,000 paras to be labeled.

We are half way through labeling activity and we realized there was a mistake in how the paras were generated. We know how to fix it.


  1. We first need to identify paras which are not labeled yet.
  2. Apply the fix to these paras.
  3. Update input json to prodigy with only these fixed unlabeled paras.

What is the best way to identify the unlabeled paras other than comparing input json and the sq-lite database dump.

Thank You

I think the easiest way would be to compare the input hashes. When examples come in, Prodigy will assign a hash representing the input data, e.g. the text. So you can only get those hashes and then compare them for each of the examples in the database. Something like this:

from prodigy.components.db import connect
from prodigy import set_hashes

db = connect()
input_hashes = db.get_input_hashes(["your_dataset"])

unlabelled = []
for eg in YOUR_INPUT_DATA:
    eg = set_hashes(eg)
    if eg["_input_hash"] not in input_hashes: