Identify Unlabelled text

PuneethaPai · August 27, 2019, 11:59am

We are manually tagging entities in the paragraphs extracted from several documents.
The json generated by extracting all the paragraphs has around 10,000 paras to be labeled.

We are half way through labeling activity and we realized there was a mistake in how the paras were generated. We know how to fix it.

Problem:

We first need to identify paras which are not labeled yet.
Apply the fix to these paras.
Update input json to prodigy with only these fixed unlabeled paras.

What is the best way to identify the unlabeled paras other than comparing input json and the sq-lite database dump.

Thank You

ines · August 27, 2019, 2:31pm

I think the easiest way would be to compare the input hashes. When examples come in, Prodigy will assign a hash representing the input data, e.g. the text. So you can only get those hashes and then compare them for each of the examples in the database. Something like this:

from prodigy.components.db import connect
from prodigy import set_hashes

db = connect()
input_hashes = db.get_input_hashes(["your_dataset"])

unlabelled = []
for eg in YOUR_INPUT_DATA:
    eg = set_hashes(eg)
    if eg["_input_hash"] not in input_hashes:
        unlabelled.append(eg)

Topic		Replies	Views
evaluate the existing label correct or not usage	6	566	July 12, 2019
NER document Labeling ner , solved	25	3690	August 1, 2019
Custom templates with custom DB and exclude logic usage , custom , solved	20	3060	January 29, 2018
annotating entities in text documents usage , ner , solved	15	9936	November 28, 2017
Importing existing custom annotated data from brat usage	7	1900	September 29, 2018

Identify Unlabelled text

Related topics