Missing data

usteiner · January 23, 2019, 1:54pm

We have 196 lines starting with {“text” in the jsonl which we give to prodigy - after 185 prodigy says ‘No Tasks Available’. Any hint how to find the missing 11?

ines · January 23, 2019, 2:50pm

Which recipe are you using? If you’re running an active learning powered recipe like ner.teach for instance, Prodigy will create various possible analyses for each example and only show you the most relevant for annotation. This means it’ll skip the ones with very high or very low predictions and will focus on the most relevant examples. This is nice if you have lots of data and want the best possible annotations, but not so helpful if you want to label every example.

If you’re running a manual annotation recipe like ner.manual, the examples should be streamed in in order and as they appear in your original data. If you end up with fewer examples, the most likely explanation would be that your data contains duplicates, which are filtered out by default. Another problem could be invalid JSON – like an unescaped "" or something like that somewhere down the line. Invalid lines are also skipped by default.

You might also want to check that the dataset you’re saving the examples to doesn’t yet have some examples from the dataset saved in it. Because by default, Prodigy will exclude examples that have already been annotated in the current dataset.

Finally, you can also set PRODIGY_LOGGING=verbose, which will log everything that’s going on and will print all examples that pass through the application. You could also edit the recipe code and call list around the stream (which will evaluate the generator and give you a list of all examples that will be sent out), and then compare that to your original file.

usteiner · January 24, 2019, 6:30am

thanks a lot - that helps - the jsonl had dublicates data

sharathreddym · October 15, 2020, 5:04am

I'm using ner.manual recipe for annotation and 20 out of 58 rows of data from input .jsonl file are found to be missing in the output .jsonl file. few are duplicates and few maybe due to invalid characters. But atleast 10 rows of data without any issue are missing in the output file.

I'm using the following command:

python -m prodigy ner.manual trialdb blanknermodel inputdata.jsonl

i'm creating a blank model "blanknermodel" and using the below code to create blank model. please help me if there is an issue in the below code which is causing the issue.

def buildspacymodel(TRAIN_DATA, savemodelpath):

import random
import spacy
from spacy.util import minibatch, compounding

model=None
n_iter=5
nlp = spacy.blank("en")  # create blank Language class
print("Created blank 'en' model")

ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)

for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

with nlp.disable_pipes(*other_pipes):  # only train NER
    # reset and initialize the weights randomly – but only if we're
    # training a new model
    if model is None:
        nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                losses=losses,
            )
        print("Losses", losses)

for text, _ in TRAIN_DATA:
    doc = nlp(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

return nlp.to_disk(savemodelpath)

This function takes TRAIN_DATA, folder path as input.

The TRAIN_DATA will be in the following format:

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]

sharathreddym · October 15, 2020, 6:00am

And also, is there a way to validate the .jsonl file for invalid lines of data or issues due to unescaped "" using any code or method?

ines · October 15, 2020, 9:33am

The training part shouldn't be relevant here – ner.manual will just use the model for tokenization, so whether you've updated it with examples or not won't impact the annotations you see.

What's relevant is whether the data includes duplicates or invalid entries and whether annotations on the same data are already in the dataset.

If you're trying to find actual invalid JSON, you could use any JSON linter – or a Python script that calls json.loads on each line. If that fails, the line contains invalid JSON.

Topic		Replies	Views
Annotation tasks finish even when more samples are in the jsonl dataset usage , solved , streams	5	446	April 8, 2022
Help! I have duplicates or missing data: Best practices on accounting for annotations best-practices	6	704	September 13, 2023
no task available with jsonl file usage , streams , more-info-needed	5	487	September 4, 2020
Missing first N annotations when using ner.manual recipe usage , ner , solved	6	1070	May 15, 2019
Prodigy says "No tasks available." usage , solved , streams	14	905	October 7, 2021

Missing data

Related topics