Missing data

We have 196 lines starting with {“text” in the jsonl which we give to prodigy - after 185 prodigy says ‘No Tasks Available’. Any hint how to find the missing 11?

Which recipe are you using? If you’re running an active learning powered recipe like ner.teach for instance, Prodigy will create various possible analyses for each example and only show you the most relevant for annotation. This means it’ll skip the ones with very high or very low predictions and will focus on the most relevant examples. This is nice if you have lots of data and want the best possible annotations, but not so helpful if you want to label every example.

If you’re running a manual annotation recipe like ner.manual, the examples should be streamed in in order and as they appear in your original data. If you end up with fewer examples, the most likely explanation would be that your data contains duplicates, which are filtered out by default. Another problem could be invalid JSON – like an unescaped "" or something like that somewhere down the line. Invalid lines are also skipped by default.

You might also want to check that the dataset you’re saving the examples to doesn’t yet have some examples from the dataset saved in it. Because by default, Prodigy will exclude examples that have already been annotated in the current dataset.

Finally, you can also set PRODIGY_LOGGING=verbose, which will log everything that’s going on and will print all examples that pass through the application. You could also edit the recipe code and call list around the stream (which will evaluate the generator and give you a list of all examples that will be sent out), and then compare that to your original file.

thanks a lot - that helps - the jsonl had dublicates data

1 Like

I'm using ner.manual recipe for annotation and 20 out of 58 rows of data from input .jsonl file are found to be missing in the output .jsonl file. few are duplicates and few maybe due to invalid characters. But atleast 10 rows of data without any issue are missing in the output file.

I'm using the following command:

python -m prodigy ner.manual trialdb blanknermodel inputdata.jsonl

i'm creating a blank model "blanknermodel" and using the below code to create blank model. please help me if there is an issue in the below code which is causing the issue.

def buildspacymodel(TRAIN_DATA, savemodelpath):

import random
import spacy
from spacy.util import minibatch, compounding

model=None
n_iter=5
nlp = spacy.blank("en")  # create blank Language class
print("Created blank 'en' model")

ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)

for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

with nlp.disable_pipes(*other_pipes):  # only train NER
    # reset and initialize the weights randomly – but only if we're
    # training a new model
    if model is None:
        nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                losses=losses,
            )
        print("Losses", losses)

for text, _ in TRAIN_DATA:
    doc = nlp(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

return nlp.to_disk(savemodelpath)

This function takes TRAIN_DATA, folder path as input.

The TRAIN_DATA will be in the following format:

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]

And also, is there a way to validate the .jsonl file for invalid lines of data or issues due to unescaped "" using any code or method?

The training part shouldn't be relevant here – ner.manual will just use the model for tokenization, so whether you've updated it with examples or not won't impact the annotations you see.

What's relevant is whether the data includes duplicates or invalid entries and whether annotations on the same data are already in the dataset.

If you're trying to find actual invalid JSON, you could use any JSON linter – or a Python script that calls json.loads on each line. If that fails, the line contains invalid JSON.