We have 196 lines starting with {“text” in the jsonl which we give to prodigy - after 185 prodigy says ‘No Tasks Available’. Any hint how to find the missing 11?
Which recipe are you using? If you’re running an active learning powered recipe like ner.teach
for instance, Prodigy will create various possible analyses for each example and only show you the most relevant for annotation. This means it’ll skip the ones with very high or very low predictions and will focus on the most relevant examples. This is nice if you have lots of data and want the best possible annotations, but not so helpful if you want to label every example.
If you’re running a manual annotation recipe like ner.manual
, the examples should be streamed in in order and as they appear in your original data. If you end up with fewer examples, the most likely explanation would be that your data contains duplicates, which are filtered out by default. Another problem could be invalid JSON – like an unescaped ""
or something like that somewhere down the line. Invalid lines are also skipped by default.
You might also want to check that the dataset you’re saving the examples to doesn’t yet have some examples from the dataset saved in it. Because by default, Prodigy will exclude examples that have already been annotated in the current dataset.
Finally, you can also set PRODIGY_LOGGING=verbose
, which will log everything that’s going on and will print all examples that pass through the application. You could also edit the recipe code and call list
around the stream
(which will evaluate the generator and give you a list of all examples that will be sent out), and then compare that to your original file.
thanks a lot - that helps - the jsonl had dublicates data
I'm using ner.manual recipe for annotation and 20 out of 58 rows of data from input .jsonl file are found to be missing in the output .jsonl file. few are duplicates and few maybe due to invalid characters. But atleast 10 rows of data without any issue are missing in the output file.
I'm using the following command:
python -m prodigy ner.manual trialdb blanknermodel inputdata.jsonl
i'm creating a blank model "blanknermodel" and using the below code to create blank model. please help me if there is an issue in the below code which is causing the issue.
def buildspacymodel(TRAIN_DATA, savemodelpath):
import random import spacy from spacy.util import minibatch, compounding model=None n_iter=5 nlp = spacy.blank("en") # create blank Language class print("Created blank 'en' model") ner = nlp.create_pipe("ner") nlp.add_pipe(ner, last=True) for _, annotations in TRAIN_DATA: for ent in annotations.get("entities"): ner.add_label(ent[2]) # get names of other pipes to disable them during training other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] with nlp.disable_pipes(*other_pipes): # only train NER # reset and initialize the weights randomly – but only if we're # training a new model if model is None: nlp.begin_training() for itn in range(n_iter): random.shuffle(TRAIN_DATA) losses = {} # batch up the examples using spaCy's minibatch batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: texts, annotations = zip(*batch) nlp.update( texts, # batch of texts annotations, # batch of annotations drop=0.5, # dropout - make it harder to memorise data losses=losses, ) print("Losses", losses) for text, _ in TRAIN_DATA: doc = nlp(text) print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) return nlp.to_disk(savemodelpath)
This function takes TRAIN_DATA, folder path as input.
The TRAIN_DATA will be in the following format:
TRAIN_DATA = [
(
"Horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, LABEL)]},
),
("Do they bite?", {"entities": []}),
(
"horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, LABEL)]},
),
("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
(
"they pretend to care about your feelings, those horses",
{"entities": [(48, 54, LABEL)]},
),
("horses?", {"entities": [(0, 6, LABEL)]}),
]
And also, is there a way to validate the .jsonl file for invalid lines of data or issues due to unescaped "" using any code or method?
The training part shouldn't be relevant here – ner.manual
will just use the model for tokenization, so whether you've updated it with examples or not won't impact the annotations you see.
What's relevant is whether the data includes duplicates or invalid entries and whether annotations on the same data are already in the dataset.
If you're trying to find actual invalid JSON, you could use any JSON linter – or a Python script that calls json.loads
on each line. If that fails, the line contains invalid JSON.