Thanks for your post and welcome to the Prodigy community
I think the problem is you uploaded your data db-in
into chall_transcripts2
and tried to put your annotations into the same dataset chall_transcripts2
. But the db-in
step wasn't necessary.
Said differently, first clean out only your Prodigy dataset with your annotations, then rerun your same command:
prodigy drop chall_transcripts2
prodigy spans.manual chall_transcripts2 blank:en ../../../data/prodigy_jsonl/transcripts2.jsonl --label 'M','U','R','UNK',.... ```
What's happening is you're hitting Prodigy's dedupe as you're trying to annotate examples in your source file (transcript2.jsonl
) that are already in your chall_transcript2
dataset.
You would likely see this in your logs
Also, why did you include --use-annotations
in your prodigy manual.spans
? That's not a default argument so it won't do anything. If you pass a .jsonl
with spans, it'll automatically use those annotations.
I can understand that this is a bit confusing. I posted on this previously.
You can interpret this as:
So while it has the error
Found and keeping existing "answer" in 0 examples
, it's saying that it kept your original"answer"
tags for 0 examples because it replaced them for you.
The recipe is keeping a count of the total with the "answer"
replaced:
added_answers = 0
for task in data:
task = set_hashes(task, overwrite=rehash)
if "answer" not in task or overwrite:
task["answer"] = answer
added_answers += 1
examples.append(task)
Then printing out at the end:
n_total = len(examples)
msg.good(
f"Imported {n_total} annotated examples and saved them to '{set_id}' "
f"(session {session_id}) in database {DB.db_name}",
f'Found and keeping existing "answer" in {n_total - added_answers} examples',
)
So that number is n_total - added_answers
or the total number of records minus those with an added answer. Since your data didn't have any answers, added_answers
was for the all records (aka, the same as n_total
), hence why it was showing 0
.
What's important is that you should see in the output one line above that the same number of annotations (1838
) were still loaded into the database and automatically populated as "accept"
.
Fyi, if you want to view the recipe, you can find the recipe by finding your path to your installation run Prodigy stats
, look for Location
, then look for the recipes/commands.py
and find the db-in
recipe. You can do the same for the other built-in recipes.
But again, you really don't need this db-in
if you're trying to correct these labels as you can simply load them from your source file directly.
One other tip - we typically wouldn't recommend having more than 5-7 labels at a time -- the cognitive load to switch to those can be a bit challenging. However, if you need to have more than 7 labels, I recommend creating a labels.txt
with the name of your labels on new lines:
# labels.txt
M
U
R
UNK
ADJ
...
Then pass that labels.txt
instead of the raw label names in your Prodigy command like --labels labels.txt
. That'll minimize the error of a fat finger error of your labels.