Hello @ljvmiranda921, and thank you very much for your quick response!
I was following Prodigy documentation (more specifically the suggestion "When you view or export your data later, e.g. with db-out
, you can then explicitly filter out those examples and deal with them." provided here), because I was looking towards a robust way to separate "ACCEPTED", "REJECTED" and "IGNORED" texts, for later analysis. I exported a small annotation job (ner.manual
), consisting of 5 texts, in which I purposedly "IGNORED" the first, "REJECTED" the second and "ACEPTED" the 3 latest. BTW, I am using jsonlines python library to manually open and handle those files. I tried the following script to see what I was obtaining:
# 'demo_df' is the file name I am analyzing here:
with jsonlines.open('demo_df.jsonl') as reader:
for obj in reader:
print(obj.keys())
For which I obtained the following output:
dict_keys(['text', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'answer', '_timestamp'])
dict_keys(['text', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'answer', '_timestamp'])
dict_keys(['text', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'spans', 'answer', '_timestamp'])
dict_keys(['text', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'spans', 'answer', '_timestamp'])
dict_keys(['text', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'spans', 'answer', '_timestamp'])
From that result, I have the following questions:
- I was hoping to see "a larger difference" between the "IGNORED" and the "REJECTED" texts, is "OK / normal" what I am getting here?
1.1. If the output is indeed correct, how to difference an "IGNORED" from a "REJECTED" text?
- If I use "Prodigy defaults", will I always have the same keys than the ones shown in my current experiment (i.e., 'text', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'spans', 'answer', '_timestamp')?
2.2. In which case those keys would vary?
Thank you very much for your support!