error while loading pre-annotated jsonl file

I am trying to load a pre-annotated .jsonl file (label is already present in the data I am loading) using the following command:

prodigy textcat.manual dataset_name /path_to_file/filename.jsonl --label /path_to_labels/labels.txt

My file looks like this:

{"0":"id","1":"deployment","2":"date","3":"text","4":"label","5":"score"}
{"0":"1234","1":"xxxx","2":"2021-08-24","3":"random text","4":"label-name","5":"0.5"}

I am getting the following error:

✘ Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.

Any thoughts as to why I cannot load this dataset?

Thanks!
Cheyanne

Hi! The problem is that the data you're loading in doesn't really follow the expected format or includes a "text", so Prodigy can't know which information it should display and what text you want to annotate. (In your case, that seems to be "3"? Or is it "1"? I'm not 100% sure I can tell what the text is and what the labels are).

You can find an example of the expected format for the multiple choice interface here: https://prodi.gy/docs/api-interfaces#choice

To pre-select labels, you can add them to the list of "accept": [], e.g. "accept": ["label1", "label2"].

1 Like

Ok! I edited the file, and that seemed to work:

{"id":"1234","deployment":"xxxx","date":"2021-08-24","text":"random text","label":"label-name","score":"0.5"}

One more question:
I am using a checklist for all of the labels, and I see the pre-annotated label shows up above the text. Will this be logged in the db if someone changes the label?

Glad it worked!

Based on your example, wouldn't you want the label-name to be added as a selected option? Or is this entirely separate from the choice options? If you want to pre-select a choice option, you can pre-populate the "accept": [], e.g. "accept": ["test-label1"] will pre-select the first option.

And yes, if you change the label in the UI, the "accept": [] list will be updated with the currently selected labels. If you want to preserve the original answers (e.g. to later compare how often the pre-annotated data was changed), you could just add it to the JSON under an arbitrary key that's then preserved in the data. For example, "orig_accept": ["label1", "label2"]. If the user then edits the labels, the data saved in the database would then have both.

1 Like

Can I refer to a labels file, or does each label need to be written out in a list for "accept": [] and "orig_accept": []? I have a long list of labels. I would like the label to be pre-selected, and the annotator can choose to unselect it and choose another, or keep it selected. So in my .jsonl file, my "label" (the pre-annotated portion) would become "orig_accept"? I tested out a few .jsonl formats, and found that having the pre-annotated label as both "accept": [] and "orig_accept": [] in the .jsonl file (replacing "label":"label1") resulted in the pre-annotated label being pre-checked in the UI.

{"id":"1234","deployment":"xxxx","date":"2021-08-24","text":"random text","orig_accept":["label1"],"score":"0.5","accept":["label1"]}

And I see you can also have both (the pre-annotated label as a banner on top in the UI, and pre-selected (checkbox is marked) in the list:

{"id":"1234","deployment":"xxxx","date":"2021-08-24","text":"random text","orig_accept":["label1"],"score":"0.5","accept":["label1"],"label":"label1"}

The labels here should be a list of the actual label values that should be pre-selected in the UI. (Pointing to a file here wouldn't really make sense because you do want the labels to be present explicitly in the data so you can export the data later on and know the exact label values.)

I'm trying to load pre-labeled spans. This does not work -- is the data formatted correctly?

{"text":"chicken chicken chicken","accept":"[{'text': 'chicken', 'start': 0, 'end': 7, 'id': 0, 'ws': True}, {'text': 'chicken', 'start': 8, 'end': 15, 'id': 1, 'ws': True}, {'text': 'chicken', 'start': 16, 'end': 23, 'id': 2, 'ws': False}]"}

I also tried the following:

{"text":"chicken chicken chicken","accept":"[{'text': 'chicken', 'start': 0, 'end': 7, 'id': 0, 'ws': True, 'label': 'STT_ERROR'}, {'text': 'chicken', 'start': 8, 'end': 15, 'id': 1, 'ws': True, 'label': 'STT_ERROR'}, {'text': 'chicken', 'start': 16, 'end': 23, 'id': 2, 'ws': False, 'label': 'STT_ERROR'}]"}

hi @cheyanneb!

For pre-annotated spans, you'd need data more like this (this is the output from spans_manual):

{
  "text": "Multivariate analysis revealed that septic shock and bacteremia originating from lower respiratory tract infection were two independent risk factors for 30-day mortality.",
  "tokens":  [
    {"text": "Multivariate", "start": 0, "end": 12, "id": 0, "ws": true},
    {"text": "analysis", "start": 13, "end": 21, "id": 1, "ws": true},
    {"text": "revealed", "start": 22, "end": 30, "id": 2, "ws": true},
    {"text": "that", "start": 31, "end": 35, "id": 3, "ws": true},
    {"text": "septic", "start": 36, "end": 42, "id": 4, "ws": true},
    {"text": "shock", "start": 43, "end": 48, "id": 5, "ws": true},
    {"text": "and", "start": 49, "end": 52, "id": 6, "ws": true},
    {"text": "bacteremia", "start": 53, "end": 63, "id": 7, "ws": true},
    {"text": "originating", "start": 64, "end": 75, "id": 8, "ws": true},
    {"text": "from", "start": 76, "end": 80, "id": 9, "ws": true},
    {"text": "lower", "start": 81, "end": 86, "id": 10, "ws": true},
    {"text": "respiratory", "start": 87, "end": 98, "id": 11, "ws": true},
    {"text": "tract", "start": 99, "end": 104, "id": 12, "ws": true},
    {"text": "infection", "start": 105, "end": 114, "id": 13, "ws": true},
    {"text": "were", "start": 115, "end": 119, "id": 14, "ws": true},
    {"text": "two", "start": 120, "end": 123, "id": 15, "ws": true},
    {"text": "independent", "start": 124, "end": 135, "id": 16, "ws": true},
    {"text": "risk", "start": 136, "end": 140, "id": 17, "ws": true},
    {"text": "factors", "start": 141, "end": 148, "id": 18, "ws": true},
    {"text": "for", "start": 149, "end": 152, "id": 19, "ws": true},
    {"text": "30", "start": 153, "end": 155, "id": 20, "ws": false},
    {"text": "-", "start": 155, "end": 156, "id": 21, "ws": false},
    {"text": "day", "start": 156, "end": 159, "id": 22, "ws": true},
    {"text": "mortality", "start": 160, "end": 169, "id": 23, "ws": false},
    {"text": ".", "start": 169, "end": 170, "id": 24, "ws": false}
  ],
  "spans": [
    {"start": 0, "end": 21, "token_start": 0, "token_end": 1, "label": "METHOD"},
    {"start": 36, "end": 48, "token_start": 4, "token_end": 5, "label": "FACTOR"},
    {"start": 36, "end": 48, "token_start": 4, "token_end": 5, "label": "CONDITION"},
    {"start": 53, "end": 114, "token_start": 7, "token_end": 13, "label": "FACTOR"},
    {"start": 53, "end": 63, "token_start": 7, "token_end": 7, "label": "CONDITION"},
    {"start": 81, "end": 114, "token_start": 10, "token_end": 13, "label": "CONDITION"},
    {"start": 153, "end": 169, "token_start": 20, "token_end": 23, "label": "EFFECT"}
  ]
}

Your data didn't look like it had spans, only "tokens" that were put into the "accept" tag. Let me know if you have questions.

1 Like

Can you confirm what ws signifies? Whitespace?

Yes. Technically it is "whether tokens are followed by whitespace or not".

1 Like