error while loading pre-annotated jsonl file

cheyanneb · October 5, 2021, 1:04am

I am trying to load a pre-annotated .jsonl file (label is already present in the data I am loading) using the following command:

prodigy textcat.manual dataset_name /path_to_file/filename.jsonl --label /path_to_labels/labels.txt

My file looks like this:

{"0":"id","1":"deployment","2":"date","3":"text","4":"label","5":"score"}
{"0":"1234","1":"xxxx","2":"2021-08-24","3":"random text","4":"label-name","5":"0.5"}

I am getting the following error:

✘ Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.

Any thoughts as to why I cannot load this dataset?

Thanks!
Cheyanne

ines · October 5, 2021, 2:45pm

Hi! The problem is that the data you're loading in doesn't really follow the expected format or includes a "text", so Prodigy can't know which information it should display and what text you want to annotate. (In your case, that seems to be "3"? Or is it "1"? I'm not 100% sure I can tell what the text is and what the labels are).

You can find an example of the expected format for the multiple choice interface here: https://prodi.gy/docs/api-interfaces#choice

To pre-select labels, you can add them to the list of "accept": [], e.g. "accept": ["label1", "label2"].

cheyanneb · October 5, 2021, 3:12pm

Ok! I edited the file, and that seemed to work:

{"id":"1234","deployment":"xxxx","date":"2021-08-24","text":"random text","label":"label-name","score":"0.5"}

One more question:
I am using a checklist for all of the labels, and I see the pre-annotated label shows up above the text. Will this be logged in the db if someone changes the label?

ines · October 6, 2021, 10:30am

Glad it worked!

Based on your example, wouldn't you want the label-name to be added as a selected option? Or is this entirely separate from the choice options? If you want to pre-select a choice option, you can pre-populate the "accept": [], e.g. "accept": ["test-label1"] will pre-select the first option.

And yes, if you change the label in the UI, the "accept": [] list will be updated with the currently selected labels. If you want to preserve the original answers (e.g. to later compare how often the pre-annotated data was changed), you could just add it to the JSON under an arbitrary key that's then preserved in the data. For example, "orig_accept": ["label1", "label2"]. If the user then edits the labels, the data saved in the database would then have both.

cheyanneb · October 6, 2021, 3:12pm

Can I refer to a labels file, or does each label need to be written out in a list for "accept": [] and "orig_accept": []? I have a long list of labels. I would like the label to be pre-selected, and the annotator can choose to unselect it and choose another, or keep it selected. So in my .jsonl file, my "label" (the pre-annotated portion) would become "orig_accept"? I tested out a few .jsonl formats, and found that having the pre-annotated label as both "accept": [] and "orig_accept": [] in the .jsonl file (replacing "label":"label1") resulted in the pre-annotated label being pre-checked in the UI.

{"id":"1234","deployment":"xxxx","date":"2021-08-24","text":"random text","orig_accept":["label1"],"score":"0.5","accept":["label1"]}

And I see you can also have both (the pre-annotated label as a banner on top in the UI, and pre-selected (checkbox is marked) in the list:

{"id":"1234","deployment":"xxxx","date":"2021-08-24","text":"random text","orig_accept":["label1"],"score":"0.5","accept":["label1"],"label":"label1"}

ines · October 7, 2021, 9:00am

The labels here should be a list of the actual label values that should be pre-selected in the UI. (Pointing to a file here wouldn't really make sense because you do want the labels to be present explicitly in the data so you can export the data later on and know the exact label values.)

cheyanneb · March 22, 2023, 8:00pm

I'm trying to load pre-labeled spans. This does not work -- is the data formatted correctly?

{"text":"chicken chicken chicken","accept":"[{'text': 'chicken', 'start': 0, 'end': 7, 'id': 0, 'ws': True}, {'text': 'chicken', 'start': 8, 'end': 15, 'id': 1, 'ws': True}, {'text': 'chicken', 'start': 16, 'end': 23, 'id': 2, 'ws': False}]"}

I also tried the following:

{"text":"chicken chicken chicken","accept":"[{'text': 'chicken', 'start': 0, 'end': 7, 'id': 0, 'ws': True, 'label': 'STT_ERROR'}, {'text': 'chicken', 'start': 8, 'end': 15, 'id': 1, 'ws': True, 'label': 'STT_ERROR'}, {'text': 'chicken', 'start': 16, 'end': 23, 'id': 2, 'ws': False, 'label': 'STT_ERROR'}]"}

ryanwesslen · March 22, 2023, 8:10pm

hi @cheyanneb!

For pre-annotated spans, you'd need data more like this (this is the output from spans_manual):

{
  "text": "Multivariate analysis revealed that septic shock and bacteremia originating from lower respiratory tract infection were two independent risk factors for 30-day mortality.",
  "tokens":  [
    {"text": "Multivariate", "start": 0, "end": 12, "id": 0, "ws": true},
    {"text": "analysis", "start": 13, "end": 21, "id": 1, "ws": true},
    {"text": "revealed", "start": 22, "end": 30, "id": 2, "ws": true},
    {"text": "that", "start": 31, "end": 35, "id": 3, "ws": true},
    {"text": "septic", "start": 36, "end": 42, "id": 4, "ws": true},
    {"text": "shock", "start": 43, "end": 48, "id": 5, "ws": true},
    {"text": "and", "start": 49, "end": 52, "id": 6, "ws": true},
    {"text": "bacteremia", "start": 53, "end": 63, "id": 7, "ws": true},
    {"text": "originating", "start": 64, "end": 75, "id": 8, "ws": true},
    {"text": "from", "start": 76, "end": 80, "id": 9, "ws": true},
    {"text": "lower", "start": 81, "end": 86, "id": 10, "ws": true},
    {"text": "respiratory", "start": 87, "end": 98, "id": 11, "ws": true},
    {"text": "tract", "start": 99, "end": 104, "id": 12, "ws": true},
    {"text": "infection", "start": 105, "end": 114, "id": 13, "ws": true},
    {"text": "were", "start": 115, "end": 119, "id": 14, "ws": true},
    {"text": "two", "start": 120, "end": 123, "id": 15, "ws": true},
    {"text": "independent", "start": 124, "end": 135, "id": 16, "ws": true},
    {"text": "risk", "start": 136, "end": 140, "id": 17, "ws": true},
    {"text": "factors", "start": 141, "end": 148, "id": 18, "ws": true},
    {"text": "for", "start": 149, "end": 152, "id": 19, "ws": true},
    {"text": "30", "start": 153, "end": 155, "id": 20, "ws": false},
    {"text": "-", "start": 155, "end": 156, "id": 21, "ws": false},
    {"text": "day", "start": 156, "end": 159, "id": 22, "ws": true},
    {"text": "mortality", "start": 160, "end": 169, "id": 23, "ws": false},
    {"text": ".", "start": 169, "end": 170, "id": 24, "ws": false}
  ],
  "spans": [
    {"start": 0, "end": 21, "token_start": 0, "token_end": 1, "label": "METHOD"},
    {"start": 36, "end": 48, "token_start": 4, "token_end": 5, "label": "FACTOR"},
    {"start": 36, "end": 48, "token_start": 4, "token_end": 5, "label": "CONDITION"},
    {"start": 53, "end": 114, "token_start": 7, "token_end": 13, "label": "FACTOR"},
    {"start": 53, "end": 63, "token_start": 7, "token_end": 7, "label": "CONDITION"},
    {"start": 81, "end": 114, "token_start": 10, "token_end": 13, "label": "CONDITION"},
    {"start": 153, "end": 169, "token_start": 20, "token_end": 23, "label": "EFFECT"}
  ]
}

Your data didn't look like it had spans, only "tokens" that were put into the "accept" tag. Let me know if you have questions.

cheyanneb · March 29, 2023, 3:09pm

Can you confirm what ws signifies? Whitespace?

ryanwesslen · March 29, 2023, 3:18pm

Yes. Technically it is "whether tokens are followed by whitespace or not".

Topic		Replies	Views
Cant load pre-annotated ner jsonl usage , ner , solved	8	1183	June 24, 2020
Loading pre-annotated data usage , solved , streams	3	640	October 24, 2022
textcat.manual binary annotation without labels usage , textcat , solved	2	359	November 14, 2021
'Cannot find label in model' when trying to train from pre-annotated data usage , ner , solved	11	946	March 14, 2019
prodigy use case for annotation having pre-annotated text usage , solved	8	1264	March 11, 2019

error while loading pre-annotated jsonl file

Related topics