Retrieving binary-annotated records from a mixed dataset

I made the mistake of saving my new ner.teach annotations into a dataset that had all of the original un-annotated examples (but answer = accept). If I use this dataset for training, will ner.batch-train use only the annotated records, or will it assume the unannotated records have no entities ?

Assuming the latter problem, I would like to pull out only the trained records from that dataset's JSONL and put them in a new JSONL and dataset for training.

On inspection, I believe the unseen records look like this:

{"text":"If YES, continue to question 9. If NO, skip to question 13. YES NO Appears on","_input_hash":-1734817824,"_task_hash":484192225,"answer":"accept"}

while the binary-annotated records seem to be all at the end of the JSONL file and look more like this:

{"text":"$ % $ $ % $ TRAVEL $ $ OTHER DIRECT COSTS (ODC) $ $ General and Administrative (G&A) ","_input_hash":149251391,"_task_hash":504422277,"tokens":[{"text":"$","start":0,"end":1,"id":0},{"text":" ","start":2,"end":3,"id":1},{"text":"%","start":3,"end":4,"id":2},{"text":" ","start":5,"end":6,"id":3},{"text":"$","start":6,"end":7,"id":4},{"text":" ","start":8,"end":9,"id":5},{"text":"$","start":9,"end":10,"id":6},{"text":" ","start":11,"end":12,"id":7},{"text":"%","start":12,"end":13,"id":8},{"text":" ","start":14,"end":16,"id":9},{"text":"$","start":16,"end":17,"id":10},{"text":" ","start":18,"end":19,"id":11},{"text":"TRAVEL","start":19,"end":25,"id":12},{"text":" ","start":26,"end":27,"id":13},{"text":"$","start":27,"end":28,"id":14},{"text":" ","start":29,"end":30,"id":15},{"text":"$","start":30,"end":31,"id":16},{"text":" ","start":32,"end":34,"id":17},{"text":"OTHER","start":34,"end":39,"id":18},{"text":"DIRECT","start":40,"end":46,"id":19},{"text":"COSTS","start":47,"end":52,"id":20},{"text":"(","start":53,"end":54,"id":21},{"text":"ODC","start":54,"end":57,"id":22},{"text":")","start":57,"end":58,"id":23},{"text":" ","start":59,"end":61,"id":24},{"text":"$","start":61,"end":62,"id":25},{"text":" ","start":63,"end":64,"id":26},{"text":"$","start":64,"end":65,"id":27},{"text":" ","start":66,"end":67,"id":28},{"text":"General","start":67,"end":74,"id":29},{"text":"and","start":75,"end":78,"id":30},{"text":"Administrative","start":79,"end":93,"id":31},{"text":"(","start":94,"end":95,"id":32},{"text":"G&A","start":95,"end":98,"id":33},{"text":")","start":98,"end":99,"id":34},{"text":" ","start":100,"end":101,"id":35}],"spans":[{"start":95,"end":98,"text":"G&A","rank":0,"label":"PERSON","score":0.5181355889,"source":"en_core_web_lg","input_hash":149251391}],"meta":{"score":0.5181355889},"_session_id":null,"_view_id":"ner","answer":"reject"}

Is there a sure-fire way to extract all the right ones (e.g., "if 'meta' in j.keys(): ..." ) ? I didn't see any detailed description of the JSONL format in the documentation.

I'd hate to lose the annotations. Thanks in advance!

Filtering for the meta field returned the lines I needed, and I was able to train off only those records after creating a new dataset.

1 Like

Glad you found a solution! Annotations created with Prodigy v1.8.0 and above should also store a "_view_id" field on each task containing the ID of the annotation interface used to create it. So this could be an alternative solution – filter out the examples that have "_view_id": "ner" and were added via the app.