Hi prodi.gy gurus,
I'm developing an annotation workflow for targeted sentiment analysis.
At a high level, this consists of 2 steps:
- perform NER on a document to find entities of interest (e.g., "chest pain", "dyspnea")
- perform text classification with respect to each entity detected in step 1 (e.g., "Pt has chest pain but no dyspnea" yields the following: CHEST_PAIN=PRESENT, DYSPNEA=ABSENT)
Step 1 is super easy with prodigy's out-of-the-box ner
recipes (thanks for making it so easy!)
Step 2 requires some custom recipe development...
I've developed a custom recipe adapted from textcat.manual, but I've adapted the "stream" key to have an iterator containing one-object-per-entity-per-document (rather than one-object-per-document). Basically, something like this:
obj = textcat.manual(dataset,source,loader,label,exclusive,exclude) # override textcat.manual
obj['stream'] = partition_by_span(obj['stream'])
The partition_by_span
function yields multiple copies of each document, each copy containing only one span in the "spans" key. Something like this:
def partition_by_span(obj_iter):
for obj in obj_iter:
for i,span in enumerate(obj['spans']):
newobj = deepcopy(obj) # from copy import deepcopy
newobj['spans'] = [span] # just include the one span we are iterating on. This will highlight the entity/span in the prodigy UI and tell the annotator what entity to consider for sentiment
del newobj['_task_hash'], newobj['_input_hash'] # delete old _task_hash and _input_hash, since we're about to change them
newobj = set_hashes(newobj, input_keys=("text", "span_index")) # rehash _input_hash based on text and span_index (otherwise, entities in the same text document will be lost)
yield newobj
This means that I'll see two annotation opportunities for the document "Pt has chest pain but no dyspnea":
- "Pt has chest pain but no dyspnea"
- "Pt has chest pain but no dyspnea"
This works pretty well, but now I have a bunch of annotated "documents" that are actually entities masquerading as documents. I'd like to stitch them back up and put each entity sentiment in the appropriate spans
array in the parent document where it belongs.
Is there a way to re-aggregate my partitioned spans/entities within my custom recipe? Or do I have to export my collection of annotated single-entity-documents to jsonl and stitch them back together with a separate python script?
Thanks!
Ford