Document Images - Textual Images Labeling

I am using prodigy with great results on computer vision annotation tasks. I have a requirement to add labels to specific entries in the text. The text that needs annotating is in an image/scanned document. The corresponding mapping between the image and actual text is done via ocr extraction. The annotation task is elaborate requiring labels to be added to individual text spans. Was curious to see if there was a way to combine ner.manual and image.manual, that can be used in this use case.
**But currently, I am planning to use the image_manual view and trying to simplify the annotation process by requiring users to only specify an approximate boundary by drawing the box. With that information I would like to extract the corresponding word from the ocr output and correct the approximate bounding box with more accurate bounding boxes. To do this, I can implement a callback for spanselected event but unfortunately it is not triggered. The update also does not trigger. As an alternative, I could use validate_answer which is getting triggered but with this approach, the user may not be able to see the corrected boxes before accepting the answer.
Below is an excerpt from my recipe dict, with placeholders for the callbacks, any suggestions on how I could go about this is greatly appreciated. **
Thank you,

def validate_answer(eg):
print(eg)
# selected = eg.get("accept", [])
# print(selected)

def span_selected(span_data):
    print(span_data)

def update(examples):
    print(# This function is triggered when Prodigy receives annotations
    print(f"Received {len(examples)} annotations!"))


return {
    "view_id": "image_manual",  # Annotation interface to use
    "dataset": dataset,  # Name of dataset to save annotations
    "stream": stream,  # Incoming stream of examples
    "exclude": exclude,  # List of dataset names to exclude
    "prodigyupdate": update,
    "prodigyspanselected": span_selected,
    "validate_answer": validate_answer,
    "config": {  # Additional config settings, mostly for app UI
        "label": ", ".join(label) if label is not None else "all",
        "labels": label,  # Selectable label options,
        "darken_image": 0.3 if darken else 0,
    },

A few things are going through my mind that could help, but is there more information you can share about the task? Is there a reason you're considering named entity recognition instead of span categorization? Does your task parse through full sentences or are you dealing with items in a list/table? Have you compared multiple OCR libraries?

One idea that popped into my mind is that you might be able to build a custom recipe that uses blocks. Technically this should allow you to add multiple interfaces in a single view. This might be worth an experiment but I can also imagine the merit of separating the two labeling tasks. You might be able to get more labels per hour, as well as more high-quality labels if the person who is labeling only needs to concern themselves with one labeling task at a time.

Another idea that popped into my mind is that you might be able to combine OCR libraries. If two different approaches agree on the parsed text you could argue there's more confidence that it's correct. When two libraries disagree, then you may want to skip.