Query regarding ner.manual input field for OCR-corrected data (Prodigy pdf.ocr.correct)

Dear Prodigy Support Team,

I am currently working on a digital humanities project involving historical letter collections, using Prodigy for my NLP pipeline. I've successfully completed the OCR correction phase using pdf.ocr.correct.

My workflow is as follows:

  1. Raw PDFs are processed, and initial OCR is performed (thank you, in german and french too :)!

  2. OCR transcription is manually corrected using pdf.ocr.correct, and these corrections are stored in a Prodigy dataset (e.g., my_master_ocr_corrections).

Upon inspecting the exported data from pdf.ocr.correct (e.g., using db-out), I've observed that the original OCR output is stored in the text field, while my manual corrections are saved in the transcription field. An example of an exported record shows:

codeJson

{
  "label": "CONCERN",
  "text": "640/A II 4A/-",
  "transcription": "640/All4A/--Ki",
  "meta": {
    "field_id": "transcription",
    "field_label": "Transcript",
    // ... other meta fields
  },
  // ... other fields
}

Now, I'm moving to the NER annotation phase using ner.manual. My understanding, based on the ner.manual documentation, is that it primarily uses the text field as the source for annotation. This means that if I directly feed my ocr_corrected_dataset to ner.manual, it will present the uncorrected OCR text (from the text field) for annotation, rather than my validated corrections (from the transcription field).

Currently, my workaround involves:

  1. Exporting the ocr_corrected_dataset to a .jsonl file.

  2. Running a custom Python script to read this .jsonl file, move the content from the transcription field to the text field for each record, and save it to a new .jsonl file.

  3. Importing this newly prepared .jsonl file into ner.manual.

While this workaround is functional, it adds an extra step and file management.

My question is: Is there a direct parameter or a more elegant, built-in way within the ner.manual recipe (or other Prodigy recipes) to specify which field (e.g., transcription) should be used as the primary text source for annotation, instead of defaulting to text? This would streamline the process significantly.

Thank you for your time and assistance.

Sincerely,

Hi @dh_gerard ,

While there's not a CLI argument to specify which key to use (there will be in future as we move from unstructured generator-based stream to structured stream), you can definitely automate the procedure so that no manual step is involved.

I recommend you implement your logic as a custom Prodigy loader and wrap it as a tiny Prodigy recipe, for example:

import prodigy
from prodigy.components.stream import get_stream
import json
import copy


@prodigy.recipe("load-data") 
def load_data(source):
    stream = get_stream(
        source, rehash=True, dedup=True, is_binary=False, view_id="ner_manual"
    )
    for eg in stream:
        eg_copy = copy.deepcopy(eg)
        eg_copy["text"] = eg["transcript"]
        print(json.dumps(eg_copy))

Then you could pipe the output of this script to Prodigy ner.manual command setting the input stream to stdin:

python -m prodigy load-data dataset:ocr -F transcript_loader.py | python -m prodigy ner.manual test blank:en --label FOO - --loader jsonl

The first command calls the tiny loading recipe and prints the example one at a time to stdout. It could be a simple Python script as well, but by wrapping it as a Prodigy recipe you can leverage the familiar Prodigy CLI to pass arguments, such as the input dataset name (dataset:ocrin the example above) . The ner.manual reads from stdin (thats the - on the CLI) and uses jsonl loader to load it to UI.

There's no need to do db-out and manually run the preprocessing script. The only thing is to have the custom loader script available and modify the command to pipe the data through it.
You can find more info on loaders, including the stdin and custom loaders in our docs here (especially the " Using custom loaders with built-in recipes" section)
Hope that helps!