Dear Prodigy Support Team,
I am currently working on a digital humanities project involving historical letter collections, using Prodigy for my NLP pipeline. I've successfully completed the OCR correction phase using pdf.ocr.correct.
My workflow is as follows:
-
Raw PDFs are processed, and initial OCR is performed (thank you, in german and french too :)!
-
OCR transcription is manually corrected using pdf.ocr.correct, and these corrections are stored in a Prodigy dataset (e.g., my_master_ocr_corrections).
Upon inspecting the exported data from pdf.ocr.correct (e.g., using db-out), I've observed that the original OCR output is stored in the text field, while my manual corrections are saved in the transcription field. An example of an exported record shows:
codeJson
{
"label": "CONCERN",
"text": "640/A II 4A/-",
"transcription": "640/All4A/--Ki",
"meta": {
"field_id": "transcription",
"field_label": "Transcript",
// ... other meta fields
},
// ... other fields
}
Now, I'm moving to the NER annotation phase using ner.manual. My understanding, based on the ner.manual documentation, is that it primarily uses the text field as the source for annotation. This means that if I directly feed my ocr_corrected_dataset to ner.manual, it will present the uncorrected OCR text (from the text field) for annotation, rather than my validated corrections (from the transcription field).
Currently, my workaround involves:
-
Exporting the ocr_corrected_dataset to a .jsonl file.
-
Running a custom Python script to read this .jsonl file, move the content from the transcription field to the text field for each record, and save it to a new .jsonl file.
-
Importing this newly prepared .jsonl file into ner.manual.
While this workaround is functional, it adds an extra step and file management.
My question is: Is there a direct parameter or a more elegant, built-in way within the ner.manual recipe (or other Prodigy recipes) to specify which field (e.g., transcription) should be used as the primary text source for annotation, instead of defaulting to text? This would streamline the process significantly.
Thank you for your time and assistance.
Sincerely,