load data from parquet and display metadata in ner.manual UI

dear explosion team,

I have a parquet file with a couple of columns which I want to use as input for ther ner.manual recipe. I managed to get it to work with a little trial and error (the parquet loader isn't documented at all). I want to show the filename (which is a column in the parquet file) during annotation, to make it easier to go back to the original source of a text, in case parsing or ocr issues get surfaced during annotation. But I don't even know how to start implementing this trivial change.

Do I need to create a customized version of ner.manual or is there an easier way to modify the interface? The loader seems to stream in my metadata correctly, I can see the "filename" variable, when I run window.prodigy() in the javascript console.

Welcome to the forum @nikita :waving_hand:

You don't need to reimplement ner.manual — this is really the input data-shape issue. Prodigy's built-in meta panel only renders keys under task["meta"], and the parquet loader yields each row as a dict with columns becoming top-level keys on the task. So your task looks like:

{"text": "...", "filename": "doc_042.pdf"}

That's why you can see filename via window.prodigy() but the UI doesn't show it. The fix is to get the filename under meta:

{"text": "...", "meta": {"filename": "doc_042.pdf"}}

You can do that either by preprocessing the input or with a small recipe wrapper — both leave ner.manual itself untouched.

Option 1 — preprocess to JSONL (simplest)

Convert the input file with an external script and use it as input to built-in ner.manual:

# parquet_to_jsonl.py
import json, sys
import pyarrow.parquet as pq

for row in pq.read_table(sys.argv[1]).to_pylist():
    print(json.dumps({
        "text": row["text"],
        "meta": {"filename": row["filename"]},
    }))
python parquet_to_jsonl.py data.parquet > data.jsonl
prodigy ner.manual my_dataset blank:en data.jsonl --label PERSON,ORG

Option 2 — a thin wrapper recipe (keeps parquet as input)

# ner_manual_parquet.py
import spacy
import prodigy
from prodigy.recipes.ner import manual as ner_manual

def lift_to_meta(stream, keys=("filename",)):
    for eg in stream:
        meta = eg.setdefault("meta", {})
        for k in keys:
            if k in eg:
                meta[k] = eg[k]
        yield eg

@prodigy.recipe(
    "ner.manual.parquet",
    dataset=("Dataset to save to", "positional", None, str),
    lang=("Language for blank tokenizer, e.g. 'en'", "positional", None, str),
    source=("Parquet file path", "positional", None, str),
    label=("Comma-separated labels", "option", "l", str),
)
def ner_manual_parquet(dataset, lang, source, label=None):
    nlp = spacy.blank(lang)
    components = ner_manual(dataset, nlp, source, label=label)
    components["stream"].apply(lift_to_meta)
    return components

Run it with -F:

prodigy ner.manual.parquet my_dataset en data.parquet --label PERSON,ORG -F ner_manual_parquet.py

Also — fair point on the parquet loader docs. It's a built-in (.parquet files are picked up automatically as long as pyarrow is installed), but it's currently undocumented. I'll flag that internally so we can get it added.

Thanks for the warm welcome and your super helpful reply!

I went with the second option and after I changedlabel=("Comma-separated labels", "option", "l", str)to label=("Comma-separated labels", "option", "l", split_string) and adding from prodigy.util import split_string everything works exactly as I hoped!