Welcome to the forum @nikita 
You don't need to reimplement ner.manual — this is really the input data-shape issue. Prodigy's built-in meta panel only renders keys under task["meta"], and the parquet loader yields each row as a dict with columns becoming top-level keys on the task. So your task looks like:
{"text": "...", "filename": "doc_042.pdf"}
That's why you can see filename via window.prodigy() but the UI doesn't show it. The fix is to get the filename under meta:
{"text": "...", "meta": {"filename": "doc_042.pdf"}}
You can do that either by preprocessing the input or with a small recipe wrapper — both leave ner.manual itself untouched.
Option 1 — preprocess to JSONL (simplest)
Convert the input file with an external script and use it as input to built-in ner.manual:
# parquet_to_jsonl.py
import json, sys
import pyarrow.parquet as pq
for row in pq.read_table(sys.argv[1]).to_pylist():
print(json.dumps({
"text": row["text"],
"meta": {"filename": row["filename"]},
}))
python parquet_to_jsonl.py data.parquet > data.jsonl
prodigy ner.manual my_dataset blank:en data.jsonl --label PERSON,ORG
Option 2 — a thin wrapper recipe (keeps parquet as input)
# ner_manual_parquet.py
import spacy
import prodigy
from prodigy.recipes.ner import manual as ner_manual
def lift_to_meta(stream, keys=("filename",)):
for eg in stream:
meta = eg.setdefault("meta", {})
for k in keys:
if k in eg:
meta[k] = eg[k]
yield eg
@prodigy.recipe(
"ner.manual.parquet",
dataset=("Dataset to save to", "positional", None, str),
lang=("Language for blank tokenizer, e.g. 'en'", "positional", None, str),
source=("Parquet file path", "positional", None, str),
label=("Comma-separated labels", "option", "l", str),
)
def ner_manual_parquet(dataset, lang, source, label=None):
nlp = spacy.blank(lang)
components = ner_manual(dataset, nlp, source, label=label)
components["stream"].apply(lift_to_meta)
return components
Run it with -F:
prodigy ner.manual.parquet my_dataset en data.parquet --label PERSON,ORG -F ner_manual_parquet.py
Also — fair point on the parquet loader docs. It's a built-in (.parquet files are picked up automatically as long as pyarrow is installed), but it's currently undocumented. I'll flag that internally so we can get it added.