How to download the dataset I annotated using the prodigy tool in json format?

From the prodigy documentation, I can use the following command to download the annotated dataset to the local computer, but the data set is in the format of spacy. I want to get the data set in json format. How can I ask how to do it, thank you.

python -m prodigy data-to-spacy ./corpus --ner ner-d2 --eval-split 0.2

I have another question, after I use this command, there is a json file in the label folder in the corpus folder, what is this for? Thanks again。

hi @Zanejins!

The db-out recipe is the easiest way.

For example:

python -m prodigy db-out ner-d2 > my_ner_dataset.jsonl

Another alternative could be to use Prodigy's DB components to pull the data into Python and then export the data as you deem fit.

from prodigy.components.db import connect
import srsly

db = connect()
examples = db.get_dataset("ner-d2")
srsly.write_jsonl("ner-d2.jsonl", examples)

Those are pre-generated labels for faster training using spacy train since spaCy won’t have to preprocess the data to extract the labels.

thank you. I used this command to export the data in ner-d2 in the form of json, but this command does not seem to provide --eval-split.
'Prodigy train' command provides --eval-split, but the generated training set and validation set are in spacy form. Can these two spacy files be directly converted to json? Thanks

It's probably just easier to create write a quick script. Here's one that can help:

# prodigy-partition.py
from prodigy.components.db import connect
import typer
from pathlib import Path
import srsly
import random
import shutil


def main(prodigy_data: str, out_dir: Path, fraction: float = 0.2, seed: int = 0):
    """Partition the data into train/test/dev split."""
    random.seed(seed)
    db = connect()
    examples = db.get_dataset(prodigy_data)
    random.shuffle(examples)
    dev_size = int(fraction * len(examples))
    test_size = int(fraction * len(examples))
    train_size = (len(examples) - dev_size) - test_size
    if out_dir.exists():
        shutil.rmtree(out_dir)

    out_dir.mkdir(parents=True)
    srsly.write_jsonl(out_dir / "train.jsonl", examples[:train_size])
    srsly.write_jsonl(out_dir / "dev.jsonl", examples[train_size : train_size + dev_size])
    srsly.write_jsonl(out_dir / "test.jsonl", examples[train_size + dev_size :])


if __name__ == "__main__":
    typer.run(main)
python prodigy-partition.py ner-d2 data_folder

This would add your three files from ner-d2 into a new folder called data_folder. You can modify the amount of data by changing the fraction argument. This sets the amount of data used for dev and eval. By default it is 20%. Whatever is left over, is then used in train.

You can modify this as you deem fit (e.g., only have two partitions in the data, etc.).