How to download the dataset I annotated using the prodigy tool in json format？

Zanejins · March 2, 2023, 12:12pm

From the prodigy documentation, I can use the following command to download the annotated dataset to the local computer, but the data set is in the format of spacy. I want to get the data set in json format. How can I ask how to do it, thank you.

python -m prodigy data-to-spacy ./corpus --ner ner-d2 --eval-split 0.2

I have another question, after I use this command, there is a json file in the label folder in the corpus folder, what is this for? Thanks again。

ryanwesslen · March 2, 2023, 1:31pm

hi @Zanejins!

The db-out recipe is the easiest way.

For example:

python -m prodigy db-out ner-d2 > my_ner_dataset.jsonl

Another alternative could be to use Prodigy's DB components to pull the data into Python and then export the data as you deem fit.

from prodigy.components.db import connect
import srsly

db = connect()
examples = db.get_dataset("ner-d2")
srsly.write_jsonl("ner-d2.jsonl", examples)

Those are pre-generated labels for faster training using spacy train since spaCy won’t have to preprocess the data to extract the labels.

Zanejins · March 4, 2023, 8:19am

thank you. I used this command to export the data in ner-d2 in the form of json, but this command does not seem to provide --eval-split.
'Prodigy train' command provides --eval-split, but the generated training set and validation set are in spacy form. Can these two spacy files be directly converted to json? Thanks

ryanwesslen · March 6, 2023, 4:07pm

It's probably just easier to create write a quick script. Here's one that can help:

# prodigy-partition.py
from prodigy.components.db import connect
import typer
from pathlib import Path
import srsly
import random
import shutil


def main(prodigy_data: str, out_dir: Path, fraction: float = 0.2, seed: int = 0):
    """Partition the data into train/test/dev split."""
    random.seed(seed)
    db = connect()
    examples = db.get_dataset(prodigy_data)
    random.shuffle(examples)
    dev_size = int(fraction * len(examples))
    test_size = int(fraction * len(examples))
    train_size = (len(examples) - dev_size) - test_size
    if out_dir.exists():
        shutil.rmtree(out_dir)

    out_dir.mkdir(parents=True)
    srsly.write_jsonl(out_dir / "train.jsonl", examples[:train_size])
    srsly.write_jsonl(out_dir / "dev.jsonl", examples[train_size : train_size + dev_size])
    srsly.write_jsonl(out_dir / "test.jsonl", examples[train_size + dev_size :])


if __name__ == "__main__":
    typer.run(main)

python prodigy-partition.py ner-d2 data_folder

This would add your three files from ner-d2 into a new folder called data_folder. You can modify the amount of data by changing the fraction argument. This sets the amount of data used for dev and eval. By default it is 20%. Whatever is left over, is then used in train.

You can modify this as you deem fit (e.g., only have two partitions in the data, etc.).

Topic		Replies	Views
Converting data to Prodigy's format Getting Started usage , ner	1	1566	December 5, 2018
unable to convert prodigy jsonl to spacy training json usage , spacy	3	1464	June 26, 2020
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3014	April 17, 2023
How to convert prodigy dataset to .spacy object? usage , spacy , solved	6	1302	January 13, 2023
Is it possible to make Prodigy export a Tokenized JSONL file by inputting a JSON file with no annotations done on the dataset? ner , solved	1	506	October 10, 2022

How to download the dataset I annotated using the prodigy tool in json format？

Related topics