From the prodigy documentation, I can use the following command to download the annotated dataset to the local computer, but the data set is in the format of spacy. I want to get the data set in json format. How can I ask how to do it, thank you.
thank you. I used this command to export the data in ner-d2 in the form of json, but this command does not seem to provide --eval-split.
'Prodigy train' command provides --eval-split, but the generated training set and validation set are in spacy form. Can these two spacy files be directly converted to json? Thanks
It's probably just easier to create write a quick script. Here's one that can help:
# prodigy-partition.py
from prodigy.components.db import connect
import typer
from pathlib import Path
import srsly
import random
import shutil
def main(prodigy_data: str, out_dir: Path, fraction: float = 0.2, seed: int = 0):
"""Partition the data into train/test/dev split."""
random.seed(seed)
db = connect()
examples = db.get_dataset(prodigy_data)
random.shuffle(examples)
dev_size = int(fraction * len(examples))
test_size = int(fraction * len(examples))
train_size = (len(examples) - dev_size) - test_size
if out_dir.exists():
shutil.rmtree(out_dir)
out_dir.mkdir(parents=True)
srsly.write_jsonl(out_dir / "train.jsonl", examples[:train_size])
srsly.write_jsonl(out_dir / "dev.jsonl", examples[train_size : train_size + dev_size])
srsly.write_jsonl(out_dir / "test.jsonl", examples[train_size + dev_size :])
if __name__ == "__main__":
typer.run(main)
python prodigy-partition.py ner-d2 data_folder
This would add your three files from ner-d2 into a new folder called data_folder. You can modify the amount of data by changing the fraction argument. This sets the amount of data used for dev and eval. By default it is 20%. Whatever is left over, is then used in train.
You can modify this as you deem fit (e.g., only have two partitions in the data, etc.).