CSV with NER classifications to dataset

Hi, I already have a dataset with Text, Entity, Label and I want to use it as inputs to tag another dataset only with Text column. How can I do that with Prodigy? I only see that for start working they type "prodigy dataset NAME ’ ’ " and I dont know from where they get that NAME file.

Sorry if this was confusing. What Prodigy calls a “dataset” is the dataset the created annotations will be saved to. So when you run prodigy dataset your_cool_dataset, Prodigy will create an empty set called “your_cool_dataset” in the database.

When you annotate, you can tell Prodigy to save all labelled examples there. When you’re done, you can use that dataset to train a model, or run the db-out command to export it to a file to use it in a different process.

The data you want to label and load is usually specified as the source argument. Prodigy supports loading in CSV files if they contain a text or Text column. Alternatively, you can also convert your data to JSON or JSONL (see the PRODIGY_README.html for details on the format).

For example, the following command will start the ner.manual recipe so you can label data by hand:

prodigy ner.manual your_dataset en_core_web_sm /path/to/data.csv --label PERSON,ORG
  • ner.manual - the name of the recipe to run
  • your_dataset - the name of a dataset in the Prodigy database to save the examples to
  • en_core_web_sm - name of an installed spaCy model used for tokenization
  • /path/to/data.csv – the path to your data (can also be a JSON or JSONL file)
  • --label PERSON,ORG - the labels that will be available

When you’re done, you can export the annotated dataset and check it out:

prodigy db-out your_dataset > some_file.jsonl
1 Like