Hi, I already have a dataset with Text, Entity, Label and I want to use it as inputs to tag another dataset only with Text column. How can I do that with Prodigy? I only see that for start working they type "prodigy dataset NAME ’ ’ " and I dont know from where they get that NAME file.
Sorry if this was confusing. What Prodigy calls a “dataset” is the dataset the created annotations will be saved to. So when you run prodigy dataset your_cool_dataset
, Prodigy will create an empty set called “your_cool_dataset” in the database.
When you annotate, you can tell Prodigy to save all labelled examples there. When you’re done, you can use that dataset to train a model, or run the db-out
command to export it to a file to use it in a different process.
The data you want to label and load is usually specified as the source
argument. Prodigy supports loading in CSV files if they contain a text
or Text
column. Alternatively, you can also convert your data to JSON or JSONL (see the PRODIGY_README.html
for details on the format).
For example, the following command will start the ner.manual
recipe so you can label data by hand:
prodigy ner.manual your_dataset en_core_web_sm /path/to/data.csv --label PERSON,ORG
-
ner.manual
- the name of the recipe to run -
your_dataset
- the name of a dataset in the Prodigy database to save the examples to -
en_core_web_sm
- name of an installed spaCy model used for tokenization -
/path/to/data.csv
– the path to your data (can also be a JSON or JSONL file) -
--label PERSON,ORG
- the labels that will be available
When you’re done, you can export the annotated dataset and check it out:
prodigy db-out your_dataset > some_file.jsonl