Converting data to Prodigy's format

abhinandansrivastava · December 5, 2018, 7:41am

Hi,

I have spacy formatted datasets but i want to train that dataset on prodigy.

lets say I have this dataset:

[('Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .', {'entities': [(48, 54, 'GPE'), (77, 81, 'GPE'), (111, 118, 'NORP')]}), ('Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "', {'entities': [(109, 113, 'PERSON')]}), ('They marched from the Houses of Parliament to a rally in Hyde Park .', {'entities': [(57, 66, 'GPE')]})]

what is the prodigy format and also how to get into prodigy format for training?

ines · December 5, 2018, 12:51pm

@abhinandansrivastava I moved your post to a new thread to keep things organised and easier to find

Prodigy uses a pretty straightforward JSONL format (newline-delimited JSON). Entities and other highlighted spans of text can be defined in the "spans" property. So your first example could look like this:

{"Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .", "spans": [{"start": 48, "end": 54, "label": "GPE"}, {"start": 77, "end": 81, "label": "GPE"}, {"start": 111, "end": 118, "label": "GPE"}]}

So you should be able to write a script that takes your data and outputs the JSON format. You can find more details on the exact data formats for the different annotation types in your PRODIGY_README.html, available for download with Prodigy (see the “Annotation formats” section).

Once you have data in Prodigy’s JSONL format, you can either import it to a new dataset using the db-in command, or use it as input data for a recipe like ner.manual (for example, if you want to double-check and correct your annotations).

Btw, if your data is gold-standard and the annotations include all entities that exist in the text, don’t forget to set the --no-missing flag when you run ner.batch-train to train a model. This will treat all tokens that are not annotated as “outside an entitiy” (instead of just missing values), which can increase your accuracy and help the model learn.

Topic		Replies	Views
Training prodigy ner data through spacy usage , ner , spacy , solved	3	893	January 8, 2020
NER manual source data fomat Getting Started usage , ner , spacy	1	240	September 21, 2022
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3013	April 17, 2023
Prodigy annotations to SpaCy train spacy	13	5614	January 31, 2018
Spacy NER model results into a format of prodigy dataset jsonl format Getting Started usage , ner , spacy , solved	2	416	October 14, 2020

Converting data to Prodigy's format

Related topics