Converting data to Prodigy's format

usage
ner

(Abhinandan Srivastava) #1

Hi,

I have spacy formatted datasets but i want to train that dataset on prodigy.

lets say I have this dataset:

[(‘Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .’, {‘entities’: [(48, 54, ‘GPE’), (77, 81, ‘GPE’), (111, 118, ‘NORP’)]}), (‘Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "’, {‘entities’: [(109, 113, ‘PERSON’)]}), (‘They marched from the Houses of Parliament to a rally in Hyde Park .’, {‘entities’: [(57, 66, ‘GPE’)]})]

what is the prodigy format and also how to get into prodigy format for training?


PhraseMatcher Only takes words less than 10 length
(Ines Montani) #2

@abhinandansrivastava I moved your post to a new thread to keep things organised and easier to find :slightly_smiling_face:

Prodigy uses a pretty straightforward JSONL format (newline-delimited JSON). Entities and other highlighted spans of text can be defined in the "spans" property. So your first example could look like this:

{"Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .", "spans": [{"start": 48, "end": 54, "label": "GPE"}, {"start": 77, "end": 81, "label": "GPE"}, {"start": 111, "end": 118, "label": "GPE"}]}

So you should be able to write a script that takes your data and outputs the JSON format. You can find more details on the exact data formats for the different annotation types in your PRODIGY_README.html, available for download with Prodigy (see the “Annotation formats” section).

Once you have data in Prodigy’s JSONL format, you can either import it to a new dataset using the db-in command, or use it as input data for a recipe like ner.manual (for example, if you want to double-check and correct your annotations).

Btw, if your data is gold-standard and the annotations include all entities that exist in the text, don’t forget to set the --no-missing flag when you run ner.batch-train to train a model. This will treat all tokens that are not annotated as “outside an entitiy” (instead of just missing values), which can increase your accuracy and help the model learn.