Formatted data for pre-trained spaCy models

dzenilee · January 30, 2021, 10:21pm

I see the data sources listed for each of the pre-trained spaCy models here: https://spacy.io/models/en

The links take me to the websites that host/describe the data, but could you also share the files that contain the fully processed data formatted for NER training? For example:

TRAIN_DATA = [
        ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
        ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]

The final format may or may not be in the same format as when you export a prodigy labeling job, as long as the entity information is available for each item, including span indices and the label.

Context: I'm trying to augment the training data used for one of your pre-trained models by concatenating an uncased/lowercase version of the original text (while preserving the entity annotations from the original dataset).

ines · January 31, 2021, 2:08am

Hi! If you train via the command line or use Prodigy's data-to-spacy to export your Prodigy annotations, the data uses spaCy's JSON format for training. You can see an example of this here: Data formats · spaCy API Documentation

The English models are trained on the OntoNotes 5 corpus, which requires a license, so we're not allowed to actually share the data, including our converted version.

Topic		Replies	Views
Prodigy annotations to SpaCy train spacy	13	5617	January 31, 2018
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1319	July 2, 2021
data-to-spacy for adding additional NER entities usage , ner , solved	1	436	December 1, 2020
update spacy model ner , spacy , solved , training	6	1135	October 8, 2021
Updating an NER model using the annotation tool ner , spacy	6	397	June 5, 2023

Formatted data for pre-trained spaCy models

Related topics