Formatted data for pre-trained spaCy models

I see the data sources listed for each of the pre-trained spaCy models here:

The links take me to the websites that host/describe the data, but could you also share the files that contain the fully processed data formatted for NER training? For example:

        ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
        ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]

The final format may or may not be in the same format as when you export a prodigy labeling job, as long as the entity information is available for each item, including span indices and the label.

Context: I'm trying to augment the training data used for one of your pre-trained models by concatenating an uncased/lowercase version of the original text (while preserving the entity annotations from the original dataset).

Hi! If you train via the command line or use Prodigy's data-to-spacy to export your Prodigy annotations, the data uses spaCy's JSON format for training. You can see an example of this here:

The English models are trained on the OntoNotes 5 corpus, which requires a license, so we're not allowed to actually share the data, including our converted version.