I see the data sources listed for each of the pre-trained spaCy models here: https://spacy.io/models/en
The links take me to the websites that host/describe the data, but could you also share the files that contain the fully processed data formatted for NER training? For example:
TRAIN_DATA = [
("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
The final format may or may not be in the same format as when you export a prodigy labeling job, as long as the entity information is available for each item, including span indices and the label.
Context: I'm trying to augment the training data used for one of your pre-trained models by concatenating an uncased/lowercase version of the original text (while preserving the entity annotations from the original dataset).