I think the
meta field will probably be the nicest way to display that to the user. You can set it on the
example dict, and it'll be displayed as a subscript in the card. We use that to display things like the subreddit for reddit data.
To include this as a feature, the best way in spaCy v2 is to include it as an additional token. If you want to make a custom model with a library like PyTorch, a common recipe for this sort of thing is to embed the extra features with a separate embeddings table, and then sum the word embeddings with the extra feature embeddings before you pass the data into the contextual encoding. This puts the information into the model early, so that it can condition on it easily. I actually found this a surprising solution, but it's what Devlin recommends in the BERT paper, and I must say that transformers challenged my intuition about modelling in general (I still find it really unintuitive that the positional embedding works, for instance).
The next major release of Prodigy will use spaCy v3, which will make it much easier to use a custom model with Prodigy. For now, I would add the extra information to the annotation tool, and then do the experiments about the features separately from Prodigy, so that you have one less level of software to work with. You can export the annotations from Prodigy, and just run the experiment with your favourite combination of tooling. Once you've figured out what works best, you can get that working with Prodigy if you want the active learning model to do the same thing.
If you just want to try the easiest-to-implement thing first, do try just prepending the text with the contextual markers. Another easy-to-implement solution is to insert the contextual markers in between every token of the text. This looks weird and redundant, but it might be the best way to help the CNN exploit the information.