Sensitive data in custom model

Hi all, I trained a spacy model [fr_core_news_lg] on specific data annotated with prodigy. The data that we used to train the model is sensitive and we would need to get it out of the secure environment. Does any of the files or folders in the custom model (ner, parser, tagger, vocab or tokeniser) contains the training data or traces of it ?
I am pretty sure that it doesn't, but I would just like to confirm.

Thanks a lot :slightly_smiling_face:,

Oliver.

It's very hard to 100% guarantee that there is no data leakage whatsoever. Technically the NER component of a model needs to store information on what entities the model was trained on. Just that information is potentially private. This is is not something that is spaCy specific though, every ML implementation will have this issue.

A trained model will not contain a copy of the trained data. But it will contain some updated state as a result. If a "bad actor" were to gain access to the model, it might be able to infer what kind of data it was trained on just by interacting with the model. This is especially true if you added your own word vectors or something like that.

For example, if the model is able to detect names like "Billy" and "Joe" in addition to names like "Francois" then one might infer that the original dataset contains some English names.

It deserves re-stating that this is independent of spaCy or prodigy. Any machine learning model suffers from "reverse engineering by interacting".

A few more specific details about the data saved with spacy pipelines: every single token seen while training is stored in the string store, which is saved under vocab/strings.json.

For the pipeline components you mentioned above it's usually okay to delete all the strings from the string store, but you should test it carefully to make sure everything still works as intended. (Delete all the strings leaving an empty JSON list in this file.)

More info here (in particular about the strings for vectors in case you're doing any vector similarity calculations outside the pipeline): Does vocab/strings.json essential for name entity recognition? · Discussion #7794 · explosion/spaCy · GitHub