PII removal, data annonimization, NER

NehaP · September 2, 2020, 1:45pm

Hi,

I am dotnet c# developer. I have a requirement in my project where I have to remove all PII information from text inside a .doc/.docx file (Patient notes of GP practice).
Information like:
Patient Name
City
Building
zip codes
street name
email
Social security
Vehicle Identification etc.

I learned python and tried spacy. Its good but then I want to add custom entities like street names, building names etc. I am really new to python and
data science so am not sure what will work best for me. I looked at prodigy document for creating my own dataset but then it's not free and even if I buy
I am not sure how useful it is in my case.
Can somene please guide me on this.

Regards,
NP

honnibal · September 3, 2020, 8:29pm

Hi Neha,

I think you could consider delaying the use of spaCy and Prodigy until later in the project, and work on the other parts first. You might find you actually don't need them. I think the most important part of your project will be the review interface to confirm that all PII is removed. I think a Word plugin would probably be a good solution for this, so actually I think you're experienced in the right technology after all.

You should not consider deploying a machine learning solution for PII removal without manual review of the predictions. The machine learning can only be part of decision support.

spaCy has a good rule engine system (the Matcher) that's generally better than writing regular expressions. And Prodigy can also be very helpful in debugging the rules. But if you're unsure about the Python part, and not sure about getting a license to Prodigy, I would say it makes sense to focus on the core part, which would be the review and correction interface that will be used for manual review. A Word Plugin would probably be a good solution for that. You could then consider developing a Python service that runs locally that provides better predictions using spaCy's matcher. Once those pieces are in place, if developing better rules becomes the bottleneck in your process, Prodigy would likely be a good solution to help with that.

Topic		Replies	Views
Split a ner.manual dataset, into smaller texts usage , ner , spacy	3	1142	June 24, 2022
Prodigy to Spacy Guide ner , spacy , best-practices	4	5330	January 13, 2020
prodigy data-to-spacy - retain metadata information enhancement , spacy	3	493	April 27, 2021
Spacy NER model results into a format of prodigy dataset jsonl format Getting Started usage , ner , spacy , solved	2	417	October 14, 2020
annotating entities in text documents usage , ner , solved	15	9931	November 28, 2017

PII removal, data annonimization, NER

Related topics