PII removal, data annonimization, NER

Hi,

I am dotnet c# developer. I have a requirement in my project where I have to remove all PII information from text inside a .doc/.docx file (Patient notes of GP practice).
Information like:
Patient Name
City
Building
zip codes
street name
email
Social security
Vehicle Identification etc.

I learned python and tried spacy. Its good but then I want to add custom entities like street names, building names etc. I am really new to python and
data science so am not sure what will work best for me. I looked at prodigy document for creating my own dataset but then it's not free and even if I buy
I am not sure how useful it is in my case.
Can somene please guide me on this.

Regards,
NP

Hi Neha,

I think you could consider delaying the use of spaCy and Prodigy until later in the project, and work on the other parts first. You might find you actually don't need them. I think the most important part of your project will be the review interface to confirm that all PII is removed. I think a Word plugin would probably be a good solution for this, so actually I think you're experienced in the right technology after all.

You should not consider deploying a machine learning solution for PII removal without manual review of the predictions. The machine learning can only be part of decision support.

spaCy has a good rule engine system (the Matcher) that's generally better than writing regular expressions. And Prodigy can also be very helpful in debugging the rules. But if you're unsure about the Python part, and not sure about getting a license to Prodigy, I would say it makes sense to focus on the core part, which would be the review and correction interface that will be used for manual review. A Word Plugin would probably be a good solution for that. You could then consider developing a Python service that runs locally that provides better predictions using spaCy's matcher. Once those pieces are in place, if developing better rules becomes the bottleneck in your process, Prodigy would likely be a good solution to help with that.