We have discovered an intriguing phenomenon in the Machine Learning (ML) datasets hosted on the Hugging Face Hub: the inclusion of private, undocumented information about individuals. This phenomenon poses some special challenges for machine learning practitioners.
In this blog, we'll take a deep dive into the various types of datasets that contain a type of private information called Personally Identifiable Information (PII), analyze the problems with these datasets, and introduce a new feature we're testing on the Dataset Hub that's designed to help address these challenges.
Types of datasets containing personally identifiable information (PII)
We note that there are two main types of datasets containing personally identifiable information (PII).
- Labeled PII data sets: e.g. by Ai4PrivacyPII-Masking-300kThe data sets of this type are specifically designed to train PII detection models. These models are used to detect and mask PII and can help in online content auditing or providing anonymized databases.
- Pre-training dataset:: These are typically large-scale datasets, often several terabytes in size, and are often obtained through web crawlers. Although these datasets generally filter out certain types of PII, a small amount of sensitive information may still be missed due to the sheer volume of data and imperfections in the PII detection models.
Challenges with Personally Identifiable Information (PII) in Machine Learning Datasets
The presence of personally identifiable information (PII) in machine learning datasets can pose several challenges for practitioners. First, it raises privacy concerns and may be used to infer sensitive information about individuals.
In addition, if PII is not handled properly, it may also affect the performance of a machine learning model. For example, if a model is trained on a dataset that contains PII, it may learn to associate specific PII with specific outcomes, which may lead to prediction bias or generation of PII from the training set.
New Experiments on the Dataset Hub: Presidio Reports
To address these challenges, we are experimenting with a new feature on the Dataset Hub that uses thePresidio- - An open source, state-of-the-art Personal Identifiable Information (PII) detection tool.Presidio relies on detection patterns and machine learning models to recognize PII.
With this new feature, users will be able to see a report estimating the presence of PII in the dataset. This information is valuable to machine learning practitioners, helping them make informed decisions before training their models. For example, if the report indicates that the dataset contains sensitive PII, practitioners may choose to further filter the dataset using a tool like Presidio.
Dataset owners can also benefit from this functionality by using these reports to validate their PII filtering processes before releasing their datasets.
An example of a Presidio report
Let's take a look at an example of thisPre-training dataset Example of a Presidio report.
In this example, Presidio detects a small amount of email and sensitive personally identifiable information (PII) in the dataset.
reach a verdict
The presence of Personally Identifiable Information (PII) in machine learning datasets is one of the evolving challenges for the machine learning community. At Hugging Face, we are committed to transparency and helping practitioners address these challenges. By experimenting with new features like Presidio reporting on the Dataset Hub, we hope to empower users to make informed decisions and build more robust and ethical machine learning models.
We would also like to thank the National Commission on Information and Freedom (CNIL) for its support forGDPR Compliance Help. They have been invaluable in guiding us through the complexities of AI and personal data issues. Please contact us athere are Check out their updated AI how-to guide.
Stay tuned for more updates on this exciting development!
Original in English./blog/presidio-pii-detection
Original authors: Quentin Lhoest, Margaret Mitchell, Omri M, Omri Mendels
Translator: Evinci