Data Engineering  /  Security

🔒 Data Security 11 guides · updated 2026

Protecting data through its whole lifecycle — encryption, access control, masking, and the compliance frameworks (GDPR, SOC 2) that shape modern data platforms.

What Counts as PII — and Why the Definition Has Expanded

Personally Identifiable Information is any data that can identify a specific individual, either directly or when combined with other available data. The definition has grown considerably over the past decade, and what counts as PII now covers a broader range than most organizations initially expect.

Direct identifiers leave little ambiguity:

Indirect identifiers become PII depending on context:

Quasi-identifiers are fields that appear harmless individually but can uniquely identify someone in combination. A 1997 study showed that 87% of the US population could be uniquely identified using only ZIP code, birth date, and sex. This problem has not gone away — datasets with “anonymized” quasi-identifiers are regularly re-identified in academic research.

Special category data gets additional protection under GDPR and similar laws. This includes health data, genetic data, biometric data used for identification, racial or ethnic origin, political opinions, religious beliefs, sexual orientation, and data about criminal convictions. Processing these requires a stricter legal basis and heightened safeguards.


The Core Principle: Collect Less

The single most effective PII protection strategy is minimizing how much you collect in the first place. Data you never collect cannot be breached, subpoenaed, or misused.

This sounds obvious, and it is — but it runs directly against common engineering instincts. Data is useful. You might need it later. Collecting it now is cheap. These impulses lead to systems that accumulate far more PII than they actually need.

Questions to ask at the design stage:

Getting these questions answered before building is much cheaper than retrofitting minimization into an existing system.


Classification: You Cannot Protect What You Have Not Found

Before you can protect PII, you need to know where it lives. In most organizations of any size, the answer is: more places than anyone realizes.

Common PII Locations in a Typical Organization
Production Systems
├── Transactional database (users table, orders, profiles)
├── Log files (often contain emails, IPs, session tokens)
├── File storage (uploaded documents, profile photos)
└── Message queues (events containing user data)
Non-Production Systems
├── Development databases (often copies of prod)
├── Staging environments
├── Test fixtures and seed data
└── CI/CD build artifacts
Third-Party Systems
├── CRM and marketing platforms
├── Customer support tools
├── Analytics and monitoring services
└── Data warehouses and BI tools
Other
├── Email archives
├── Spreadsheets and CSV exports
├── Backups (often forgotten)
└── Collaboration tool attachments

Data discovery tools can help identify PII in structured storage automatically. AWS Macie, Microsoft Purview, and open-source alternatives like Presidio (from Microsoft) scan data stores and flag fields containing common PII patterns. These should be part of a regular scan cycle, not a one-time exercise.

Once discovered, data needs to be classified by sensitivity level. A practical three-tier model:

LevelDescriptionExample
HighDirect identifiers, financial data, health dataSSN, credit card, medical records
MediumIndirect identifiers, account dataEmail, username, IP address, location
LowAggregated or de-identified dataCountry, age bracket, anonymized analytics

Classification drives what controls apply. High sensitivity data gets encryption at rest, strict access control, audit logging, and short retention. Low sensitivity data has fewer overhead requirements.


Encryption

For data classified as sensitive, encryption is a baseline requirement — not an optional enhancement.

At rest: AES-256-GCM is the current standard for encrypting stored data. Most databases and cloud storage services support this natively, but you should verify it is enabled and that the encryption keys are managed separately from the data itself.

For field-level encryption of particularly sensitive columns (SSNs, payment data), application-layer encryption goes further — the data is encrypted before it reaches the database, so even a database administrator with full access cannot read the plaintext values.

In transit: TLS 1.2 minimum, TLS 1.3 preferred. This applies to all connections where PII moves — not just user-facing APIs, but also internal service-to-service communication and database connections. Plaintext connections between internal services are a common and underappreciated gap.

Key management: The security of encryption is only as good as the security of the keys. Keys stored next to the data they encrypt provide little real protection. Use a dedicated key management service (AWS KMS, GCP Cloud KMS, Azure Key Vault, HashiCorp Vault) and implement key rotation.


Pseudonymization and Tokenization

Pseudonymization replaces PII with a consistent substitute that allows records to be linked without exposing the original values.

import hashlib
import hmac
import os
# Store this securely in a secrets manager — never in source code
PSEUDONYM_KEY = os.environ["PSEUDONYM_SECRET"].encode()
def pseudonymize(pii_value: str) -> str:
"""
Produces a consistent, non-reversible token for a PII value.
The same input always produces the same token (deterministic),
but the token cannot be reversed without the key.
"""
return hmac.new(PSEUDONYM_KEY, pii_value.lower().encode(), hashlib.sha256).hexdigest()
# Usage
raw_email = "Jane.Doe@example.com"
token = pseudonymize(raw_email)
# Result: consistent hex string, usable as a join key without exposing the email

Using HMAC rather than plain SHA-256 adds a secret key that prevents rainbow table attacks against the tokens.

Tokenization is a different technique, more common in payment contexts. Rather than algorithmically transforming the value, tokenization looks up or generates a random, opaque token and stores the mapping in a secure vault. The token and the real value exist in separate systems, and only the vault can translate between them. PCI DSS compliance for payment card data commonly uses tokenization — card numbers are replaced with tokens that cannot be used anywhere except as a reference to retrieve the real number from the vault.


Masking for Non-Production Environments

One of the most common PII exposure vectors is development and testing environments. Developers need realistic data to test against, and the path of least resistance is copying production data. This is a significant and frequent source of breaches.

The correct approach is data masking: replacing PII in non-production copies with realistic but fake values that preserve the format and statistical properties of the original.

Production → Masked Copy
John Doe → Michael Torres
john@example.com → mtrrs2847@testmail.org
+1 (555) 234-5678 → +1 (555) 819-2340
192.168.1.45 → 192.168.8.72
4111-1111-1111-1111 → 4532-7819-2048-3761 (valid format, fake number)

The masked data should still allow meaningful testing — relationships between records are preserved, data types and formats are maintained — but the underlying PII is replaced with synthetic values.

Libraries and tools for masking: Faker (Python/JavaScript), DataVeil, Delphix, Informatica Data Masking.


Access Controls: Who Can See What

Restricting access to PII to only those with a legitimate need is one of the most effective controls available — and one of the most frequently neglected.

Role-based access control (RBAC): Define roles based on job function, and grant PII access only to roles that require it. A customer support agent might need to see a customer’s name and account status but not their payment details. A data analyst might need aggregated metrics but not individual user records.

Column-level security: Most modern data warehouses and databases support masking or hiding specific columns based on the accessing user’s role. Snowflake, BigQuery, PostgreSQL, and others all have native column-level controls.

-- PostgreSQL: Create a masked view for analysts
CREATE VIEW user_analytics AS
SELECT
user_id,
LEFT(email, 1) || '***@' || SPLIT_PART(email, '@', 2) AS masked_email,
country,
created_at,
last_login
FROM users;
-- Grant analysts access to the view only, not the base table
GRANT SELECT ON user_analytics TO analyst_role;

Audit logging: Access to PII should be logged, and the logs should be monitored. Who accessed which records, when, from where. This both deters misuse and enables incident investigation when something goes wrong.

Just-in-time access: For very sensitive data, consider systems where access is not permanently granted but must be requested for a specific purpose, is time-limited, and is subject to approval. PAM (Privileged Access Management) tools implement this for administrative access.


Handling Data Subject Rights

Under GDPR, CCPA, India’s DPDP Act, and similar laws, individuals have rights to access, correct, and delete their data. The engineering requirement is that systems be capable of fulfilling these requests reliably and within regulatory timeframes.

The challenge is that user data spreads across many systems. A deletion request typically requires removing or anonymizing data from:

Building a reliable deletion capability requires knowing where all the data is — which brings it back to data discovery and cataloging. Systems that did not track this at the design stage find deletion requests extremely difficult to fulfill completely.

A practical approach: design around a canonical user identifier from the beginning. If every system consistently stores and references the same user_id, a deletion pipeline can use that identifier to query and purge records across all systems rather than hunting for different representations of the same person.


PII in Logs: A Frequently Overlooked Gap

Application logs routinely capture PII without any deliberate decision to do so. Request logs include URL parameters. Error logs include stack traces with data values. Debug logs include object dumps. These logs often flow to centralized logging platforms with broad read access across engineering teams.

Common patterns to watch for and prevent:

PROBLEMATIC LOG LINES:
[2025-06-15 10:23:44] POST /api/user/update email=jane.doe@example.com dob=1985-04-12
[2025-06-15 10:23:44] Processing payment for card=4111111111111111 cvv=123
[2025-06-15 10:23:44] Authentication attempt for user@example.com from 203.0.113.45
BETTER:
[2025-06-15 10:23:44] POST /api/user/update user_id=u_8d3f9a fields=[email, dob]
[2025-06-15 10:23:44] Processing payment for payment_method_id=pm_abc123 (tokenized)
[2025-06-15 10:23:44] Authentication attempt for user_id=u_8d3f9a result=success

Implement log scrubbing at the logging layer for common PII patterns. Several logging libraries support this directly, and tools like presidio-analyzer can be integrated into log pipelines to detect and mask PII before logs are stored.


2025 and 2026 Developments

India’s DPDP Act enforcement began in 2025, adding requirements for organizations processing data of Indian residents that closely parallel GDPR’s structure: consent requirements, data minimization, purpose limitation, breach notification within 72 hours, and children’s data protections.

Synthetic data adoption has accelerated significantly as an alternative to masking production data for development and AI training. Modern synthetic data tools can generate datasets that are statistically indistinguishable from real data but contain no actual individuals. This is becoming a standard practice for ML training pipelines.

Biometric data regulation is tightening globally. Illinois’s BIPA (Biometric Information Privacy Act) has resulted in substantial class-action settlements against companies collecting facial recognition data without consent. Similar laws have passed or are advancing in multiple US states and the EU’s AI Act creates specific requirements for biometric processing.

AI-generated PII risk: Large language models trained on scraped data can sometimes reproduce personal information from training data. Organizations deploying AI systems need to consider whether their models may inadvertently process or disclose PII, and regulators in the EU and US are beginning to issue guidance specifically addressing this.


Practical Checklist

PII Protection Checklist
[ Discovery and Classification ]
[ ] PII fields tagged in data catalog
[ ] Regular automated scans for PII in unexpected locations
[ ] Classification levels assigned and documented
[ Data Minimization ]
[ ] Collection limited to what is operationally necessary
[ ] Retention periods defined and automated
[ ] Special category data identified and treated appropriately
[ Technical Controls ]
[ ] Encryption at rest for all high and medium sensitivity data
[ ] TLS on all connections involving PII
[ ] Key management via a dedicated KMS
[ ] Field-level encryption for most sensitive fields
[ Access Controls ]
[ ] RBAC implemented and reviewed regularly
[ ] PII access audit logging enabled
[ ] Non-production environments use masked or synthetic data
[ Process Controls ]
[ ] Data subject request process implemented
[ ] Breach detection and response plan in place
[ ] Third-party data sharing documented with DPAs
[ ] PII excluded from application logs

PII protection is not a project with a completion date. It requires ongoing attention as systems change, new data is collected, regulations evolve, and threat landscapes shift. The organizations that treat it as a continuous operational concern rather than a compliance exercise tend to have significantly fewer incidents — and significantly less painful ones when incidents do occur.