Step 4 — Security & Governance

Here’s a question worth sitting with before we get into service names: who is actually allowed to see this column, and how do you prove that three months from now when an auditor asks? Everything in this step exists to answer that question at scale, across a lake that might span dozens of teams and thousands of tables. It’s less glamorous than building pipelines, but it’s where a meaningful chunk of DEA-C01 questions live, because permissions mistakes in a data lake are exactly the kind of thing that ends careers and makes headlines.

Encryption: At Rest and In Transit

Start with the baseline, because every governance conversation assumes it’s already handled.

IN TRANSIT                              AT REST
──────────────────────────              ──────────────────────────
TLS for all service-to-service          SSE-S3 (AWS-owned keys, AES-256)
  traffic (S3, Redshift, Glue           SSE-KMS (customer-managed keys,
  all enforce TLS by default)             audit trail via CloudTrail)
JDBC/ODBC connections to Redshift       Redshift: KMS or HSM-backed
  should require SSL                      encryption at the cluster level
Kinesis producers can encrypt           DynamoDB: encryption at rest is
  client-side before PutRecord            always on, KMS-managed by default

The distinction the exam leans on hardest is SSE-S3 vs SSE-KMS. Both encrypt objects at rest, but only SSE-KMS gives you a per-key audit trail (every decrypt call logged in CloudTrail), key rotation policies, and the ability to restrict who can use the key independently of who can access the S3 bucket. A scenario mentioning “must produce an audit log of every time a specific dataset was decrypted” is pointing you at SSE-KMS, not SSE-S3 — SSE-S3 doesn’t give you that visibility because AWS holds the keys with no per-request logging exposed to you.

Lake Formation: Fine-Grained Access Control

Lake Formation sits on top of the Glue Data Catalog and IAM, adding a permissions model that’s actually usable at the scale of a real data lake. Without it, you’d be writing S3 bucket policies and IAM policies by hand for every table, every user, every combination — which doesn’t scale past a handful of datasets.

IAM alone:
  Grant/deny access to S3 prefixes and Glue API calls directly.
  Coarse. No column-level or row-level concept. Painful at scale.

Lake Formation:
  Grants layered on top of Glue Catalog tables/databases.
  SELECT, ALTER, DROP, DESCRIBE — permissions expressed like a
  database grant, not an S3 policy.
  Supports column-level, row-level, and cell-level filtering.
  Central permissions registry — one place to audit who can see what.

How the Permission Model Works

Data Lake Administrator
   │
   ├──► Registers S3 location with Lake Formation
   │
   ├──► Grants: "finance-analysts" role → SELECT on orders.amount, orders.date
   │                                        (but NOT orders.customer_ssn)
   │
   └──► Grants: "eu-team" role → SELECT on customers
                                    WHERE region = 'EU'   (row filter)

Lake Formation permissions are granted to IAM principals (or, in cross-account setups, to another AWS account or an organization) against catalog resources — databases, tables, or specific columns. When a query runs through Athena, Redshift Spectrum, or EMR (with Lake Formation integration enabled), the engine checks Lake Formation grants before returning any rows, and it can silently exclude denied columns or rows rather than erroring the whole query.

Column-Level and Row-Level Security

Column-level security — grant SELECT on a named subset of columns instead of the whole table. The classic case: an analytics team can query an orders table, but the customer_ssn and payment_token columns are excluded from their grant entirely — they don’t see nulls, they simply don’t see those columns exist in their query results.

Row-level security — implemented through data filters, which attach a WHERE-style predicate to a grant. A grant might restrict a role to only rows where region = 'APAC', meaning the same table serves multiple regional teams with each seeing only their own slice, with zero application-level filtering logic needed.

Requirement	Lake Formation mechanism
Hide specific sensitive columns from a role	Column-level permissions
Restrict a role to a subset of rows (e.g., by region or business unit)	Row-level data filters
Grant access across AWS accounts without copying data	Cross-account Lake Formation sharing
Let a governed team self-serve grants for their own datasets	Lake Formation tag-based access control (LF-TBAC)

Tag-based access control (LF-TBAC) deserves a callout because it’s how large organizations avoid drowning in individual per-table grants — you attach tags (like confidentiality=restricted or department=finance) to catalog resources, then grant permissions against the tag rather than against every individual table. Add a new table with the right tag, and it inherits the existing grants automatically.

Glue Data Catalog as the Governance Backbone

Every governance mechanism in this step depends on the Glue Data Catalog actually reflecting reality — a table Lake Formation doesn’t know about is a table it can’t protect. The catalog stores:

Database → Table → Columns (name, type, comment)
                 → Partitions
                 → Location (S3 path)
                 → Table properties (classification, format)

This is also the same catalog Athena, Redshift Spectrum, EMR, and Glue ETL jobs all read from — one metadata store, many compute engines, which is precisely what makes centralized Lake Formation permissions possible in the first place. If you bypass the catalog (querying S3 directly from a Spark job without going through Glue Catalog), you also bypass Lake Formation’s enforcement, which is a governance gap the exam expects you to spot.

Data Lineage and Governance with Amazon DataZone

Where Lake Formation answers “who can access this,” Amazon DataZone answers “where did this come from, who owns it, and can I trust it.” DataZone is the catalog-and-governance layer built for organization-wide data discovery — think of it as the business-facing front door to data that Lake Formation and Glue Catalog manage underneath.

Data Producer (owns "orders" dataset)
     │
     ▼
Publishes to DataZone Data Portal
     │  (business metadata: description, owner, glossary terms,
     │   data quality score, lineage back to source pipeline)
     ▼
Data Consumer searches portal ──► requests subscription
     │
     ▼
Approval workflow (owner or governed approver signs off)
     │
     ▼
Access granted (DataZone provisions underlying Lake Formation grant)

DataZone’s project-and-domain model organizes data assets by business domain rather than by AWS account or service boundary, which matters increasingly for organizations adopting data mesh principles — where each business domain owns and publishes its own data products rather than a central team owning every pipeline. DataZone gives that federated model a shared catalog and subscription workflow so domains can discover and request access to each other’s data products without every request becoming a ticket to a central platform team.

Lineage tracking in DataZone captures the chain from source system through transformation jobs to the published asset, which is what lets a consumer (or an auditor) answer “where did this number actually come from” without archaeology through pipeline code.

Compliance Considerations for Regulated Data

For regulated datasets — PII, financial records, healthcare data — a few patterns recur across exam scenarios:

Data residency — some regulations require data to stay within a specific geographic boundary; S3 bucket region choice and Lake Formation cross-region sharing rules need to respect this.
Right to erasure / retention limits — S3 Object Lock can enforce WORM (write-once-read-many) for retention compliance, while a separate deletion pipeline has to handle erasure requests without breaking downstream aggregates that depend on the deleted record.
Masking vs tokenization — for fields that need to exist in non-production or analytics environments without exposing the real value, tokenization (reversible, via a secure lookup) is different from masking (irreversible obfuscation) — the exam expects you to know an analytics team generally needs masked or tokenized data, not the raw sensitive value, and Lake Formation column-level permissions or a transformation step in Glue can enforce this before data ever reaches that team.
Audit trail completeness — CloudTrail logging on KMS key usage, S3 data events, and Lake Formation grant changes together form the audit story regulators ask for; missing any one of the three leaves a gap.

Exam Focus: What Questions Test From This Step

SSE-S3 vs SSE-KMS, specifically around audit trail and key management control
Lake Formation’s permission model layered over the Glue Data Catalog, vs raw IAM/S3 policies
Column-level permissions vs row-level data filters — matching the right mechanism to a stated requirement
Tag-based access control (LF-TBAC) for scaling grants across many tables
Why bypassing the Glue Data Catalog also bypasses Lake Formation enforcement
Amazon DataZone’s role in lineage, discovery, and subscription-based access requests
How DataZone supports a data mesh model of domain-owned data products
Masking/tokenization vs raw access for regulated or sensitive fields
Which CloudTrail/KMS/Lake Formation logs together satisfy an audit requirement

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.