Step 4 โ Security & Governance
Hereโs a question worth sitting with before we get into service names: who is actually allowed to see this column, and how do you prove that three months from now when an auditor asks? Everything in this step exists to answer that question at scale, across a lake that might span dozens of teams and thousands of tables. Itโs less glamorous than building pipelines, but itโs where a meaningful chunk of DEA-C01 questions live, because permissions mistakes in a data lake are exactly the kind of thing that ends careers and makes headlines.
Encryption: At Rest and In Transit
Start with the baseline, because every governance conversation assumes itโs already handled.
IN TRANSIT AT RESTโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโTLS for all service-to-service SSE-S3 (AWS-owned keys, AES-256) traffic (S3, Redshift, Glue SSE-KMS (customer-managed keys, all enforce TLS by default) audit trail via CloudTrail)JDBC/ODBC connections to Redshift Redshift: KMS or HSM-backed should require SSL encryption at the cluster levelKinesis producers can encrypt DynamoDB: encryption at rest is client-side before PutRecord always on, KMS-managed by defaultThe distinction the exam leans on hardest is SSE-S3 vs SSE-KMS. Both encrypt objects at rest, but only SSE-KMS gives you a per-key audit trail (every decrypt call logged in CloudTrail), key rotation policies, and the ability to restrict who can use the key independently of who can access the S3 bucket. A scenario mentioning โmust produce an audit log of every time a specific dataset was decryptedโ is pointing you at SSE-KMS, not SSE-S3 โ SSE-S3 doesnโt give you that visibility because AWS holds the keys with no per-request logging exposed to you.
Lake Formation: Fine-Grained Access Control
Lake Formation sits on top of the Glue Data Catalog and IAM, adding a permissions model thatโs actually usable at the scale of a real data lake. Without it, youโd be writing S3 bucket policies and IAM policies by hand for every table, every user, every combination โ which doesnโt scale past a handful of datasets.
IAM alone: Grant/deny access to S3 prefixes and Glue API calls directly. Coarse. No column-level or row-level concept. Painful at scale.
Lake Formation: Grants layered on top of Glue Catalog tables/databases. SELECT, ALTER, DROP, DESCRIBE โ permissions expressed like a database grant, not an S3 policy. Supports column-level, row-level, and cell-level filtering. Central permissions registry โ one place to audit who can see what.How the Permission Model Works
Data Lake Administrator โ โโโโบ Registers S3 location with Lake Formation โ โโโโบ Grants: "finance-analysts" role โ SELECT on orders.amount, orders.date โ (but NOT orders.customer_ssn) โ โโโโบ Grants: "eu-team" role โ SELECT on customers WHERE region = 'EU' (row filter)Lake Formation permissions are granted to IAM principals (or, in cross-account setups, to another AWS account or an organization) against catalog resources โ databases, tables, or specific columns. When a query runs through Athena, Redshift Spectrum, or EMR (with Lake Formation integration enabled), the engine checks Lake Formation grants before returning any rows, and it can silently exclude denied columns or rows rather than erroring the whole query.
Column-Level and Row-Level Security
Column-level security โ grant SELECT on a named subset of columns instead of the whole table. The classic case: an analytics team can query an orders table, but the customer_ssn and payment_token columns are excluded from their grant entirely โ they donโt see nulls, they simply donโt see those columns exist in their query results.
Row-level security โ implemented through data filters, which attach a WHERE-style predicate to a grant. A grant might restrict a role to only rows where region = 'APAC', meaning the same table serves multiple regional teams with each seeing only their own slice, with zero application-level filtering logic needed.
| Requirement | Lake Formation mechanism |
|---|---|
| Hide specific sensitive columns from a role | Column-level permissions |
| Restrict a role to a subset of rows (e.g., by region or business unit) | Row-level data filters |
| Grant access across AWS accounts without copying data | Cross-account Lake Formation sharing |
| Let a governed team self-serve grants for their own datasets | Lake Formation tag-based access control (LF-TBAC) |
Tag-based access control (LF-TBAC) deserves a callout because itโs how large organizations avoid drowning in individual per-table grants โ you attach tags (like confidentiality=restricted or department=finance) to catalog resources, then grant permissions against the tag rather than against every individual table. Add a new table with the right tag, and it inherits the existing grants automatically.
Glue Data Catalog as the Governance Backbone
Every governance mechanism in this step depends on the Glue Data Catalog actually reflecting reality โ a table Lake Formation doesnโt know about is a table it canโt protect. The catalog stores:
Database โ Table โ Columns (name, type, comment) โ Partitions โ Location (S3 path) โ Table properties (classification, format)This is also the same catalog Athena, Redshift Spectrum, EMR, and Glue ETL jobs all read from โ one metadata store, many compute engines, which is precisely what makes centralized Lake Formation permissions possible in the first place. If you bypass the catalog (querying S3 directly from a Spark job without going through Glue Catalog), you also bypass Lake Formationโs enforcement, which is a governance gap the exam expects you to spot.
Data Lineage and Governance with Amazon DataZone
Where Lake Formation answers โwho can access this,โ Amazon DataZone answers โwhere did this come from, who owns it, and can I trust it.โ DataZone is the catalog-and-governance layer built for organization-wide data discovery โ think of it as the business-facing front door to data that Lake Formation and Glue Catalog manage underneath.
Data Producer (owns "orders" dataset) โ โผPublishes to DataZone Data Portal โ (business metadata: description, owner, glossary terms, โ data quality score, lineage back to source pipeline) โผData Consumer searches portal โโโบ requests subscription โ โผApproval workflow (owner or governed approver signs off) โ โผAccess granted (DataZone provisions underlying Lake Formation grant)DataZoneโs project-and-domain model organizes data assets by business domain rather than by AWS account or service boundary, which matters increasingly for organizations adopting data mesh principles โ where each business domain owns and publishes its own data products rather than a central team owning every pipeline. DataZone gives that federated model a shared catalog and subscription workflow so domains can discover and request access to each otherโs data products without every request becoming a ticket to a central platform team.
Lineage tracking in DataZone captures the chain from source system through transformation jobs to the published asset, which is what lets a consumer (or an auditor) answer โwhere did this number actually come fromโ without archaeology through pipeline code.
Compliance Considerations for Regulated Data
For regulated datasets โ PII, financial records, healthcare data โ a few patterns recur across exam scenarios:
- Data residency โ some regulations require data to stay within a specific geographic boundary; S3 bucket region choice and Lake Formation cross-region sharing rules need to respect this.
- Right to erasure / retention limits โ S3 Object Lock can enforce WORM (write-once-read-many) for retention compliance, while a separate deletion pipeline has to handle erasure requests without breaking downstream aggregates that depend on the deleted record.
- Masking vs tokenization โ for fields that need to exist in non-production or analytics environments without exposing the real value, tokenization (reversible, via a secure lookup) is different from masking (irreversible obfuscation) โ the exam expects you to know an analytics team generally needs masked or tokenized data, not the raw sensitive value, and Lake Formation column-level permissions or a transformation step in Glue can enforce this before data ever reaches that team.
- Audit trail completeness โ CloudTrail logging on KMS key usage, S3 data events, and Lake Formation grant changes together form the audit story regulators ask for; missing any one of the three leaves a gap.
Exam Focus: What Questions Test From This Step
- SSE-S3 vs SSE-KMS, specifically around audit trail and key management control
- Lake Formationโs permission model layered over the Glue Data Catalog, vs raw IAM/S3 policies
- Column-level permissions vs row-level data filters โ matching the right mechanism to a stated requirement
- Tag-based access control (LF-TBAC) for scaling grants across many tables
- Why bypassing the Glue Data Catalog also bypasses Lake Formation enforcement
- Amazon DataZoneโs role in lineage, discovery, and subscription-based access requests
- How DataZone supports a data mesh model of domain-owned data products
- Masking/tokenization vs raw access for regulated or sensitive fields
- Which CloudTrail/KMS/Lake Formation logs together satisfy an audit requirement