Cloud/ AWS / AWS Certified Data Engineer โ€” Associate (DEA-C01) / DEA-C01 Security & Governance: Lake Formation, DataZone, Encryption

AWS Amazon Web Services Associate Step 4 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 4 โ€” Security & Governance

Hereโ€™s a question worth sitting with before we get into service names: who is actually allowed to see this column, and how do you prove that three months from now when an auditor asks? Everything in this step exists to answer that question at scale, across a lake that might span dozens of teams and thousands of tables. Itโ€™s less glamorous than building pipelines, but itโ€™s where a meaningful chunk of DEA-C01 questions live, because permissions mistakes in a data lake are exactly the kind of thing that ends careers and makes headlines.


Encryption: At Rest and In Transit

Start with the baseline, because every governance conversation assumes itโ€™s already handled.

IN TRANSIT AT REST
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TLS for all service-to-service SSE-S3 (AWS-owned keys, AES-256)
traffic (S3, Redshift, Glue SSE-KMS (customer-managed keys,
all enforce TLS by default) audit trail via CloudTrail)
JDBC/ODBC connections to Redshift Redshift: KMS or HSM-backed
should require SSL encryption at the cluster level
Kinesis producers can encrypt DynamoDB: encryption at rest is
client-side before PutRecord always on, KMS-managed by default

The distinction the exam leans on hardest is SSE-S3 vs SSE-KMS. Both encrypt objects at rest, but only SSE-KMS gives you a per-key audit trail (every decrypt call logged in CloudTrail), key rotation policies, and the ability to restrict who can use the key independently of who can access the S3 bucket. A scenario mentioning โ€œmust produce an audit log of every time a specific dataset was decryptedโ€ is pointing you at SSE-KMS, not SSE-S3 โ€” SSE-S3 doesnโ€™t give you that visibility because AWS holds the keys with no per-request logging exposed to you.


Lake Formation: Fine-Grained Access Control

Lake Formation sits on top of the Glue Data Catalog and IAM, adding a permissions model thatโ€™s actually usable at the scale of a real data lake. Without it, youโ€™d be writing S3 bucket policies and IAM policies by hand for every table, every user, every combination โ€” which doesnโ€™t scale past a handful of datasets.

IAM alone:
Grant/deny access to S3 prefixes and Glue API calls directly.
Coarse. No column-level or row-level concept. Painful at scale.
Lake Formation:
Grants layered on top of Glue Catalog tables/databases.
SELECT, ALTER, DROP, DESCRIBE โ€” permissions expressed like a
database grant, not an S3 policy.
Supports column-level, row-level, and cell-level filtering.
Central permissions registry โ€” one place to audit who can see what.

How the Permission Model Works

Data Lake Administrator
โ”‚
โ”œโ”€โ”€โ–บ Registers S3 location with Lake Formation
โ”‚
โ”œโ”€โ”€โ–บ Grants: "finance-analysts" role โ†’ SELECT on orders.amount, orders.date
โ”‚ (but NOT orders.customer_ssn)
โ”‚
โ””โ”€โ”€โ–บ Grants: "eu-team" role โ†’ SELECT on customers
WHERE region = 'EU' (row filter)

Lake Formation permissions are granted to IAM principals (or, in cross-account setups, to another AWS account or an organization) against catalog resources โ€” databases, tables, or specific columns. When a query runs through Athena, Redshift Spectrum, or EMR (with Lake Formation integration enabled), the engine checks Lake Formation grants before returning any rows, and it can silently exclude denied columns or rows rather than erroring the whole query.

Column-Level and Row-Level Security

Column-level security โ€” grant SELECT on a named subset of columns instead of the whole table. The classic case: an analytics team can query an orders table, but the customer_ssn and payment_token columns are excluded from their grant entirely โ€” they donโ€™t see nulls, they simply donโ€™t see those columns exist in their query results.

Row-level security โ€” implemented through data filters, which attach a WHERE-style predicate to a grant. A grant might restrict a role to only rows where region = 'APAC', meaning the same table serves multiple regional teams with each seeing only their own slice, with zero application-level filtering logic needed.

RequirementLake Formation mechanism
Hide specific sensitive columns from a roleColumn-level permissions
Restrict a role to a subset of rows (e.g., by region or business unit)Row-level data filters
Grant access across AWS accounts without copying dataCross-account Lake Formation sharing
Let a governed team self-serve grants for their own datasetsLake Formation tag-based access control (LF-TBAC)

Tag-based access control (LF-TBAC) deserves a callout because itโ€™s how large organizations avoid drowning in individual per-table grants โ€” you attach tags (like confidentiality=restricted or department=finance) to catalog resources, then grant permissions against the tag rather than against every individual table. Add a new table with the right tag, and it inherits the existing grants automatically.


Glue Data Catalog as the Governance Backbone

Every governance mechanism in this step depends on the Glue Data Catalog actually reflecting reality โ€” a table Lake Formation doesnโ€™t know about is a table it canโ€™t protect. The catalog stores:

Database โ†’ Table โ†’ Columns (name, type, comment)
โ†’ Partitions
โ†’ Location (S3 path)
โ†’ Table properties (classification, format)

This is also the same catalog Athena, Redshift Spectrum, EMR, and Glue ETL jobs all read from โ€” one metadata store, many compute engines, which is precisely what makes centralized Lake Formation permissions possible in the first place. If you bypass the catalog (querying S3 directly from a Spark job without going through Glue Catalog), you also bypass Lake Formationโ€™s enforcement, which is a governance gap the exam expects you to spot.


Data Lineage and Governance with Amazon DataZone

Where Lake Formation answers โ€œwho can access this,โ€ Amazon DataZone answers โ€œwhere did this come from, who owns it, and can I trust it.โ€ DataZone is the catalog-and-governance layer built for organization-wide data discovery โ€” think of it as the business-facing front door to data that Lake Formation and Glue Catalog manage underneath.

Data Producer (owns "orders" dataset)
โ”‚
โ–ผ
Publishes to DataZone Data Portal
โ”‚ (business metadata: description, owner, glossary terms,
โ”‚ data quality score, lineage back to source pipeline)
โ–ผ
Data Consumer searches portal โ”€โ”€โ–บ requests subscription
โ”‚
โ–ผ
Approval workflow (owner or governed approver signs off)
โ”‚
โ–ผ
Access granted (DataZone provisions underlying Lake Formation grant)

DataZoneโ€™s project-and-domain model organizes data assets by business domain rather than by AWS account or service boundary, which matters increasingly for organizations adopting data mesh principles โ€” where each business domain owns and publishes its own data products rather than a central team owning every pipeline. DataZone gives that federated model a shared catalog and subscription workflow so domains can discover and request access to each otherโ€™s data products without every request becoming a ticket to a central platform team.

Lineage tracking in DataZone captures the chain from source system through transformation jobs to the published asset, which is what lets a consumer (or an auditor) answer โ€œwhere did this number actually come fromโ€ without archaeology through pipeline code.


Compliance Considerations for Regulated Data

For regulated datasets โ€” PII, financial records, healthcare data โ€” a few patterns recur across exam scenarios:


Exam Focus: What Questions Test From This Step