Azure Data Lake Storage Gen2: Hierarchical Namespace Object Storage for Analytics

Azure Data Lake Storage Gen2 (ADLS Gen2) is not a separate service — it is Azure Blob Storage with the hierarchical namespace (HNS) feature enabled. That single toggle changes how the storage layer represents directories: instead of simulating folders through key prefixes, it creates real directory objects. This sounds like a small implementation detail, but it has significant performance and security consequences for analytics workloads.

Rename and delete operations on a directory in Blob Storage without HNS are O(n) — Azure must update every blob under the prefix. With HNS, a directory rename is O(1) — a single metadata update. At the scale of data engineering (millions of files, petabyte datasets), this difference matters enormously for job completion times.

Real-World Scenario

A telecom company runs a Databricks pipeline that processes 2 billion CDR (Call Detail Records) per day. The pipeline writes to Parquet files partitioned by date and region. At the end of each day, it renames the staging directory to production. Without HNS, renaming a directory with 40 million files takes over an hour. With ADLS Gen2 HNS, the same rename completes in milliseconds. The entire job can run within its 4-hour SLA window.

Gen1 vs. Gen2 Differences

Azure Data Lake Storage Gen1 was a separate service (not built on Blob Storage) that Microsoft has retired. Gen2 replaced it:

Comparison: ADLS Gen1 vs Gen2
-------------------------------
Feature           | Gen1 (Retired)      | Gen2
------------------|---------------------|--------------------------------
Storage base      | Custom service      | Azure Blob Storage + HNS
Pricing model     | Separate SKU        | Blob Storage pricing (lower)
Protocol          | WebHDFS only        | Blob REST, ABFS, WebHDFS, NFS
POSIX ACLs        | Yes                 | Yes (improved implementation)
Global redundancy | No (region only)    | LRS, ZRS, GRS, RA-GRS
Blob features     | No                  | Yes (lifecycle, tiers, versioning)
Lifecycle mgmt    | No                  | Yes (move to Cool/Archive)
Integration       | HDInsight focused   | Synapse, Databricks, HDInsight, ADF

If you are on Gen1, Microsoft has provided migration tooling (WANdisco Fusion, ADF copy activity with HNS awareness) to move data to Gen2.

Hierarchical Namespace and POSIX ACLs

With HNS enabled, the storage account presents a true directory tree. Permissions follow the POSIX model: access ACLs (who can read/write/execute this object) and default ACLs (inherited by new children):

ADLS Gen2 Directory Tree
--------------------------
/
├── raw/
│   ├── cdr/
│   │   ├── 2024/
│   │   │   ├── 06/
│   │   │   │   ├── 15/
│   │   │   │   │   ├── region=US/
│   │   │   │   │   └── region=EU/

ACL on /raw/cdr/:
  Owner: pipeline-service-principal  rwx
  Group: data-engineers              r-x
  Other: ---

Default ACL (inherited by new directories):
  data-engineers                     r-x
  pipeline-sp                        rwx

ACLs are set using the storage SDK, Azure CLI, or Azure Storage Explorer. Managed identities assigned appropriate roles (Storage Blob Data Contributor, Storage Blob Data Reader) interact cleanly with ACL-protected directories.

Azure Blob File System (ABFS) Driver

Analytics engines like Databricks, Synapse Spark, and HDInsight access ADLS Gen2 via the ABFS driver, which speaks the Blob REST protocol but adds HNS-awareness. The URI scheme is:

abfss://<container>@<storage_account>.dfs.core.windows.net/<path>

Example:
abfss://datalake@mycompanyadls.dfs.core.windows.net/raw/cdr/2024/06/15/

The .dfs.core.windows.net endpoint (as opposed to .blob.core.windows.net) routes through the HNS-optimised code path for directory operations. Using the wrong endpoint for a HNS-enabled account works but loses the directory operation performance benefits.

Mounting in Databricks

# Databricks notebook: mount ADLS Gen2 with service principal
configs = {
    "fs.azure.account.auth.type":
        "OAuth",
    "fs.azure.account.oauth.provider.type":
        "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id":
        "<service_principal_client_id>",
    "fs.azure.account.oauth2.client.secret":
        dbutils.secrets.get(scope="kv", key="sp-secret"),
    "fs.azure.account.oauth2.client.endpoint":
        "https://login.microsoftonline.com/<tenant_id>/oauth2/token",
}

dbutils.fs.mount(
    source="abfss://datalake@mycompanyadls.dfs.core.windows.net/",
    mount_point="/mnt/datalake",
    extra_configs=configs
)

# Access data
df = spark.read.parquet("/mnt/datalake/raw/cdr/2024/06/15/")

For new Databricks workspaces, Unity Catalog with service principal credential passthrough or Managed Identity is preferred over mount points — it provides per-user data access governance.

Lifecycle Management for Analytics Data

ADLS Gen2 inherits Blob Storage lifecycle management. Analytics workloads typically follow a hot-warm-cold pattern:

Lifecycle Policy for Data Lake
--------------------------------
/raw/  (landing zone, hot access for 30 days)
  After 30 days -> move to Cool tier
  After 365 days -> move to Archive

/processed/  (query result cache, hot for 7 days)
  After 7 days -> Cool
  After 90 days -> delete

/archive/  (compliance hold, do not delete)
  Apply immutability policy (WORM)

Tiering works at the file level, so you can tier old date-partitioned directories without touching recent ones.

Architecture: Medallion Pattern on ADLS Gen2

[Source Systems]
      |
  Azure Data Factory / Event Hubs
      |
[Bronze Layer]  /raw/<source>/<date>/
  Raw files as-received (Parquet, JSON, CSV)
  Immutable after landing
      |
  [Databricks / Synapse Spark job]
      |
[Silver Layer]  /clean/<domain>/<date>/
  Validated, de-duplicated, schema-aligned
  Delta Lake format
      |
  [Databricks job / SQL Pool]
      |
[Gold Layer]  /curated/<subject>/<date>/
  Aggregated, business-ready
  Served to Power BI, Synapse SQL Serverless

Each layer is a directory hierarchy in ADLS Gen2. ACLs restrict write access to the transformation service principal and read access to analysts per layer.

Key Interview Points

HNS is immutable after account creation: You cannot enable HNS on an existing storage account that does not have it. You must create a new account with HNS enabled and migrate data.
NFS 4.1 support: ADLS Gen2 with HNS supports NFS 4.1 mount for Linux clients, making it usable as a file system for on-premises or VM workloads alongside analytics use.
Not compatible with all Blob features: Some Blob Storage features do not work with HNS enabled (e.g., certain blob index tag query patterns, anonymous public access). Check compatibility before enabling HNS.
ACL inheritance: Default ACLs are applied to new objects created under a directory, not to existing objects. Retrospective ACL changes must use az storage fs access set-recursive.
Delta Lake and ADLS Gen2: Delta Lake’s atomic rename operations for log files require real directory semantics — HNS is a hard requirement for Delta Lake on Azure. Without HNS, rename-based atomicity breaks.

Best Practices

Enable HNS at storage account creation — you cannot add it later without data migration.
Use the .dfs.core.windows.net endpoint for analytics tools to get the full benefit of HNS directory operation performance.
Apply POSIX ACLs at the directory level with default ACLs so new partitions inherit correct permissions automatically.
Separate storage accounts per layer (bronze, silver, gold) rather than containers in one account — this gives independent access control, lifecycle policies, and billing visibility.
Integrate ADLS Gen2 with Microsoft Purview for data lineage and cataloguing; Purview can scan ADLS Gen2 and automatically classify sensitive columns.