AWS Glue: Serverless ETL Service That Crawls, Catalogues, and Transforms Data

Data engineering teams often spend more time moving data than doing anything useful with it. A schema changes upstream and the pipeline breaks. A Spark cluster sits idle between nightly runs. Someone maintains schema definitions by hand in a wiki nobody updates. AWS Glue targets each of these problems: automated schema discovery through crawlers, a shared metadata catalogue, and on-demand Spark clusters that exist only while a job runs.

What Glue Is — and What It Is Not

Glue is a managed ETL service running on Apache Spark. You write the transformation logic in Python or Scala. Glue provisions the cluster, runs the job, and shuts everything down when the job finishes. You pay only for the compute time consumed.

Glue is batch-oriented. It does not replace Kinesis or MSK for streaming. Its strength is scheduled or event-triggered jobs that reshape and move large datasets between storage systems.

Data Sources
    |
    +--- S3 (CSV, Parquet, ORC, JSON, Avro)
    +--- RDS / Aurora (via JDBC)
    +--- Redshift (via JDBC or connector)
    +--- DynamoDB
    +--- MongoDB (via Atlas connector)
    v
[Glue Crawler] ──────────> [Glue Data Catalog]
                                    |
                                    v
                           [Glue ETL Job]  <── Python / Scala script
                                    |
                                    v
                           Data Targets
                                    |
                                    +--- S3 (Parquet, ORC)
                                    +--- Redshift
                                    +--- RDS
                                    +--- DynamoDB

The Data Catalogue: One Schema Registry for Everything

The Glue Data Catalogue is a managed Hive-compatible metadata store. It holds database and table definitions — column names, data types, partition keys, and the SerDe (serialisation / deserialisation) information that tells downstream services how to read the underlying format.

Athena, EMR, and Redshift Spectrum all use the Glue Data Catalogue natively. A table registered by a Glue crawler is immediately queryable in Athena with no additional configuration. Define it once; query it everywhere.

Each table record in the catalogue stores:

Data location (S3 prefix or JDBC connection string)
Schema (column names, types)
Partition keys and partition values discovered by crawlers
SerDe information (e.g., org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe)

Crawlers: Automated Schema Discovery

A crawler connects to a data source, samples the data, infers the schema, and writes table definitions to the Data Catalogue. It supports S3, JDBC databases, DynamoDB, and document databases via connectors.

Crawler Run
    |
    +-- Connect to data store
    +-- Sample files or rows
    +-- Infer schema (names, types)
    +-- Compare with existing Catalog table
         |
         +-- Table missing  --> Create new table definition
         +-- Schema changed --> Update existing table definition
         +-- No change      --> Skip (no write)
         +-- New partition  --> Add partition metadata

When a crawler processes an S3 prefix with Hive-style partition paths (year=2024/month=06/), it recognises the partition structure and registers partition metadata rather than creating a separate table per folder.

Running crawlers efficiently: Running a crawler on a fixed schedule against a large S3 prefix is expensive. A better approach is event-driven: an S3 event notification triggers an EventBridge rule that starts the crawler only when new data lands. The recrawl policy can also be set to process only new or changed folders, which is much faster for large prefixes.

Glue ETL Jobs: Serverless Spark

A Glue job is a Spark application. You provide the script; Glue manages the cluster. Compute is measured in DPUs (Data Processing Units). Each DPU provides 4 vCPUs and 16 GB of memory. Glue allocates Spark executors across those DPUs automatically.

Glue introduces the DynamicFrame — an extension of the Spark DataFrame that tolerates inconsistent schemas. A column that is a string in some rows and an integer in others does not crash a DynamicFrame read. For well-structured data, convert to a standard Spark DataFrame with toDF() to access the full Spark SQL API.

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Read from Glue Data Catalog
source = glueContext.create_dynamic_frame.from_catalog(
    database="retail_db",
    table_name="raw_transactions"
)

# Convert to DataFrame for complex transforms
df = source.toDF()
df_clean = df.filter(df["amount"] > 0).dropDuplicates(["txn_id"])

# Write Parquet to S3, partitioned by date
df_clean.write.mode("overwrite").partitionBy("txn_date") \
    .parquet("s3://my-bucket/processed/transactions/")

Job types available in Glue:

Spark: Full distributed Spark. For large-scale transformations.
Spark Streaming: Continuous processing from Kinesis or Kafka.
Python Shell: Single-node Python, no Spark. For lightweight tasks like API calls or small file operations. Cheaper and faster to start.
Ray: Distributed Python for ML preprocessing workloads.

Job Bookmarks: Incremental Processing Without Custom State

Without bookmarks, every Glue job re-reads the entire dataset. For a growing S3 bucket, that means re-processing files that were already transformed — a waste of compute and time.

Job bookmarks track the state of each data source after a successful run. On the next execution, Glue processes only data that arrived after the previous bookmark. For S3, Glue tracks which files have been processed based on file metadata. For JDBC sources, you designate a column (typically a timestamp or auto-increment ID) and Glue queries only rows with values greater than the last bookmark.

Use bookmarks when:

Source data is append-only (new files arrive in S3, new rows are inserted)
You want incremental ETL without building custom state management

Avoid bookmarks when:

Source records are updated in-place (you need to re-read existing rows)
You are running a historical backfill

To reset a bookmark via CLI:

aws glue reset-job-bookmark --job-name my-etl-job

Glue Studio: Visual Job Authoring

Glue Studio is a drag-and-drop canvas for building Glue jobs. You connect source nodes, transformation nodes, and target nodes visually. Studio generates the PySpark script behind the scenes. The script is editable — you can add custom logic and the canvas updates to reflect the changes.

Studio is useful for prototyping transformations, building jobs without Spark experience, and visually reviewing existing job data flows before modifying them. Generated scripts are exportable and version-controllable.

Integration With Athena

Athena uses the Glue Data Catalogue as its default metastore. Tables registered by crawlers are immediately available for SQL queries in Athena.

A common pipeline pattern:

S3 (raw JSON events)
        |
   [Crawler] ──> Data Catalog (raw_events table)
                       |
                   [Athena] <── ad-hoc exploration on raw data
                       |
              [Glue ETL Job]
              (clean, deduplicate, convert to Parquet)
                       |
       S3 (Parquet, partitioned by date)
                       |
               [Crawler] ──> Data Catalog (clean_events table)
                       |
                   [Athena] <── production queries on clean data

Real-World Scenario: Multi-Region Retail Pipeline

A retailer receives daily sales files from five regional systems, each in a slightly different CSV format with inconsistent column names. The data team uses Glue to:

Trigger a crawler when new files arrive in S3 (via EventBridge, not on a fixed schedule)
Run a Glue job that reads all five tables, standardises column names with ApplyMapping, filters test transactions, and writes a unified Parquet dataset partitioned by region and date
Query the Parquet dataset in Athena for daily reporting

Before Glue, this required a self-managed EMR cluster running around the clock and a manually maintained schema registry. With Glue, the cluster exists for about 12 minutes per night, and the catalogue updates automatically when a regional system adds a column.

Sizing and Cost Considerations

DPU Sizing Reference:
  Small job (<10 GB data)       -->  2-5 DPUs
  Medium job (10-100 GB)        -->  5-20 DPUs
  Large job (100 GB - 1 TB)     -->  20-50 DPUs
  Very large (>1 TB)            -->  50+ DPUs

Worker Types:
  Standard:  2 executors per DPU  (4 vCPU, 16 GB per DPU)
  G.1X:      1 executor per DPU   (4 vCPU, 16 GB) -- memory-intensive joins
  G.2X:      1 executor per DPU   (8 vCPU, 32 GB) -- very large aggregations
  G.025X:    Streaming jobs only

Monitor jobs using the Spark UI. Enable it by setting --enable-spark-ui and --spark-event-logs-path job parameters. Glue streams Spark event logs to S3 and hosts a Spark History Server viewable from the Glue console.

Common Interview Questions

What is the difference between a Glue job and an EMR job? Both run Spark. Glue is fully managed and serverless — no cluster to provision or terminate. EMR gives full control over instance types, Spark configuration, and installed libraries. Glue is simpler to operate; EMR is more flexible and typically cheaper for continuous, long-running workloads.

How does Glue handle schema evolution? Crawlers detect schema changes and update the Data Catalogue. Glue ETL jobs using DynamicFrames handle inconsistent schemas at the row level. For strict schema enforcement, convert to a DataFrame and apply a defined schema using StructType.

What is the maximum Glue job runtime? 48 hours by default. Set a shorter timeout to prevent runaway jobs from consuming budget undetected.