AWS Glue Components: Crawlers, Classifiers, Jobs, Triggers, and Workflows

AWS Glue is described as a serverless ETL service, which is accurate but understates its complexity. There are eight distinct components, each with its own configuration surface and runtime behaviour. Understanding each independently — and knowing how they interact — is the difference between a Glue environment that works reliably and one that produces difficult-to-diagnose failures.

Component Map

+------------------+       +---------------------+       +------------------+
|   Data Sources   |       |    Glue Crawler      |       |  Data Catalog    |
| (S3, JDBC, etc.) +-----> | + Classifiers        +-----> | Databases        |
+------------------+       | + Connections        |       | Tables           |
                            +---------------------+       | Partitions       |
                                                          +--------+---------+
                                                                   |
                            +---------------------+               |
                            |    Glue ETL Job     | <-------------+
                            | (Spark / PySpark)   |
                            +--------+------------+
                                     |
                            +--------+------------+
                            |    Data Targets     |
                            | (S3, Redshift, RDS) |
                            +---------------------+

Orchestration:
  [Triggers] ──> [Workflows] ──> chain of Crawlers + Jobs

1. The Glue Data Catalogue

The Data Catalogue is the central metadata store for the entire Glue ecosystem. It is a Hive-compatible managed metastore: the same concepts apply — databases, tables, columns, partitions, and SerDes. Athena, EMR, and Redshift Spectrum all use it, making it the single authoritative source of table schemas across your analytics stack.

Objects in the catalogue:

Database: A logical namespace for tables. Maps to an S3 prefix or a JDBC schema.
Table: A schema definition — column names, types, partition keys, and the data location.
Partition: A subset of a table defined by partition column values (e.g., year=2024/month=06).
Connection: Credentials and network configuration for JDBC or other non-S3 data sources.

The catalogue is regional — one catalogue per AWS region. Resources can be shared across accounts using AWS Lake Formation or AWS Resource Access Manager.

2. Crawlers

A crawler is the mechanism that populates the Data Catalogue from actual data without manual schema entry. You point it at a data store and it discovers the schema, then writes or updates table definitions.

Crawler starts
     |
     +-- List objects in target S3 prefix (or connect to JDBC source)
     |
     +-- Sample a subset of files or rows
     |
     +-- Infer schema from samples
     |
     +-- Group files by inferred schema
     |
     +-- Compare against existing Catalog tables
     |       |
     |       +-- New schema      --> Create table
     |       +-- Schema changed  --> Update table
     |       +-- Same schema     --> No write
     |
     +-- Record partition metadata (if Hive-style paths exist)

Crawlers treat each top-level folder as a separate table by default. If your S3 layout includes Hive-style partition directories (s3://bucket/orders/year=2024/month=06/), the crawler recognises them and registers partition metadata rather than creating a new table per folder.

Exclusion patterns: Glob-style patterns like **/_temporary/** or **.log tell the crawler to skip matching paths. Useful for avoiding partial uploads or temp files that share your data prefix.

Recrawl policy: The default is to recrawl everything. Setting the policy to crawl only new or changed folders is significantly faster for large S3 prefixes — the crawler only processes what has changed since the last run.

3. Classifiers

Classifiers determine the file format and schema during a crawler run. Glue has built-in classifiers for JSON, CSV, Parquet, ORC, Avro, XML, and common log formats (Apache, nginx). The crawler tries each built-in classifier in priority order — the first one that succeeds defines the table’s format.

Custom classifiers cover cases the built-ins cannot handle:

Grok patterns: For text-based logs with known structures (similar to Logstash grok syntax)
XML classifiers: XPath-like definitions for XML documents
JSON classifiers: JSONPath expressions for JSON documents with a specific root structure
CSV classifiers: Custom delimiter, quote character, and header configuration

A typical use case: a vendor delivers pipe-delimited files with no header row and a .txt extension. The built-in CSV classifier misidentifies the delimiter. A custom CSV classifier specifies pipe as the delimiter and provides column names, solving the problem without touching the source files.

4. Connections

A Glue Connection stores the network and credential configuration for a data source or target that is not S3. JDBC databases, Redshift, Kafka, MongoDB, and Snowflake all require a connection.

A connection stores:

Connection type (JDBC, Kafka, MongoDB, Network, etc.)
JDBC URL or endpoint
Credential reference (AWS Secrets Manager ARN — credentials are not stored directly)
VPC configuration (subnet, security group) required for private network targets

Glue ETL Job
     |
     +-- Connection: prod-mysql
          |
          +-- VPC:            vpc-abc123
          +-- Subnet:         subnet-xyz789
          +-- Security Group: sg-glue-outbound
          +-- Secret ARN:     arn:aws:secretsmanager:.../mysql-creds
          +-- JDBC URL:       jdbc:mysql://db.internal:3306/salesdb

When an RDS instance is in a private subnet, the Glue job attaches an elastic network interface into that subnet using the connection configuration. The database security group must allow inbound traffic from the Glue security group.

A common failure mode: the connection test in the console passes but the job fails. The test may use a different network path than the job executor. Verify the subnet has a route to the database and the security group rules are symmetric.

5. Glue ETL Jobs

A Glue job is a Spark application. You write the script; Glue manages the cluster lifecycle. Job types:

Spark: Full distributed Spark for large-scale transformations (PySpark or Scala)
Spark Streaming: Near-real-time processing from Kinesis or Kafka
Python Shell: Single-node Python, no Spark overhead — for lightweight tasks like API calls, small file moves, or quick validation checks. Cheaper and faster to start than a full Spark job.
Ray: Distributed Python for ML preprocessing at scale

Job parameters are passed as command-line arguments and read inside the script with getResolvedOptions. This makes a single job script reusable across different tables and environments:

from awsglue.utils import getResolvedOptions
import sys

args = getResolvedOptions(sys.argv, ['source_table', 'target_path', 'run_date'])
source_table = args['source_table']
target_path   = args['target_path']
run_date      = args['run_date']

DPU sizing reference:

Standard worker:  4 vCPU, 16 GB per DPU, 2 executors per DPU
G.1X worker:      4 vCPU, 16 GB per DPU, 1 executor per DPU (memory-heavy joins)
G.2X worker:      8 vCPU, 32 GB per DPU, 1 executor per DPU (very large datasets)
G.025X worker:    Streaming jobs only

6. Triggers

Triggers control when crawlers and jobs run. Three trigger types:

Scheduled trigger: Runs on a cron expression — for example, every day at 02:00 UTC.

On-demand trigger: Runs when manually started via the console, CLI, or API. Used for testing or one-off runs.

Event trigger: Fires when another crawler or job in a workflow reaches a specific completion state (SUCCEEDED, FAILED, or STOPPED). This is the building block for dependent pipelines.

[Schedule Trigger: 02:00 UTC]
          |
          v
[Crawler: ingest-raw-orders]
          |
    [on SUCCEEDED]
          v
[Job: transform-orders]
          |
    [on SUCCEEDED]
          v
[Job: load-orders-to-redshift]

Event triggers can use a batch condition: fire only after N jobs have completed, rather than after each individual job. Useful when multiple upstream jobs feed a single downstream step.

7. Workflows

A Glue Workflow is a directed graph of crawlers, jobs, and triggers that visualises and tracks a multi-step pipeline. The Glue console shows which nodes have completed, which are running, and which failed — for any workflow run in the last 90 days.

Workflow: nightly-sales-pipeline
     |
     +-- [Trigger: 01:00 UTC] --> [Crawler: raw-sales]
                                           |
                              [Trigger: on SUCCEEDED]
                                           |
                              [Job: standardise-sales]
                                           |
                              [Trigger: on SUCCEEDED]
                                           |
                              [Job: aggregate-by-region]

Workflows support parallel branches — multiple jobs can run simultaneously, and a downstream trigger can be configured to wait for all branches to complete before firing.

8. Glue Studio and DataBrew

Glue Studio is the visual authoring interface for ETL jobs. You drag and drop sources, transforms, and targets onto a canvas, and Studio generates PySpark code. The generated script is editable — custom logic added in the editor appears reflected in the canvas.

Glue DataBrew is a separate product aimed at analysts rather than engineers. It provides 250+ pre-built transformations (pivot, unpivot, normalise, outlier detection, pattern matching) through a spreadsheet-like interface with no code required. DataBrew also profiles datasets to show column statistics, missing value rates, and distribution charts before any transformation is written.

When to use each:

Use Glue ETL job when:
  - Multi-source joins with complex business logic
  - Reusable pipeline with parameterised source/target config
  - Output to Redshift, RDS, or DynamoDB
  - Engineering team comfortable with PySpark

Use DataBrew when:
  - Data quality checks by analysts without Spark experience
  - Exploratory data cleaning with profile statistics
  - One-off transformation projects without pipeline requirements

Common Interview Questions

What is the difference between a trigger and a workflow in Glue? A trigger is a single condition that starts one crawler or job. A workflow is a graph of multiple jobs, crawlers, and triggers wired together into a pipeline. Workflows use event triggers to chain steps with dependency logic.

When would you use a Python Shell job instead of a Spark job? Python Shell runs on a single node with no Spark overhead. Use it for tasks that do not involve distributed data processing: calling an external API, validating a small config file, sending a notification, or moving a small number of files. It starts faster and costs less than a Spark job.

What happens when a crawler detects a schema change? By default, the crawler updates the table definition in the catalogue — adding new columns and updating changed types. The schema change policy can be configured to log a warning instead of updating (for strict schema control), or to delete and recreate the table (if a fresh schema is preferred over incremental updates).