AWS Glue Interview Questions and Answers: What Engineers Actually Get Asked
These questions reflect what data engineering and cloud architect interviews actually cover. They are grouped by topic and progress from foundational to advanced.
Fundamentals
Q1: What is AWS Glue, and what problem does it solve?
AWS Glue is a managed serverless ETL service that runs on Apache Spark. It solves the problem of managing ETL infrastructure — you do not provision or terminate Spark clusters. Glue also provides the Data Catalogue, a managed metastore shared by Athena, EMR, and Redshift Spectrum.
Q2: What is the Glue Data Catalogue?
It is a Hive-compatible managed metadata repository. It stores database and table definitions — column names, types, partition keys, and data locations. Any AWS service that supports the Hive metastore interface can use the Glue Data Catalogue directly, including Athena, EMR, and Redshift Spectrum.
Q3: What is a DPU, and how is Glue billed?
A DPU (Data Processing Unit) is 4 vCPUs and 16 GB of memory. Glue is billed per DPU-hour. A job configured with 10 DPUs that runs for 30 minutes costs 5 DPU-hours. Python Shell jobs use fractional DPUs (0.0625 DPU). Billing is rounded up to the nearest second.
Q4: What is the difference between a Glue Spark job and a Python Shell job?
A Spark job runs a distributed Spark application across multiple executors. It is designed for large-scale data processing. A Python Shell job runs on a single node with no Spark overhead — it is for lightweight tasks such as calling an API, moving a small file, or running a quick data quality check. Python Shell is significantly cheaper and starts faster.
Q5: What programming languages does Glue support for ETL jobs?
PySpark (Python 3) and Scala for Spark jobs. Python 3 for Python Shell jobs.
Crawlers
Q6: What does a Glue crawler do?
It connects to a data store, samples the data, infers the schema, and writes table definitions to the Data Catalogue. It can detect new data, schema changes, and new partitions.
Q7: What data stores can Glue crawlers access?
S3 (all common file formats), JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server, Redshift), DynamoDB, and document databases via connectors (MongoDB, Elasticsearch).
Q8: What is a classifier in Glue?
A classifier determines the format and schema of data during a crawler run. Glue has built-in classifiers for JSON, CSV, Parquet, ORC, Avro, XML, and common log formats. Custom classifiers support grok patterns, JSON paths, and custom CSV configurations.
Q9: What happens when a crawler finds that a table’s schema has changed?
By default, the crawler updates the table definition in the Data Catalogue. You can configure the schema change policy to log a warning without updating, or to delete and recreate the table.
Q10: How does a crawler handle Hive-style partitions in S3?
If your S3 paths follow the pattern s3://bucket/table/year=2024/month=06/day=15/, Glue recognises the partition keys and adds partition metadata to the catalogue rather than creating a new table for each folder.
Q11: What is crawler recrawl policy?
It controls whether a crawler reprocesses previously seen data. Options are: crawl everything (default), crawl only new folders, or crawl based on modification time. Crawling only new folders is much faster for large S3 prefixes.
Q12: How do you exclude files from a crawler?
Using exclusion patterns — glob-style patterns like **/_temporary/** or **.log that tell the crawler to skip matching paths.
ETL Jobs and Transformations
Q13: What is a DynamicFrame, and how does it differ from a Spark DataFrame?
A DynamicFrame is Glue’s extension of the Spark DataFrame. It tolerates inconsistent schemas — a column can be a string in some rows and an integer in others. You can convert between the two using toDF() and fromDF(). For well-structured data, DataFrames and the full Spark SQL API are often more practical.
Q14: How do you pass parameters to a Glue job?
Job parameters are passed as --key value arguments when starting a job. Inside the script, you retrieve them with getResolvedOptions(sys.argv, ['key']).
Q15: How do you handle schema mapping in Glue?
Using the ApplyMapping transform, which lets you rename columns, change data types, and drop columns in a single declarative step.
from awsglue.transforms import ApplyMapping
mapped = ApplyMapping.apply( frame=source_frame, mappings=[ ("old_name", "string", "new_name", "string"), ("revenue", "string", "revenue", "double"), ])Q16: How do you read data from the Glue Data Catalogue in a job?
Using create_dynamic_frame.from_catalog():
frame = glueContext.create_dynamic_frame.from_catalog( database="sales_db", table_name="raw_orders")Q17: How do you write output in a specific format and compression?
Using getSink or write options on a DynamicFrame:
glueContext.write_dynamic_frame.from_options( frame=frame, connection_type="s3", connection_options={"path": "s3://bucket/output/"}, format="parquet", format_options={"compression": "snappy"})Q18: What is the Relationalize transform in Glue?
Relationalize flattens nested JSON structures (arrays and structs) into a set of flat tables. It is useful when your source data has deeply nested objects and you need to load them into a relational target like Redshift.
Job Bookmarks
Q19: What is a Glue job bookmark?
A bookmark is state that Glue maintains to track which data has already been processed. On each run, Glue only processes new data since the last successful bookmark. This enables incremental ETL without custom state management.
Q20: When should you not use job bookmarks?
When your source data is updated in-place (not just appended), when you are doing backfill processing, or when you need to reprocess a specific date range. In these cases, manage state yourself or disable bookmarks.
Q21: How do you reset a job bookmark?
Via the Glue console (reset bookmark for a specific job run) or via the CLI:
aws glue reset-job-bookmark --job-name my-etl-jobQ22: Do bookmarks work with JDBC sources?
Yes. For JDBC sources, you specify a column (typically a timestamp or auto-increment ID) as the bookmark key. Glue tracks the maximum value seen in that column and only reads rows where the value is greater on the next run.
Performance Tuning
Q23: How do you tune the number of DPUs for a Glue job?
Start with a reasonable default (10 DPUs for medium jobs) and monitor the Spark UI available in CloudWatch. Look for:
- High executor idle time: reduce DPUs
- GC pressure or spill to disk: increase DPUs or optimise shuffles
- Skewed partitions: use
repartition()or a salt key
Q24: What is Glue’s worker type configuration?
You choose a worker type to define the DPU size per executor:
- Standard: 2 executors per DPU (4 vCPU, 16 GB per DPU)
- G.1X: 1 executor per DPU (4 vCPU, 16 GB) — for memory-intensive jobs
- G.2X: 1 executor per DPU (8 vCPU, 32 GB) — for very large joins or ML workloads
- G.025X: For Streaming jobs, 0.25 DPU per executor
Q25: How do you handle skewed data in Glue?
Add a salt column to distribute skewed keys across multiple partitions before a join, then remove the salt after. Alternatively, use Adaptive Query Execution (AQE) by enabling spark.sql.adaptive.enabled = true in the Spark configuration.
Q26: What is the impact of small files on Glue job performance?
Small files cause excessive S3 API calls and metadata overhead. They slow down both reads and writes. Use coalesce() or repartition() before writing to reduce the number of output files. For existing small-file problems in S3, use a compaction job to merge small files into larger ones.
Q27: How do you enable Spark UI for a Glue job?
Set the --enable-spark-ui and --spark-event-logs-path job parameters. Glue streams Spark event logs to the specified S3 path, and you can view them in a hosted Spark History Server through the Glue console.
Connections and Security
Q28: What is a Glue Connection, and when is it required?
A Connection stores the network configuration and credentials for accessing a JDBC data source, Kafka cluster, MongoDB, or Snowflake. It is required whenever the data source is not S3 — JDBC databases, Redshift (via JDBC), Kafka, and similar.
Q29: How does Glue access a database in a private VPC?
By specifying a Connection that includes the VPC, subnet, and security group. Glue attaches an elastic network interface to the subnet and routes job traffic through it. The database security group must allow inbound traffic from the Glue security group.
Q30: How do you store database credentials for Glue jobs securely?
Store them in AWS Secrets Manager. Reference the secret ARN in the Glue Connection or pass the ARN as a job parameter and retrieve it at runtime using the Secrets Manager API.
Integration
Q31: How does Glue integrate with Athena?
Athena uses the Glue Data Catalogue as its default metastore. Tables created by Glue crawlers are immediately queryable in Athena without additional configuration.
Q32: How does Glue integrate with Redshift?
Glue can read from and write to Redshift using JDBC or the Redshift-optimised connector (from_options with connection_type="redshift"). The Redshift connector uses S3 as an intermediate staging area, which is faster than pure JDBC for large data volumes.
Q33: How does Glue integrate with Lake Formation?
Lake Formation uses the Glue Data Catalogue as its metadata store. Lake Formation adds column-level and row-level access control on top of the catalogue, enforced when Glue jobs or Athena queries run under Lake Formation-governed permissions.
Q34: Can Glue process streaming data?
Yes, using Glue Streaming jobs. These are Spark Structured Streaming jobs that read from Kinesis Data Streams or Apache Kafka. They run continuously, processing micro-batches, rather than terminating after a single run.
Advanced and Design Questions
Q35: You have a Glue job that processes 10 TB of S3 data daily and is taking 4 hours. How would you optimise it?
Start by identifying the bottleneck using the Spark UI. Common optimisations:
- Convert source data to Parquet or ORC (columnar formats read faster)
- Ensure source data is partitioned by a column used in filters (e.g., date) to enable partition pruning
- Increase DPUs if executors are CPU-bound
- Avoid wide shuffles — check if joins can be broadcast joins
- Enable job bookmarks to process only new data rather than the full dataset
Q36: How would you design a Glue pipeline that handles schema evolution gracefully?
Use DynamicFrames for ingestion (they tolerate schema drift). Run a crawler after ingestion to update the Data Catalogue. Use ApplyMapping or ResolveChoice to handle ambiguous types. Store output in Parquet with schema evolution options enabled. Log all schema changes to CloudWatch for monitoring.
Q37: What is the difference between Glue and AWS Glue DataBrew?
Glue ETL is for writing Spark-based transformation logic in Python or Scala — suited for engineers. DataBrew is a visual, no-code data preparation tool aimed at analysts. DataBrew profiling shows column statistics, outlier detection, and missing value rates before you write any transformation.
Q38: How do you handle errors in a Glue job?
Wrap transformation logic in try/except blocks. Write failed records to a separate S3 error path rather than failing the entire job. Use CloudWatch metrics and alarms to detect elevated error rates. Enable job run notifications via EventBridge to alert on job failures.
Q39: What are Glue Workflows, and when should you use them instead of a simple scheduled trigger?
Workflows are appropriate when you have multiple dependent jobs and crawlers that must run in sequence or in parallel with dependencies between them. A simple scheduled trigger works when you have a single job that runs independently on a cron schedule.
Q40: What happens to the Glue Data Catalogue when you delete a table in Athena?
The table definition is removed from the Glue Data Catalogue. The underlying data in S3 is not affected. Any Glue crawlers that previously created the table will recreate it on their next run if the data still exists.