AWS Pipeline Failures
I have directed technology delivery for large-scale BFSI data platforms on AWS, overseeing engineering operations across multiple legal entities simultaneously. This analysis is drawn directly from a production incident register logged during live operations — every failure category, every root cause, every resolution. Nothing here is theoretical.
After logging and resolving 95 production incidents across roughly 68 days of operations on a large-scale banking data platform, the patterns become impossible to ignore.
Most incident post-mortems treat each failure as a standalone event. Root cause identified, fix applied, ticket closed. That process is operationally correct but analytically incomplete. When you read 95 incidents together — across an AWS infrastructure running a medallion architecture at financial-services scale — the same eight failure mechanisms appear, repeatedly, across different services, different components, and different dates.
Seven of the eight are entirely preventable. The eighth is an architectural limit that must be designed around.
This is the breakdown of all eight.
The platform context
The platform serves multiple legal entities within a large financial services group. It runs on AWS, processes data through a medallion architecture — Bronze ingestion, Silver transformation, Gold reporting — and uses Apache Spark for the bulk of heavy data processing. Orchestration runs on a batch server cluster. Real-time ingestion uses Kafka. Analytics workbenches are connected to Athena and a data lake backed by S3.
The 95 incidents span the period February through April 2026, distributed across six distinct operating environments. The failure categories below represent the clustering of all 95 incidents by underlying mechanism — not by surface symptom, not by the service that raised the alert, but by the root engineering cause.
| Category | ~% of incidents | Preventable |
|---|---|---|
| 1 — Apache Spark resource exhaustion | 37% | Config fix |
| 2 — Disk exhaustion on worker nodes | 19% | Config fix |
| 3 — Server routing drift after library updates | 13% | Process fix |
| 4 — SSL certificate expiry | 8% | Automation |
| 5 — IAM failures, KMS gaps, secret key exposure | 9% | Arch fix |
| 6 — Antivirus causing platform outage | 2% | Removal |
| 7 — Docker and container runtime incompatibility | 3% | Runbook |
| 8 — Kafka standalone architecture limits | 9% | Arch redesign |
Category one — Apache Spark resource exhaustion
37% of all incidentsThis is the dominant failure mode on the platform. It presents in three distinct technical forms.
The Catalyst optimizer stack overflow
The first form is a java.lang.StackOverflowError at org.apache.spark.sql.catalyst.trees.TreeNode.clone. This is not a memory failure — it is a query complexity failure. The Spark Catalyst optimizer represents SQL query plans as recursive tree structures. When a query contains hundreds of nested CASE statements, thousands of UNION operations, or an excessively long chain of transformations, the logical plan tree grows too deep for the JVM's default thread stack to represent. The optimizer crashes before execution begins.
The resolution is to decompose complex SQL into staged intermediate materialisation steps. Write the intermediate result to a managed table, then continue the transformation from there. The query plan tree depth resets at each materialisation boundary.
-- Problematic pattern: deeply nested UNION chains
SELECT * FROM a UNION SELECT * FROM b UNION ... [n = 500+]
-- Fix: materialise intermediate results
CREATE TABLE stage_1 AS SELECT ...;
CREATE TABLE stage_2 AS SELECT ... FROM stage_1 ...;
The zombie SparkContext — IllegalStateException on a stopped context
The second and more operationally damaging Spark failure presents as java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. The job enters a "Running" state in the orchestrator, the Spark session is internally dead, and all downstream work that depends on this job is blocked — silently, indefinitely.
The trigger is typically deduplication via window function against a very large table. The pattern ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) — used to keep only the latest version of each record — forces a full network shuffle of every record in the partition. On a Silver-layer table with tens of millions of records receiving millions of daily delta rows, this shuffle exceeds available executor resources. The executor heartbeat is lost. The cluster manager terminates the SparkContext. The orchestrator continues to show the job as running.
A compounding bottleneck: when the same job calculates an incremental watermark with a subquery like SELECT MAX(timestamp) FROM silver_table without partition pruning, Spark scans the entire Silver table to find one value. At tens of millions of rows, this alone causes significant driver memory pressure before the shuffle begins.
Resolution: Move the job to cluster mode. Increase spark.sql.shuffle.partitions from the default of 200 to 1000 or higher for large-volume jobs. Partition the Silver table by a date column so the watermark subquery can prune to recent partitions instead of scanning the full table.
-- Before: full-table scan for watermark
WHERE timestamp > (SELECT MAX(timestamp) FROM silver.table)
-- After: partition-pruned watermark
WHERE timestamp > (SELECT MAX(timestamp)
FROM silver.table
WHERE business_date = current_date - 1)
AND business_date = current_date
-- Spark config for high-volume jobs
SET spark.sql.shuffle.partitions = 1000;
SET spark.driver.maxResultSize = 4g;
Node loss cascading to full cluster failure
The third Spark failure form is the most severe: multiple worker nodes losing contact with the cluster Master simultaneously. In one documented cluster event, three of five worker nodes lost their heartbeat with the Master at approximately the same time during a scheduled overnight processing window. The remaining two nodes, designed to carry 40% of the workload, were forced to absorb 100% of it.
The cascade proceeded as follows. Within minutes of the node loss, the surviving executors began reporting SparkOutOfMemoryError during external sort phases. Under memory pressure, Spark began aggressively spilling intermediate shuffle data to local disk. The spill rate accelerated. The local scratch partition on the affected worker node reached 100% utilisation. A FileOutputStream write failed with java.io.IOException: No space left on device. The DAG Scheduler aborted the job after repeated task failures. Two concurrent jobs failed simultaneously at the same node.
Resolution: restart Spark Master services to re-establish worker heartbeats. Permanent remediation: add worker nodes to the cluster, allocate dedicated scratch volumes for Spark tmp directories, and remove general-purpose servers from the Spark cluster so that cluster capacity is not shared with other workloads.
Category two — Disk exhaustion on worker nodes
19% of all incidentsDisk exhaustion was the second-largest failure category. It is also the most immediately destructive — a full disk stops every process on that node, not just the offending job.
The mechanism is Spark's shuffle architecture. When executors cannot hold intermediate shuffle data in memory, they spill to the local filesystem. On high-volume jobs, spill files accumulate at gigabytes per minute. If the partition hosting the Spark tmp directory is shared with the operating system, application binaries, or log files, the job can fill the node completely within a single processing run.
A particularly instructive variant of this failure occurs during cross-partition file moves. When Java's Files.move() detects that the source directory (/tmp) and the target directory (/app) are on different filesystem partitions, it performs a copy-then-delete operation. This requires enough overhead space to hold the full output file twice — once in the source location and once in the target. A partition that appears to have sufficient free space for the file will fail if it cannot accommodate the copy overhead.
# Diagnose cross-partition boundaries
df -h /tmp /app/framework/tmp /app
# If on different partitions, copy-then-delete applies
# Solution: mount tmp on same partition as target
# or dedicate a separate large volume for /app/framework/tmp
# Monitor disk usage with CloudWatch agent
# Alert thresholds: 70%, 85%, 95%
A separate disk exhaustion event caused every scheduled task across an entire operating environment to stop simultaneously. The batch orchestrator had no working space and could not initiate any new execution. Recovery time was proportional to the size of the cleanup required.
Disk monitoring with automated alerts at 70%, 85%, and 95% thresholds is not optional in production data platforms. A shared batch server with no disk alerting is a scheduled outage waiting for a date.
Category three — Server routing drift after library updates
13% of all incidentsThe third failure category contributed approximately 13% of all incidents — and every single one of them shared an identical root cause: a library update applied to one batch server was not replicated to the second batch server before the orchestrator resumed distributing work across both.
The impact was not a single incident. It was twelve separate incidents across three operating environments over three consecutive days, each raised and resolved individually before the shared root cause was identified. By the time the pattern was recognised, twelve separate firefighting cycles had been triggered and data delivery was disrupted across multiple business reporting workflows.
This is configuration drift — the most preventable failure category on the list. The fix is architectural: never apply library updates directly to running servers. Use immutable deployment artefacts — AMI baking, container image promotion, or a blue-green deployment strategy — so that all servers in a cluster are always running identical environments. Add a pre-execution environment parity check to the orchestrator that verifies server runtime versions match before distributing jobs.
Category four — SSL certificate expiry causing silent pending-state failures
8% of all incidentsSSL certificate expiry contributed approximately 8% of incidents. The failure mode is distinctive because it is not loud. There is no hard error, no stack trace, no immediate alert. A pipeline trigger that depends on a TLS-protected connection simply enters a pending state — waiting for a handshake that will never succeed — until a timeout eventually kills it.
In one case, every scheduled task across an entire operating environment failed on the same day because the SSL certificate used in the connection chain had expired overnight. The diagnosis took time because the surface symptoms — tasks in pending state, no obvious error — did not immediately point to certificate expiry.
# AWS Certificate Manager — native expiry alerting
aws acm describe-certificate \
--certificate-arn arn:aws:acm:region:account:cert/id
# Self-managed certificates — check expiry
openssl x509 -in /path/to/cert.pem -noout -enddate
# Output: notAfter=May 7 18:32:28 2026 GMT
# CloudWatch alarm at 30 days before expiry
# Alert thresholds: 60 days / 30 days / 7 days / 1 day
SSL certificate expiry is entirely preventable. It requires one automation build of a few hours. It eliminates a class of silent, difficult-to-diagnose failures that can affect every pipeline in an environment simultaneously.
Category five — IAM failures, KMS policy gaps, and secret key exposure
9% of all incidentsAccess and identity failures accounted for approximately 9% of incidents across three distinct failure modes.
Missing KMS Generate* permission
When writing encrypted objects to S3, the KMS policy must explicitly include kms:GenerateDataKey and related kms:Generate* actions. These are not inherited from broader KMS permissions. In a documented case, a KMS policy update included the required permissions for reading and decrypting but omitted the generate actions. S3 metadata writes and batch sync operations failed immediately with Access Denied errors. The policy gap was not visible from the S3 error message alone — it required tracing the error to the KMS call chain.
{
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:Encrypt",
"kms:GenerateDataKey",
"kms:GenerateDataKeyWithoutPlaintext",
"kms:DescribeKey"
],
"Resource": "arn:aws:kms:region:account:key/key-id"
}
AWS Lake Formation per-table permission grants
AWS Lake Formation does not automatically inherit permissions for new tables created in the data catalog from database-level grants. Each new table requires an explicit permission grant. When new tables were added to the Silver layer and made available through Athena, they returned stale or absent data until the per-table Lake Formation grants were applied manually. The fix is a post-deployment automation — a Lambda or Step Function that grants the correct Lake Formation permissions for every new catalog table as part of the table creation process.
Business user API key exposure
Two AWS API keys were accidentally exposed by business users during report building activity within an analytics workbench. Security monitoring detected the exposure. Both keys were deactivated immediately. Every service authenticating with those keys — across multiple environments — failed simultaneously and remained failed until replacement keys were generated, distributed, and propagated across all dependent configurations.
The architecture implication: business users and workbench interfaces should never hold raw AWS API keys. All workbench authentication should proxy through IAM roles with temporary STS session tokens. Long-lived access keys distributed to non-engineering users in a regulated BFSI environment are a compliance and operational liability. Secret scanning on workbench outputs and clipboard data is a table-stakes control in this environment.
Category six — Antivirus software causing a total platform outage
2% of incidents — highest single-incident impactThis is the highest-impact incident category by operational consequence, despite contributing only two incidents to the total count.
Endpoint antivirus software installed on a production application server interfered with system-level file permissions. The interference caused the API gateway to fail with a 502 error. With the gateway down, the platform could not serve any requests. Every scheduled task failed. Platform availability dropped to zero.
The failure was not immediately attributable to antivirus. The 502 error pointed to a gateway problem. Gateway investigation revealed no application-level error. Systematic elimination of candidates eventually identified the antivirus software as the blocking agent. The resolution was immediate removal of the antivirus software from the production server. Platform availability was restored.
The correct architectural position: runtime antivirus on nodes hosting Java applications with high I/O workloads is an operational hazard. Security scanning belongs at the network perimeter, the container registry, and the CI/CD build pipeline — not at the process level of production application nodes. The correct follow-up action — replicating the failure in a non-production environment to identify the specific system call being intercepted — is essential for preventing recurrence if antivirus reinstallation is ever required.
Category seven — Docker and container runtime incompatibility
3% of all incidentsA Priority One incident on a newly provisioned server traced to a compatibility mismatch between the runc container runtime and the Docker version installed. Python Docker containers failed to start. All API-to-File operators that depended on those containers failed with them.
This failure pattern is specific to new server provisioning. The runc OCI runtime has strict compatibility requirements with the Linux kernel version. When Docker packages are installed on a kernel that the packages were not built against, container lifecycle management breaks — either silently or with errors that point to Docker rather than the underlying runc issue.
# Diagnose runc and kernel version alignment
runc --version
uname -r
docker info | grep "Kernel Version"
# Full reinstall to restore compatibility
apt-get remove docker docker-engine docker.io containerd runc
apt-get install docker-ce docker-ce-cli containerd.io
# Verify post-reinstall
docker run --rm hello-world
Resolution: full reinstall of Docker and the container runtime. Prevention: include container runtime smoke tests in server provisioning runbooks before any production workloads are deployed to new nodes.
Category eight — Kafka standalone architecture limits
9% of all incidents — architectural constraintThe final category is an architectural constraint rather than a configuration failure. It is the one item on this list that cannot be fixed with a configuration change or an automation build.
Real-time data ingestion pipelines running on a standalone Kafka architecture were experiencing timeouts every three to four days under sustained production load. The operational response was manual restart of the pipelines twice per week to maintain data flow. This is not a sustainable operating model at banking-grade service levels.
Standalone Kafka has no fault tolerance. A single broker that becomes unstable under sustained load causes producer queuing and consumer offset issues that accumulate until the pipeline stalls. There is no automatic recovery. There is no leader election. There is no replica to promote. Related ingestion failures noted at the same time confirmed that EMR or EC2 scaling would be required — confirming that the compute layer, not just the messaging layer, needed horizontal scaling.
The required architecture: a Kafka cluster of minimum three brokers with replication factor three for all production topics. ZooKeeper ensemble or KRaft mode (Kafka 3.3+) for leader election. Consumer groups with automatic offset management. Dead-letter queues for poison messages that cannot be processed. This eliminates the biweekly manual restart cycle entirely.
An additional edge case — S3 object storage class and Athena query failures
A distinct failure mode worth documenting separately: Athena batch operations failing with a Null Pointer Exception and the message "The operation is not valid for the object's storage class."
The cause: objects in an S3 bucket had been transitioned to the Glacier storage class by an automated lifecycle policy. Athena cannot query Glacier-class objects directly. The objects must be explicitly restored to Standard or Standard-IA before any query or batch operation can access them. The Null Pointer Exception surface error gave no indication that the actual problem was storage class — the diagnosis required checking the S3 object metadata directly.
# Identify Glacier objects in a bucket
aws s3api list-objects-v2 \
--bucket your-bucket \
--query "Contents[?StorageClass=='GLACIER'].[Key,StorageClass]" \
--output table
# Restore Glacier object (processing time: up to 48 hours)
aws s3api restore-object \
--bucket your-bucket \
--key path/to/object \
--restore-request Days=7,GlacierJobParameters={Tier=Standard}
# Permanent fix: exclude active query partitions from
# Glacier lifecycle policies, or move to S3 Intelligent-Tiering
The ten engineering actions that would have prevented 80% of these incidents
What this incident record actually tells us
Ninety-five incidents across 68 days of production operations is not evidence of a poorly engineered platform. It is evidence of a platform under real production load — processing hundreds of millions of financial records across multiple legal entities, evolving continuously as the business scales. Failure at this rate is not unusual. What distinguishes high-performing platforms from degrading ones is not the absence of these failures but the speed of pattern recognition.
When twelve incidents share the same root cause, the platform that identifies the pattern at incident two recovers differently from the platform that identifies it at incident twelve. Incident taxonomy — clustering failures by underlying mechanism rather than surface symptom — is an engineering discipline, not an administrative one.
The eight categories documented here, and the ten engineering actions above, represent the operational maturity model for a production AWS data platform at BFSI scale. Six of the eight failure categories can be addressed with configuration and automation changes that take days to implement. The remaining two require architectural decisions that take weeks. All of them are known. All of them have documented fixes. None of them need to recur.
Every incident is a message from your architecture. Reading ninety-five of them at once tells you something no single post-mortem can.
What causes Apache Spark OOM errors in production data platforms?
The most common causes are window function deduplication generating massive network shuffles on high-volume tables; complex SQL plans with deeply nested CASE or UNION logic overflowing the Catalyst optimizer stack; incremental loads against large Silver-layer tables without partition pruning; and node loss redistributing full data volume to surviving workers. The primary fixes are date partitioning on Silver tables, increased shuffle partitions, and cluster mode over standalone.
Why does Spark report No space left on device?
Spark spills intermediate shuffle data to local disk when executors cannot hold it in memory. If the tmp directory partition is shared with the OS or application binaries, a single high-volume job can fill the node entirely. A secondary cause is cross-partition file moves — Java performs copy-then-delete when source and target are on different partitions, requiring double the file size in temporary overhead. The fix is a dedicated scratch volume for Spark tmp, mounted separately from the OS partition.
How does SSL certificate expiry cause silent pipeline failures?
Pipelines depending on TLS connections do not raise hard errors when a certificate expires — they enter a pending state waiting for a handshake that never succeeds, until timeout terminates them. The failure is silent and slow, making diagnosis difficult. Automated certificate monitoring with alerting at 60, 30, 7, and 1-day thresholds before expiry is the only reliable prevention.
What happens when AWS API keys are leaked by business users?
Security monitoring detects the exposure and deactivates the keys immediately. Every service authenticating with those keys fails instantly and stays failed until replacement keys are propagated. The correct architecture is to eliminate long-lived key distribution entirely — use IAM roles with temporary STS session tokens for all workbench and business user access.
Can antivirus software cause a total production outage on AWS?
Yes. Endpoint antivirus on production application servers can block system-level file permissions and socket operations, causing gateway failures that bring an entire platform down. The correct architecture is to remove endpoint AV from production application nodes and operate scanning at the network perimeter, container registry, and build pipeline instead.
Found this useful? I write weekly on cloud data infrastructure, AWS engineering, and the operational discipline that separates data platforms that survive regulatory review from those that do not. Subscribe to the newsletter — no spam, unsubscribe anytime.
Working on a data platform reliability or cloud architecture challenge? Get in touch directly.
Raj Thilak is Head of Technology for Data & Analytics with 24 years of experience in BFSI technology leadership across Citi, Standard Chartered, and Accenture. He directs large-scale engineering programmes for financial services data platforms on AWS. Based in Pune, India. rajthilak.dev
Found this useful? Subscribe for weekly insights.
Join the conversation
Loading comments...