Razor Group's Migration from Amazon Redshift to an Apache Iceberg Lakehouse on AWS
Razor Group's Migration from Amazon Redshift to an Apache Iceberg Lakehouse on AWS
Status: Draft Target: AWS Big Data Blog Date: 2026-04-10
[Preface and problem statement sections already written — not included here]
The Migration Journey: Five Phases
Razor Group didn't flip a switch. The migration was designed as five phases — each one delivering standalone value while building the foundation for the next. At no point was a production pipeline paused or a dashboard unavailable.
Phase 1: Establish the Lakehouse Foundation
Before migrating a single query, the team needed to answer three questions: where does the data live, how is it managed, and how do we query it?
Why S3 Tables over self-managed Iceberg
The team had already committed to Apache Iceberg as the table format — open, engine-agnostic, with ACID transactions and time-travel built in. The question was whether to self-manage Iceberg on standard S3 buckets or use Amazon S3 Tables.
Self-managed Iceberg is powerful but operationally expensive. Someone has to run compaction jobs to prevent small-file proliferation. Someone has to expire old snapshots before metadata bloat degrades query planning. Someone has to clean up orphaned data files after failed writes. At 719 models running across 44 schemas — many materializing multiple times per day — the maintenance burden would scale with the platform, not shrink.
S3 Tables eliminated this entire category of work. Compaction, snapshot management, and unreferenced file removal run continuously and automatically. Tables that would have degraded under growing data volumes instead maintain consistent query performance without any team intervention. The integrated Iceberg REST Catalog API means any compatible engine — Spark, Trino, Athena, Redshift, Flink — can discover and query tables without maintaining a separate metastore. And because tables are first-class AWS resources, access control, encryption, and lifecycle policies operate at the table level rather than through complex S3 bucket policies layered on top of file-path conventions.
For a team migrating hundreds of models and didn't want to trade one operational burden (Redshift cluster management) for another (Iceberg table maintenance), this was the deciding factor.
AWS Glue Data Catalog provides unified metadata discovery across all tiers, and Lake Formation handles column- and table-level access control with IAM roles following least-privilege principles and CloudTrail enabled for a full audit trail.
Choosing the query protocol
The legacy Redshift cluster was 98% ETL — only 1% of compute hours were analyst SELECT queries. The replacement engine needed to handle heavy batch transformations first, interactive queries second.
Traditional Spark (spark-submit) handles batch ETL well, but couples clients to the cluster. Every job requires packaging driver JARs, managing classpaths, and submitting from within the cluster. Getting execution feedback back into Airflow was brittle — the team prototyped this approach and found that polling for job status introduced race conditions, and driver failures didn't propagate cleanly to Airflow task states. Debugging meant SSH-ing into cluster nodes to read logs. For a platform running 221 production DAGs, this operational friction was a non-starter.
Spark Connect — the gRPC-based client-server protocol introduced in Spark 3.4 — solved the coupling problem entirely. The cluster runs a persistent gRPC endpoint; clients connect remotely and submit queries over the wire. Airflow operators become thin clients: open session, submit SQL, get results, with success and failure mapping directly to Airflow task states. No driver JARs, no polling. Multiple consumers — Airflow workers, the web application, developer notebooks — share one cluster without any of them needing Spark installed locally.
Deploying Spark Connect
Amazon EMR was the natural hosting choice, but at the time of evaluation it did not support Spark Connect. The team deployed a self-hosted cluster on EC2 — Graviton-based leader node on-demand, spot workers at ~70% cost savings, with the Connect endpoint exposed through an internal Network Load Balancer. Custom Packer AMIs bake in Spark 3.5.4, Iceberg 1.7.1, and the S3 Tables catalog JAR — because the --packages flag requires internet access that private-subnet nodes don't have. The team plans to migrate to EMR now that Connect support is available, but self-hosting gave them full control during the critical migration period.
This phase produced no immediate business value — but it made everything that followed possible.
Phase 2: Migrate Data Ingestion
Razor Group's ingestion layer pulls data from Amazon SP API, Seller Central portals, NetSuite ERP, and custom web scrapers. In the legacy architecture, all of this landed in Redshift via COPY commands — tying data freshness to the always-on cluster's availability and batch schedule.
The team migrated these pipelines to AWS Lambda functions orchestrated by Apache Airflow 3.2, writing data directly to S3 Tables in Iceberg format. The shift from schedule-driven to event-driven was transformative: Lambda functions spin up only when there's data to process, and Airflow sensors trigger downstream transformations the moment new data lands — replacing rigid hourly batch windows with data freshness measured in minutes.
Airflow itself runs on EC2 with Celery workers, managing 221 DAGs across 92 orchestration flows. Existing DAG definitions migrated with minimal changes — the heavy lift was rewiring destinations from Redshift COPY to Iceberg writes, not rewriting orchestration logic.
This phase alone eliminated roughly 40% of compute costs by severing the always-on cluster dependency for ingestion.
Phase 3: Transform Processing Pipelines
Migrating 719 SQL models from Redshift to Apache Spark was the most technically demanding phase — and where the team learned the most. The migration moved incrementally, starting with Bronze-to-Silver transformations and working up the dependency chain across 44 schemas.
The team built automated conversion tooling and a validation framework that ran both Redshift and Spark outputs in parallel, comparing results row-by-row before decommissioning anything. Several categories of transformation pushed the limits of what automation could handle:
- Window functions: Redshift's
QUALIFYclause has no Spark equivalent. Each instance required wrapping in a subquery with explicit row numbering — 18 models in the inventory schema alone. - JSON serialization: The most time-consuming category. Complex columns stored as JSON STRING in Redshift needed
from_json()with hand-written STRUCT definitions in Spark. Every nested payload column across ads, orders, and transaction pipelines required schema introspection — there was no shortcut. - Function dialect: 20+ function-level conversions —
NVL→COALESCE,DATEADD→ interval arithmetic,LISTAGG→ARRAY_JOIN(COLLECT_LIST()),SPLIT_PART→ zero-indexedSPLIT,ILIKE→LOWER() LIKE LOWER(). - Snapshot elimination: The single biggest hidden cost. Full table copies running 6× daily just to preserve point-in-time state consumed 39+ hours of weekly Redshift compute. With Iceberg's native time-travel, these became zero-cost operations overnight.
The team used Claude to accelerate SQL conversion, building a structured migration workflow directly into the repository: read source SQL → identify dependencies → convert syntax → resolve missing base tables → add JSON parsing → validate → write to lakehouse. What could have been a multi-year slog became a systematic, repeatable process measured in weeks.
Phase 4: Unify the Serving Layer
With data flowing through Iceberg tables, the team collapsed the serving layer. Tableau dashboards, ML notebooks, and the production application all query the same Iceberg tables through Spark Connect — no more maintaining separate data copies, materialized views, or extract jobs for different consumers.
This is the strategic payoff of an open table format. Iceberg tables on S3 are engine-agnostic: Spark for batch transforms today, Trino for interactive queries tomorrow, Flink for streaming next quarter. Any engine that speaks Iceberg can read the data without conversion or migration. The team went from being locked into a single vendor's SQL dialect to having the freedom to evaluate and adopt new engines without touching the storage layer.
Phase 5: Operationalize and Observe
The final phase made the lakehouse production-grade:
- OpenTelemetry collectors on Amazon ECS aggregate metrics, traces, and logs from every pipeline component into a unified view
- Grafana Loki provides centralized log search backed by S3 with 90-day retention
- Guardian — a custom autonomous SRE agent backed by Amazon DynamoDB — monitors cluster health, detects anomalies, correlates failures across Spark and Airflow, and routes contextualized alerts through Slack
Pipeline orchestration moved from EventBridge to Airflow with three DAG patterns: a daily pipeline (ingestion → materialization → export → AI agent analysis), an operations worker polling every 15 minutes, and weekly scraper jobs.
The cutover was zero-downtime by design: Airflow ran in parallel with EventBridge for two weeks. Automated comparison checks validated that every pipeline produced identical outputs before the legacy scheduler was disabled.
Architecture Overview
[Architecture diagram]
The architecture follows a medallion pattern across three data tiers:
- Bronze: Raw data from landing zones — SP API reports, Seller Central scrapers, NetSuite ERP, third-party feeds. Data lands in Iceberg format with minimal transformation, preserving source schemas for auditability.
- Silver: Business logic across 50 schemas — deduplication, type casting, cross-source joins, enrichment. Models are SQL files with Jinja2 templating supporting both incremental and full-refresh execution.
- Gold: Consumption-ready data marts — dashboard aggregates, ML feature stores, API-ready exports powering Tableau, the production application, and downstream automation.
AWS Glue Data Catalog provides unified metadata discovery. Lake Formation enforces access policies down to the column level.
Results
| Metric | Before (Redshift) | After (Lakehouse) | Impact | |--------|-------------------|-------------------|--------| | Weekly compute | 609 hours | ~180 hours | 70% reduction | | Cluster model | 6× ra3.4xlarge, always-on | Spot-based Spark, on-demand | Pay only for what runs | | Storage | 152 TB / 20,252 tables | 97.7 TB / 15,646 tables | 36% reduction | | Snapshot overhead | 39+ hrs/week | Zero | Eliminated by Iceberg | | Data freshness | Hourly batch windows | Event-driven, minutes | Order-of-magnitude faster | | Engine lock-in | Redshift SQL only | Open Iceberg, any engine | Eliminated |
Before the migration even began, the team's workload analysis surfaced waste that had been invisible for years. The single most expensive DAG consumed 87.5 hours of Redshift compute per week — the same UNLOAD operation running 3× daily with two different partition strategies. A storage audit found 4,606 orphaned tables — developer scratch data, abandoned temp tables, dead pipelines with zero readers — consuming 54.6 TB. Dropping them enabled downsizing from 6 nodes to 4 immediately.
These findings alone paid for the first month of the lakehouse infrastructure.
Lessons Learned
The hardest part isn't the technology — it's the archaeology. Iceberg, Spark, and S3 Tables work well together out of the box. The real challenge is mapping 719 models across 44 schemas, tracing dependencies through 221 DAGs, and discovering that "column X" is actually a JSON STRING that was silently serialized differently between two engines. Migration is an excavation project as much as an engineering one.
Know your workload before you replace your warehouse. The Redshift workload analysis was the single highest-ROI activity of the project. Learning that 98% of compute was ETL — and only 1% was analyst queries — validated the move to on-demand Spark and prevented the team from over-provisioning the replacement infrastructure for interactive workloads that barely existed.
Automate conversion, but budget for the 30%. Automated tooling handled the mechanical syntax conversions well. But models with complex JSON payloads, deeply nested window functions, or cross-schema snapshot dependencies required human judgment. Roughly 30% of models needed significant manual work — and those models consumed 70% of the migration effort.
Snapshot operations are hidden cost bombs. Full table copies running 6× daily to preserve point-in-time state cost 39+ hours of weekly compute — and nobody questioned it because "that's how snapshots work." Iceberg's time-travel made them free. This single capability justified a meaningful portion of the migration.
Run in parallel, then trust but verify. The two-week parallel run caught edge cases that row-level validation missed: timezone handling differences between Redshift and Spark, partition pruning behavior under concurrent writes, and subtle ordering differences in non-deterministic window functions. The parallel run wasn't optional — it was where the migration actually proved itself.
Conclusion
Razor Group's migration from Amazon Redshift to an Apache Iceberg lakehouse on S3 Tables shows that even large-scale, production-critical data platforms can be modernized systematically — without heroics and without downtime. The key was treating migration as an engineering discipline: measure the workload, move incrementally, validate continuously, and cut over only when the evidence says you're ready.
The open lakehouse architecture now positions Razor Group for what comes next: real-time streaming, ML-powered forecasting, and multi-engine federation — all reading from the same Iceberg tables, all without another migration.
Ready to automate your Amazon operations?
CorditeOS combines intelligence, AI agents, and execution in one platform.
Get Early Access