Coinbase Logo

From EC2 to EKS: Inside Coinbase’s 10x Compute Modernization

TL;DR: To prepare for extreme market volatility, Coinbase migrated 3,500 services from EC2 to Amazon EKS and AWS Graviton, a project accelerated by specialized consultants that resulted in 50% faster scaling and significant cost savings. The journey continued with a transition from Cluster Autoscaler to Karpenter for more dynamic, just-in-time node provisioning, further reducing infrastructure costs by 20% and creating a more agile and cost-effective platform for the future.

By Tiberiu Oprisiu, Frances Chong, Sara Reddy

, October 27, 2025

Coinbase Blog

The crypto market’s volatility is a defining challenge for our infrastructure. We face traffic surges comparable to a massive video game launch, but with one key difference: we never know when "launch day" is. During the "crypto winter" of 2022, we knew we had to evolve. Our goal was to build a more efficient and responsive platform that could scale instantly during market peaks while remaining cost-effective during quieter periods.

This led us to launch a company-wide initiative to modernize our compute platform by migrating from a traditional EC2 architecture to the containerized environment of Amazon EKS.

The Playbook for a Mass Migration

The scale of this project was immense: migrating 3,500 service configurations in just 12 months. To achieve this without disrupting product development, we developed a new operational model.

We brought in specialized consultants who acted as a force multiplier for our internal teams. They were organized into "pods," with each pod paired with a full-time Coinbase engineer. This structure was critical for overcoming the natural resistance from service owners, as the full-time employee could bridge the trust gap and assist with cultural onboarding.

Automation was another key pillar. An internally built tool called "kubetools" allowed engineers to create a new EKS configuration from an existing Auto Scaling Group configuration with just a few CLI commands. Trust through validation was also equally critical. Load testing capabilities and weighted traffic shifting were provided. This meant issues could be quickly identified and rolled back without customer impact. By combining this tooling with a clear operational playbook and weekly check-ins to celebrate wins, we built the momentum needed to complete the migration on schedule.

Learnings from the Field: Navigating Roadblocks

A journey of this scale is never a straight line. Intensive load testing—at 10x our peak traffic—was essential for uncovering risks before they could impact customers.

  • IP Exhaustion: Early tests revealed that the higher density of pods in EKS could lead to IP address exhaustion within our VPCs. Identifying this allowed us to proactively re-architect our networking before it became a production incident.

  • De-risking Prerequisites: We learned to identify and tackle infrastructure-level dependencies early. For instance, we automated the migration from Classic Load Balancers (CLB) to Application Load Balancers (ALB) to simplify the process for engineering teams and reduce risk.

  • Breaking Outdated Practices: Services were previously over-provisioned due to the limitations of legacy infrastructure. Although Kubernetes removed this need, the practice continued, especially in anticipation of market events. Implementing measures such as a shared buffer capacity with preemptible pods helped to address these concerns.

The Result: A Faster, More Efficient Platform

The move to EKS fundamentally transformed our compute platform. Because we no longer had to wait for new EC2 instances to boot to scale, we achieved 50% faster scaling speeds.

The efficiency gains were even more significant. By bin-packing multiple services onto single nodes and using granular resource settings, we saw a 68% reduction in resources used by migrated services. Finally, by offloading the management of the Kubernetes control plane to AWS and leveraging widely adopted open-source solutions, we reduced the operational burden on our infrastructure teams.

Next-Level Efficiency with AWS Graviton

With the EKS migration complete, we moved to the next phase: transitioning our EKS workloads to AWS Graviton instances to take advantage of their superior performance and 20% lower cost.

This introduced a new challenge. The standard Kubernetes Cluster Autoscaler was a blocker, as it would randomly select a node group to scale and could not be configured to prioritize Graviton. Our solution was pragmatic: we increased the number of Graviton node group in our clusters. This raised the probability of the autoscaler choosing a Graviton host, allowing us to adopt a "burst into Graviton" strategy while keeping x86 instances as a fallback for resilience.

This final step delivered an additional 10% in overall compute savings, proving that the combination of EKS and Graviton was the optimal architecture for our needs.

The Next Evolution: Moving to Karpenter for Smarter Scaling

Karpenter before and after

Coinbase's continued growth exposed the limitations of its architecture built on Cluster Autoscaler and Managed Node Groups, necessitating a long-term infrastructure solution. A key pain point was the process of updating node groups, which often resulted in slow, disruptive cycles. With a frustrating 15-minute timeout and a single attempt to complete an upgrade, the old system proved to be inflexible and unreliable. Additionally, managing a variety of instance types became a cumbersome task, requiring a separate node group for each type. This complex and inefficient setup no longer met the company’s demands for speed and scalability.

In search of a more dynamic and efficient solution, Coinbase transitioned to Karpenter. This move allowed us to leverage a more Kubernetes-native approach to node management. Karpenter operates as a controller, continuously watching for unscheduleable pods and provisioning new nodes in real-time. This dynamic, just-in-time provisioning eliminates the need for predefined node groups and the associated management overhead. This streamlined operations and delivered an impressive 20% reduction in infrastructure costs. By embracing Karpenter, we achieved a more agile, cost-effective, and scalable foundation to support Coinbase’s continued growth.

​​IP address management is a subtle but critical challenge with Karpenter. While it's great at provisioning resources "just-in-time", it doesn't always have full visibility into a subnet's IP availability. This can lead to IP exhaustion, where a subnet runs out of addresses, causing newly provisioned nodes to become "unready" and pods to get stuck in a pending state. The issue often arises because Karpenter's logic for selecting a subnet might not align with IP availability. It could favor a particular Availability Zone (AZ) that has limited remaining IPs, leading to an unbalanced distribution of nodes. This is especially problematic for highly dynamic workloads with frequent scaling events. To address this long-term, we are actively working on enabling dual-stack EKS clusters to utilize IPv6, mitigating the risk of IPv4 exhaustion.

Building for the Future

This modernization of our core compute platform represents a multi-year strategic effort, achieved through the dedication and collaboration of countless engineering teams. While this journey marks a significant milestone, our work on platform evolution is never finished. We remain committed to continuous optimization, building an infrastructure that scales effortlessly and efficiently. This foundational work is critical to our mission of supporting the future of the cryptoeconomy and increasing economic freedom around the world.

Coinbase logo