With 1000+ nodes and 200+ clusters, how does Slack use Karpenter to reduce costs and increase efficiency?

Originally published on CloudMystery

Slack is an AI work management and collaboration platform. As business needs grow, Slack has made significant changes to its internal compute orchestration platform to increase scalability, improve efficiency, and reduce costs. The internal platform, codenamed "Bedrock," is built on Amazon EKS and Karpnter.

By leveraging Bedrock, Slack has simplified container deployment and management into a single YAML file, streamlining workflows for internal developers. With tools such as Jenkins, Nebula overlay networking and FQDN service discovery, more than 80 percent of Slack's applications now run on Bedrock, improving test accuracy and infrastructure management. However, as the business expanded, Slack faced challenges managing scalability and resource utilization, and adopted Karpenter, an open source cluster auto-scaling tool.

In this post, we'll dive into Slack's process of modernizing their container platform on Amazon EKS and how they've been able to save money and increase O&M efficiency by leveraging Karpenter.

Before Karpenter: Challenges for Slack

Prior to integrating Karpenter, Slack relied on managing multiple Auto Scaling Groups (ASGs) for its EKS compute resources. While this approach worked well initially, it began to experience bottlenecks as workloads and complexity continued to grow:

1. Scalability issues: Managing multiple ASGs becomes challenging as instance types and application requirements increase.

2. Upgrading bottlenecks: Frequent updates to an EKS cluster with thousands of worker nodes can slow down deployment.

3. Architectural constraints: Single-copy architecture creates vulnerabilities, and managing different instance requirements across zones creates inefficiencies.

These constraints led Slack to seek a more resilient, efficient, and dynamic elastic scaling solution.

Two-step strategy: gradual roll-out of Karpenter

Karpenter's ability to dynamically schedule and scale nodes based on workload demand makes it an ideal solution for Slack's needs. By interacting directly with Amazon EC2's Fleet API, Karpenter can assess a pod's resource requirements and select the best instance type for the job.Slack has adopted a meticulous two-phase rollout strategy to ensure a smooth transition:

Phase 1: Validation

Initially, Karpenter was deployed with core applications in groups of managed nodes. During this phase, Karpenter's consolidation feature (optimizing node utilization by redistributing workloads) is disabled and the focus is on validating its functionality with specific workloads. This phase also included extensive testing to identify and resolve pod configuration errors to ensure workloads were optimized for efficient scheduling.

Phase 2: Full-scale roll-out

Slack moved the Karpenter controller workload to a dedicated ASG, ensuring that Karpenter would not run on its managed nodes. After rigorous testing, Slack eventually fully deployed Karpenter in an environment of more than 200 EKS clusters running thousands of worker nodes. consolidation was enabled at this stage, enabling Slack to maximize resource utilization and achieve significant cost savings.

Because Karpenter is phased in, you can control which clusters are enabled for Karpenter, which allows Slack to validate the performance of workloads in Karpenter and quickly roll back when issues are reported.

When workloads don't have appropriate request/limits, Karpenter allocates smaller instances or only a fraction of large instances, leading to frequent pod churn (the cycle of creation, destruction, and subsequent re-creation of Pods and containers) as load increases.Slack identified this issue with Karpenter, and went on to improved its platform to ensure that Pods are set up appropriately and assigned to the appropriate nodes. For workloads that require a specific instance type, Slack was able to tune the NodePool customization resource and use Karpenter tags to pin the pod to the relevant instance type.

The architectural strengths of Slack's Bedrock EKS cluster are its resiliency and efficiency, as shown in the figure below.The combined impact of Bedrock and Karpenter on Slack's EKS architecture is clear.
slack-eks

After Karpenter: What Slack Achieved

Slack's adoption of Karpenter has resulted in significant improvements in several areas:

1. Resource optimization

Karpenter dynamically selects instance types based on pod requirements (from8xlarge until (a time)32xlarge), all available resources can be used efficiently for workloads that do not require a specific instance type, greatly improving cluster utilization. Utilizing consolidation eliminates idle instances by redistributing workloads and reduces the need for a minimum ASG size across availability zones (AZs).

Myriad CalculationsKarpenter's Smart Node Selection feature further optimizes Karpenter's dynamic instance selection, intelligently matching more than 750 instance types to automatically match diverse instance types for user workloads to reduce resource wastage, improve compute performance, and enhance application stability.

2. Cost optimization

By automatically scaling nodes and leveraging a richer family of instances, Slack has reduced compute costs by 12 percent. Dynamic instance allocation also reduces the burden on infrastructure teams, who were previously tasked with maintaining multiple ASG configurations.

3、Enhance performance

Because Karpenter makes on-the-fly node auto-scaling decisions, it accelerates node provisioning and reduces pod startup time, enabling faster scaling during peak workloads. In addition, Slack runs Karpenter with customized over-provisioning to buffer against sudden traffic spikes.

Karpenter's direct API interaction with Amazon EC2 and retry mechanism improves resiliency and ensures faster recovery in the event of an Availability Zone (AZ) failure.

4. Simplified operation

By removing hard-coded instance types from the Terraform configuration, it helps launch pods faster, which in turn alleviates concerns about upgrading the Slack system by quickly evicting and rotating nodes during the upgrade process. Customized Helm Chart configurations enable Slack to use a single NodePool and EC2NodeClass across more than 200 EKS clusters.

Because Karpenter offers a variety of instance types to choose from in the instance family, it is helpful to switch from one instance type to another when using dynamic scheduling constraints. This eases the burden on infrastructure teams and reduces the risk of instance type changes.

future planning

Slack's successful deployment of Karpenter marks the beginning of a journey of further optimization, and Slack is currently working to simplify the current Karpenter configuration to further improve operations and save even more money:

Customize the Kubelet configuration: Slack plans to bypass the Infrastructure-as-Code (IaC) solution by configuring the kubelet flag directly through Karpenter's EC2NodeClass, thereby reducing instance startup time.
Warmpool for rapid expansion: Slack is exploring ways to reduce startup time by having Karpenter pick instances from the warm pool instead of calling the Amazon EC2 Fleet API.
Interrupt Control: Enhanced outage control minimizes the impact of consolidation on application availability and ensures that applications run smoothly even during periods of high concurrency.

summarize

In this post, we discuss how Slack's Bedrock improved the operation of Amazon EKS clusters by transitioning from ASG-based auto-scaling to Karpenter for increased scalability. We also discussed how Slack leveraged Karpenter as an elastic scaling tool for EKS clusters to streamline infrastructure and reduce costs. Going forward, Slack will focus more on building a robust and powerful platform by contributing and leveraging new features to further optimize the Karpenter environment.