In this edition we feature several exciting announcements related to Amazon EKS including but not limited to support for Ultra Scale Clusters, enhancememnts to GuardDuty Threat Detection, Amazon EKS Pod Identity, Amazon ECR enhanced scanning, Amazon EMR on EKS and Amazon Sagemaker HyperPod operator.
New AWS services and features
Amazon EKS now supports up to 100,000 worker nodes per cluster
- Amazon EKS now supports up to 100,000 worker nodes in a cluster, enabling you to run ultra scale AI/ML training and inference workloads in a single cluster.
- The most advanced AI models with trillions of parameters demonstrate significantly enhanced capabilities in understanding context, reasoning, and solving complex tasks. These workloads cannot be easily distributed across multiple clusters.
- These workloads cannot be easily distributed across multiple clusters. This release solves the challenge by enabling you to run ultra scale AI/ML workloads that require all compute accelerators to be available within a single cluster.
Amazon EKS Pod Identity simplifies the experience for cross-account access
- Pod Identity enables applications in your EKS cluster to access AWS resources across accounts through a process called role chaining. Amazon EKS Pod Identity now provides a simplified experience for configuring application permissions to access AWS resources in separate accounts
- With this release, your applications running in the EKS cluster automatically receive the required AWS credentials during runtime without requiring any code changes. To learn more, navigate to documentation here
Amazon EKS add-ons now supports Private CA Connector for Kubernetes
- Private CA Connector for Kubernetes add-on is now generally available.
- AWS Private CA is a managed service that lets you create private certificate authority hierarchies to issue private certificates. With this the new Amazon EKS add-on, you can quickly and easily set up new and existing clusters using automation to leverage AWS Private CA certificates, enhancing security and simplifying certificate management. Previously, this process could take hours or even days and involved numerous manual steps.
Amazon GuardDuty Extended Threat Detection now supports Amazon EKS
- GuardDuty Extended Threat Detection automatically detects multi-stage attacks that span data sources, multiple types of AWS resources, and time, within an AWS account.
- With this launch GuardDuty Extended Threat Detection includes coverage for multi-stage attacks targeting Amazon EKS clusters by correlating multiple security signals across Amazon EKS audit logs, runtime behavior of processes, malware execution, and AWS API activity to detect sophisticated attack patterns.
Amazon ECR enhanced scanning now surfaces image use status
- ECR enhanced scanning is an integration with Amazon Inspector that provides vulnerability scanning for your container images.
- With this launch, ECR now surfaces how an image is used on Amazon EKS - including last used date, the number of clusters that the image was used, and the cluster ARNs.
Amazon EMR on EKS now supports Service Quotas
- Previously, to request an increase for EMR on EKS quotas, such as maximum number of StartJobRun API calls per second, customers had to open a support ticket and wait for the support team to process the increase.
- With this launch, you can now view and manage EMR on EKS quota limits directly in the Service Quotas console, thereby having better visbility and control.
Announcing Amazon SageMaker HyperPod training operator
- Amazon SageMaker HyperPod training operator, a purpose-built Kubernetes extension for resilient foundation model training on HyperPod, is now Generally Available.
- With this launch, you can now accelerate AI model development across hundreds or thousands of GPUs with built-in resiliency, decreasing model training time by up to 40%.
Amazon VPC CNI now supports higher bandwidth and network performance per pod
- By default, the Amazon VPC CNI plugin assigns one elastic network interface (ENI) to each pod. With this launch, you can now attach multiple network interfaces to pods using the Amazon VPC Container Network Interface (CNI) Plugin.
- With maximized network card capacity, application traffic will be distributed across multiple concurrent ENI connections, raising the bandwidth and packet rate performance in your workloads.
AWS blogs
[Blog] Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster
- Kubernetes has emerged as a key enabler for customers to run large scale AI/ML workloads. In order to advance artificial general intelligence (AGI) ambitions, we would require training trillion-parameter models, warranting industry-leading scale with Kubernetes conformance.
- This blog article is intended to serve as walkthrough on newly launched Amazon EKS Ultra Scale clusters that support upto 100K Nodes per Cluster.
[Blog] Under the hood: Amazon EKS ultra scale clusters
- Through multiple technical innovations, architectural improvements and open-source collaboration, Amazon EKS has built the next generation of its cluster control plane and data plane for ultra scale, with full Kubernetes conformance.
- This blog article serves as a technical deep dive on Amazon EKS design changes, such as offloading etcd consensus to AWS’s multi-AZ journal system, scheduling optimizations, scaling optimizations that made order-of-magnitude performance improvements possible all while maintaining full Kubernetes conformance.
[Blog] Maximizing GPU Utilization using NVIDIA Run:ai in Amazon EKS
- In the fast-paced world of artificial intelligence and machine learning, GPU resources are both critical and in high demand. Traditionally in Kubernetes, GPUs are allocated statically which often leads to under-utilization of this scarce resource. This in turn leads to unpredictable throughput and latency.
- This blog article serves as walkthrough guide for steps involved in setting up a self-hosted Run:ai deployment option with Amazon EKS for unlocking maximum value from GPU infrastructure.
[Blog] Accelerate generative AI inference with NVIDIA Dynamo and Amazon EKS
- NVIDIA Dynamo is an open-source, low-latency, modular inference framework for serving generative AI models in distributed environments.
- This blog article serves as introduction to NVIDIA Dynamo and explains how to set it up on Amazon EKS for automated scaling and streamlined Kubernetes operations.
[Blog] Deep Dive: Amazon EKS Dashboard for Visibility into Multi-Cluster Operations and Governance
- Without centralized visibility, teams often discover EKS clusters running outdated software, miss security updates, execute unplanned upgrades, and encounter unexpected extended support costs. Amazon EKS Dashboard addresses these challenges by providing a centralized interface to maintain organization-wide visibility across their Kubernetes clusters.
- This blog article serves as walkthrough guide for steps involved in setting up Amazon EKS Dashboard.
[Blog] Amazon EKS Pod Identity streamlines cross account access
- At re:Invent 2023, Amazon EKS introduced the Pod Identity feature, enabling users to configure Kubernetes applications running on Amazon EKS with fine-grained IAM permissions to access AWS resources.
- This blog article serves as guide on enhancements introduced in EKS Pod Identity to streamline cross account resource access for Kubernetes applications.
[Blog] Amazon GuardDuty expands Extended Threat Detection coverage to Amazon EKS clusters
- Traditional monitoring approaches might detect individual suspicious events, but often miss the broader attack pattern that spans across these different data sources and time periods.
- GuardDuty Extended Threat Detection introduces a new critical severity finding type, which automatically correlates security signals across Amazon EKS audit logs, runtime behaviors of processes associated with EKS clusters, malware execution in EKS clusters, and AWS API activity to identify sophisticated attack patterns that might otherwise go unnoticed.
- This blog articles serves as walks through guide on capabilities provided by GuardDuty Extended Threat Detection for EKS.
[Blog] Accelerate Microservices Development with DAPR and Amazon EKS
- Dapr (Distributed Application Runtime) is an open source runtime for building microservices on cloud and edge. Dapr graduated from the Cloud Native Computing Foundation (CNCF) on Oct. 30, 2024, marking a major milestone in the project’s journey.
- This blog article serves as guide on how Dapr simplifies microservices development on Amazon EKS and shows how to deploy a sample application on EKS using Dapr.
[Blog] Volvo Cars Streamlines In-Vehicle Software Testing with AWS Graviton on Amazon EKS
- Volvo Cars recently introduced a new “game-changing” approach to technology, Volvo Cars Superset tech stack.
- This blog article serves as case study on how Volvo Cars has built a scalable Continuous Integration (CI) platform on Amazon EKS and AWS Graviton supporting the Superset tech stack.
Community news and articles
Multi-VPC Load Balancing for Amazon EKS workloads
- The article explores steps to configure the AWS Load Balancer Controller to create AWS ELBs in a different VPC (VPC B) than the one hosting your EKS cluster (VPC A) for enhanced security isolation, network segmentation, and more flexible network design.
Fractional GPU Sharing on Amazon EKS using NVIDIA KAI Scheduler and Accelerated EC2 Instances
- This article explores steps to leverage NVIDIA’s KAI Scheduler for Fractional GPU Sharing on Amazon EKS.
Optimizing Karpenter on EKS: A Guide to Efficient NodePool Configuration Strategies
- This article explores effective strategies for configuring Karpenter NodePools when managing mixed architecture environments.
EKS Auto Mode: collecting logs for services that run locally on the node
- When you use EKS Auto Mode, multiple services run as daemon processes rather than DaemonSets. This article explores steps to configure CloudWatch agent to collect journal logs from all these processes.
Videos and webinars
Open source projects
- Kiro
- The AI IDE for prototype to production. Kiro is an agentic IDE that helps you do your best work with features such as specs, steering, and hooks.