During the lifecycle of a long-running GKE cluster, periodic disruptions to workloads occur due to infrastructure interruptions that Google Cloud issues. These automatic events can occur to respond to scheduling decisions (preemption events), control plane or node updates, which include GKE node auto-upgrades (maintenance events), or remediation of detected issues (termination events).
This page helps you understand what node disruption means in GKE, monitor maintenance notifications, and minimize disruption impact in your GKE nodes with attached GPUs and TPUs.
This document is for Platform admins and operators who manage the lifecycle of the underlying tech infrastructure. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
What does infrastructure interruption mean in GKE?
Your GKE clusters manage the lifecycle of the GKE nodes. These nodes are provisioned on Compute Engine VMs, which periodically experience the following interruptions:
Remediation of detected issues (
TerminationEvent
): these events occur because Google Cloud detects an issue and interrupts your cluster infrastructure.TerminationEvent
events don't support graceful shutdown.TerminationEvent
events are triggered by the following issues:- Auto repair occurs when GKE repairs a node after repeated failed health checks.
- HostError occurs when a hardware or software error on the physical machine causes the VM to stop.
Maintenance or upgrade events (
MaintenanceEvent
): these events occur when Google Cloud needs to interrupt a VM to perform maintenance.MaintenanceEvent
events are triggered by the following maintenance tasks:- Maintenance events occurs when Google Cloud upgrades the underlying host.
- Node updates, which include node auto-upgrades, occurs when GKE updates the version of Kubernetes running on the node.
For more information about how you and GKE manage changes during the lifecycle of a cluster, see Types of changes.
Response to scheduling decisions (
PreemptionEvent
): occur when Google Cloud needs to preempt VMs to make capacity available for higher-priority resources.PreemptionEvent
events can be any of the following:- Eviction: occurs when preemptible or Spot infrastructure is preempted to accommodate a higher priority VM.
- Defragmentation: occurs when GKE preempts a smaller TPU slice to accommodate a larger TPU slice. Defragmentation only occurs on TPU slices.
During the lifecycle of a long-running GKE cluster, the nodes might experience periodic disruptions to training or serving workloads. When these disruptions affect your GKE nodes that run AI/ML workloads, GKE needs to restart both the running workloads and the underlying node.
Why GPUs and TPUs require interruption management
Most Compute Engine VMs, with some exceptions, have their host maintenance policy set to live migrate, which means that running workloads typically experience little to no disruption. However, certain classes of VMs don't support live migration, including VMs with attached GPUs and TPUs. When a host event happens to the VM within a TPU slice, the entire slice gets interrupted and then rescheduled because all maintenance events are coordinated at the slice level. So if you create a TPU slice that has hundreds of VMs, all of those VMs will receive the same maintenance event schedule.
When a host event occurs, GKE terminates the node and its Pods. If the Pods are deployed as part of a larger workload, like a Job or Deployment, GKE restarts the Pods on the affected node.
It is up to you, or the frameworks that you use, to handle the workload configuration to react appropriately to maintenance events. For example, you can save the state of your AI training job to reduce data loss.
To manage disruption on AI/ML workloads, you can do the following:
- Monitor node and node pool interruptions
- Monitor maintenance notifications
- Minimize disruption impact
Monitor node interruptions
The following GKE system metric reports the count of interruptions for a GKE node since the last sample (the metric is sampled every 60 seconds):
kubernetes.io/node/interruption_count
The interruption_type
(such as TerminationEvent
, MaintenanceEvent
, or PreemptionEvent
) and interruption_reason
(like HostError
, Eviction
, or AutoRepair
) fields can help provide the reason for why
a node was interrupted.
To get a breakdown of the interruptions and their causes in TPU nodes in the clusters in your project, use the following PromQL query:
sum by (interruption_type,interruption_reason)(
sum_over_time(
kubernetes_io:node_interruption_count{monitored_resource="k8s_node"}[${__interval}]))
To only see the
host maintenance events,
update the query to filter the HW/SW Maintenance
value for the interruption_reason
. Use the following PromQL query:
```promql
sum by (interruption_type,interruption_reason)(
sum_over_time(
kubernetes_io:node_interruption_count{monitored_resource="k8s_node", interruption_reason="HW/SW Maintenance"}[${__interval}]))
```
To see the interruption count aggregated by node pool, use the following PromQL query:
```promql
sum by (node_pool_name,interruption_type,interruption_reason)(
sum_over_time(
kubernetes_io:node_pool_interruption_count{monitored_resource="k8s_node_pool", interruption_reason="HW/SW Maintenance", node_pool_name=NODE_POOL_NAME }[${__interval}]))
```
Monitor maintenance notifications
Compute Engine issues notifications when nodes and their underlying VMs are scheduled for disruptive host events, and when these events become active. The notifications include information about planned start time, the type of event, and other details.
On GKE version 1.31.1-gke.2008000 and later, you can monitor upcoming maintenance events, including the events that are described in this section.
Upcoming maintenance is scheduled but not active
Before a VM with attached GPUs or TPUs has a scheduled
maintenance event,
Compute Engine pushes notifications out to all its VMs. These
notifications report the start of the maintenance window. When an upcoming
maintenance is scheduled by the VM but not active, GKE adds
scheduled-maintenance-time
to the node label.
To query these notifications at the node level, run the following command:
kubectl get nodes -l cloud.google.com/scheduled-maintenance-time \
-L cloud.google.com/scheduled-maintenance-time
The output is similar to the following:
NAME STATUS SCHEDULED-MAINTENANCE-TIME
<gke-accelerator-node-name> Ready 1733083200
<gke-accelerator-node-name> Ready 1733083200
[...]
The SCHEDULED-MAINTENANCE-TIME
column represents
seconds, which are displayed in Unix epoch time format.
To query these notifications at the level of node metadata, check instances for a maintenance event notification.
Scheduled maintenance starts
For accelerator-optimized machine families that support
advanced maintenance, you
can access the upcoming-maintenance
endpoint that provides information about
scheduled and started maintenance events.
When scheduled maintenance starts, Compute Engine updates the metadata
in the http://metadata.google.internal/computeMetadata/v1/instance/attributes/
directory. Compute Engine updates the metadata labels as follows:
- Sets
maintenance-event
toTERMINATE_ON_HOST_MAINTENANCE
. - In
upcoming-maintenance
, setsmaintenance_status
toONGOING
.
GKE will gracefully evict Pods and terminate workloads within the limited maximum predefined time of the maintenance notification window.
Minimize disruption impact
To minimize the impact of node disruption, you can manually start a host maintenance event.
If you don't start a maintenance event, Compute Engine will complete the regularly scheduled maintenance.
Manually start a host maintenance event
When Compute Engine issues a notification about a scheduled maintenance event, you can manually start maintenance at a time that aligns with your operational schedule, for example, during periods of reduced activity.
On a node in the node pool, set the node label
cloud.google.com/perform-maintenance
to true
. For example:
kubectl label nodes <node-name> cloud.google.com/perform-maintenance=true
GKE will gracefully evict Pods and terminate workloads before the maintenance event starts with the perform-maintenance action. The duration between label application and maintenance start varies.
Configure GKE to terminate your workloads gracefully
In this section, you configure GKE to manage your application lifecycle and minimize the disruption to your workload. If you don't configure a grace period, the grace period defaults to 30 seconds.
GKE makes a best effort to terminate these Pods gracefully and to
execute the termination action that you define, for example, saving a training
state. GKE sends a SIGTERM
signal to Pods at the beginning of the grace
period. If Pods don't exit by the end of the grace period, GKE
sends a follow-up SIGKILL
signal to any processes still running in any
container in the Pod.
To configure the graceful termination period, set the termination grace period
(seconds) in the spec.terminationGracePeriodSeconds
field of your Pod
manifest. For example, to get a notification time of 10 minutes, set the
spec.terminationGracePeriodSeconds
field in your Pod manifest to 600 seconds,
as follows:
spec:
terminationGracePeriodSeconds: 600
We recommend that you set a termination grace period that is long enough for any ongoing
tasks to finish within the notification timeframe.
If your workload uses a ML framework such as MaxText, Pax, or JAX with
Orbax, the workloads
can capture the shutdown SIGTERM
signal and initiate a checkpointing process.
To learn more, see TPU Autocheckpoint.
Process of graceful termination
When a disruption event begins, whether it's triggered manually or automatically
by the VM, Compute Engine signals the impending machine shutdown by
updating the maintenance-event
metadata key. In both cases of impending node
shutdown, GKE will start graceful termination.
The following workflow shows how GKE executes graceful node termination when there is an impending node shutdown:
- Within 60 seconds, the following occurs:
- The system components apply the
cloud.google.com/active-node-maintenance
node label set toONGOING
to indicate that workloads are being stopped. - GKE applies the node taint to prevent new Pods from being
scheduled on the node. The taint has the
cloud.google.com/impending-node-termination:NoSchedule
key. We recommend that you don't modify your workloads to tolerate this taint due to the known termination that occurs.
- The system components apply the
- The maintenance-handler component begins to evict Pods by first evicting workload Pods, and then evicting system Pods (for example, kube-system).
- GKE sends a
SIGTERM
shutdown signal to workload Pods that are running on the node to alert them of an imminent shutdown. Pods can use this alert to finish any ongoing tasks. GKE makes a best effort to terminate these Pods gracefully. - After eviction finishes, GKE updates the value of the
cloud.google.com/active-node-maintenance
label toterminating
to indicate that the node is ready to terminate.
Afterwards, the node termination occurs and a replacement node is allocated. GKE clears the labels and taints when the process is finished. To increase the termination window for your workloads using GPUs or TPUs, complete the steps in the Manually start a host maintenance event section.
Monitor the progress of an active graceful termination
You can filter the GKE logs by the following graceful termination events:
- When the VM detects a disruption due to an impending node termination like
Compute Engine host maintenance event, GKE sets the
cloud.google.com/active-node-maintenance
toONGOING
when workloads are being stopped, and toterminating
when the workloads are finished and the node is ready to terminate. - When restricting new workloads from being scheduled, GKE
applies the
cloud.google.com/impending-node-termination:NoSchedule
taint.