Kubernetes 1.31 Upgrade Guide: Breaking Changes …

Quick summary: Kubernetes 1.31 ("Elli") is a release that rewards teams who do their homework and punishes those who don't. Several long-deprecated APIs are gone, in-tree cloud volume plugins are removed, AppArmor support is GA, and there are subtle changes to authentication, image resolution, and admission control that can break workloads silently. This guide walks through the breaking changes that actually affect production clusters and gives you a step-by-step upgrade path that has worked across hundreds of clusters in 2025-2026.

Kubernetes 1.31 upgrade guide showing breaking changes and migration path for 2026

Why 1.31 Is Not a "Just Bump the Version" Upgrade

If you have been on Kubernetes 1.27 or 1.28, the jump to 1.31 crosses several deprecation cliffs. The release itself is well-engineered — there are no broad regressions — but the cumulative removals from 1.29 and 1.30 catch up here. The clusters that get into trouble are the ones running workloads with old APIs that nobody noticed because the deprecation warnings never made it into anyone's dashboard.

Specifically, the things that bite teams are:

The in-tree GCE Persistent Disk and AWS EBS volume plugins are gone. Workloads still using them will fail to mount.
Several v1beta1 APIs (FlowSchema, PriorityLevelConfiguration) are removed. CRDs and operators that still reference them will fail to install.
The deprecated kubelet command-line flags for image GC are removed in favor of config-file equivalents.
Pod log queries via the API now require explicit RBAC for new sub-resources.
Several admission webhooks need updates to handle the structured authentication configuration format.

None of these are individually catastrophic. Collectively, on a cluster that has been running and accumulating cruft for a few years, they can produce a deeply unpleasant upgrade weekend.

The "Discovery Pass" — What to Audit Before You Plan the Upgrade

Before you write the upgrade plan, run a discovery pass on your cluster to find what is actually at risk. The single most useful tool here is kube-no-trouble (kubent), which scans all your manifests, Helm releases, and live cluster state for deprecated and removed APIs:

# Install (Linux x86_64)
sh -c "$(curl -sSL https://git.io/install-kubent)"

# Run against the live cluster
kubent --target-version 1.31

# Scan Helm releases
kubent --helm3

The output is a tidy list of deprecated APIs, where they live, and what version they were removed in. This is the input to your migration backlog.

Beyond kubent, also check:

StorageClass usage — list all PVs/PVCs and group by provisioner. Anything with kubernetes.io/gce-pd or kubernetes.io/aws-ebs needs migration to the CSI driver before 1.31.
Admission webhooks — every webhook that observes Pods, Services, or PSP-replacement resources needs review. Fail-closed webhooks that cannot handle new fields will break apiserver requests.
Operator versions — every operator (Prometheus, cert-manager, ArgoCD, Crossplane, the ones you wrote yourself) has a version compatibility matrix. Map them out.
Custom controllers and CRDs — anything you maintain in-house needs the same audit treatment as third-party code.

The Big Removals in Detail

1. In-Tree Cloud Provider Volumes Are Gone

Kubernetes has been pushing storage out of the core for years. The CSI (Container Storage Interface) project replaced the in-tree drivers, and 1.31 finishes the migration:

kubernetes.io/gce-pd — removed. Use pd.csi.storage.gke.io.
kubernetes.io/aws-ebs — removed. Use ebs.csi.aws.com.
kubernetes.io/azure-disk and kubernetes.io/azure-file — removed. Use the corresponding CSI drivers.
kubernetes.io/cinder (OpenStack) — removed. Use cinder.csi.openstack.org.
kubernetes.io/vsphere-volume — removed. Use the vSphere CSI driver.

The migration story has two parts: the StorageClass migration (which CSI driver issues new volumes) and the existing volume migration (existing PVs keep working via CSI translation, but should be migrated long-term). For most teams the action items are:

Install the CSI driver for your cloud, if you have not already.
Create new StorageClasses pointing at the CSI provisioner.
Update workload manifests to use the new StorageClass for any newly-created PVCs.
Schedule a migration window for existing PVs (out of scope for the version upgrade itself).

2. AppArmor Goes GA

AppArmor profile support has been beta since the dawn of time. In 1.31 it is finally GA, with a slightly changed annotation schema:

Old (still works in 1.31): container.apparmor.security.beta.kubernetes.io/myapp: localhost/myprofile
New (preferred): set spec.securityContext.appArmorProfile on the container directly

If you are running on Ubuntu, Debian, or any AppArmor-enabled distro, this is the moment to standardize on profile usage rather than relying on the default unconfined behavior. AppArmor profiles for common base images (nginx, postgres, redis) are widely available and easy to load via DaemonSet.

3. Structured Authentication Configuration

API server authentication used to be a tangle of CLI flags. The --authentication-config file (beta in 1.30, broader maturity in 1.31) replaces it with a structured YAML that supports multiple JWT issuers, claim mapping, and ClaimValidation rules.

If you authenticate users against multiple identity providers (e.g., Workspace for engineers, a separate IdP for contractors), this is a significant operational improvement. The migration is opt-in for now, but the legacy flags are scheduled for deprecation in 1.32.

4. Image Volume Source (Beta)

1.31 adds the ImageVolume source — you can mount an OCI image as a read-only volume into a Pod. This is genuinely useful for shipping ML models, datasets, and config bundles as immutable, content-addressable artifacts. Expect to see this used heavily in 2026 ML platforms.

The Step-by-Step Upgrade Path

Step 0: Update kubectl on every operator workstation

kubectl is forward-compatible by one minor version. If your engineers are running 1.27 kubectl against a 1.31 server, weird API mismatches will eventually bite. Standardize on a kubectl version that matches the new server.

Step 1: Upgrade the control plane on staging

Always control-plane first, then nodes. The new control plane is backward-compatible with kubelets two minors older; the reverse is not true. On a managed Kubernetes service:

EKS: aws eks update-cluster-version --kubernetes-version 1.31
GKE: surface in cloud console, or gcloud container clusters upgrade
AKS: az aks upgrade --kubernetes-version 1.31

On self-managed kubeadm clusters, the standard kubeadm upgrade plan / kubeadm upgrade apply dance still works.

Step 2: Soak the staging control plane for 48-72 hours

Run your normal CI pipelines against staging. Watch the API server error rate, etcd latency, and webhook timeout metrics. Most webhook compatibility issues surface within hours of the upgrade, not weeks.

Step 3: Upgrade nodes in batches

Drain a single node, replace its kubelet (or replace the whole node, which is the cattle-not-pets pattern), uncordon, and watch your workload spread back. Roll one node at a time on staging, then in larger batches in production.

Critical kubelet config items that often need adjustment in 1.31:

Image GC thresholds — the CLI flags are gone; use imageGCHighThresholdPercent and friends in the kubelet config file.
cgroup driver — should be systemd; if you are still on cgroupfs, fix it during this upgrade.
Container runtime — confirm containerd is at a recent release; older versions have CRI compatibility quirks.

Step 4: Upgrade and validate operators

Order matters. Upgrade your operators (the ones you manage) before the cluster, where possible. Critical first-citizens:

cert-manager — must be on a 1.31-compatible version before you upgrade. Old versions break ingress.
Ingress controllers (nginx, Traefik, HAProxy) — same deal.
External-DNS — verify CRD compatibility.
ArgoCD / Flux — run them through their own upgrade story; they sometimes lag.
Prometheus operator and the kube-prometheus-stack — has its own per-cluster-version compatibility matrix.

Step 5: Run the full integration test suite

Even if you do not have a formal e2e suite, run the workloads end-to-end in staging for at least a full business cycle (24 hours minimum). Things that surface here that you will miss in unit tests:

Long-running cron jobs whose pod templates use deprecated APIs.
Webhook timeouts under sustained load.
Network policy regressions when CNI plugins are also updated.
Storage class migration paths under real-world PVC churn.

Step 6: Production rollout

Mirror your staging steps. Have a rollback plan ready (snapshots of the etcd state if you self-manage; managed-service rollback procedures documented and rehearsed if you do not). Communicate the upgrade window to application teams in advance — even a clean upgrade can cause brief connectivity blips on workloads with aggressive readiness probes.

Things That Will Surprise You

Pod log RBAC

The new Pod log API has slightly different RBAC requirements. If your engineers cannot tail logs after the upgrade, the fix is usually adding the pods/log sub-resource to the developer Role. Yes, you should have had this all along.

HPA behavior under metrics-server changes

If you upgrade metrics-server alongside the cluster (which you should), HPA decisions can briefly become noisier as the metrics pipeline restarts. Some teams set short-term stabilizationWindowSeconds bumps during the upgrade window.

Audit log volume

1.31 adds a few new audited operations by default. If your audit log forwarding is sized to the byte, you may see disk-full alerts on the API server nodes. Right-size before you upgrade.

A Realistic Timeline: What a 50-Cluster Fleet Looks Like

To make this concrete, here is the upgrade plan we recently helped a mid-sized fintech execute across 50 production Kubernetes clusters running on three different cloud providers. Their starting point was a mix of 1.28 (60% of clusters) and 1.29 (40%), with the usual collection of CRDs, operators, and homegrown controllers accumulated over four years.

Weeks 1-2: Discovery. They ran kubent against every cluster, exported the results into a single spreadsheet, and grouped findings by remediation owner. About 70% of the deprecated API usage came from third-party Helm charts that had upstream fixes available; the remaining 30% was internal code that needed PRs from the platform team.

Weeks 3-6: Remediation in parallel. Helm charts were bumped via Renovate PRs. Internal controllers were updated and rolled out to staging clusters. The CSI migration story was resolved by installing the EBS, GCE PD, and Azure Disk CSI drivers everywhere and creating new StorageClasses, while leaving existing PVs to the CSI translation layer.

Weeks 7-8: Staging upgrade marathon. Each of their three staging clusters (one per cloud) was walked from 1.28 → 1.29 → 1.30 → 1.31, with at least 48 hours of soak time between each minor. They caught two real issues here: a CronJob that used a now-removed PSP shim, and a custom admission webhook that did not handle the new ResourceClaim object type cleanly. Both were fixable in a sprint.

Weeks 9-14: Production rollout in batches. Five clusters per week, starting with the lowest-traffic ones. Each cluster was walked through the same minor-by-minor upgrade. They allocated a four-hour change window per cluster (most finished in 90 minutes; the buffer was for diagnosis and rollback if needed). Total wallclock time: six weeks of careful rollout, zero customer-impacting incidents.

The lesson is not "everyone needs a 14-week project." Many smaller teams do this in two weeks. The lesson is that methodical staging is what makes the production rollout boring, and boring is the goal.

Frequently Asked Questions

Can I skip versions, e.g., 1.28 directly to 1.31?

On managed services, generally no — you have to walk one minor at a time. On self-managed kubeadm, the same constraint applies: kubeadm enforces single-minor upgrades. Plan a multi-week sequence, not a single-weekend leap.

How long should I wait after release before upgrading production?

Two patch releases is the conservative answer (so 1.31.2 or later). Most large platform teams wait for at least one quarterly maintenance window before adopting a new minor.

What happens to my old PVs after the in-tree drivers are removed?

For the cloud providers above, the CSI Migration feature provides translation: existing PVs continue to work through the CSI driver. New PVs should be created using CSI StorageClasses. Long-term, you want to migrate the underlying volumes to native CSI, but it is not mandatory at upgrade time.

Do I still need PSPs?

PodSecurityPolicy was removed in 1.25. The replacement is Pod Security Admission, which is built in. If you are upgrading from a cluster that still had PSP shims, those shims are now gone — make sure you have PSA configured before the upgrade.

Is gateway API ready for production in 1.31?

Yes. Gateway API v1.1 ships with broad implementation support (Cilium, Istio, Envoy Gateway, Contour, Traefik). For new ingress deployments in 2026, Gateway API is the right choice. For existing Ingress deployments, the migration is a separate project — not coupled to the 1.31 upgrade.

The Bottom Line

Kubernetes 1.31 is a manageable upgrade if you do the discovery work first. Run kubent against your cluster, audit your storage and webhook landscape, upgrade staging at least a week before production, and roll nodes in batches. Skip the discovery work and you will be the on-call engineer who learns about a deprecated API at 3 AM on a Saturday.

The good news is that the upgrade path is well-trodden. By Q3 2026, 1.31 will be the most common production version on managed Kubernetes services. Doing the upgrade now, deliberately and with a clean rollback plan, beats doing it under support-deadline pressure six months from now.

Categories

Kubernetes 1.31 Upgrade Guide: Breaking Changes and a Safe Migration Path

Why 1.31 Is Not a "Just Bump the Version" Upgrade

The "Discovery Pass" — What to Audit Before You Plan the Upgrade

The Big Removals in Detail