KubeCon North American 2025

I went to Kubecon 2025 this year, over in Atlanta. The main talks I attended were on Backstage and Developer Experience, as well as AI. As everywhere else, AI was a popular topic at the conference. This took place on nearly every level of the platform. Coding Agents, SRE tooling, FinOps, and how to host and use these new tools. The hosting and security elements were present and discussed for both hosted foundation models from the usual suspects and self-hosted models.

FinOps is definitely a significant concern, given rising cloud costs. The number of vendors I was stopped by to ask about finops, spend, and how we combat and contain spend was numerous.

One question I wanted answered during this trip was: when companies implemented Backstage, and what exactly did it solve for them? The answer to this did not actually come from a talk, as all were a bit varied in what they solved for. It came at a dinner with platform staff from a large airline. Their answer was that Backstage acted as glue for disparate pieces of the platform teams, where each different niche needed to be brought in to get things up and running. This was especially useful for spinning up new environments. For instance, when asked, they could spin up a complete environment for a service in about 10 minutes.

This included the following:

Identity and Access
Networking for public or private access
DNS entries to make the service accessible
Create infrastructure for a basic service
Create the template for a basic service, FE, BE
CI/CD pipelines for the service

So the biggest takeaway is that Backstage serves as the bridge between teams that deliver developer experience consistently. This seems to be the bridge needed in larger orgs, where the team creating the platform experience is spread across multiple teams and timelines. Giving the end user a one-stop shop to get onto the golden path is the deliverable. Overall, it seems more of a tool for fixing team spread for an organization when each of the teams creates a singular product that serves a user.

Feature Flags and Observability

Feature Flags control many of the flows that our users go through and their deployments. It is necessary to treat them like first-party observable resources. They should be included in any telemetry data that is collected, since they have a high chance of affecting how services behave. Similar to how a correlation ID is used to chain events in our traces, we should also include the feature flags used or called when an event occurs.

One interesting comment from a talk on this was: “Observability shows symptoms but hides the diagnosis”.

Similar to how we will add annotations to services when deployments/version changes have been made, we should also add annotations when feature flags have been toggled. This will enable telemetry-driven routing, so we can better understand the underlying changes in a system rather than searching for an issue only to find it was caused by a change to a feature flag that wasn’t readily visible in our observability tools. Feature flags need to be given first-class treatment.

Enabling feature flags as a first-class observable resource allows us to better monitor progressive delivery. If you can check your services based on feature flags, you can ensure that during rollouts, you see the exact traffic affected by each flag. If traffic is rolling out and you see a spike in error rates during the deployment, you can ensure the errors are isolated to your test group before expanding the rollout. It is much easier to spot that 100% of the traffic going to 1% of your users is erroring out than to look at your dashboard and say that, out of all traffic, you only have a 1% error rate. This seems to be the innovative practice I have seen implemented before, but sometimes bright ideas don’t always come to practice everywhere, and we could use a reminder. Using context-aware feature flags can help minimize the downsides of rollouts. And since feature flags toggle so fast, this can help reduce any downtime.

One project that came up during the talks was an open feature supported by OTEL. This will be a good place to start, as most observability platforms have already been onboarding OTEL.

Agentic Runbooks

David Von Thenen’s talk on agentic runbooks was quite interesting. The premise was using an agentic approach to go through a “runbook”, the CIA Kubernetes Hardening Guide, to validate if a cluster was using best practices. The code run during this presentation can be found here. The setup included RAG (CIA PDF), MCP, and k8sgpt. This enabled a self-contained system in which you could ask k8sgpt to review the cluster and determine actions based on the hardening guidelines to improve the cluster’s security.

There were multiple companies at the con pitching similar products to help either harden systems or troubleshoot alerts/incidents. Some were better than others. At this point, it looks like agents will be able to review your observability data and reference runbooks. A standard runbook configuration that agents can learn from will be needed at some point unless RAG/Vector Search gets good enough. Each observability platform also has its own flavor of this popping up, whether from DataDog, Grafana, or other top providers.

The main takeaway I got from this was that we will be able to offer self-hosted solutions for more security-sensitive enterprises, and that we will be able to enable teams to augment their skills to at least establish a baseline of security for teams that might lack that skill set internally.

Agents and new network levels

One idea Solo.io floated was adding another “layer” to the OSI model. This layer would act as a circuit breaker, inspecting traffic more deeply before forwarding it. At this 8th layer, we would take a look at request payloads to check their contents, at least in a cursory way. For instance, when agents or users are making calls to a mode, it would help to estimate the number of tokens being forwarded. This would allow us to stop requests that exceed their current allocated token, so we can return a 429 if they are over their limit.

The Myth of Portability

Corey Quinn gave a great talk on The Myth of Portability: Why your Cloud Native App is Married to Your Provider. The premise was that you should stick to the devil (cloud provider, you know) instead of spreading your risk across multiple providers. The groans from people who, over the past 9 days, had been hit by both Azure and good old us-east-1 being down could be heard. Not to be defeatist, but no matter what we do, no two deployments or infra setups across different cloud providers are the same. There are either minor differences in how they set up the underlying compute/DB/caching primitives you use, or differences in the limit allowances that must be configured in each.

On top of that, finding engineers who are as skilled in a single provider, let alone two or even three, is damn hard. So, at the end of the day, using one of the cloud providers effectively can help you improve your stability and uptime. The caveat was that this was sometimes not possible due to prior build-outs or acquisitions. In those cases, best of luck. Try to converge when possible, or ensure your engineers have sufficient knowledge coverage to keep those systems alive. But do not lie to yourself and think that everything can be swapped over in an instant. This was by far my favorite talk. Once the recording is out, give it a listen.

Closing Notes

The con was a great experience. I was able to take it more in-depth with coworkers and vendors in person about current needs and what’s emerging in the industry. There was definitely an emphasis on all things AI, and the odd mix of vendors slapping on AI to at least something so they could be in the relevant zeitgeist. In the halls, though, and before or after the talk, there were different conversations, not in how to shove AI into your product to get that sweet multiple for the market or to impress leadership, but in day-to-day problems and how teams are solving them. Whether they were challenging technical problems or more complex social/team/org problems, and how to navigate them, it seemed like there was a divide between what was being presented on stage and the issues people are having day to day. There is a place for both of those streams of thought, but I look forward to seeing the confluence when they come together.

Other Notes

Apple makes the best damn slides, with consistent minimal design and excellent transitions. Patrick Bateman would be drooling over them. I would not want to present after them. But it does feel like they are coming a bit late to the party with the new AI wave.
Context-aware routing and networking. Rate limiting is no longer just based on the number of requests but also on the body. So we may need to inspect the body/payload for things like API/Model token limits. The scanning of this is being referred to loosely as level 8 networking.
The book Principles of Building AI Agents was one of the books I found at the con. It’s about an hour read and one that I suggest teams read. It gives a high-level overview of Building Agents, including code snippets, concisely. Reading it as a team will give your team a better understanding of what is currently out there and a better feel of the current nomenclature, so that you can communicate about these systems more easily.

After CompSci