Unpacking the Costs and Value of Observability with Martin Mao
Martin Mao, CEO & Cofounder at Chronosphere, joins Corey on Screaming in the Cloud to discuss the trends he sees in the observability industry. Martin explains why he feels measuring observability costs isn’t nearly as important as understanding the velocity of observability costs increasing, and why he feels efficiency is something that has to be built into processes as companies scale new functionality. Corey and Martin also explore how observability can now be used by business executives to provide top line visibility and value, as opposed to just seeing observability as a necessary cost.
Martin is a technologist with a history of solving problems at the largest scale in the world and is passionate about helping enterprises use cloud native observability and open source technologies to succeed on their cloud native journey. He's now the Co-Founder & CEO of Chronosphere, a Series C startup with $255M in funding, backed by Greylock, Lux Capital, General Atlantic, Addition, and Founders Fund. He was previously at Uber, where he led the development and SRE teams that created and operated M3. Previously, he worked at AWS, Microsoft, and Google. He and his family are based in the Seattle area, and he enjoys playing soccer and eating meat pies in his spare time.
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: Human-scale teams use Tailscale to build trusted networks. Tailscale Funnel is a great way to share a local service with your team for collaboration, testing, and experimentation. Funnel securely exposes your dev environment at a stable URL, complete with auto-provisioned TLS certificates. Use it from the command line or the new VS Code extensions. In a few keystrokes, you can securely expose a local port to the internet, right from the IDE.
I did this in a talk I gave at Tailscale Up, their first inaugural developer conference. I used it to present my slides and only revealed that that’s what I was doing at the end of it. It’s awesome, it works! Check it out!
Their free plan now includes 3 users & 100 devices. Try it at snark.cloud/tailscalescream
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This promoted guest episode is brought to us by our friends at Chronosphere. It’s been a couple of years since I got to talk to their CEO and co-founder, Martin Mao, who is kind enough to subject himself to my slings and arrows today. Martin, great to talk to you.
Martin: Great to talk to you again, Corey, and looking forward to it.
Corey: I should probably disclose that I did run into you at Monitorama a week before this recording. So, that was an awful lot of fun to just catch up and see people in person again. But one thing that they started off the conference with, in the welcome-to-the-show style of talk, was the question about benchmarking: what observability spend should be as a percentage of your infrastructure spent. And from my perspective, that really feels a lot like a question that looks like, “Well, how long should a piece of string be?” It’s always highly contextual.
Corey: Agree, disagree, or are you hopelessly compromised because you are, in fact, an observability vendor, and it should always be more than it is today?
Martin: [laugh]. I would say, definitely agree with you from a exact number perspective. I don’t think there is a magic number like 13.82% that this should be. It definitely depends on the context of how observability is used within a company, and really, ultimately, just like anything else you pay for, it really gets derived from the value you get out of it. So, I feel like if you feel like you’re getting the value out of it, it’s sort of worth the dollars that you put in.
I do see why a lot of companies out there and people are interested because they’re trying to benchmark, to trying to see, am I doing best practice? So, I do think that there are probably some best practice ranges that I’d say most typical organizations out there that we see. This is one thing I’d say. The other thing I’d say when it comes to observability costs is one of the concerns we’ve seen talking with companies is that the relative proportion of that cost to the infrastructure is rising over time. And that’s probably a bad sign for companies because if you extrapolate, you know, if the relative cost of observability is growing faster than infrastructure, and you extrapolate that out a few years, then the direction in which this is going is bad. So, it’s probably more the velocity of growth than the absolute number that folks should be worried about.
Corey: I think that that is probably a fair assessment. I get it all the time, at least in years past, where companies will say, “For every 1000 daily active users, what should it cost to service them?” And I finally snapped in one of my talks that I gave at DevOps Enterprise Summit, and said, I think it was something like $7.34.
Martin: [laugh]. Right, right.
Corey: It’s an arbitrary number that has no context on your business, regardless of whether those users are, you know, Twitter users or large banks you have partnerships with. But now you have something to cite. Does it help you? Not really. But we’ll it get people to leave you alone and stop asking you awkward questions?
Martin: Right, right.
Corey: Also not really, but at least now you have a number.
Martin: Yeah, a hundred percent. And again, like I said, there’s no—and glad magic numbers weren’t too far away from each other. But yeah, I mean, there’s no exact number there, for sure. One pattern I’ve been seeing more recently is, like, rather than asking for the number, there’s been a lot more clarity in companies on figuring out, “Well, okay, before even pick what the target should be, how much am I spending on this per whatever unit of efficiency is?” Right?
And generally, that unit of efficiency, I’ve actually seen it being mapped more to the business side of things, so perhaps to the number of customers or to customer transactions and whatnot. And those things are generally perhaps easier to model out and easier to justify as opposed to purely, you know, the number of seats or the number of end-users. But I’ve seen a lot more companies at least focus on the measurement of things. And again, it’s been more about this sort of, rather than the absolute number, the relative change in number because I think a lot of these companies are trying to figure out, is my business scaling in a linear fashion or sub-linear fashion or perhaps an exponential fashion, if it’s—the costs are, you know, you can imagine growing exponentially, that’s a really bad thing that you want to get ahead of.
Corey: That I think is probably the real question people are getting at is, it seems like this number only really goes up and to the right, it’s not something that we have any real visibility into, and in many cases, it’s just the pieces of it that rise to the occasion. A common story is someone who winds up configuring a monitoring system, and they’ll be concerned about how much they’re paying that vendor, but ignore the fact that, well, it’s beating up your CloudWatch API charges all the time on this other side as well, and data egress is not free—surprise, surprise. So, it’s the direct costs, it’s the indirect costs. And the thing people never talk about, of course, is the cost of people to feed and maintain these systems.
Martin: Yeah, a hundred percent, you’re spot on. There’s the direct costs, there’s the indirect costs. Like you mentioned, in observability, network egress is a huge indirect cost. There’s the people that you mentioned that need to maintain these systems. And I think those are things that companies definitely should take into account when they think about the total cost of ownership there.
I think what’s more in observability actually is, and this is perhaps a hard thing to measure, as well, is often we ask companies, “Well, what is the cost of downtime?” Right? Like if you’re, if your business is impacted and your customers are impacted and you’re down, what is the cost of each additional minute of downtime, perhaps, right? And then the effectiveness of the tool can be evaluated against that because you know, observability is one of these, it’s not just any other developer tool; it’s the thing that’s giving you insight into, is my business or my product or my service operating in the way that I intend. And, you know, is my infrastructure up, for example, as well, right? So, I think there’s also the piece of, like, what is the tool really doing in terms of, like, a lost revenue or brand impact? Those are often things that are sort of quite easily overlooked as well.
Corey: I am curious to see whether you have noticed a shifting in the narrative lately, where, as someone who sells AWS cost optimization consulting as a service, something t