Podcast transcript: Keeping an eye on observability

This automatically-generated transcript is taken from the IT Pro Podcast episode ‘Keeping an eye on observability’. To listen to the full episode, click here. We apologise for any errors.

Adam Shepherd

Hi, I'm Adam Shepherd.

Jane McCallion

And I'm Jane McCallion.

Adam

And you're listening to the IT Pro Podcast where this week we're taking a look at observability.

Jane

Back in the early 2000s, Donald Rumsfeld popularised the concept of known knowns and unknown unknowns. Leaving aside the political context of his statement, the core idea he was referencing is a solid one. How can you solve a problem when you don't know it exists in the first place?

Adam

This is a challenge that IT admins frequently have to grapple with as issues in one system can often introduce errors in another seemingly unrelated system. Over the past few years, the principle of observability has evolved in order to address this problem.

Jane

Joining us today to discuss the role of observability and how it can be effectively applied to organisations' tech stacks is Greg Ouillion, CTO of New Relic. Greg, welcome to the show.

Greg Ouillion

Good morning. And yes, happy to join guys.

Jane

So if we could kick off by asking what exactly is observability?

Greg

Right. So I would say observability is obviously an evolution from monitoring. And maybe we will discuss both and how we compare but observability is the ability to instrument all of your software stack, across the full stack from the infrastructure, the middleware, the software, and the user experience. And from there, collect telemetry and all that telemetry will be put in one place, unified in one platform, so that you can build an end to end understanding of the performance, the health and the behaviour of your software architecture, so it enables you to drive performance, to drive velocity, to drive scale. So it's really a change that's been driven by the evolution of architectures as well. And I'm sure we'll elaborate on this.

Adam

So why is it so important within DevOps organisations particularly?

Greg

Well, I think we need to take stock that the architectures have evolved immensely over the last few years. A few years ago, software were monoliths, they were reasonably static in terms of the underlying infrastructure. And there were only a few releases per year. So you could still call your servers, Bob and Jenny, right. And so in that environment, you could rely on your infrastructure monitoring, having a number of alerts telling you that something is happening and where it's happening. And you can rely on your logs to try and make sense of those alerts in your software. We now live in a world where software is much more fragmented with microservices architectures developed by multiple teams independently. Many organisations roll out software, hundreds of times per day, or even thousands of times per day. And they deploy on highly volatile architectures like containers, cloud, Kubernetes clusters, where some of those infrastructure can be very short lived, literally minutes. So in a complex where your architecture is much more fragmented, much more complex and much more volatile, you cannot rely on monitoring as it was before. And you have to get that holistic understanding of everything in your architecture, so that you can make sense of what's going on. In terms of behaviour, you can understand how your stack behaves, and if an issue starts to develop, you can understand why.

Adam

So what are the main ways in which observability differs from what we might call traditional infrastructure monitoring?

Greg

Yeah, so monitoring was developed by IT professionals who tendm who tended to be very horizontally specialised, you had your infrastructure guy, your storage guy, you had the DB admin, you had the the head of network, and they tended to use their own specific monitoring tools, so that they got good visibility on their domain. But this created data silos, team silos. This has often developed as a means to tell the other teams that I am not the problem. I'm the only one who gets my stack. I am using my monitoring tool and I'm telling you everything in my is fine in my patch, does that improve end to end performance overall? Well, very often now it doesn't. So this idea that you bring telemetry from the entire stack in one place and then make it available and visible to everybody completely changes the spirit of collaboration. It allows you to build a body of knowledge of your stack, it makes developers much more open and sensitive to what target architecture they deploy on, they understand their infrastructure. And the and the ops guys, or the infra guys start to understand code, they start to sniff Java errors or, or poorly coded web browsers. So it really fosters collaboration. And it shortens time to understand, time to attribute issues, and therefore time to resolve. That's a massive difference.

Jane

So the the evolution then from monitoring, traditional monitoring to more of an idea of observability, was that driven by the move from kind of traditional IT towards, like you said, these more volatile environments, things like clouds, things like containers, or was it more kind of I guess a business impetus, where you just needed to get these teams working together and get greater visibility, and conveniently, that works really well in a cloud first world.

Greg

But I think it came first from the realisation that the first first generation of monitoring was really your own infrastructure. You monitor your CPU and your RAM and your disks usage, and your network. And then APM arrived, application performance monitoring that allowed you without coding to get a sense of how your application behaves, that was a game changer. And you still have, I would say, two camps in the IT world - there's the teams who have discovered APM, and who have changed to manage application centric, so user and application centric, I see an issue in my app. And then I will drill down to an underlying issue if necessary. And you still have many, many teams in the world who leverage IT, infrastructure logs, to check the health of their systems. And this is what they use to infer that they might have an application issue. So observability, I think, is now blending the merging of these two worlds, where if you can have your logs, your infrastructure, your application, your browser performance all together, then you get this application centric and user centric understanding of what's going on. So I still feel it's been driven by the evolution of technology, how to instrument and collect telemetry easily at scale in very large volumes, and cost effectively. But where you are right is that now we can see more and more business and product teams using observability to drive the performance of their business in real time.

Adam

So you mentioned that some of the elements around things like cost and around, you know, breaking down the the silos, when it comes to monitoring different parts of the of the stack. What are the main barriers that IT teams face when it comes to observability?

Greg

I think one of the barriers is is just time, it takes time and effort to instrument your stack, with monitoring. So a lot of teams have spent many years at taming some of these tools, integrating them with their components of their stack, they got used to it, they loved them. And so there's a there's a cultural change, which is to admit that at some stage, you're going to lose that exclusivity on your domain. And you're going to get that visibility across. You need also to understand the level of effort required to transition from your existing monitoring tools to instrumenting, either with agents from your vendor or with open telemetry. So there is a transition. And some of those teams are usually afraid of that transition. And that's why usually doing proof of concept, a pilot project allows them to get a much better sense of how easy it is, and also how they can once they have done it once they can automate that deployment. So where it's much easier than what they would assume. And then there is this. Usually there is a uncomfortable moment, right? If if all of a sudden all your mistakes, all your stack errors, your technical debt starts to show to everybody, then you have to go through that moment where everybody accepts that this is where we are. This is the kind of baseline moment. Yeah. And then from there, you build again.

Jane

It's like being sort of professionally naked in front of your colleagues, like here I am, baring my soul, these are all my errors, warts and all.

Greg

Absolutely, but but I had that moment in my previous company, I was a customer of New Relic before I joined New Relic. And the first day I joined my company, I was in front of our largest customer. And I essentially got flamed for two hours with threats of leaving and cancelling the contract and that their performance was horrendous. And so my agenda was pretty much set for the following month, right? I knew what I was going to do, and I discovered many teams, working very hard, heavily frustrated, because they did not know how to improve the situation, monolith monitoring everywhere, at least 12 different tools. But they did not manage to improve things. We could only release a few times a year, and we had 100% rollback. No deployment was okay. We had the massive technical debt. We introduced New Relic by chance, one of our sysadmin say, Why don't try something else. And a few minutes after the observability platform was deployed, people started to aggregate behind the screen. Because they were seeing things that they had never seen before. This kind of end to end view of the application, the underlying infrastructure, the middleware, the behaviour in the browser. And we uncovered issues that we had for years, literally, all that technical debt surfaced, and there was that moment of nakedness standing on the table, because some of those errors were blatant. Right. And they were fixed within days. So yes, there is that moment. And most companies I'm working with today have had that moment, this aha moment.

Jane

Yeah, well, presumably, it must show up. If the problem is say the application is not working nicely with the middleware, that there is some kind of conflict there that you wouldn't have known about, were it not for the fact that you're now looking at everything all at once, rather than each of these things individually, which appear to be working fine.

Greg

Yes, absolutely. And also, this is where this is the evolution from reactive incident management to proactive performance management. Most organisations are still built on the idea that some incident is going to develop, we will understand it, and we will resolve it and when we will calculate our SLA impact, and we will calculate our MTTR. Once you start to be able to understand and visualise the behaviour of all the components in your stack, then you start to build that understanding of what is normal. You create a baseline of what is normal, and then you can start to detect very easily any deviation to the normal, either by yourselves and alerts and and visualisations and dashboards, or with artificial intelligence and ML helping you in detecting those issues. So in a matter of weeks, most teams transition from dealing with incidents to managing proactively the performance of their stack, and it makes a huge difference in terms of number of incidents, especially incidents that actually impact customers.

Adam

It's the idea of if it ain't broke, don't fix it, right? You know, if there's not an obvious problem that I can see, that is having an impact on things like you know, page loads or error rates or you know, bounce rates, then there's no problem. And I can just carry on until something actually breaks versus actually looking at, you know, things like logs and data readouts and saying, okay, it, it's not broken yet. But there is something here we can tweak and optimise and make better within our stack.

Greg

Absolutely. And also, as you we mentioned, the unknown unknowns and ripple effects. Sometimes, you know, you've got one of these weird back end microservices, that usually responds in 25 milliseconds, 90th percentile. And you if you did not know that, in fact, that is a super critical microservice, that's being called 100 times in any transaction. If the latency starts to build up, then you wouldn't care right until it just everything collapses. If you can have a sense that this micro service is critical and its behaviour is starting to change, then you can fix it before it just collapses the system. So absolutely.

Jane

So Greg, when we're thinking about the move from a kind of monitoring only strategy or mindset and going towards something that's more observability based, are there any sort of prerequisites in terms of tooling that IT teams need to take into account before they embark on this journey?

Greg

Not really, I would say it's fairly straightforward. First, you will typically start observability maybe with in mind, one of the part of your stacks that you are you have a gap with, right? Many companies have started with infrastructure and logs. They don't do APM. They don't do browser or user experience monitoring. So maybe they say, okay, observability will bring that additional capability. So, there they will essentially deploy agents, or they will leverage open telemetry, Prometheus, kind of open source instrumentation of components of their stack. And they will basically start to deploy usually on a finite perimeter, one app, one part of their stack. And they will learn the practice of interpreting the data, looking at those building some dashboards, building a few alerts, learning the interface, and learning how all the telemetry they gather gets grouped into what we call entities. So an entity is anything is a component of the architecture, but does report telemetry; just simple as that. They will start to understand how all the entities related to each other in the application. And from there, they can start to expand, they will either start to also instrument their infrastructure, and their middleware in the cloud, cloud PaaS or on prem middleware, they will start to instrument more systems. And then at some stage, they will automate, they will say, Okay, I've got my CI/CD pipeline. And now that I understand how to deploy an agent, I can fully automate that so that the next release, the next deployment, software ships and gets activated, already instrumented, it will start reporting as soon as it's up and running. So I would say it's a fairly gradual journey. There is no like, big day day zero thing where you have to invest nine months before you can get started. And it is very straightforward, I think for anybody who's a sysadmin, an SRE, a developer, you know, these are concepts that you get pretty quickly.

Jane

Yeah, cuz I was gonna say it kind of is this incremental approach better than a big bang, but it sounds like the answer is yes. or indeed, just nobody does Big Bang. Everybody is too flexible.

Greg

Well, sometimes you are in a big bang situation, when, when companies have already acquired, I would say the observability muscle, they have athletes in their teams who are already observability ready, and who want to go into a big tool consolidation, for instance, they say, Okay, now we've had that logs, data lake in an elastic somewhere. And we know that it's becoming effective in terms of performance is costing us a lot of infrastructure and people maintaining it. So the time has come that we start to migrate our logs, and some will do it in a gradual way. And some will want to do that Blitz night where elk is gone, you know. So, so it's a matter of how you want to drive your your projects as well. But most important is, if you want to start small, if you want to go incremental, it is absolutely possible.

Adam

So you mentioned tool consolidation there. And of course, within the traditional monitoring sphere, there are, as you previously mentioned, a lot of different tools and utilities that are used within areas like storage, networking, infrastructure, for very specific tasks. Can a focus on observability help cut down tooling costs by consolidating some of these elements and allowing companies to, you know, cancel subscriptions and get rid of some of the more redundant tools?

Greg

Yeah, I think there again, it's all going to be up to how the CIO, CTO wants to drive this, but I think you always need to be balanced. If you start to consolidate the one big platform thing, a number of people will get frustrated, some of their usual way of working and tools will be gone. And also some of those tools might be specialised enough that they really bring value for very specific tasks. So it's not a let's remove every tool on the planet, and I just want to have everybody logged into an observability platform. But it is going to be, what are those critical data that I absolutely want to have in a single pane of glass, that I want to be able to correlate so that I understand that this log entry relates to that transaction that runs on that host. And therefore I can understand what's going on, then if some guy really needs a deep probe of the Nagios instance, to monitor some of their network components in a specific way, why remove that? And maybe the cost of consolidating everything would be higher than actually letting some of those components stay. And just forward data to the platform to the observability platform. So it's all about I think the big decision is, we want to make sure that all the telemetry flows into one place, so that everybody gets access to this end to end view, then what are the tools which are creating cost, or too much fragmentation of the culture and the way of working, that should be eliminated? And maybe then we select those few tools that will stay as specialist tools that will stay forever, potentially.

Jane

So obviously, everybody's sort of interested in DevOps and the cloud and all this kind of thing, you know, what we spoke about towards the first part of this episode. But the reality is that most large businesses at least have at least some on premises infrastructure. So is there a difference between how observability should be approached for these two different kinds of environments?

Greg

So yes, and no. Obviously, with the larger cloud providers, everything that touches the Cloud Platform as a Service or the PaaS components will essentially all the telemetry will be essentially collected by by the cloud provider himself. And then that telemetry will get pushed to the observability platform through API's or a firehose of that telemetry. So somehow, all the bottom layer, infrastructure, middleware, you don't have to instrument yourself, because it's instrumenting by default by the cloud providers. But then if you want to get visibility into your application and software, you still need to deploy some agent or some open telemetry to collect the software into the container or the virtual machine. So you still have some instrumentation to do, but not at the bottom infrastructure middleware layer. On Premise, well, you're going to have to deploy infrastructure agents. And you're going to have to deploy middleware integrations so that you can capture the actual stack that you're operating on. But again, that can be completely automated after a while. And so it's not so much effort. So it's very similar. But when you go to the cloud, you get kind of pre baked instrumentation for both. Also, as you mentioned, a lot of companies who deploy to the cloud, and I would say, the majority of companies today do have some level of deployment to the cloud, and a lot will stay hybrid. Companies want to have an observability capability that is independent from where they deploy. So if you deploy your code in containers, whether those containers go to an Azure or GCP, and AWS, or they go to your on prem, somehow, you want to maintain a certain level of independence from your underlying infrastructure. And this is where also observability platform can be very helpful. Let's take the case where you are migrating from on premise to the cloud. Well, if you have already the ability to baseline the behaviour of your application on premise, and if you can make use of the same observability platform as you migrate to the cloud, then you understand whether you are actually improving degrading your SLOs, your SLAs in terms of reliability, performance. So it's really a very powerful tool to de-risk some of your migrations, and also to make that baselining of the performance of your apps completely independent from which cloud or infrastructure you run on.

Adam

So Greg, for organisations who are maybe using more traditional monitoring, and metrics, how can these companies get started with improving the observability of their IT?

Greg

So we love to think of observability as helping in four key areas. One is whether you want to increase your speed and agility, the velocity, I want to deploy more frequently and faster. It can be customer experience, I want to improve the end to end customer experience. It's going to be the foundational reliability uptime and performance. So I realised that my performance is not where it should be, or it's going to be about I would like to scale my architecture and my teams, and I want to be more cost effective and effective overall. So we like to first start with what are you trying to solve as a problem? You don't you don't go to observability, just because it's the last word keyword that you should have. Right? So where is it going to make the most difference? And how are you going to drive your teams to really hone in? What are those indicators that you want to capture? What are you trying to improve? Because it's a it's a good change agent, if you're going to make a big difference in the first few days, because you're already solving an issue. And then and then everybody feels good about it. Very often for less mature organisation or more traditional, it starts with reliability, uptime and performance. So how can I capture incidents? Much faster detect them? How can I understand them faster, and ultimately, how can I resolve them faster. So traditionally, again, starts with an environment which has issues painful and in production, don't start with some staging, we add staging useless environment, because it will tell you nothing, use it where you have pain and show the difference is made it makes in terms of visibility, understanding and speed. And usually, as I say, this is a game changer for teams. So, once we have seen that, once we realised what it provides in terms of understanding, they will not go back and they will start to deploy more. So uptime, reliability and performance will have a very positive effect, because not only you will start to fix your incidents faster, but you will see your technical debt, you will see what is causing those issues. So you will start to fix those technical debt, and your number of incident is going to start to drop. So all of a sudden, you have less incidents, and you spend much less time at fixing them. So it's breathing time, it's productive time for everybody. I can start to make my architecture more resilient, I can start to develop more innovation. And the morale in the teams goes up because they stop stop fighting each other, they work together to solve issues. So So usually, the impact on these first phase is huge. Just in terms of productivity, and collaboration, and actual results in terms of SLOs, SLA and mean time to repair and number of incidents. Then you scale up. Once you've done that you you've created alerts and dashboards that give your give you a sense of how your stack behaved, you start to ask those questions about unknowns, I see that the latency of that transaction is starting to go up. Could it be related to the change in infrastructure I've done last week. So let's look at the underlying infrastructure. Oh, yes, it's interesting. It's consuming more CPU. And so you can interrogate all the telemetry that you have. And you can ask those unknown questions that you had not predetermined. And they allow you to to put a diagnosis on anything in your stack. Very important, so you move to proactive performance.

Adam

So Greg, what kind of advantages can observability have for businesses outside of the realm of IT and technology?

Greg

Observability, as we have seen it evolve over the last few years, is really a business conversation as well. And progressively as you have those dashboards, we see more and more product teams and business teams who realise the power of real time, because a key characteristic of observability is that it gives you a real time visibility, not just on your stack, but also on the behaviour of your software. And if you have a digital business, the behaviour of your software is the behaviour of your business. What is the latency, the abandonment rate? What are the demographics of some transactions? What are the customer journeys that are going that customers are going through? Where do they stop their journeys? What in real time can I follow, in terms of the performance of my business, and it's very easy to add business attributes to an observability platform. Let me give you an example. You run a very large coffee estate worldwide with tens of thousands of points of sale, you can integrate observability in all the points of sale. So you can follow the behaviour of a point of sale in your shops, infrastructure, software, but you can also very easily capture the average amount of the basket, the revenue and sales, the type of loyalty card of your customers. And all of a sudden, on the same dashboard, you're not only following issues around CPU, RAM or transaction throughput, you're starting to follow the health of your business. And you can attribute any degradation in sales to your IT stack that's degrading, or maybe it's something out there in the street, there are some works preventing customers from reaching the shop. And you can make those decisions. If you're an airline, you can actually track whether your bookings and sales of tickets are tracking compared to forecast in real time, just by capturing that in your transactions in the observability platform. So what we're seeing is the next generation of companies using observability are using it to manage their their business in real time. They are tracking and it will be complimentary to their BI and analytics. The BI analytics cube is very useful to make deep dive complex queries. But it doesn't give you that real time essence of your business, which can be crucial when it's a Black Friday, when you're running a large marketing campaign. When you are in your busiest day of the year, you want to know what's going on exactly right now, and a lot of business teams are using it now just to do that.

Jane

Well, unfortunately, that's all we have time for this week. But thank you once again, Greg, for joining us.

Greg

Thank you very much for having me.

Adam

You can find links to all of the topics we've spoken about today in the show notes, and even more on our website www.itpro.com.

Jane

You can also follow us on twitter where we are @ITPro as well as Facebook, LinkedIn, and YouTube.

Adam

Don't forget to subscribe to the IT Pro Podcast wherever you find podcasts to never miss an episode. And if you're enjoying the show, leave us a rating and a review. We'll be back next week with more analysis from the world of IT. Until then, goodbye.

Jane

Bye.

ITPro is a global business technology website providing the latest news, analysis, and business insight for IT decision-makers. Whether it's cyber security, cloud computing, IT infrastructure, or business strategy, we aim to equip leaders with the data they need to make informed IT investments.

For regular updates delivered to your inbox and social feeds, be sure to sign up to our daily newsletter and follow on us LinkedIn and Twitter.