How Photobox unified observability to slash incident response times

Photobox A4 & A3 Lay Flat Photo Books - Premium Hardcover
(Image credit: Photobox)

Legacy systems and fragmented tooling can create major visibility challenges for IT teams. Without a unified view of workflows and metrics, it becomes difficult to quickly triage and remediate incidents or optimize applications.

Photobox found itself in this position after years of accumulating technical debt across its e-commerce platform. The photo printing service relied on a mix of custom off-the-shelf software tied together with complex integrations; a monolithic Perl application handled order processing, while Kubernetes and serverless architecture powered newer microservices and front-ends.

Observability data was siloed across various monitoring tools lacking integration. Site reliability engineers routinely spent three or four hours correlating logs and metrics to track down issues. 

Institutional knowledge bottlenecks meant only one or two people could effectively troubleshoot problems that arose. In short, the need for change was paramount.

What Photobox was looking for in a new system

Photobox knew its fragmented monitoring tools was no longer fit for purpose. As Alex Hibbitt, engineering director, explains: “It was taking us three to four hours per incident just to track down where a problem lied in our stack, rather than actually start taking action.”

The goals were clear: reduce incident response times, democratize access to observability data, and optimize applications. But with a mix of legacy and cloud-native systems, finding a unified solution was easier said than done.

In conversation with
Alex Hibbitt from Photobox
In conversation with
Alex Hibbitt

Hibbitt has been a part of the Photobox family since 2019, joining as a principal site reliability engineer, and was promoted to engineering director in January 2023. He’s a technologist with a proven history of creating high-performing international teams.

"We needed somebody who could instrument the Perl application, and finding a solution that can work well with Perl was challenging," says Hibbitt. Seamless Kubernetes and serverless support were also must-haves.”

The evaluation criteria focused on three core capabilities: observability for legacy Perl workloads, Kubernetes, and serverless. The ideal solution would provide unified visibility without requiring extensive custom instrumentation. After assessing its options, Dynatrace emerged as the top contender, but Photobox needed proof it could handle its heterogeneous environment. So it launched an ambitious six-month pilot.

The importance of planning and partnerships

Photobox undertook an extensive pilot to evaluate Dynatrace's capabilities across its complex ecosystem. The goal was ambitious: instrument the entire customer journey from initial web request to backend database call.

"We effectively set out to try and instrument the entire end-to-end customer journey across all of our different technologies and across all of our different platforms," Hibbitt explains, adding the Dynatrace implementation delivered quick wins with Kubernetes and serverless workloads. "Within the first month of our six-month period, we had pretty much all of our Kubernetes workloads instrumented and all of our serverless infrastructure implemented.”

RELATED RESOURCE

Forrester: The Total Economic Impact™ Of IBM OpenPages

(Image credit: IBM)

Discover a governance, risk, and compliance solution that assesses enterprise risk across an entire organization

DOWNLOAD NOW

But legacy integrations took more work. The core Perl monolith required custom bindings to enable instrumentation. As Hibbitt notes: "The other stream was the traditional Perl monolith Babel as we call it. That took a bit longer because we had to work with Dynatrace as implementation teams and actually build out some new bindings between Perl and C++."

While technically straightforward, the pilot underscored the importance of planning and partnerships. Photobox worked closely with Dynatrace engineers to handle the intricacies of its hybrid environment.

The six-month timeline allotted sufficient room for customization beyond out-of-the-box capabilities. By the pilot's end, Photobox had achieved comprehensive observability spanning its customer journey - the proof point needed to proceed with Dynatrace.

What unification gave Photobox

The six-month investment in Dynatrace delivered measurable improvements across Photobox's environment, and the unified monitoring data enabled faster troubleshooting and remediation.

"We saw a massive decrease in that three to four hours’ time that I mentioned, to initially detect the problem. We got into the engagement with the hypothesis that we would be able to decrease the time. We were shocked by how much we actually managed to decrease our average mean-time-to-resolution (MTTR) as against pre-Dynatrace days, it was something like 70 to 80%.”

With full-stack visibility, engineers could quickly pinpoint and resolve issues like memory leaks that had persisted undetected for years. Dynatrace also enabled proactive application optimization based on precise performance metrics.

"By having Dynatrace deployed into that front end, not only were we able to see that the memory leak was declaring exactly what chunk of code was triggering the memory leak, how many customers were being impacted by it, and what their experience was like," Hibbitt says.

For Photobox, the outcomes surpassed expectations. Dynatrace delivered the unified observability needed to streamline operations across its hybrid landscape of legacy and cloud-native applications. The improvements also came without inflating costs. Consolidating monitoring tools delivered a cost-neutral implementation after factoring in licensing.

Ensuring a successful monitoring transformation

Photobox's Dynatrace journey yielded important insights for any organization pursuing improved observability. "Don't take these things lightly. The Dynatrace implementation all things considered was a relative breeze, but it was still a huge piece of work," he advises.

Digital transformations like this have an expansive scope spanning technology, processes, and culture, and Photobox ensured success by assembling diverse, cross-functional teams and planning for organizational change management. "Make sure you build out your teams effectively. Make sure they're not siloed make sure that they're cross-discipline and multifunctional," says Hibbitt.

The technical implementation also required moving beyond out-of-the-box capabilities. Emitting custom metrics enabled more tailored observability in Photobox's environment.

"There was a lot of that. But there was also a need for us to supplement that with our own choice. For low emissions of data," notes Hibbitt.

Finally, the power of modern platforms like Dynatrace's AIOps requires new technical skills. Photobox invested time in training its staff to leverage advanced analytics."We had never considered what impact having such a complicated ecosystem would have on how people could diagnose issues and on the reliance that we would need on an AIOps-powered tool," Hibbitt explains.

For those eyeing similar journeys, Photobox's experience highlights the importance of cross-team collaboration, customization, and skill building to realize the full value of new observability platforms.

Photobox's experience highlights the challenges many enterprise IT organizations face from fragmented visibility and monitoring sprawl. Mergers, technical debt, and legacy systems often create complex environments lacking unified observability.

By implementing Dynatrace across its hybrid landscape, Photobox was able to streamline operations and deliver major improvements. Incident response time was slashed by 70 to 80% while long-standing performance issues were optimized. Applications also saw better reliability without costs rising. 

Achieving these outcomes required thoughtful planning and execution across technology, processes, and culture. Photobox's six-month validation pilot ensured Dynatrace could handle end-to-end monitoring before full adoption. And the proof point gave the firm confidence to proceed.

Rene Millman

Rene Millman is a freelance writer and broadcaster who covers cybersecurity, AI, IoT, and the cloud. He also works as a contributing analyst at GigaOm and has previously worked as an analyst for Gartner covering the infrastructure market. He has made numerous television appearances to give his views and expertise on technology trends and companies that affect and shape our lives. You can follow Rene Millman on Twitter.