How can businesses handle data sprawl?

Poor data observability opens the door to security vulnerabilities

Tool sprawl concept, multiple terminals showing lines of code stacked in front of one another with a reflective blue gradient background
(Image credit: Getty Images)

Data sprawl is the uncontrolled – and often undocumented – spread of data across an organization’s systems, platforms, locations and formats. This includes on premises infrastructure, user devices, SaaS applications and cloud providers.

But we’re probably doing data an injustice describing the issue as such, says Tim Nelms, senior director analyst at Gartner, as the principal challenge is with content sprawl. “Content – unstructured data – represents 80% of corporate information,” he points out.

Whether coined as content or data, this sprawl issue most commonly arises when organizations rapidly adopt cloud services that enable different teams or departments to generate, store and use data independently, and don’t have a process of filtering out duplicate and unneeded data effectively.

“In an effort to be agile and make use of modern technologies as fast as possible, this often happens without central oversight or data governance,” explains Tom van Aardt, chief data officer (CDO) advisor at consultancy firm BML. “The result is a fragmented data landscape that increases risks related to security, compliance and inefficiency, and makes it difficult to discover, trust or leverage data effectively.”

This is in addition to a steady buildup of data in recent years due to the widespread adoption of multiple cloud platforms, decentralized procurement of tools, shadow IT and the rapid growth of unstructured data across collaboration platforms. Throw into the mix the race to use AI and advanced analytics and you’re left with duplicated, fragmented and poorly catalogued data, he notes.

This is in addition to a steady buildup of data in recent years due to the widespread adoption of multiple cloud platforms, decentralized procurement of tools, shadow IT and the rapid growth of unstructured data across collaboration platforms. Throw into the mix the race to use AI and advanced analytics and you’re left with duplicated, fragmented and poorly catalogued data, he notes.

The pandemic must take some of the responsibility for the exponential growth in data, as this was a key reason for the rapid roll-out of digital workplace technologies that in some cases overlooked recommended governance and lifecycle controls in favor of speedy implementation.

Microsoft 365 alone has gone from 13 million active monthly users to over 330 million today. Most of those deployments happened quickly,” Nelms notes.

The more data the better, no?

In an era when data is one of the world’s most valuable commodities, it’s fair to ask what’s the harm in having so much of it? Well, the real issue isn’t the volume of data, but the lack of visibility and observability around it.

This opens up the organization to security vulnerabilities, as potentially sensitive data could be sitting in places it shouldn’t. “This could be on end-user devices, in a cloud storage bucket that has misconfigured security, in poorly protected transactional systems,” says Brent Ellis, principal analyst at Forrester. “While traditional ransomware encryption is still a threat, exfiltration and the movement of data outside of the perimeter that a business can control means it is threatened by risks to compliance, IP theft or loss of customer confidence.”

While businesses can use programs specifically designed to detect data sprawl, it’s most common for them to discover they have an issue when a security incident occurs. According to security firm Rubrik, enterprises are facing a cloud security crisis fueled by data sprawl, with consequences of breaches including reputational damage, compliance-related fines and forced leadership changes.

It can also significantly hamper incident response and recovery when the worst occurs, by obscuring where sensitive and critical data resides, delaying breach detection, complicating forensic investigations and slowing containment efforts, notes van Aardt.

Data sprawl just magnifies the overall burden of recovery, Ellis agrees, adding that more generally, in terms of operations, data sprawl means longer back-ups and higher storage costs – essentially eating up time and money that could be better used elsewhere.

How can businesses handle data sprawl?

In today’s complex hybrid and multi-cloud environments, full control over all data is rarely achievable. It’s more realistic therefore, to focus on effective mitigation rather than complete elimination of data sprawl.

“One might have a goal of complete control, but companies must be agile and the environments where data must be used will change over time. The big thing to establish is a culture that respects data and understands that the business has to manage it responsibly,” Ellis advises.

An area of focus should be on improving observability of your data, which can be achieved through a mix of tools and practices. Start with implementing a data strategy, setting up data governance and then implementing tools to support these, advises van Aardt.

“Following on from having a solid strategy and governance framework, businesses can then improve observability of data across hybrid or multi-cloud ecosystems by implementing centralized data inventory tools, standardizing metadata and tagging practices and using cloud native monitoring and logging services integrated into a unified data catalogue or governance platform.”

Building standardized ways of deploying infrastructure across cloud and on prem is an important step so that you can use consistent tooling to understand where data is and what policies are applied to it, continues Ellis. “Exceptions occur, but those should be controlled and clear practices for making sure data in a new SaaS vendor, for instance, is kept in compliance to internal data controls. This may mean negotiating with vendors to get data visibility that’s not currently available as part of the service.”

“Enhancing observability also requires automated data discovery, classification and lineage tracking alongside clear ownership models and policy enforcement to ensure visibility, compliance and control over all data assets – regardless of where they reside,” adds van Aardt.

The human factor

While automation and AI can also help support the management of data sprawl, it’s important not to overlook the human side of the equation. Strong leadership and ownership from the top are required to build a successful culture of data management and it’s important to break down silos and foster communication and collaboration across IT, security and business teams.

Control over data sprawl looks set to become even more important as near-future regulations are likely to demand greater transparency on where data resides, how it’s processed and who has access.

“Ultimately, evolving regulations will force businesses to embed privacy-by-design and security-by-default principles into their data strategies, turning regulatory compliance from a checkbox exercise into a core operational discipline,” van Aardt concludes.

Keri Allan

Keri Allan is a freelancer with 20 years of experience writing about technology and has written for publications including the Guardian, the Sunday Times, CIO, E&T and Arabian Computer News. She specialises in areas including the cloud, IoT, AI, machine learning and digital transformation.