What can companies learn from object storage pioneers?

published 29 January 2020

The shift to the cloud is encouraging enterprises to rethink their options on storage. According to a June 2019 study from IHS Markit, 56% of organisations said they plan to increase investment in object storage, putting it ahead of unified storage at 51%, storage-area networks at 48% and network-attached storage at 36%. Most object storage is in the cloud, with popular examples including AWS S3, Azure Blob Storage and Google Cloud Platform (GCP) Cloud Storage.

But shifting to a new storage architecture at the same time as the cloud move is not entirely painless.

At the beginning of the decade, Moneysupermarket.com, the consumer online comparison and information site for financial services, was using a combination of SQL databases and SAS analytics environment. By 2014, it had moved to AWS for website hosting and data analytics, including use of S3 object storage and Vertica data warehouse. By May 2019, it moved its data and analytics to GCP using the BigQuery data warehouse and Cloud Storage object storage. The website itself remains on AWS.

Harvinder Atwal, Chief Data Officer at MoneySuperMarket, tells IT Pro: "One of the good things about the cloud is the initial learning curve is very shallow: it's easy to start. But then you get to the point where it's very much steeper and you need to understand some of the complexities involved."

One example of those complexities is the introduction of object lifecycle policies. The idea is to define policies to manage objects throughout the time the organisation needs them. That might be to move them to cheap long-term storage such as AWS Glacier or to expire them all together. Getting these rules right from the outset can save costs.

"That's one of the things that maybe we should put a little more effort into from the very beginning," Atwal says.

Other advice for those moving to object storage in the cloud includes avoiding biting off more than the team can chew.

"I would not do the migration all in one go," Atwal says. "I think the bigger project and the more money and resources it uses, the more likely it is to fail. I would encourage people to think of their use case and application and build a minimal viable product around that."

It's worth getting advice about the transition from independent third parties, which the cloud platform vendors can recommend. For example, Moneysupermarket.com used a consultancy called DataTonic with its transition to Google Cloud Platform.

Lastly, there can be a cultural change in store for the IT department, Atwal says. "The IT function can be very traditional in its thinking around how you use data. They think you must cleanse it, put it into a relational schema and only then can users access it. But with data today, the value in analytics comes from actually being able to use data for many sources and join them together, and IT has to learn to ditch its historic mindsets."

Nasdaq, the tech stock market, began working with AWS in 2012. It stores market, trade and risk data on the platform using S3 and Glacier. It uploads raw data to Amazon S3 throughout the trading day, using a separate system running in the cloud, converts raw data into Parquet files and places them in their final S3 location. This way, the system is able to elastically scale to meet the demands of market fluctuations. It also uses Amazon Redshift Spectrum to query data to support billing and reporting, and Presto and Spark on Elastic MapReduce (EMR) or Athena for analytics and research.

"Migrating to Amazon S3 as the 'source of truth' means we're able to scale data ingest as needed as well as scale the read side using separate query clusters for transparent billing to internal business units," says Nate Sammons, assistant vice president and lead cloud architect at Nasdaq.

But getting the scale of analytics solutions right for the problem has been a challenge, he says. "We currently operate one of the largest Redshift clusters anywhere, but it's soon to be retired in favour of smaller purpose-specific clusters. Some of the custom technologies we developed [in the early days] have since been retired as cloud services have matured. Had technologies like Amazon Redshift Spectrum existed when we started, we would have gone straight to Amazon S3 to start with, but that was not an option."

The advantage of using S3, though, was that it made the organisation less concerned about individual machine outages or data centre failures, Sammons says. "If one of the Amazon Redshift Spectrum query clusters fail, we can just start another one in its place without losing data. We don't have to do any cluster re-sizing and we don't require any CPU activity on the query clusters to do data ingest."

Rahul Gupta, IT transformation expert at PA Consulting, says those exploiting object storage in the cloud should know that apparent scalability and elasticity does not remove the need to do some basic housekeeping on data.

"A lot of people feel storage is cheap, so they build systems with vast amounts of data and think the impact on cost is not that great. They push the data into S3, or an equivalent, and then once it's in there, they feel that they can impose structure on the data, which is not the right thing to do," he says.

He says that by understanding data structure upfront and creating governance such as role-based access, organisations will not have to revisit the architecture once the data grows.

Just because so many organisations are moving storage to the cloud, does not mean they all get the same value from the transition. The considerable investment cloud infrastructure, storage and analytics application will offer the greatest returns to those who understand the storage lifecycle upfront, create some governance rules around access and understand data structure from the outset.