What is database sharding and how does it benefit enterprise IT?

Implemented properly, database sharding can radically improve performance and scalability

A CGI illustration showing yellow server disks rotating around a central point to form a large circle, against a grey background.
(Image credit: Getty Images)

As enterprises grow their operations and customer base, it can become more difficult to keep on top of the amount of data that needs to be stored and managed. In particular, the traditional database server can become a bottleneck when multiple users try to save, access, and retrieve data and in extreme circumstances the system might crash or deliver a compromised user experience.

Database sharding is the process of splitting a large database into multiple smaller, independent, and manageable databases distributed across multiple machines or nodes within an enterprise network. It exists to alleviate the problems that come with storing enterprise datasets, including big data.

"Picture a massive library split into separate rooms,” says Paul Done, field CTO Modernization at MongoDB, one of the largest database companies to implement sharding.

“Instead of cramming every book into one room, books are spread across different rooms. Finding what you need remains quick as the number of books increases over time, and there isn’t a single room that becomes overcrowded."

A typical database server stores data in multiple rows and columns. Each shard contains unique but interrelated rows and columns of data, which are always a subset of a larger database.

There are three components of shards:

  • Logical shard: a set of data that has been systematically partitioned into smaller units - the shard we just defined.
  • Physical shard: the actual hardware that “physically” holds logical shards is known as a physical shard.
  • Shard key: a field that determines how data is divided and distributed across shards.

Business benefits of sharding

Databases can be scaled vertically and horizontally. Vertical scaling expands the capabilities of a single database server by using more memory, cloud storage, CPU power, and parallel processing. A vertically scaled database can handle large volumes of data.

Horizontal scalability: Sharding can horizontally scale databases. Adding more shards, volume data handling, memory load distribution, and higher transaction throughput support horizontal scaling for big data and intensive deep learning workloads.

Cost-cutting: Enterprises increase storage and computational power by spending a significant portion of their IT budgets. “For firms, database sharding is far more cost-effective than upgrading one large server, which at some point isn’t possible to go any further with,” Done explains.

Improved response time: Even if a single database server tries to store large chunks of data, it can process a limited number of queries and return small amounts of data. By splitting the database into smaller parts, queries need to search fewer rows and columns. Query response times improve drastically, making it easier for enterprises to retrieve data.

Fault tolerance: In monolithic databases, hardware malfunction, software downtime, or even planned maintenance would limit database operations and availability. It results in a single point of failure (SPOF), a total service outage that compromises business continuity. Data sharding is built on a shared-nothing architecture, where each shard is an independent data block. Failure of a shard’s host node does not impact the entire database, which allows the rest of the database to remain functional for enterprise operations.

How to perform database sharding

Database sharding isn’t a silver bullet, nor is it the method to use in every circumstance. “Sharding methods should be selected based on specific enterprise needs,” says Radoslaw Szulgo, senior product manager at database company Percona. In addition, the choice of shard keys affects the performance and scalability of database management systems.

Let’s consider a dataset: Customer_id, name, age, and city to understand sharding.

Swipe to scroll horizontally

Entry number

Customer ID

Name

Age

City

1

1000

Jack

35

London

2

1001

Kim

28

Seoul

3

1002

Nick

43

Edinburgh

4

1003

Olivia

22

Mumbai

5

1004

George

31

San Diego

6

1005

Lily

26

Tokyo

7

1006

James

30

Chicago

5000

5999

Carol

32

Berlin

Dynamic sharding

Also known as range-based sharding, dynamic sharding splits database rows into a range of values, based on any field from the database. By setting a predefined range, data from the larger database is split into shards.

The system starts with fewer shards. As entries increase, multiple shards appear. When a single shard becomes too large or busy for user queries, it automatically splits into two or more.

For example, shard A contains all customers from 1 to 5000. When shard A becomes busy, it splits into two shards, A1 containing customer entries from 1 to 2500, and A2 containing customer entries from 2501 to 5000.

The automatic splitting property and quick implementation make dynamic sharding an ideal choice for enterprises in industries with fast-moving, large-scale information such as healthcare and finance.

Hashed sharding

Hashed sharding is also known as algorithmic sharding because it applies an algorithm, known as the hash function, to the shard key in each row. The value returned by the hash function determines the number of shards and the assignment of entries.

For example, let’s suppose that the customer ID is the shard key. The hashing function can be Customer_ID % 4.

Swipe to scroll horizontally

Customer ID

Shard assignment

1000

0

1001

1

1002

2

1003

3

1004

0

1005

1

1006

2

5999

3

There will be four shards, numbered 0, 1, 2, and 3. The table shows the assignment of customer IDs 1000 and 1004 into shard A and so on.

"The most common sharding method we’ve seen at enterprises is 'hash-based' sharding because it reliably offers uniform data distribution,” Szulgo says.

Geolocation sharding

Geolocation sharding, used by large-scale multinational corporations, considers location as the shard key. Physical shards are geo-located.

Swipe to scroll horizontally

Name

City

Shard

Jack

London

Europe

Kim

Seoul

Asia

Nick

Edinburgh

Europe

Olivia

Mumbai

Asia

George

San Diego

North America

Lily

Tokyo

Asia

James

Chicago

North America

Carol

Berlin

Europe

As data is stored near the customer location, latency and compliance risks remain low. Due to extensive use in certain regions, some shards can become overloaded while others in less active locations remain near empty. These ‘data hotspots’ limit the ability of sharded databases to perform load balancing and run fast queries.

To get around this, enterprises can further divide popular locations into distinct shards, or fine-tune the distribution of shard keys.

Shard keys must have high cardinality, which are unique values that facilitate even distribution of data. Another factor is the frequency of the shard key, which defines how often a value appears in a database. The frequency of a shard key must not be too high or low, but balanced.

Enterprise strategy: sharding with partitioning and replication

Data replication maintains identical copies across multiple physical nodes in the enterprise network. When data replication is implemented with sharding, each shard is a replica set so that if one shard becomes unavailable, the database remains functional. Due to replication, shards that run on multiple nodes are consistent with a similar schema or design. Hence, data can be retrieved from another shard.

Data partitioning is another database management strategy that divides data into manageable segments, known as partitions. The difference between partitioning and sharding is that partitioning runs on the same database server while shards run on different servers and machines. Enterprises implement partitioning for databases that handle a high volume of read operations, whereas sharding is used for write operations.

Enterprises implement partitioning for databases that handle a high volume of read operations, whereas sharding is used for write operations. Done properly, they can work in tandem to radically improve performance. But enterprises need to carefully consider their data architecture and implementation lest they accidentally overcomplicate systems and introduce data silos.

"Frequent queries and cross-system requests require intricate coordination, sometimes leading to ungoverned local copies and security risks,” explains Tom Peirson-Webber, VP Engineering at Harbr Data.

"The solution is architectural – centralizing access and governance while keeping storage distributed.”

Enterprises need to consider whether their database is large enough to be sharded. Big data, in the order of multiple terabytes and petabytes, must be sharded because vertical scaling would exhaust bandwidth and increase latency. The overhead of managing multiple complex database partitions would burn CapEX.

Database sharding can lower latency radically, but the architecture needs a router to check multiple shards to return queries. Higher volume may burden routers in communicating with multiple shards and collecting data. This results in query overload and increased workload on a single physical node.

While growing databases call for high computational power, operational overhead, and increased costs, database sharding improves data storage patterns and contributes to effective big data management.

Venus Kohli
Freelance writer

Venus is a freelance technology writer specializing in IT, quantum physics, electronics, and among other technical fields. She holds a degree in Electronics and Telecommunications Engineering from Mumbai University, India.

With years of experience in writing for global media brands and IT companies, she enjoys translating complex content into engaging stories. When she’s not writing about the latest IT trends, Venus can be found tracking enterprise trends or the newest processor in town.