What is database sharding and how does it benefit enterprise IT?

A CGI illustration showing yellow server disks rotating around a central point to form a large circle, against a grey background.

(Image credit: Getty Images)

As enterprises grow their operations and customer base, it can become more difficult to keep on top of the amount of data that needs to be stored and managed. In particular, the traditional database server can become a bottleneck when multiple users try to save, access, and retrieve data and in extreme circumstances the system might crash or deliver a compromised user experience.

Database sharding is the process of splitting a large database into multiple smaller, independent, and manageable databases distributed across multiple machines or nodes within an enterprise network. It exists to alleviate the problems that come with storing enterprise datasets, including big data.

"Picture a massive library split into separate rooms,” says Paul Done, field CTO Modernization at MongoDB, one of the largest database companies to implement sharding.

“Instead of cramming every book into one room, books are spread across different rooms. Finding what you need remains quick as the number of books increases over time, and there isn’t a single room that becomes overcrowded."

A typical database server stores data in multiple rows and columns. Each shard contains unique but interrelated rows and columns of data, which are always a subset of a larger database.

There are three components of shards:

Logical shard: a set of data that has been systematically partitioned into smaller units - the shard we just defined.
Physical shard: the actual hardware that “physically” holds logical shards is known as a physical shard.
Shard key: a field that determines how data is divided and distributed across shards.

Business benefits of sharding

Databases can be scaled vertically and horizontally. Vertical scaling expands the capabilities of a single database server by using more memory, cloud storage, CPU power, and parallel processing. A vertically scaled database can handle large volumes of data.

Horizontal scalability: Sharding can horizontally scale databases. Adding more shards, volume data handling, memory load distribution, and higher transaction throughput support horizontal scaling for big data and intensive deep learning workloads.

Cost-cutting: Enterprises increase storage and computational power by spending a significant portion of their IT budgets. “For firms, database sharding is far more cost-effective than upgrading one large server, which at some point isn’t possible to go any further with,” Done explains.

Improved response time: Even if a single database server tries to store large chunks of data, it can process a limited number of queries and return small amounts of data. By splitting the database into smaller parts, queries need to search fewer rows and columns. Query response times improve drastically, making it easier for enterprises to retrieve data.

Fault tolerance: In monolithic databases, hardware malfunction, software downtime, or even planned maintenance would limit database operations and availability. It results in a single point of failure (SPOF), a total service outage that compromises business continuity. Data sharding is built on a shared-nothing architecture, where each shard is an independent data block. Failure of a shard’s host node does not impact the entire database, which allows the rest of the database to remain functional for enterprise operations.

How to perform database sharding

Database sharding isn’t a silver bullet, nor is it the method to use in every circumstance. “Sharding methods should be selected based on specific enterprise needs,” says Radoslaw Szulgo, senior product manager at database company Percona. In addition, the choice of shard keys affects the performance and scalability of database management systems.

Let’s consider a dataset: Customer_id, name, age, and city to understand sharding.

Swipe to scroll horizontally

Entry number	Customer ID	Name	Age	City
1	1000	Jack	35	London
2	1001	Kim	28	Seoul
3	1002	Nick	43	Edinburgh
4	1003	Olivia	22	Mumbai
5	1004	George	31	San Diego
6	1005	Lily	26	Tokyo
7	1006	James	30	Chicago
…	…	…	…	…
5000	5999	Carol	32	Berlin

Dynamic sharding

Also known as range-based sharding, dynamic sharding splits database rows into a range of values, based on any field from the database. By setting a predefined range, data from the larger database is split into shards.

The system starts with fewer shards. As entries increase, multiple shards appear. When a single shard becomes too large or busy for user queries, it automatically splits into two or more.

For example, shard A contains all customers from 1 to 5000. When shard A becomes busy, it splits into two shards, A1 containing customer entries from 1 to 2500, and A2 containing customer entries from 2501 to 5000.

The automatic splitting property and quick implementation make dynamic sharding an ideal choice for enterprises in industries with fast-moving, large-scale information such as healthcare and finance.

Hashed sharding

Hashed sharding is also known as algorithmic sharding because it applies an algorithm, known as the hash function, to the shard key in each row. The value returned by the hash function determines the number of shards and the assignment of entries.

For example, let’s suppose that the customer ID is the shard key. The hashing function can be Customer_ID % 4.

Swipe to scroll horizontally

Customer ID	Shard assignment
1000	0
1001	1
1002	2
1003	3
1004	0
1005	1
1006	2
…	…
5999	3

There will be four shards, numbered 0, 1, 2, and 3. The table shows the assignment of customer IDs 1000 and 1004 into shard A and so on.

"The most common sharding method we’ve seen at enterprises is 'hash-based' sharding because it reliably offers uniform data distribution,” Szulgo says.

Geolocation sharding

Geolocation sharding, used by large-scale multinational corporations, considers location as the shard key. Physical shards are geo-located.

Swipe to scroll horizontally

Name	City	Shard
Jack	London	Europe
Kim	Seoul	Asia
Nick	Edinburgh	Europe
Olivia	Mumbai	Asia
George	San Diego	North America
Lily	Tokyo	Asia
James	Chicago	North America
…	…	…
Carol	Berlin	Europe

As data is stored near the customer location, latency and compliance risks remain low. Due to extensive use in certain regions, some shards can become overloaded while others in less active locations remain near empty. These ‘data hotspots’ limit the ability of sharded databases to perform load balancing and run fast queries.

To get around this, enterprises can further divide popular locations into distinct shards, or fine-tune the distribution of shard keys.

Shard keys must have high cardinality, which are unique values that facilitate even distribution of data. Another factor is the frequency of the shard key, which defines how often a value appears in a database. The frequency of a shard key must not be too high or low, but balanced.

Enterprise strategy: sharding with partitioning and replication

Data replication maintains identical copies across multiple physical nodes in the enterprise network. When data replication is implemented with sharding, each shard is a replica set so that if one shard becomes unavailable, the database remains functional. Due to replication, shards that run on multiple nodes are consistent with a similar schema or design. Hence, data can be retrieved from another shard.

Data partitioning is another database management strategy that divides data into manageable segments, known as partitions. The difference between partitioning and sharding is that partitioning runs on the same database server while shards run on different servers and machines. Enterprises implement partitioning for databases that handle a high volume of read operations, whereas sharding is used for write operations.

Enterprises implement partitioning for databases that handle a high volume of read operations, whereas sharding is used for write operations. Done properly, they can work in tandem to radically improve performance. But enterprises need to carefully consider their data architecture and implementation lest they accidentally overcomplicate systems and introduce data silos.

"Frequent queries and cross-system requests require intricate coordination, sometimes leading to ungoverned local copies and security risks,” explains Tom Peirson-Webber, VP Engineering at Harbr Data.

"The solution is architectural – centralizing access and governance while keeping storage distributed.”

Enterprises need to consider whether their database is large enough to be sharded. Big data, in the order of multiple terabytes and petabytes, must be sharded because vertical scaling would exhaust bandwidth and increase latency. The overhead of managing multiple complex database partitions would burn CapEX.

Database sharding can lower latency radically, but the architecture needs a router to check multiple shards to return queries. Higher volume may burden routers in communicating with multiple shards and collecting data. This results in query overload and increased workload on a single physical node.

While growing databases call for high computational power, operational overhead, and increased costs, database sharding improves data storage patterns and contributes to effective big data management.

Venus is a freelance technology writer specializing in IT, quantum physics, electronics, and among other technical fields. She holds a degree in Electronics and Telecommunications Engineering from Mumbai University, India.

With years of experience in writing for global media brands and IT companies, she enjoys translating complex content into engaging stories. When she’s not writing about the latest IT trends, Venus can be found tracking enterprise trends or the newest processor in town.