Sharding Theory, Framework and Practice [PDF]

  • 0 0 0
  • Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden
Datei wird geladen, bitte warten...
Zitiervorschau

IEEE Transaction Publication – October 2010

1

Sharding Theory, Framework and Practice A. Peter, S. Kumar, D. Vachani IEEE Members

Abstract—This document defines a fundamental theory, framework and implementation practices of sharding. The research conducted here focuses on growth, performance, scalability, manageability and availability of different types of data sets. Index Terms—Sharding, Distributed Systems, Enterprise Architecture,

The fundamental goal of sharding is to manage Data Volume to Response Time ratio. Data Volume to Response Time ratio is the key indicator of sharding heuristic. The traditional scaling methods follow a strict Linear function (y=a+bx). The sharding method follows a Power function (y=axb). [2]

I. INTRODUCTION

T

HERE is a data explosion starting with the democratization of the Internet. Moorse Law does not apply to data explosion. Google and other big data storage and processing companies have acknowledged these trends. The CEO of Google, Eric Schmidt has openly acknowledged the fact the existing hardware cannot handle the volume of data collected from the internet. If you are reading this paper, most likely you are bombarded with data that needs to be managed with low-cost commodity hardware stringed together with smart code. [1] The traditional method of Sharding is called Partition, which is a logical division of data within a single host, sharing common disk I/O, CPU, RAM. Recently, database partitioning started utilizing multiple nodes to spread the data processing, but it requires partitioning, indexing and other data management over-head on a regular basis. This paper simplifies the data classification and spreading, which is called sharding mechanism. The Data distribution is classified into: Uniform Group Random Distinct The sharding mechanism is classified into: Vertical Range Hash Key Directory lookup

Fig. 1. The dotted line shows the traditional growth of Data Volume (DV) to Response Time (RT) ratio. The solid line show DV:RT ratio with Sharding methods proposed in this paper, the RT can be kept at a fairly constant pace while the DV is increasing. This is the fundamental goal of Sharding.

Anytime the data volume is very large to handle all the records in one table / database. Sharding is required to meet the reporting SLAs and load performance. II. WHAT IS SHARDING? Sharding is a database technique where you break up a big database into many smaller ones. It is the process of splitting up a database / table across multiple machines to improve the scalability of an application. The justification for database sharding is that after a certain scale point it is cheaper and more feasible to scale a site horizontally by adding more machines than to grow it vertically by adding beefier servers. There are many advantages to this sharding approach. The total number of rows in each table is reduced. This reduction in data size reduces index size, which generally improves search performance. Database shard can be placed on separate hardware, and

IEEE Transaction Publication – October 2010

2

multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of commodity machines. Spreading the database across multiple machines improves the performance significantly. Scaling up commodity hardware is relatively easy and can plan for linear growth as the data volume increases. In addition, if the database shard is based on some real-world segmentation of the data (e.g. customers vs. contacts) then it may be possible to infer the appropriate shard membership. III. COMMON SHARDING SCHEMAS There are a number of different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are four of the most popular schemes used by various large-scale Web applications today. A. Vertical Sharding A simple way to segment your database is to organize the tables specific data retrieval needs. An efficient way sharding method in this scenario is to physically separate out the tables across multiple machines.

Fig. 3. This illustration shows Range based Sharding and Lookup.

Ex. User table (User Name A – G) -- Shard 1 User table (User Name H – O) -- Shard 2 User table (User Name P – Z) -- Shard 3 C. HashKey based Sharding With this approach, each entity has a value that can be used as input into a hash function whose output is used to determine which database server to use. The advantage over range based sharding is a broader distribution of sharding key.

Fig. 2. This illustration shows Vertical Sharding and Lookup.

Ex: Contacts table -- Shard 1 User table -- Shard 2 Preference table -- Shard 3 B. Range based Sharding In situations where the entire data set for a single feature or table still needs to be further subdivided across multiple servers, it is important to ensure that the data is split up in a predictable manner. One approach to ensuring this predictability is to split the data based on values ranges that occur within each entity.

Fig. 4. This illustration shows Hash Key based sharding, which depicts the broader sharding direction by implying hash function on values.

Ex. User table - hash(user name{Aa…Gz}) -- Shard 1 User table - hash(user name{Ha…Oz}) -- Shard 2 User table - hash(user name{Pa…Zz}) -- Shard 3 D. Directory based Sharding A loosely couples approach to this problem is to create a lookup service which knows your current partitioning scheme and abstracts it away from the database access code. This means the lookupReflector() method actually hits a web service or a database which actually stores/returns the mapping between each entity key and the database server it resides on. This loosely coupled approach means you can perform tasks

IEEE Transaction Publication – October 2010

3

like adding servers to the database pool or change your partitioning scheme without having to impact your application. Consider the previous example where there are ten servers and the hash function is a module operation. Let's say we want to add five database servers to the pool without incurring downtime. We can keep the existing distribution function, add these servers to the pool and then run the loader script that will load the data to the new servers and/or upload the existing data.

Fig. 6. This illustration shows how data is loaded into the shards, how it’s replicated using database methods, loaded into the cluster, loaded into geo cluster based on replication techniques.

Geo Distribution based on Dual Load Instead of using native replication, data is loaded individually to the cluster. The illustration on Figure 7 depicts this approach. Fig. 5. This illustration shows Directory based Sharding.

Ex. User table (User Name T,O,N) -- Shard 1 User table (User Name B,Y,S) -- Shard 2 User table (User Name A,P,Z,M) -- Shard 3

IV. SHARDING FRAMEWORK This Framework illustrates how the data is loaded into the cluster and retrieved back for data consumption. In Data Loading, the full data set is split into multiple shard specific data files using one of the sharding methods. The shard specific data files are then loaded into the respective shards. Then, to reflect the data changes, the central data mapping is updated. The sharded data is distributed to the Geo-Distributed cluster using one of the following methods: 1. Geo-Distributed based on Replicant (Fig. 6). 2. Geo-Distributed based on Dual Load (Fig. 7). Geo Distribution based on Replicant The secondary cluster is located on a different geographic location. The illustration on Figure 6 depicts this approach.

Fig. 7. This illustration shows how data is loaded into the shards, how it’s replicated using database methods, loaded into the cluster, loaded into geo cluster based on replication techniques.

V. SHARDING PRACTICE To solve sharding performance problems such as poor response time or congestion due to high data volume, we are going to use the straightforward methods discussed in the previous section to collect the necessary data and analyze it to choose the correct sharding strategy.

IEEE Transaction Publication – October 2010

4

TABLE I DATA AND SHARDING COMPATIBILITY MATRIX Sharding

Uniform

Vertical

X

Range

Group

Random

X*

X

X

Hash Key

X*

X

X

Directory

X*

VI. DATA RETRIEVAL Distinct

The Lookup Reflector is layered approach, which is used to identify the data location in the shards and retrieve the data sets and format for end user consumption. TABLE II THE DATA RETRIEVAL LAYER

X

X* = could be used in combination with Vertical Sharding.

A. Vertical Sharding Equation Input: Multiple Big Tables. Decision: Move the tables into respective shards. Data Structure: Uniform Distribution Outcome: Reduce overload on a single host. Multiple big tables in a single database with high demand for storage and concurrent queries. In this scenario, the tables are physically separated to individual shards. B. Range Sharding Equation Input: Single Big Table & Custom Range Decision: Distribute the data set into multiple shards based on the custom range. Data Structure: Group Distribution Outcome: Distribute the processing on a single table. Single big table in a single database with high demand for storage and concurrent queries. C. Hash Key Sharding Equation Input: Single Big Table Decision: Split the table into multiple shards with broader range based on Hash-Key applied on Value. E.g. Sharding Key = Numeric(Base64(MD5(Value)) Data Structure: Random Distribution Outcome: Reduce overload due to one big table. Single big table in a single database with high demand for storage and concurrent queries. D. Directory Sharding Equation Input: Single OR Multiple Big Tables Decision: Split the tables into multiple shards based on the lookup table. Data Structure: Distinct Distribution Outcome: Reduce overload due to one or more big tables.

The functional areas of the layers can be defined in the following methods: Layer 1 - Un-Wrapper: This layer takes the incoming request to validate the request-format and data-boundaries. It also validates the security credentials. Layer 2 - Segregator: Based on the parameters, it checks the sharding keys against the mapping to identify the location of the data set. Layer 3 – Executioner: This layer executes the queries against the identified shard. This layer also handles the exception conditions. Layer 4 – Formatter: This layer formats the result-set to match the expected response format, eg. XML, JSON, Java bean etc., Layer 5- Responder: This is the final layer, which sends the formatted result set to the requester.

VII. CONCLUSION Sharding is an art of sculpting your data orientation. There is no right or wrong approach, but each approach has it’s pros and cons. The data can take different forms of distribution, combining one or more sharding methods, we can achieve growth, performance, scalability, manageability and availability for a successful sharding deployment. The simplicity of our approach discussed in this paper provides a theory, framework and practical application. Happy Sharding.

IEEE Transaction Publication – October 2010

REFERENCES [1] [2] [3] [4]

Tynan D., Google: Brace Yourselves for the Data Explosion. (Book style). Belmont, CA: Wadsworth, August 2010. Powell, Baker, Management Science, Hoboken, NJ: John Wiley & Sons, Inc., 2nd ed., 2009, pp. 43. H. Poor, An Introduction to Signal Detection and Estimation. New York: Springer-Verlag, 1985, ch. 4. P. Alexander, “An approach to graphs of linear forms (Unpublished work style),” unpublished.

5