27 Parallel Database
Dr R. Baskaran
CENTRALIZED DATABASES
Centralized database management systems in which all the data is maintained at a single site and assumed that the processing of individual transactions is essentially sequential.One of the most important trends in databases is the increased use of parallel evaluation techniques and data distribution.
PARALLEL DATABASES – Introduction
A parallel database system seeks to improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries. Data may be stored in a distributed fashion in such a system, the distribution is governed solely by performance considerations.Parallelism is motivated by performance considerations, several distinct issues motivate data distribution:Increased Availability: If a site containing a relation goes down, the relation continues to be available if a copy is maintained at another site.
- Distributed Access to Data.
- Analysis of Distributed Data.
Parallelism is motivated by performance considerations, several distinct issues motivate data distribution:
- Increased Availability: If a site containing a relation goes down, the relation continues to be available if a copy is maintained at another site.
- Distributed Access to Data.
- Analysis of Distributed Data.
ARCHITECTURE FOR PARALLEL DATABASES
Multiple CPU’s are attached to an interconnection network and can access a connection region of main memory. Closer to a conventional machine, many commercial database systems have been ported to shared memory platform with relative ease. Communication overhead is low, because main memory can be used for this purpose and operating system services can be leveraged to utilize the additional CPU’s.
ARCHITECTURE FOR PARALLEL DATABASES
Memory contention becomes a bottle neck as the number of CPU increases. Each CPU has a private memory and direct access to all disks through an interconnection network. Share disk architecture also has the problem as that of shared memory because large amounts of data are shipped through interconnection network.
- Basic problem with the shared memory and shared disk architecture is interference.
- As more disks are added existing CPUs are slowed down because of increased contention for memory accesses and network bandwidth.
- Even an average 1 percent slowdown per additional CPU means that maximum speed-up is a factor of 37 and adding additional CPU slows down the system.
- A system with 1000 CPU is only 4% as effective as single CPU system.
- This observation motivated the development of shared nothing architecture.
- Best considered for large parallel database systems.
- Each CPU has a local main memory and disk space, but no two CPUs can access the same storage area.
- All Communication between CPUs is through a network connection.
- Requires extensive reorganization of DBMs code.
- Provides linear speed-up: time taken for operations decreases in proportion to the increase in number of increase in disks and CPUs and increase in amount of data.
Provides Linear Scale-up: performance is sustained if the number of CPUs and disks are increased in proportion to the amount of data.
SPEED UP and SCALE UP
PIPELINED PARALELLISM
- A relational query execution plan is a graph of relational algebra operators and the operators in a graph can be executed in parallel.
- If one operator consumes the output of a second operator, we have pipelined parallelism.
- If not, the two operators can proceed exceptionally independent.
- Pipelined parallelism is limited by the presence of operators.
- A relational query execution plan is a graph of relational algebra operators and the operators in a graph can be executed in parallel.
- If one operator consumes the output of a second operator, we have pipelined parallelism.
- If not, the two operators can proceed exceptionally independent.
- Pipelined parallelism is limited by the presence of operators.Data Partitioned Parallel Evaluation
- We can evaluate each individual operator in a query plan in a parallel fashion.
- The key to evaluating an operator in parallel is to partition the input data; we can the work on each partition in parallel and combine the results.
- This approach is called data-partitioned parallel evaluation.
Data Partitioning
- Partitioning a large data set horizontally across several disks enables us to exploit the I/O bandwidth of the disks by reading and writing them in parallel.
- We can assign tuples to processors in a round-robin fashion, we can use hashing, or we can assign tuples to processors by ranges of field values.
- If there are n processors, the i’th tuple is assigned to processor i mod n in round-robin partitioning.
- In hash partitioning, a hash function is applied to a tuple to determine its processor.
- In range partitioning, tuples are sorted (conceptually), and n ranges are chosen for the sort key values so that each range contains roughly the same number of tuples; tuples in range i are assigned to processor i. Round-robin partitioning is suitable for efficiently evaluating queries that access the entire relation.
- If only a subset of the tuples (e.g., those that satisfy the selection condition age = 20) is required, hash partitioning and range partitioning are better than round-robin partitioning because they enable us to access only those disks that contain matching tuples.
- If range selections such as 15 < age < 25 are specified, range partitioning is superior.
- range partitioning can lead to data skew; that is, partitions with widely varying numbers of tuples across partitions or disks.
- Skew causes processors dealing with large partitions to become performance bottlenecks.
- Hash partitioning has the additional virtue that it keeps data evenly distributed even if the data grows and shrinks over time.
SUMMARY
- Conventional vs Parallel Database Systems.
- Architecture of Parallel Systems
- Shared Memory
- Shared Disk
- Shared Nothing
- Data Partitioning Techniques
- Round Robin Partitioning
- Hashing Partitioning
- Range Partitioning