12 Query Optimization
When the relational model was first launched commercially, one of the major criticisms often cited was inadequate performance of queries. Since then, a significant amount of research has been devoted to developing highly efficient algorithms for processing queries.
There are many ways in which a complex query can be performed, and one of the aims of query processing is to determine which one is the most cost effective. In first generation network and hierarchical database systems, the low-level procedural query language is generally embedded in a high-level programming language such as COBOL, and it is the programmer’s responsibility to select the most appropriate execution strategy. In contrast, with declarative languages such as SQL, the user specifies what data is required rather than how it is to be retrieved. This relieves the user of the responsibility of determining, or even knowing, what constitutes a good execution strategy and makes the language more universally usable. Additionally, giving the DBMS the responsibility for selecting the best strategy prevents users from choosing strategies that are known to be inefficient and gives the DBMS more control over system performance.
There are two main techniques for query optimization, although the two strategies are usually combined in practice. The first technique uses heuristic rules that order the operations in a query. The other technique compares different strategies based on their relative costs and selects the one that minimizes resource usage. Since disk access is slow compared with memory access, disk access tends to be the dominant cost in query processing for a centralized DBMS, and it is the one that we concentrate on exclusively in this chapter when providing cost estimates.
Query Processing – The activities involved in parsing, validating, optimizing, and executing a query.
Query optimization – The activity of choosing an efficient execution strategy for processing a query.
An important aspect of query processing is query optimization. As there are many equivalent transformations of the same high-level query, the aim of query optimization is to choose the one that minimizes resource usage. Generally, we try to reduce the total execution time of the query, which is the sum of the execution times of all individual operations that make up the query (Selinger et al., 1979). However, resource usage may also be viewed as the response time of the query, in which case we concentrate on maximizing the number of parallel operations (Valduriez and Gardarin, 1984). Since the problem is computationally intractable with a large number of relations, the strategy adopted is generally reduced to finding a near optimum solution (Ibaraki and Kameda, 1984).
Both methods of query optimization depend on database statistics to evaluate properly the different options that are available. The accuracy and currency of these statistics have a significant bearing on the efficiency of the execution strategy chosen. The statistics cover information about relations, attributes, and indexes. For example, the system catalog may store statistics giving the cardinality of relations, the number of distinct values for each attribute, and the number of levels in a multilevel index.
Keeping the statistics current can be problematic. If the DBMS updates the statistics every time a tuple is inserted, updated, or deleted, this would have a significant impact on performance during peak periods. An alternative, and generally preferable, approach is to update the statistics on a periodic basis, for example nightly, or whenever the system is idle. Another approach taken by some systems is to make it the users’ responsibility to indicate when the statistics are to be updated.
Three equivalent relational algebra queries corresponding to this SQL statement are:
(1) σ(position=‘Manager’) ∧ (city =‘London’) ∧ (Staff.branchNo=Branch.branchNo)(Staff × Branch)
(2) σ(position=‘Manager’) ∧ (city =‘London’)(Staff 1Staff.branchNo=Branch.branchNo Branch)
(3) (σposition=‘Manager’(Staff)) 1Staff.branchNo=Branch.branchNo (σcity =‘London’(Branch))
For the purposes of this example, we assume that there are 1000 tuples in Staff, 50 tuples in Branch, 50 Managers (one for each branch), and 5 London branches. We compare these three queries based on the number of disk accesses required. For simplicity, we assume that there are no indexes or sort keys on either relation, and that the results of any intermediate operations are stored on disk. The cost of the final write is ignored, as it is the same in each case. We further assume that tuples are accessed one at a time (although in practice disk accesses would be based on blocks, which would typically contain several tuples), and main memory is large enough to process entire relations for each relational algebra operation.
The first query calculates the Cartesian product of Staff and Branch, which requires (1000 + 50) disk accesses to read the relations, and creates a relation with (1000 * 50) tuples. We then have to read each of these tuples again to test them against the selection predicate at a cost of another (1000 * 50) disk accesses, giving a total cost of: (1000 + 50) + 2*(1000 * 50) = 101 050 disk accesses
The second query joins Staff and Branch on the branch number branchNo, which again requires (1000 + 50) disk accesses to read each of the relations. We know that the join of the two relations has 1000 tuples, one for each member of staff (a member of staff can only work at one branch). Consequently, the Selection operation requires 1000 disk accesses to read the result of the join, giving a total cost of:2*1000 + (1000 + 50) = 3050 disk accesses
The final query first reads each Staff tuple to determine the Manager tuples, which requires 1000 disk accesses and produces a relation with 50 tuples. The second Selection operation reads each Branch tuple to determine the London branches, which requires 50 disk accesses and produces a relation with 5 tuples. The final operation is the join of the reduced Staff and Branch relations, which requires (50 + 5) disk accesses, giving a total cost of:
1000 + 2*50 + 5 + (50 + 5) = 1160 disk accesses
Clearly the third option is the best in this case, by a factor of 87:1. If we increased the number of tuples in Staff to 10 000 and the number of branches to 500, the improvement would be by a factor of approximately 870:1. Intuitively, we may have expected this as the Cartesian product and Join operations are much more expensive than the Selection operation, and the third option significantly reduces the size of the relations that are being joined together. We will see shortly that one of the fundamental strategies in query processing is to perform the unary operations, Selection and Projection, as early as possible, thereby reducing the operands of any subsequent binary operations.
There are two choices for when the first three phases of query processing can be carried out. One option is to dynamically carry out decomposition and optimization every time the query is run. The advantage of dynamic query optimization arises from the fact that all information required to select an optimum strategy is up to date. The disadvantages are that the performance of the query is affected because the query has to be parsed, validated, and optimized before it can be executed. Further, it may be necessary to reduce the number of execution strategies to be analyzed to achieve an acceptable overhead, which may have the effect of selecting a less than optimum strategy.
The alternative option is static query optimization, where the query is parsed, validated, and optimized once. This approach is similar to the approach taken by a compiler for a programming language. The advantages of static optimization are that the runtime overhead is removed, and there may be more time available to evaluate a larger number of execution strategies, thereby increasing the chances of finding a more optimum strategy. For queries that are executed many times, taking some additional time to find a more optimum plan may prove to be highly beneficial. The disadvantages arise from the fact that the execution strategy that is chosen as being optimal when the query is compiled may no longer be optimal when the query is run. However, a hybrid approach could be used to overcome this disadvantage, where the query is re-optimized if the system detects that the database statistics have changed significantly since the query was last compiled.
Alternatively, the system could compile the query for the first execution in each session, and then cache the optimum plan for the remainder of the session, so the cost is spread across the entire DBMS session.
Query decomposition is the first phase of query processing. The aims of query decomposition are to transform a high-level query into a relational algebra query, and to check that the query is syntactically and semantically correct. The typical stages of query decomposition are analysis, normalization, semantic analysis, simplification, and query restructuring.
(1) Analysis
In this stage, the query is lexically and syntactically analyzed using the techniques of programming language compilers (see, for example, Aho and Ullman, 1977). In addition, this stage verifies that the relations and attributes specified in the query are defined in the system catalog. It also verifies that any operations applied to database objects are appropriate for the object type. For example, consider the following query:
SELECT staffNumber FROM Staff WHERE position > 10;
This query would be rejected on two grounds:
(1) In the select list, the attribute staffNumber is not defined for the Staff relation (should be staffNo).
(2) In the WHERE clause, the comparison ‘>10’ is incompatible with the data type position, which is a variable character string.
On completion of this stage, the high-level query has been transformed into some internal representation that is more suitable for processing. The internal form that is typically chosen is some kind of query tree, which is constructed as follows:
- A leaf node is created for each base relation in the query.
- A non-leaf node is created for each intermediate relation produced by a relational algebra operation.
- The root of the tree represents the result of the query.
- The sequence of operations is directed from the leaves to the root.
(2) Normalization
The normalization stage of query processing converts the query into a normalized form that can be more easily manipulated. The predicate (in SQL, the WHERE condition), which may be arbitrarily complex, can be converted into one of two forms by applying a few transformation rules (Jarke and Koch, 1984):
- Conjunctive normal form A sequence of conjuncts that are connected with the ∧ (AND) operator. Each conjunct contains one or more terms connected by the ∨ (OR) operator. For example:(position = ‘Manager’ ∨ salary > 20000) ∧ branchNo = ‘B003’A conjunctive selection contains only those tuples that satisfy all conjuncts.
- Disjunctive normal form A sequence of disjuncts that are connected with the ∨ (OR) operator. Each disjunct contains one or more terms connected by the ∧ (AND) operator.
For example, we could rewrite the above conjunctive normal form as:(position = ‘Manager’ ∧ branchNo = ‘B003’ ) ∨ (salary > 20000 ∧ branchNo = ‘B003’) A disjunctive selection contains those tuples formed by the union of all tuples that satisfy the disjuncts.
(3) Semantic analysis
The objective of semantic analysis is to reject normalized queries that are incorrectly formulated or contradictory. A query is incorrectly formulated if components do not contribute to the generation of the result, which may happen if some join specifications are missing. A query is contradictory if its predicate cannot be satisfied by any tuple. For example, the predicate (position = ‘Manager’ ∧ position = ‘Assistant’) on the Staff relation is contradictory, as a member of staff cannot be both a Manager and an Assistant simultaneously. However, the predicate ((position = ‘Manager’ ∧ position = ‘Assistant’) ∨ salary > 20000) could be simplified to (salary > 20000) by interpreting the contradictory clause s the boolean value FALSE. Unfortunately, the handling of contradictory clauses is not consistent between DBMSs.
Algorithms to determine correctness exist only for the subset of queries that do not contain disjunction and negation. For these queries, we could apply the following checks:
(1) Construct a relation connection graph (Wong and Youssefi, 1976). If the graph is not connected, the query is incorrectly formulated. To construct a relation connection graph, we create a node for each relation and a node for the result. We then create edges between two nodes that represent a join, and edges between nodes that represent the source of Projection operations.
(2) Construct a normalized attribute connection graph (Rosenkrantz and Hunt, 1980).
If the graph has a cycle for which the valuation sum is negative, the query is contradictory. To construct a normalized attribute connection graph, we create a node for each reference to an attribute, or constant 0. We then create a directed edge between nodes that represent a join, and a directed edge between an attribute node and a constant 0 node that represents a Selection operation. Next, we weight the edges a → b with the value c, if it represents the inequality condition (a ≤ b + c), and weight the edges 0 → a with the value −c, if it represents the inequality condition (a ≥ c).
(4) Simplification
The objectives of the simplification stage are to detect redundant qualifications, eliminate common subexpressions, and transform the query to a semantically equivalent but more easily and efficiently computed form. Typically, access restrictions, view definitions, and integrity constraints are considered at this stage, some of which may also introduce redundancy. If the user does not have the appropriate access to all the components of the query, the query must be rejected. Assuming that the user has the appropriate access privileges, an initial optimization is to apply the well-known idempotency rules of boolean algebra, such as:
p ∧ (p) ≡ p p ∨ (p) ≡ p
p ∧ false ≡ false p ∨ false ≡ p
p ∧ true ≡ p p ∨ true ≡ true
p ∧ (~p) ≡ false p ∨ (~p) ≡ true
p ∧ (p ∨ q) ≡ p p ∨ (p ∧ q) ≡ p
For example, consider the following view definition and query on the view:
CREATE VIEW Staff3 AS SELECT * SELECT staffNo, fName, lName, salary, branchNo FROM Staff3
FROM Staff WHERE (branchNo = ‘B003’ AND WHERE branchNo = ‘B003’; salary > 20000);
As discussed in Section 6.4.3, during view resolution this query will become:
SELECT staffNo, fName, lName, salary, branchNo FROM Staff WHERE (branchNo = ‘B003’ AND salary > 20000) AND branchNo = ‘B003’;
and the WHERE condition reduces to (branchNo = ‘B003’ AND salary > 20000). Integrity constraints may also be applied to help simplify queries. For example, consider the following integrity constraint, which ensures that only Managers have a salary greater than Rs.20,000:
CREATE ASSERTION OnlyManagerSalaryHigh
CHECK ((position <> ‘Manager’ AND salary < 20000)
OR (position = ‘Manager’ AND salary > 20000)); and consider the effect on the query:
SELECT * FROM Staff WHERE (position = ‘Manager’ AND salary < 15000);
The predicate in the WHERE clause, which searches for a manager with a salary below Rs.15,000, is now a contradiction of the integrity constraint so there can be no tuples that satisfy this predicate.
(5) Query restructuring
In the final stage of query decomposition, the query is restructured to provide a more efficient implementation.
Heuristic Approach to query optimization
Transformation Rules for the Relational Algebra Operations
(1) Conjunctive Selection operations can cascade into individual Selection operations (and vice versa).
(2) Commutativity of Selection operations.
(3) In a sequence of Projection operations, only the last in the sequence is required.
(4) Commutativity of Selection and Projection.
(5) Commutativity of Theta join (and Cartesian product).
(6) Commutativity of Selection and Theta join (or Cartesian product).
(7) Commutativity of Projection and Theta join (or Cartesian product).
(8) Commutativity of Union and Intersection (but not Set difference).
(9) Commutativity of Selection and set operations (Union, Intersection, and Set difference).
(10) Commutativity of Projection and Union.
(11) Associativity of Theta join (and Cartesian product).
(12) Associativity of Union and Intersection (but not Set difference).
Heuristical Processing Strategies
(1) Perform Selection operations as early as possible.
(2) Combine the Cartesian product with a subsequent Selection operation whose predicate represents a join condition into a Join operation.
(3) Use associativity of binary operations to rearrange leaf nodes so that the leaf nodes with the most restrictive Selection operations are executed first.
(4) Perform Projection operations as early as possible.
(5) Compute common expressions once.
Cost Estimation for the Relational Algebra Operations
A DBMS may have many different ways of implementing the relational algebra operations.The aim of query optimization is to choose the most efficient one. To do this, it uses formulae that estimate the costs for a number of options and selects the one with the lowest cost. In this section we examine the different options available for implementing the main relational algebra operations. For each one, we provide an overview of the implementation and give an estimated cost. As the dominant cost in query processing is usually that of disk accesses, which are slow compared with memory accesses, we concentrate exclusively on the cost of disk accesses in the estimates provided. Each estimate represents the required number of disk block accesses, excluding the cost of writing the result relation.
Many of the cost estimates are based on the cardinality of the relation. Therefore, as we need to be able to estimate the cardinality of intermediate relations, we also show some typical estimates that can be derived for such cardinalities. We start this section by examining the types of statistics that the DBMS will store in the system catalog to help with cost estimation.
Query optimization is a well researched field and a number of alternative approaches to the System R dynamic programming algorithm have been proposed. For example,Simulated Annealing searches a graph whose nodes are all alternative execution strategies (the approach models the annealing process by which crystals are grown by first heating the containing fluid and then allowing it to cool slowly). Each node has an associated cost and the goal of the algorithm is to find a node with a globally minimum cost. A move from one node to another is deemed to be downhill (uphill) if the cost of the source node is higher (lower) than the cost of the destination node. A node is a local minimum if, in all paths starting at that node, any downhill move comes after at least one uphill move. A node is a global minimum if it has the lowest cost among all nodes. The algorithm performs a continuous random walk accepting downhill moves always and uphill moves with some probability, trying to avoid a high-cost local minimum. This probability decreases as time progresses and eventually becomes zero, at which point the search stops and the node with the lowest cost visited is returned as the optimal execution strategy. The interested reader is referred to Kirkpatrick et al. (1983) and Ioannidis and Wong (1987).