8 Data Manipulation Transparencies

In all the queries we’ve seen so far, the rows in the results table have not been ordered in any way. SQL just retrieved the rows in the order in which it found them in the table. The ORDER BY clause allows you to impose an order on the query results.

 

You can use ORDER BY with one or more column names to specify the ordering of the query results. For example, to list student’s records in alphabetical order by surname:

 

SELECT * FROM STUDENTS ORDER BY SURNAME ;

 

 

The ORDER BY clause only affects the manner in which these rows are displayed by SQL. If there are NULL values in the ORDER BY column then they appear either at the beginning or at the end of the list depending on your dialect of SQL. This query listed the student’s rows alphabetically by SURNAME, in ascending order. This is the default. We can explicitly specify the ordering by using the ASC (for ascending) and the DESC (for descending) keywords. If we had used DESC in the previous query:

 

SELECT * FROM STUDENTS d ORDER BY SURNAME DESC ;

SURNAME FIRST_NAME D_O_B STUDENT_NO DEPT_NO YEAR

 

 

The students are now listed in reverse alphabetical order. Note that ASC is optional. If neither DESC OR ASC is specified then ASC is assumed to be in effect. You can use ORDER BY with more than one column. In this case, SQL will use the first column as the primary ordering field, the second column as the secondary and so on. In our STUDENTS table for example, to list the student’s records by departments and within each department by surname:

 

SELECT * FROM STUDENTS ORDER BY DEPT_NO, SURNAME ;

 

 

Notice that the rows in the results table are now ordered by the DEPT_NO field. This is the primary ordering field.Within each department, the students are displayed in alphabetical order by SURNAME. This is the secondary ordering field. Although you can use as many ordering fields as you like in the ORDER BY clause, the ANSI/ISO standard requires that the columns used in the ORDER BY clause are also displayed in the results table. This means that they must be specified in the SELECT clause, either explicitly by name, or implicitly by using the asterisk. This ANSI/ISO requirement is not enforced by all SQL dialects but it is a good idea to adhere to it anyway for portability reasons. You have seen how to order results rows by using column names in the ORDER BY clause. What if you don’t know what the column name is? Such situations are not as remote as you might think. For example calculated columns and aggregate functions cannot be referred to by their column name. To overcome this, ORDER BY also accepts column number values. For example, we can list the contents of the DEPARTMENTS table in allocated budget order either by specifying ORDER BY BUDGET or by specifying the column number:

 

SELECT DEPT_NO, DEPT_NAME, BUDGET

FROM DEPARTMENTS

ORDER BY 3 ;

 

The first column specified in the SELECT clause is always column 1. Subsequent columns have numeric values according to where they are specified in SELECT and not where they occur in the table itself. This applies to calculated columns as well:

 

SELECT DEPT_NAME, (BUDGET * 2.25)

FROM DEPARTMENTS

ORDER BY 2 ;

DEPT_NAME

——————– ———–

Medicine                        775687500

Physical Sciences         526500000

Industrial Law               877500

Management Studies   28237500

Arts & Humanities        8471250

Engineering                  65025000

 

The rows in a table are elemental pieces of information that you can use to base your decisions on. Very often, the data that you need can be found directly in one or more columns. But sometimes, the data is based on the values of all the rows in the table.

 

For example, if you need to know the average mark in the exams table, you must add up the marks for all the students, then divide that value by the number of students in the table. ANSI/ISO SQL provides five functions, known as aggregate functions which can be used to summarize data in tables. These functions operate on the table data and produce a single value as output.

The five ANSI/ISO functions are:

 

COUNT() outputs the number of rows or column values that would be selected by the query. The function does not actually list any of the rows, but only a value denoting the total number of rows or column values that the query retrieves.

 

SUM() outputs the sum total of all the column values that are addressed by the query.This function can only be used with numeric type columns.

 

VG() outputs the average (arithmetic mean) of the column values addressed by the query. As with the SUM() function, AVG() can only be used with numeric type columns.

 

MIN() outputs the minimum, the smallest, column value from those that are addressed by the query.

 

MAX() outputs the maximum, the largest, column value from those that are addressed by the query.

 

Aggregate functions can be used in the select list just like regular columns with the following provisions: You cannot nest aggregate functions and you cannot mix regular columns and aggregate functions in the same query.

 

There are two different versions of the COUNT() aggregate function that ANSI/ISO allows. The first counts and lists the number of non-NULL values in a particular column.

 

The second counts and displays the total number of rows that would be retrieved by a query. These two versions of COUNT() differ only in the arguments that are passed to them. Let’s use COUNT() to count the number of data values in a column. To find out how many students have been assigned to a department in the STUDENTS table:

 

SELECT COUNT(DEPT_NO)

FROM STUDENTS;

COUNT(DEPT_NO)

————–

  In our case, all the students are assigned a department number and the number output by the query is the same as the number of students there are in the table. If this were not the case, ie. if there were NULL values in the DEPT_NO field for some of the student’s rows, then these rows would not appear in the COUNT() function’s total. To count the number of different values in a column, the column name must be preceded by the DISTINCT keyword. For example, to look at the number of different departments that are represented in the DEPT_NO field of the STUDENTS table:

   SELECT COUNT(DISTINCT DEPT_NO)

FROM STUDENTS;

COUNT(DEPT_NO)

————–

 

The output from this query is 5 because there are five different department number values in this column. The ANSI/ISO standard states that DISTINCT must be used with column names in the COUNT() function, most commercial versions of SQL relax this requirement and leave it up to the user to use DISTINCT or not.

 

As pointed out earlier, the COUNT() function can also be used to count rows in a table as well as column values. To do this, COUNT() must be used with an asterisk. To count the number of rows in the EXAMS table:

 

SELECT COUNT(*)

FROM EXAMS;

COUNT(*)

——–

19

 

The COUNT(*) total includes all the rows addressed by the query, including NULL and duplicate rows. If we are only interested in knowing the number of exams taken by a particular student, we would have to use the WHERE clause to retrieve those rows that

 

we are interested in:

SELECT COUNT(*)

FROM EXAMS

WHERE STUDENT_NO = 1 ;

COUNT(*)

———

3

 

The SUM() aggregate function calculates the sum total of the values in a column. The parameter passed to SUM() must be the name of the column either by itself or used in a scalar expression. The data in the columns used by SUM() must be numeric of type such as integer, decimal etc. Let’s use SUM() to find the total expenditure on staff pay:

 

SELECT SUM(PAY)

FROM LECTURERS ;

SUM(PAY)

——–

261270

 

This query adds up all the values in the PAY column and lists the final total. The output of SUM() (and also the other aggregate functions that deal with numeric type data) is usually of the same data type as the column data but sometimes, the result is of greater precision then the column data.

 

You can use scalar expressions as parameters to the aggregate functions. The following query adds 1500 to each lecturer’s pay and calculates the sum total:

 

SELECT SUM(PAY), SUM(PAY + 1500)

FROM LECTURERS ;

SUM(PAY) SUM(PAY+1500)

———- ————-

261270 270270

 

In this simple example, we could have calculated this value by adding 1500 x 6 = 9000 to the SUM(PAY) value. Scalar expressions are most useful when you want to look at say, the total expenditure on pay for a percentage increase in salary for each lecturer.

 

For example, this query finds the total expenditure on pay if we increase each lecturer’s salary by 7.5%:

 

SELECT SUM(PAY), SUM(PAY * 1.075)

FROM LECTURERS ;

SUM(PAY) SUM(PAY*1.075)

———– ————–

261270 280865

 

The AVG() function calculates the average or arithmetic mean of the values in a column. AVG() can only be applied to numeric type columns and outputs a numeric value. SQL calculates the average by adding up all the values in the column, then dividing the total by the number of values. As an example, the following query calculates the average pay for a lecturer:

 

SELECT AVG(PAY)

FROM LECTURERS ;

AVG(PAY)

———

43545

 

We can also selectively calculate averages. This query finds the average mark obtained by students in a particular subject:

 

SELECT AVG(MARK) FROM EXAMS WHERE SUB_NO = 5 ;AVG(MARK)

———

55

 

The MIN() function finds the smallest value in a column of data.

 

MIN() can operate on string and numeric data types as well as non-ANSI types such as date and time. For example, to find the earliest date when a lecturer joined the staff:

SELECT MIN(JOINED)

FROM LECTURERS ;

MIN(JOINED)

————-

03-28-1960

 

Most dialects of SQL treat earlier dates and times as being less than later dates and times. So to find the last date when a lecturer joined the staff:

 

SELECT MAX(JOINED)

FROM LECTURERS ;

MAX(JOINED)

————

03-25-1990

 

MIN() and MAX() both allow you to use scalar expressions as well as column names as parameters. For example, if the average pass mark for all subjects was found to be 58%, then this query finds the lowest difference in percentage points between this mark and the exam marks:

 

SELECT MIN(58 – MARK)

FROM EXAMS ;

MIN(58 – MARK)

————–

-31

 

The query comes up with the answer of -31 because the highest mark in EXAMS is 89 and 58 – 89 = -31. This result may not be what you expected and serves to illustrate an important point. You need to be careful when wording your queries to ensure that they do what you intend them to do.

 

The aggregate functions described in the previous section have been used to produce grand totals. Values output by them are just like the totals that appear at the end of each column listing in a report. You can also use these functions to output sub-total values. The GROUP BY clause of the SELECT statement lets you split up the values in a column into subsets. The aggregate functions are then applied to these subsets instead of the column as a whole. For example, in the EXAMS table, we could find theaverage mark obtained by the students by:

 

SELECT AVG(MARK)

FROM EXAMS ;

AVG(MARK)

———

55

 

This value is not very informative as the exams were sat by students of all abilities. It would be more meaningful to get the average mark for each student. This can be obtained by using the GROUP BY clause:

 

SELECT STUDENT_NO, AVG(MARK)

FROM EXAMS

GROUP BY STUDENT_NO ;

 

his query first groups the rows in the EXAMS table by the values in STUDENT_NO. The AVG() function then operates on each group. The average values output are thus the averages for the exams taken by individual students.

 

Queries using the GROUP BY clause are known as grouped queries.

 

All the rules for using the ANSI/ISO functions that we have looked at also apply to grouped queries.

 

The only difference being that in grouped queries, the DBMS applies the functions to each group individually rather than to the column as a whole. You can also get the same results by running several queries with

a WHERE clause. For example, to find the average mark for a student:

 

SELECT STUDENT_NO, AVG(MARK)

FROM EXAMS

WHERE STUDENT_NO = 1 ;

STUDENT_NO AVG(MARK)

———– ———

1 65

 

By changing the 1 value in the predicate, we could calculate the average for different students.

 

GROUP BY can be used with multiple fields. For example, in the SUBJECTS table, to find the highest pass mark for each department/credits combination:

 

SELECT SUB_NAME, DEPT_NO, CREDITS, MAX(PASS)

FROM SUBJECTS

GROUP BY DEPT_NO, CREDITS ;

SUB_NAME DEPT_NO CREDITS MAX(PASS)

—————- ——- ——- ———

Engineering Drwg 1 1 71

Mathematics 1 2 65

Electronics 1 3 71

English Lit 2 1 60

Basic Accounts 3 1 67

Marketing 3 2 56

Industrial Law 4 2 52

Organic Chemistry 5 3 57

Anatomy 6 1 74

Physiology 6 3 78

 

You cannot use aggregate functions in the WHERE clause of a SELECT statement. This means that you cannot use WHERE to selectively eliminate data that does not interest you from the results of aggregate queries. For example, in the query that we used to find the average mark for each student, if we are only interested in averages that are above 56%, then SQL won’t let you use the following query because it uses AVG() in the WHERE clause:

 

SELECT STUDENT_NO, AVG(MARK)

FROM EXAMS

WHERE AVG(MARK) > 56

GROUP BY STUDENT_NO ;

 

Error 67: Aggregate function used in WHERE.

 

The HAVING clause performs a similar function to WHERE in that it eliminates groups from the results table. Thus to list only those students where the average is above 56%:

 

SELECT STUDENT_NO, AVG(MARK)

FROM EXAMS

GROUP BY STUDENT_NO

HAVING AVG(MARK) > 56 ;

STUDENT_NO AVG(MARK)

———– ———-

1  62

3  70

6  74

8  62

 

The field referenced by HAVING can not have more than one value for each group. This means that in practice HAVING can only reference aggregate functions and columns that are used in GROUP BY.

 

So far, we’ve been looking at queries that retrieve data from single table at a time. Single table queries are useful but they do not exploit the full power of the SQL language. SQL is a relational database query language and as such, one of it’s most important features is it’s ability to retrieve information from several different related

 

tables. In relational database terms, this process is called a join. The tables to be joined are named in the FROM clause of the SELECT with each table name separated by a comma. The relationships between the tables in a join are defined by the predicate in

 

the WHERE clause. The predicate can refer to any column from the joined tables to form the relations. For example, to list the names of all the lecturers and the subjects that they teach:

 

SELECT LECTURERS.SURNAME, SUBJECTS.SUB_NAME FROM LECTURERS, SUBJECTS

WHERE LECTURERS.DEPT_NO = SUBJECTS.DEPT_NO ;

 

Of course, this join assumes that all the lecturers are multi-skilled in that each is able to teach all the subjects in one particular department. Notice the column naming convention we have used The column names in this query are prefixed by the name of the table that the column is part of. If all the columns in the joined tables had unique names, then the table prefix would not have been required. In our university example though, there is a column called DEPT_NO in both the LECTURERS and the SUBJECTS tables. In this case we must use LECTURERS.DEPT_NO and SUBJECTS.DEPT_NO to distinguish between the columns. Generally, it is good to get into the habit of using the table name prefixto specify columns. As your queries get more and more complex, it may not always be clear to the reader which column you mean if the table prefix is not used. In the last query we did not have to tell SQL how to retrieve the data from the tables, instead, we merely specified what data we wanted to see. The actual tables themselves might have been stored on disks located at different sites. SQL shields the user from these technicalities in that you do not have to know how to get at the data or even where it is. You only have to specify the data to get at. When processing a query with a join, SQL looks at all the possible combination of rows from the tables in the join and uses the criteria defined in the predicate to add or omit the rows from the results table.