37 Clustering

Dr R. Baskaran

CLUSTERING

The data mining techniques described in this session are used to find meaningful patterns in data. These patterns are not always immediately forthcoming. Sometimes this is because there are no patterns to be found. Other times, the problem is not the lack of patterns, but the excess. The data may contain so much complex structure that even the best data mining techniques are unable to coax out meaningful patterns.

When mining such a database for the answer to some specific question, competing explanations tend to cancel each other out. As with radio reception, too many competing signals add up to noise. Clustering provides a way to learn about the structure of complex data, to break up the cacophony of competing signals into its components. When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply.

If someone were asked to describe the color of trees in the forest, the answer would probably make distinctions between deciduous trees and evergreens, and between winter, spring, summer, and fall. People know enough about woodland flora to predict that, of all the hundreds of variables associated with the forest, season and foliage type, rather than say age and height, are the best factors to use for forming clusters of trees that follow similar coloration rules. Once the proper clusters have been defined, it is often possible to find simple patterns within each cluster. “In Winter, deciduous trees have no leaves so the trees tend to be brown” or “The leaves of deciduous trees change color in the autumn, typically to oranges, reds, and yellows.” In many cases, a very noisy dataset is actually composed of a number of better-behaved clusters. The question is: how can these be found? That is where techniques for automatic cluster detection come in—to help see the forest without getting lost in the trees.

Where data mining techniques are classified as directed or undirected, automatic cluster detection is described as a tool for undirected knowledge discovery. In the technical sense, that is true because the automatic cluster detection algorithms themselves are simply finding structure that exists in the data without regard to any particular target variable. Most data mining tasks start out with a pre-classified training set, which is used to develop a model capable of scoring or classifying previously unseen records.

In clustering, there is no pre-classified data and no distinction between independent and dependent variables. Instead, clustering algorithms search for groups of records—the clusters—composed of records similar to each other. The algorithms discover these similarities. It is up to the people running the analysis to determine whether similar records represent something of interest to the business—or something inexplicable and perhaps unimportant. In a broader sense, however, clustering can be a directed activity because clusters are sought for some business purpose. In marketing, clusters formed for a business purpose are usually called “segments,” and customer segmentation is a popular application of clustering.

Automatic cluster detection is a data mining technique that is rarely used in isolation because finding clusters is not often an end in itself. Once clusters have been detected, other methods must be applied in order to figure out what the clusters mean. When clustering is successful, the results can be dramatic: One famous early application of cluster detection led to our current understanding of stellar evolution.

K-Means Clustering:

The K-means algorithm is one of the most commonly used clustering algorithms. The “K” in its name refers to the fact that the algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other. The version described here was first published by J. B. MacQueen in 1967. For ease of explaining, the technique is illustrated using two-dimensional diagrams. Bear in mind that in practice the algorithm is usually handling many more than two independent variables. This means that instead of points corresponding to two-element vectors (x1,x2), the points correspond to n-element vectors (x1,x2, . . . , xn). The procedure itself is unchanged.

Three Steps of the K-Means Algorithm

In the first step, the algorithm randomly selects K data points to be the seeds. MacQueen’s algorithm simply takes the first K records. In cases where the records have some meaningful order, it may be desirable to choose widely spaced records, or a random selection of records. Each of the seeds is an embryonic cluster with only one element. This example sets the number of clusters to 3.

The second step assigns each record to the closest seed. One way to do this is by finding the boundaries between the clusters, as shown geometrically in Figure 1. The boundaries between two clusters are the points that are equally close to each cluster. Recalling a lesson from high-school geometry makes this less difficult than it sounds: given any two points, A and B, all points that are equidistant from A and B fall along a line (called the perpendicular bisector) that is perpendicular to the one connecting A and B and halfway between them. In Figure 1, dashed lines connect the initial seeds; the resulting cluster boundaries shown with solid lines are at right angles to the dashed lines. Using these lines as guides, it is obvious which records are closest to which seeds. In three dimensions, these boundaries would be planes and in N dimensions they would be hyperplanes of dimension N – 1. Fortunately, computer algorithms easily handle these situations. Finding the actual boundaries between clusters is useful for showing the process geometrically. In practice, though, the algorithm usually measures the distance of each record to each seed and chooses the minimum distance for this step. For example, consider the record with the box drawn around it. On the basis of the initial seeds, this record is assigned to the cluster controlled by seed number 2 because it is closer to that seed than to either of the other two. At this point, every point has been assigned to exactly one of the three clusters centered around the original seeds. The third step is to calculate the centroids of the clusters; these now do a better job of characterizing the clusters than the initial seeds finding the centroids is simply a matter of taking the average value of each dimension for all the records in the cluster.

Figure 1: Height and weight of group of teenagers

Similarity and Distance:

Once records in a database have been mapped to points in space, automatic cluster detection is really quite simple—a little geometry, some vector means. The problem, of course, is that the databases encountered in marketing, sales, and customer support are not about points in space. They are about purchases, phone calls, airplane trips, car registrations, and a thousand other things that have no obvious connection to the dots in a cluster diagram.

Clustering records of this sort requires some notion of natural association; that is, records in a given cluster are more similar to each other than to records in another cluster. Since it is difficult to convey intuitive notions to a computer, this vague concept of association must be translated into some sort of numeric measure of the degree of similarity. The most common method, but by no means the only one, is to translate all fields into numeric values so that the records may be treated as points in space. Then, if two points are close in the geometric sense, they represent similar records in the database. There are two main problems with this approach:

Many variable types, including all categorical variables and many numeric variables such as rankings, do not have the right behavior to properly be treated as components of a position vector.

In geometry, the contributions of each dimension are of equal importance, but in databases, a small change in one field may be much more important than a large change in another field.

Similarity Measures and Variable Type:

Geometric distance works well as a similarity measure for well-behaved numeric variables. A well-behaved numeric variable is one whose value indicates its placement along the axis that corresponds to it in our geometric model. Not all variables fall into this category. For this purpose, variables fall into four classes, listed here in increasing order of suitability for the geometric model.

Categorical variables
Ranks Intervals
True measures

Categorical variables only describe which of several unordered categories a thing belongs to. For instance, it is possible to label one ice cream pistachio and another butter pecan, but it is not possible to say that one is greater than the other or judge which one is closer to black cherry. In mathematical terms, it is possible to tell that X ≠ Y, but not whether X > Y or X < Y. Ranks put things in order, but don’t say how much bigger one thing is than another. The valedictorian has better grades than the salutatorian, but we don’t know by how much. If X, Y, and Z are ranked A, B, and C, we know that X > Y > Z, but we cannot define X-Y or Y-Z. Intervals measure the distance between two observations. If it is 56° in San Francisco and 78° in San Jose, then it is 22 degrees warmer at one end of the bay than the other.

True measures are interval variables that measure from a meaningful zero point. This trait is important because it means that the ratio of two values of the variable is meaningful. The Fahrenheit temperature scale used in the United States and the Celsius scale used in most of the rest of the world do not have this property. In neither system does it make sense to say that a 30° day is twice as warm as a 15° day. Similarly, a size 12 dress is not twice as large as a size 6, and gypsum is not twice as hard as talc though they are 2 and 1 on the hardness scale. It does make perfect sense, however, to say that a 50-year-old is twice as old as a 25-year-old or that a 10-pound bag of sugar is twice as heavy as a 5-pound one. Age, weight, length, customer tenure, and volume are examples of true measures.

Geometric distance metrics are well-defined for interval variables and true measures. In order to use categorical variables and rankings, it is necessary to transform them into interval variables. Unfortunately, these transformations may add spurious information. If ice cream flavors are assigned arbitrary numbers 1 through 28, it will appear that flavors 5 and 6 are closely related while flavors 1 and 28 are far apart.

Manhattan Distance

Another common distance metric gets its name from the rectangular grid pattern of streets in midtown Manhattan. It is simply the sum of the distances traveled along each axis. This measure is sometimes preferred to the Euclidean distance because given that the distances along each axis are not squared, it is less likely that a large difference in one dimension will dominate the total distance.

Number of Features in Common:

When the preponderance of fields in the records are categorical variables, geometric measures are not the best choice. A better measure is based on the degree of overlap between records. As with the geometric measures, there are many variations on this idea. In all variations, the two records are compared field by field to determine the number of fields that match and the number of fields that don’t match. The simplest measure is the ratio of matches to the total number of fields.

In its simplest form, this measure counts two null or empty fields as matching. This has the perhaps perverse result that everything with missing data ends up in the same cluster. A simple improvement is to not include matches of this sort in the match count. Another improvement is to weight the matches by the prevalence of each class in the general population. After all, a match on “Chevy Nomad” ought to count for more than a match on “Ford F-150 Pickup.”

Scaling for Consistency:

Here are three common ways of scaling variables to bring them all into comparable ranges:
Divide each variable by the range (the difference between the lowest and highest value it takes on) after subtracting the lowest value. This maps all values to the range 0 to 1, which is useful for some data mining algorithms.
Divide each variable by the mean of all the values it takes on. This is often called “indexing a variable.”
Subtract the mean value from each variable and then divide it by the standard deviation.
This is often called standardization or “converting to z-scores.” A z-score tells you how many standard deviations away from the mean a value is.

Normalizing a single variable simply changes its range. A closely related concept is vector normalization which scales all variables at once. This too has a geometric interpretation. Consider the collection of values in a single record or observation as a vector. Normalizing them scales each value so as to make the length of the vector equal one. Transforming all the vectors to unit length emphasizes the differences internal to each record rather than the differences between records.

Use Weights to Encode Outside Information

Scaling takes care of the problem that changes in one variable appear more significant than changes in another simply because of differences in the magnitudes of the values in the variable. What if we think that two families with the same income have more in common than two families on the same size plot, and we want that to be taken into consideration during clustering? That is where weighting comes in. The purpose of weighting is to encode the information that one variable is more (or less) important than others.

A good place to starts is by standardizing all variables so each has a mean of zero and a variance (and standard deviation) of one. That way, all fields contribute equally when the distance between two records is computed. We suggest going farther. The whole point of automatic cluster detection is to find clusters that make sense to you. If, for your purposes, whether people have children is much more important than the number of credit cards they carry, there is no reason not to bias the outcome of the clustering by multiplying the number of children field by a higher weight than the number of credit cards field.

After scaling to get rid of bias that is due to the units, use weights to introduce bias based on knowledge of the business context. Some clustering tools allow the user to attach weights to different dimensions, simplifying the process. Even for tools that don’t have such functionality, it is possible to have weights by adjusting the scaled values. That is, first scale the values to a common range to eliminate range effects. Then multiply the resulting values by a weight to introduce bias based on the business context. Of course, if you want to evaluate the effects of different weighting strategies, you will have to add another outer loop to the clustering process.