34 WHY DO WE NEED TO PRE PROCESS THE DATA?
Dr R. Baskaran
WHY DO WE NEED TO PREPROCESS THE DATA?
Much of the raw data contained in databases is unpreprocessed, incomplete, and noisy.For example, the databases may contain:Fields that are obsolete or redundant. Missing values Outliers.Data in a form not suitable for data mining models. Values not consistent with policy or common sense.To be useful for data mining purposes, the databases need to undergo preprocessing, in the form of data cleaning and data transformation. Data mining often deals with data that hasn’t been looked at for years, so that much of the data contains field values that have expired, are no longer relevant, or are simply missing. The overriding objective is to minimize GIGO: to minimize the “garbage” that gets into our model so that we can minimize the amount of garbage that our models give out. Dorian Pyle, in his book Data Preparation for Data Mining , estimates that data preparation alone accounts for 60% of all the time and effort expanded in the entire data mining process.
DATA CLEANING
To illustrate the need to clean up data, let’s take a look at some of the types of errors that could creep into even a tiny data set, such as that in Table 1. Let’s discuss, attribute by attribute, some of the problems that have found their way into the data set in Table 1. The customer ID variable seems to be fine. What about zip?
Let’sassumethatweareexpectingallofthecustomersinthedatabasetohavethe usual five-numeral U.S. zip code. Now, customer 1002 has this strange (to American eyes) zip code of J2S7K7. If we were not careful, we might be tempted to classify this unusual value as an error and toss it out, until we stop to think that not all countries use the same zip code format. Actually, this is the zip code of St. Hyancinthe, Quebec, Canada, so probably represents real data from a real customer. What has evidently occurred is that a French-Canadian customer has made a purchase and put their home zip code down in the field required. Especially in this era of the North American Free Trade Agreement, we must be ready to expect unusual values in fields such as zip codes, which vary from country to country. What about the zip code for customer 1004? We are unaware of any countries that have four-digit zip codes, such as the 6269 indicated here, so this must be an error, right? Probably not. Zip codes for the New England states begin with the numeral 0.
Unless the zip code field is defined to be character (text) and not numeric, the software will probably chop off the leading zero, which is apparently what happened here. The zipcodeisprobably 06269, which refers to Storrs, Connecticut, home of the University of Connecticut. The next field, gender, contains a missing value for customer 1003. The income field, which we assume is measuring annual gross income, has three potentially anomalous values. First, customer 1003 is shown as having an income of Rs.10,000,000 per year. Although entirely possible, especially when considering the customer’s zip code (90210, Beverly Hills), this value of income is nevertheless an outlier, an extreme data value. Certain statistical and data mining modeling techniques do not function smoothly in the presence of outliers;
Table 1: Can You Find Any Problems in This Tiny Data Set?
Poverty is one thing, but it is rare to find an income that is negative, as our poor customer 1004 has. Unlike customer 1003’s income, customer 1004’s reported income of -Rs.40,000 lies beyond the field bounds for income and therefore must be an error. It is unclear how this error crept in, with perhaps the most likely explanation being that the negative sign is a stray data entry error. However, we cannot be sure and should approach this value cautiously, attempting to communicate with the database manager most familiar with the database history. So what is wrong with customer 1005’s income of Rs.99,999? Perhaps nothing; it may in fact be valid. But if all the other incomes are rounded to the nearest Rs.5000, why the precision with customer 1005? Often, in legacy databases, certain specified values are meant to be codes for anomalous entries, such as missing values. Perhaps 99999 was coded in an old database to mean missing. Again, we cannot be sure and should again refer to the “wetware.”
Finally, are we clear as to which unit of measure the income variable is measured in? Databases often get merged, sometimes without bothering to check whether such merges are entirely appropriate for all fields. For example, it is quite possible that customer 1002, with the Canadian zip code, has an income measured in Canadian dollars, not U.S. dollars.
The age field has a couple of problems. Although all the other customers have numerical values for age, customer 1001’s “age” of C probably reflects an earlier categorization of this man’s age into a bin labeled C. The data mining software will definitely not like this categorical value in an otherwise numerical field, and we will have to resolve this problem somehow. How about customer 1004’s age of 0? Perhaps there is a newborn male living in Storrs, Connecticut, who has made a transaction of Rs.1000.
More likely, the age of this person is probably missing and was coded as 0 to indicate this or some other anomalous condition (e.g., refused to provide the age information). Of course, keeping an age field in a database is a minefield in itself, since the passage of time will quickly make the field values obsolete and misleading. It is better to keep date-type fields (such as birthdate) in a database, since these are constant and may be transformed into ages when needed.
The marital status field seems fine, right? Maybe not. The problem lies in the meaning behind these symbols. We all think we know what these symbols mean, but are sometimes surprised. For example, if you are in search of cold water in a rest room in Montreal and turn on the faucet marked C, you may be in for a surprise, since the C stands for chaud, which is French for hot. There is also the problem of ambiguity. In Table 1, for example, does the S for customers 1003 and 1004 stand for single or separated? The transaction amount field seems satisfactory as long as we are confident that we know what unit of measure is being used and that all records are transacted in this unit.
HANDLING MISSING DATA
Missing data is a problem that continues to plague data analysis methods. Even as our analysis methods gain sophistication, we continue to encounter missing values in fields, especially in databases with a large number of fields. The absence of information is rarely beneficial. All things being equal, more data is almost always better. Therefore, we should think carefully about how we handle the thorny issue of missing data. To help us tackle this problem, we will introduce ourselves to a new data set, the cars data set, originally compiled by Barry Becker and Ronny Kohavi of Silicon Graphics, and available at the SGI online data repository at www.sgi.com/tech/mlc/db. The data set, also available on the book series Web site accompanying the text, consists of information about 261 automobiles manufactured in the 1970s and 1980s, including gas mileage, number of cylinders, cubic inches, horsepower, and so on.
A common method of handling missing values is simply to omit from the analysis the records or fields with missing values. However, this may be dangerous, since the pattern of missing values may in fact be systematic, and simply deleting records with missing values would lead to a biased subset of the data. Further, it seems like a waste to omit the information in all the other fields, just because one field value is missing. Therefore, data analysts have turned to methods that would replace the missing value with a value substituted according to various criteria. Insightful Miner offers a choice of replacement values for missing data:
1. Replace the missing value with some constant, specified by the analyst.
2. Replace the missing value with the field mean (for numerical variables) or the mode (for categorical variables).
3. Replace the missing values with a value generated at random from the variable distribution observed.
IDENTIFYING MISCLASSIFICATIONS
Let us look at an example of checking the classification labels on the categorical variables, to make sure that they are all valid and consistent. One of the functions of Insightful Miner’s missing values node is to display a frequency distribution of the categorical variables available. For example, the frequency distribution of the categorical variable origin, where Insightful Miner’s missing values node is applied to the cars data set, classes: India, USA, France, US, Europe, and Japan. However, two of the classes, USA and France, have a count of only one automobile each. What is clearly happening here is that two of the records have been classified inconsistently with respect to the origin of manufacture. To maintain consistency with the remainder of the data set, the record with origin USA should have been labeled US, and the record with origin France should have been labeled Europe.
METHODS FOR IDENTIFYING OUTLIERS
Outliers are extreme values that lie near the limits of the data range or go against the trend of the remaining data. Identifying outliers is important because they may represent errors in data entry. Also, even if an outlier is a valid data point and not in error, certain statistical methods are sensitive to the presence of outliers and may deliver unstable results. Neural networks benefit from normalization, as do algorithms that make use of distance measures, such as the k-nearest neighbor algorithm.
DATA TRANSFORMATION
Variables tend to have ranges that vary greatly from each other. For example, if we are interested in major league baseball, players’ batting averages will range from zero to less than 0.400, while the number of home runs hit in a season will range from zero to around 70. For some data mining algorithms, such differences in the ranges will lead to a tendency for the variable with greater range to have undue influence on the results.
Therefore, data miners should normalize their numerical variables, to standardize the scale of effect each variable has on the results. There are several techniques for normalization, and we shall examine two of the more prevalent methods. Let X refer to our original field value and X ∗ refer to the normalized field value.