33 Predictive Data Mining
Dr R. Baskaran
Predictive Data Mining
The goal of data mining is to produce new knowledge that the user can act upon. It does this by building a model of the real world based on data collected from a variety of sources which may include corporate transactions, customer histories and demographic information, process control data, and relevant external databases such as credit bureau information or weather data. The result of the model building is a description of patterns and relationships in the data that can be confidently used for prediction.
To avoid confusing the different aspects of data mining, it helps to envision a hierarchy of the choices and decisions you need to make before you start:
• Business goal
• Type of prediction
• Model type
• Algorithm
• Product
At the highest level is the business goal: what is the ultimate purpose of mining this data? For example, seeking patterns in your data to help you retain good customers, you might build one model to predict customer profitability and a second model to identify customers likely to leave (attrition).
Your knowledge of your organization’s needs and objectives will guide you in formulating the goal of your models.
The next step is deciding on the type of prediction that’s most appropriate: (1) classification: predicting into what category or class a case falls, or (2) regression: predicting what number value a variable will have (if it’s a variable that varies with time, it’s called time series prediction). In the example above, you might use regression to forecast the amount of profitability, and classification to predict which customers might leave. These are discussed in more detail below.
Now you can choose the model type: a neural net to perform the regression, perhaps, and a decision tree for the classification. There are also traditional statistical models to choose from such as logistic regression, discriminant analysis, or general linear models. The most important model types for data mining are described in the next section, on DATA MINING MODELS AND ALGORITHMS.
Many algorithms are available to build your models. You might build the neural net using backpropagation or radial basis functions. For the decision tree, you might choose among CART, C5.0, Quest, or CHAID. Some of these algorithms are also discussed in DATA MINING MODELS AND ALGORITHMS, below. When selecting a data mining product, be aware that they generally have different implementations of a particular algorithm even when they identify it with the same name. These implementation differences can affect operational characteristics such as memory usage and data storage, as well as performance characteristics such as speed and accuracy. Other key considerations to keep in mind are covered later in the section on SELECTING DATA MINING PRODUCTS.
Many business goals are best met by building multiple model types using a variety of algorithms. You may not be able to determine which model type is best until you’ve tried several approaches.
Some terminology
In predictive models, the values or classes we are predicting are called the response, dependent or target variables. The values used to make the prediction are called the predictor or independent variables.
Predictive models are built, or trained, using data for which the value of the response variable is already known. This kind of training is sometimes referred to as supervised learning, because calculated or estimated values are compared with the known results. (By contrast, descriptive techniques such as clustering, described in the previous section, are sometimes referred to as unsupervised learning because there is no already-known result to guide the algorithms.)
Classification
Classification problems aim to identify the characteristics that indicate the group to which each case belongs. This pattern can be used both to understand the existing data and to predict how new instances will behave. For example, you may want to predict whether individuals can be classified as likely to respond to a direct mail solicitation, vulnerable to switching over to a competing longdistance phone service, or a good candidate for a surgical procedure. Data mining creates classification models by examining already classified data (cases) and inductively finding a predictive pattern. These existing cases may come from an historical database, such as people who have already undergone a particular medical treatment or moved to a new longdistance service. They may come from an experiment in which a sample of the entire database is tested in the real world and the results used to create a classifier. For example, a sample of a mailing list would be sent an offer, and the results of the mailing used to develop a classification model to be applied to the entire database. Sometimes an expert classifies a sample of the database, and this classification is then used to create the model which will be applied to the entire database.
Regression
Regression uses existing values to forecast what other values will be. In the simplest case, regression uses standard statistical techniques such as linear regression. Unfortunately, many real-world problems are not simply linear projections of previous values. For instance, sales volumes, stock prices, and product failure rates are all very difficult to predict because they may depend on complex interactions of multiple predictor variables. Therefore, more complex techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary to forecast future values.
The same model types can often be used for both regression and classification. For example, the CART (Classification And Regression Trees) decision tree algorithm can be used to build both classification trees (to classify categorical response variables) and regression trees (to forecast continuous response variables). Neural nets too can create both classification and regression models.
Time series
Time series forecasting predicts unknownfuture values based on a time-varying series of predictors. Like regression, it uses known results to guide its predictions. Models must take into account the distinctive properties of time, especially the hierarchy of periods (including such varied definitions as the five- or seven-day work week, the thirteen-“month” year, etc.), seasonality, calendar effects such as holidays, date arithmetic, and special considerations such as how much of the past is relevant.
DATA MINING MODELS AND ALGORITHMS
Now let’s examine some of the types of models and algorithms used to mine data. Most products use variations of algorithms that have been published in computer science or statistics journals, with their specific implementations customized to meet the individual vendor’s goal. For example, many vendors sell versions of the CART or CHAID decision trees with enhancements to work on parallel computers. Some vendors have proprietary algorithms which, while not extensions or enhancements of any published approach, may work quite well. Most of the models and algorithms discussed in this section can be thought of as generalizations of the standard workhorse of modeling, the linear regression model. Much effort has been expended in the statistics, computer science, artificial intelligence and engineering communities to overcome the limitations of this basic model. The common characteristic of many of the newer technologies we will consider is that the pattern-finding mechanism is data-driven rather than user-driven. That is, the relationships are found inductively by the software itself based on the existing data rather than requiring the modeler to specify the functional form and interactions. Perhaps the most important thing to remember is that no one model or algorithm can or should be used exclusively. For any given problem, the nature of the data itself will affect the choice of models and algorithms you choose. There is no “best” model or algorithm. Consequently, you will need a variety of tools and technologies in order to find the best possible model.
Neural networks
Neural networks are of particular interest because they offer a means of efficiently modeling large and complex problems in which there may be hundreds of predictor variables that have many interactions.(Actual biological neural networks are incomparably more complex.) Neural nets may be used in classification problems (where the output is a categorical variable) or for regressions (where the output variable is continuous).
A neural network starts with an input layer, where each node corresponds to a predictor variable. These input nodes are connected to a number of nodes in a hidden layer. Each input node is connected to every node in the hidden layer. The nodes in the hidden layer may be connected to nodes in another hidden layer, or to an output layer. The output layer consists of one or more response variables.
THE DATA MINING PROCESS
Process Models
Recognizing that a systematic approach is essential to successful data mining, many vendor and consulting organizations have specified a process model designed to guide the user (especially someone new to building predictive models) through a sequence of steps that will lead to good results.
SPSS uses the 5A’s — Assess, Access, Analyze, Act and Automate — and SAS uses SEMMA
— Sample, Explore, Modify, Model, Assess. Recently, a consortium of vendors and users consisting of NCR Systems Engineering Copenhagen (Denmark), Daimler-Benz AG (Germany), SPSS/Integral Solutions Ltd. (England) and OHRA Verzekeringen en Bank Groep B.V (The Netherlands) has been developing a specification called CRISP-DM — Cross-Industry Standard Process for Data Mining. CRISP-DM is similar to process models from other companies including the one from Two Crows Corporation. As of September 1999, CRISP-DM is a work in progress. It is a good start in helping people to understand the necessary steps in successful data mining.
The Two Crows Process Model
The Two Crows data mining process model described below is derived from the Two Crows process model discussed in the previous edition of this document, and also takes advantage of some insights from CRISP-DM.
Keep in mind that while the steps appear in a list, the data mining process is not linear — you will inevitably need to loop back to previous steps. For example, what you learn in the “explore data” step may require you to add new data to the data mining database. The initial models you build may provide insights that lead you to create new variables.
The basic steps of data mining for knowledge discovery are:
1. Define business problem
2. Build data mining database
3. Explore data
4. Prepare data for modeling
5. Build model
6. Evaluate model
7. Deploy model and results
Let’s go through these steps to better understand the knowledge discovery process.
1. Define the business problem. First and foremost, the prerequisite to knowledge discovery is understanding your data and your business. Without this understanding, no algorithm, regardless of sophistication, is going to provide you with a result in which you should have confidence. Without this background you will not be able to identify the problems you’re trying to solve, prepare the data for mining, or correctly interpret the results. To make the best use of data mining you must make a clear statement of your objectives. It may be that you wish to increase the response to a direct mail campaign. Depending on your specific goal, such as “increasing the response rate” or “increasing the value of a response,” you will build a very different model. An effective statement of the problem will include a way of measuring the results of your knowledge discovery project. It may also include a cost justification.
2. Build a data mining database. This step along with the next two constitute the core of the data preparation. Together, they take more time and effort than all the other steps combined. There may be repeated iterations of the data preparation and model building steps as you learn something from the model that suggests you modify the data. These data preparation steps may take anywhere from 50% to 90% of the time and effort of the entire knowledge discovery process!
The data to be mined should be collected in a database. Note that this does not necessarily imply a database management system must be used. Depending on the amount of the data, the complexity of the data, and the uses to which it is to be put, a flat file or even a spreadsheet may be adequate. In general, it’s not a good idea to use your corporate data warehouse for this. You will be better off creating a separate data mart. Mining the data will make you a very active user of the data warehouse, possibly causing resource allocation problems. You will often be joining many tables together and accessing substantial portions of the warehouse. A single trial model may require many passes through much of the warehouse. Almost certainly you will be modifying the data from the data warehouse. In addition you may want to bring in data from outside your company to overlay on the data warehouse data or you may want to add new fields computed from existing fields. You may need to gather additional data through surveys. Other people building different models from the data warehouse (some of whom will use the same data as you) may want to make similar alterations to the warehouse.
However, data warehouse administrators do not look kindly on having data changed in what is unquestionably a corporate resource. One more reason for a separate database is that the structure of the corporate data warehouse may not easily support the kinds of exploration you need to do to understand this data. This includes queries summarizing the data, multi-dimensional reports (sometimes called pivot tables), and many different kinds of graphs or visualizations.
Lastly, you may want to store this data in a different DBMS with a different physical design than the one you use for your corporate data warehouse. Increasingly, people are selecting special purpose DBMSs which support these data mining requirements quite well. If, however, your corporate data warehouse allows you to create logical data marts and if it can handle the resource demands of data mining, then it may also serve as a good data mining database.
The tasks in building a data mining database are:
a. Data collection
b. Data description
c. Selection
d. Data quality assessment and data cleansing
e. Consolidation and integration
f. Metadata construction
g. Load the data mining database
h. Maintain the data mining database
You must remember that these tasks are not performed in strict sequence, but as the need arises. For example, you will start constructing the metadata infrastructure as you collect the data, and modify it continuously. What you learn in consolidation or data quality assessment may change your initial selection decision.
a. Data collection. Identify the sources of the data you will be mining. A data-gathering phase may be necessary because some of the data you need may never have been collected. You may need to acquire external data from public databases (such as census or weather data) or proprietary databases (such as credit bureau data).
b. Data Description Describe the contents of each file or database table.
c. Selection. The next step in preparing the data mining database is to select the subset of data to mine. This is not the same as sampling the database or choosing predictor variables. Rather, it is a gross elimination of irrelevant or unneeded data. Other criteria for excluding data may include resource constraints, cost, restrictions on data use, or quality problems.
d. Data quality assessment and data cleansing. GIGO (Garbage In, Garbage Out) is quite applicable to data mining, so if you want good models you need to have good data. A data quality assessment identifies characteristics of the data that will affect the model quality. Essentially, you are trying to ensure not only the correctness and consistency of values but also that all the data you have is measuring the same thing in the same way. There are a number of types of data quality problems. Single fields may have an incorrect value. For example, recently a man’s nine-digit Social Security identification number was accidentally entered as income when the government computed his taxes! Even when individual fields have what appear to be correct values, there may be incorrect combinations, such as pregnant males. Sometimes the value for a field is missing. Inconsistencies must be identified and removed when consolidating data from multiple sources. Missing data can be a particularly pernicious problem. If you have to throw out every record with a field missing, you may wind up with a very small database or an inaccurate picture of the whole database. The fact that a value is missing may be significant in itself. Perhaps only wealthy customers regularly leave the “income” field blank, for instance. It can be worthwhile to create a new variable to identify missing values, build a model using it, and compare the results with those achieved by substituting for the missing value to see which leads to better predictions.
Another approach is to calculate a substitute value. Some common strategies for calculating missing values include using the modal value (for nominal variables), the median (for ordinal variables), or the mean (for continuous variables). A less common strategy is to assign a missing value based on the distribution of values for that variable. For example, if a database consisted of 40% females and 60% males, then you might assign a missing gender entry the value of “female” 40% of the time and “male” 60% of the time. Sometimes people build predictive models using data mining techniques to predict missing values. This usually gives a better result than a simple calculation, but is much more time-consuming. Recognize that you will not be able to fix all the problems, so you will need to work around them as best as possible. It is far preferable and more cost-effective to put in place procedures and checks to avoid the data quality problems — “an ounce of prevention.” Usually, however, you must build the models you need with the data you now have, and avoidance is something you’ll work toward for the future.
e. Integration and consolidation. The data you need may reside in a single database or in multiple databases. The source databases may be transaction databases used by the operational systems of your company. Other data may be in data warehouses or data marts built for specific purposes. Still other data may reside in a proprietary database belonging to another company such as a credit bureau. Data integration and consolidation combines data from different sources into a single mining database and requires reconciling differences in data values from the various sources. Improperly reconciled data is a major source of quality problems. There are often large differences in the way data are defined and used in different databases. Some inconsistencies may be easy to uncover, such as different addresses for the same customer. Making it more difficult to resolve these problems is that they are often subtle. For example, the same customer may have different names or — worse — multiple customer identification numbers. The same name may be used for different entities (homonyms), or different names may be used for the same entity (synonyms). There are often unit incompatibilities, especially when data sources are consolidated from different countries; for example, U.S. dollars and Canadian dollars cannot be added without conversion.
f. Metadata construction. The information in the Dataset Description and Data Description reports is the basis for the metadata infrastructure. In essence this is a database about the database itself. It provides information that will be used in the creation of the physical database as well as information that will be used by analysts in understanding the data and building the models.
g. Load the data mining database. In most cases the data should be stored in its own database. For large amounts or complex data, this will usually be a DBMS as opposed to a flat file. Having collected, integrated and cleaned the data, it is now necessary to actually load the database itself. Depending on the DBMS and hardware being used, the amount of data, and the complexity of the database design, this may turn out to be a serious undertaking that requires the expertise of information systems professionals.
h. Maintain the data mining database. Once created, a database needs to be cared for. It needs to be backed up periodically; its performance should be monitored; and it may need occasional reorganization to reclaim disk storage or to improve performance. For a large, complex database stored in a DBMS, the maintenance may also require the services of information systems professionals.
3. Explore the data. See the DATA DESCRIPTION FOR DATA MINING section above for a detailed discussion of visualization, link analysis, and other means of exploring the data. The goal is to identify the most important fields in predicting an outcome, and determine which derived values may be useful.
In a data set with hundreds or even thousands of columns, exploring the data can be as time consuming and labor-intensive as it is illuminating. A good interface and fast computer response are very important in this phase because the very nature of your exploration is changed when you have to wait even 20 minutes for some graphs, let alone a day.
4. Prepare data for modeling. This is the final data preparation step before building models. There are four main parts to this step:
a. Select variables
b. Select rows
c. Construct new variables
d. Transform variables
a. Select variables. Ideally, you would take all the variables you have, feed them to the data mining tool and let it find those which are the best predictors. In practice, this doesn’t work very well. One reason is that the time it takes to build a model increases with the number of variables. Another reason is that blindly including extraneous columns can lead to incorrect models. A very common error, for example, is to use as a predictor variable data that can only be known if you know the value of the response variable. People have actually used date of birth to “predict” age without realizing it.
While in principle some data mining algorithms will automatically ignore irrelevant variables and properly account for related (covariant) columns, in practice it is wise to avoid depending solely on the tool. Often your knowledge of the problem domain can let you make many of these selections correctly. For example, including ID number or Social Security number as predictor variables will at best have no benefit and at worst may reduce the weight of other important variables.
b. Select rows. As in the case of selecting variables, you would like to use all the rows you have to build models. If you have a lot of data, however, this may take too long or require buying a bigger computer than you would like. Consequently it is often a good idea to sample the data when the database is large. This yields no loss of information for most business problems, although sample selection must be done carefully to ensure the sample is truly random. Given a choice of either investigating a few models built on all the data or investigating more models built on a sample, the latter approach will usually help you develop a more accurate and robust model.
You may also want to throw out data that are clearly outliers. While in some cases outliers may contain information important to your model building, often they can be ignored based on your understanding of the problem. For example, they may be the result of incorrectly entered data, or of a one-time occurrence such as a labor strike. Sometimes you may need to add new records (e.g., for customers who made no purchases).
c. Construct new variables. It is often necessary to construct new predictors derived from the raw data. For example, forecasting credit risk using a debt-to-income ratio rather than just debt and income as predictor variables may yield more accurate results that are also easier to understand. Certain variables that have little effect alone may need to be combined with others, using various arithmetic or algebraic operations (e.g., addition, ratios). Some variables that extend over a wide range may be modified to construct a better predictor, such as using the log of income instead of income.
d. Transform variables. The tool you choose may dictate how you represent your data, for instance, the categorical explosion required by neural nets. Variables may also be scaled to fall within a limited range, such as 0 to 1. Many decision trees used for classification require continuous data such as income to be grouped in ranges (bins) such as High, Medium, and Low. The encoding you select can influence the result of your model. For example, the cutoff points for the bins may change the outcome of a model.
5. Data mining model building. The most important thing to remember about model building is that it is an iterative process. You will need to explore alternative models to find the one that is most useful in solving your business problem. What you learn in searching for a good model may lead you to go back and make some changes to the data you are using or even modify your problem statement.
Once you have decided on the type of prediction you want to make (e.g., classification or regression), you must choose a model type for making the prediction. This could be a decision tree, a neural net, a proprietary method, or that old standby, logistic regression. Your choice of model type will influence what data preparation you must do and how you go about it. For example, a neural net tool may require you to explode your categorical variables. Or the tool may require that the data be in a particular file format, thus requiring you to extract the data into that format. Once the data is ready, you can proceed with training your model. The process of building predictive models requires a well-defined training and validation protocol in order to insure the most accurate and robust predictions. This kind of protocol is sometimes called supervised learning. The essence of supervised learning is to train (estimate) your model on a portion of the data, then test and validate it on the remainder of the data. A model is built when the cycle of training and testing is completed. Sometimes a third data set, called the validation data set, is needed because the test data may be influencing features of the model, and the validation set acts as an independent measure of the model’s accuracy.
Training and testing the data mining model requires the data to be split into at least two groups: one for model training (i.e., estimation of the model parameters) and one for model testing. If you don’t use different training and test data, the accuracy of the model will be overestimated. After the model is generated using the training database, it is used to predict the test database, and the resulting accuracy rate is a good estimate of how the model will perform on future databases that are similar to the training and test databases. It does not guarantee that the model is correct. It simply says that if the same technique were used on a succession of databases with similar data to the training and test data, the average accuracy would be close to the one obtained this way.
Simple validation. The most basic testing method is called simple validation. To carry this out, you set aside a percentage of the database as a test database, and do not use it in any way in the model building and estimation. This percentage is typically between 5% and 33%. For all the future calculations to be correct, the division of the data into two groups must be random, so that the training and test data sets both reflect the data being modeled.
After building the model on the main body of the data, the model is used to predict the classes or values of the test database. Dividing the number of incorrect classifications by the total number of instances gives an error rate. Dividing the number of correct classifications by the total number of instances gives an accuracy rate (i.e., accuracy = 1 – error). For a regression model, the goodness of fit or “r-squared” is usually used as an estimate of the accuracy.
In building a single model, even this simple validation may need to be performed dozens of times. For example, when using a neural net, sometimes each training pass through the net is tested against a test database. Training then stops when the accuracy rates on the test database no longer improve with additional iterations.
Cross validation. If you have only a modest amount of data (a few thousand rows) for building the model, you can’t afford to set aside a percentage of it for simple validation. Cross validation is a method that lets you use all your data. The data is randomly divided into two equal sets in order to estimate the predictive accuracy of the model. First, a model is built on the first set and used to predict the outcomes in the second set and calculate an error rate. Then a model is built on the second set and used to predict the outcomes in the first set and again calculate an error rate. Finally, a model is built using all the data. There are now two independent error estimates which can be averaged to give a better estimate of the true accuracy of the model built on all the data. Typically, the more general n-fold cross validation is used. In this method, the data is randomly divided into n disjoint groups. For example, suppose the data is divided into ten groups. The first group is set aside for testing and the other nine are lumped together for model building. The model built on the 90% group is then used to predict the group that was set aside. This process is repeated a total of 10 times as each group in turn is set aside, the model is built on the remaining 90% of the data, and then that model is used to predict the set-aside group. Finally, a model is built using all the data. The mean of the 10 independent error rate predictions is used as the error rate for this last model.
Bootstrapping is another technique for estimating the error of a model; it is primarily used with very small data sets. As in cross validation, the model is built on the entire dataset. Then numerous data sets called bootstrap samples are created by sampling from the original data set. After each case is sampled, it is replaced and a case is selected again until the entire bootstrap sample is created. Note that records may occur more than once in the data sets thus created. A model is built on this data set, and its error rate is calculated. This is called the resubstitution error. Many bootstrap samples (sometimes over 1,000) are created. The final error estimate for the model built on the whole data set is calculated by taking the average of the estimates from each of the bootstrap samples.
Based upon the results of your model building, you may want to build another model using the same technique but different parameters, or perhaps try other algorithms or tools. For example, another approach may increase your accuracy. No tool or technique is perfect for all data, and it is difficult if not impossible to be sure before you start which technique will work the best. It is quite common to build numerous models before finding a satisfactory one.
6. Evaluation and interpretation.
a. Model Validation. After building a model, you must evaluate its results and interpret their significance. Remember that the accuracy rate found during testing applies only to the data on which the model was built. In practice, the accuracy may vary if the data to which the model is applied differs in important and unknowable ways from the original data. More importantly, accuracy by itself is not necessarily the right metric for selecting the best model. You need to know more about the type of errors and the costs associated with them.
b. Confusion matrices. For classification problems, a confusion matrix is a very useful tool for understanding results. A confusion matrix (Figure 9) shows the counts of the actual versus predicted class values. It shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong. The following table is a sample confusion matrix. The columns show the actual classes, and the rows show the predicted classes. Therefore the diagonal shows all the correct predictions. In the confusion matrix, you can see that our model predicted 38 of the 46 Class B’s correctly, but misclassified 8 of them: two as Class A and six as Class C. This is much more informative than simply telling us an overall accuracy rate of 82% (123 correct classifications out of 150 cases).
- Deploy the model and results. Once a data mining model is built and validated, it can be used in one of two main ways. The first way is for an analyst to recommend actions based on simply viewing the model and its results. For example, the analyst may look at the clusters the model has identified, the rules that define the model, or the lift and ROI charts that depict the effect of the model.The second way is to apply the model to different data sets. The model could be used to flag records based on their classification, or assign a score such as the probability of an action (e.g., responding to a direct mail solicitation). Or the model can select some records from the database and subject these to further analyses with an OLAP tool. Often the models are part of a business process such as risk analysis, credit authorization or fraud detection. In these cases the model is incorporated into an application. For instance, a predictive model may be integrated into a mortgage loan application to aid a loan officer in evaluating the applicant. Or a model might be embedded in an application such as an inventory ordering system that automatically generates an order when the forecast inventory levels drop below a threshold.
The data mining model is often applied to one event or transaction at a time, such as scoring a loan application for risk. The amount of time to process each new transaction, and the rate at which new transactions arrive, will determine whether a parallelized algorithm is needed. Thus, while loan applications can easily be evaluated on modest-sized computers, monitoring credit card transactions or cellular telephone calls for fraud would require a parallel system to deal with the high transaction rate.
When delivering a complex application, data mining is often only a small, albeit critical, part of the final product. For example, knowledge discovered through data mining may be combined with the knowledge of domain experts and applied to data in the database and incoming transactions. In a fraud detection system, known patterns of fraud may be combined with discovered patterns. When suspected cases of fraud are passed on to fraud investigators for evaluation, the investigators may need to access database records about other claims filed by the claimant as well as other claims in which the same doctors and lawyers were involved.
Model monitoring. You must, of course, measure how well your model has worked after you use it. However, even when you think you’re finished because your model works well, you must continually monitor the performance of the model. Over time, all systems evolve. Salespeople know that purchasing patterns change over time. External variables such as inflation rate may change enough to alter the way people behave. Thus, from time to time the model will have to be retested, retrained and possibly completely rebuilt. Charts of the residual differences between forecasted and observed values are an excellent way to monitor model results. Such charts are easy to use and understand, not computationally intensive, and could be built into the software that implements the model. Thus, the system could monitor itself.