32 Data Mining – Introduction

Dr R. Baskaran

DATA MINING – INTRODUCTION

Databases today may vary in size into the terabytes — more than 1,000,000,000,000 bytes of data. Hidden information has the masses of data which is of strategic importance. But when there are so many trees, How to extract meaningful conclusions? The newest answer is data mining, which is being used both to increase revenues and to reduce costs. Organizations that aim at creating Innovations worldwide are already using data mining to locate and appeal to higher-value customers, to reconfigure their product offerings to increase sales, and to minimize losses due to error or fraud.

Data mining is a process which tries to use a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. The first step in data mining is to describe the data — summarize its statistical attributes (such as means and standard deviations), visually review it using charts and graphs, and look for potentially meaningful links among variables (such as values that often occur together). Data mining process include collecting, exploring and selecting the right data that are critically important. But, action plan cannot be derived from the data description alone. One must build a predictive model based on patterns determined from known results, and then test that model on results outside the original sample. We should not confuse a good model with reality (you know a road map isn’t a perfect representation of the actual road), but the same be a useful guide to understanding your business. The final step of the data mining process is to empirically verify the model.

For example, from a database of customers who have already responded to a particular offer, you’ve built a model predicting which prospects are likeliest to respond to the same offer. Can you rely on this prediction? Send a mailing to a portion of the new list and see what results you get.

Data mining is just a tool, not a magic wand. This does not reside in the database watching what happens and send you e-mail to get your attention, when it sees an interesting pattern. Further, it doesn’t also tries to eliminate the need to know your business, to understand your data, or to understand analytical methods. Data mining assists business analysts with finding patterns and relationships in the data — it does not project the value of the patterns to the organization. Furthermore, the patterns uncovered by data mining must be verified in the real world.

Remember, predictive relationships found via data mining are certainly not the causes of an action or behavior. For example, data mining might determine that males with incomes between Rs.50,000 and Rs.65,000 who subscribe to certain magazines are likely purchasers of a product you want to sell. While you can take advantage of this pattern, say by aiming your marketing at people who fit the pattern, you should not assume that any of these factors cause them to buy your product.

To ensure meaningful results, it’s vital that you understand your data. The quality of your output is often decided by various factors and will often be sensitive to outliers (data values that are very different from the typical values in your database), irrelevant columns or columns that vary together (such as age and date of birth), the way you encode your data, and the data you leave in and the data you exclude. Algorithms vary in their sensitivity to such data issues, but it is depend on a data mining product to make all the right decisions on its own.

Data mining without proper guidance will not automatically discover solutions. Rather than setting the vague goal, “it helps improve the response to my direct mail solicitation”. One might use data mining to find the characteristics of people who (1) respond to your solicitation, or (2) respond AND make a large purchase. The patterns that the data mining generate and finds for those two goals may be very different. Good data mining techniques gives intricacies of statistical techniques; it further requires understanding the workings of the tool you choose and the algorithms on which it is dependent. Data mining tool and optimization algorithms will affect the accuracy and speed of the models.

Data mining can never replace a business analyst or a manager, rather this initiates a new powerful tool to improve the job they are doing. High-payoff patterns of its employees have been observed over years, which help understanding the business and its customers.

Data mining and data warehousing:

Data to be mined is first extracted from an enterprise data warehouse into a data mining database or data mart. There is some real benefit if your data is already part of a data warehouse. The data mining database may be a logical rather than a physical subset of your data warehouse, provided that the data warehouse DBMS can support the additional resource demands of data mining. If it cannot, then you will be better off with a separate data mining database.

Figure 1: Data mart extracted from a Data Warehouse

Data mining, machine learning and statistics

Data mining takes advantage of advances in the fields of machine learning, artificial intelligence (AI) and statistics. These disciplines have been working on problems of pattern recognition and classification. This has made great contributions to the understanding and application of neural nets and decision trees.

Data mining does not replace traditional statistical techniques rather it is an extension to statistics. The development of most statistical techniques was based on elegant theory and analytical methods that worked quite well on the modest amounts of data being analyzed. The increased power of computers and their lower cost, coupled with the need to analyze enormous data sets with millions of rows. This has allowed the development of new techniques based on a brute-force exploration of possible solutions.

New techniques include relatively recent algorithms like neural nets and decision trees, and new approaches to older algorithms such as discriminant analysis. By virtue of bringing to bear the increased computer power on the huge volumes of available data, these techniques can approximate almost any functional form or interaction on their own. Traditional statistical techniques rely on the modeler to specify the functional form and interactions.

The key point is that data mining is the application of these and other AI and statistical techniques to common business problems in a fashion that makes these techniques available to the skilled knowledge worker as well as the trained statistics professional. Data mining is a tool for increasing the productivity of people trying to build predictive models.

A key enabler of data mining is the major progress in hardware price and performance. The dramatic 99% drop in the price of computer disk storage in just the last few years has radically changed the economics of collecting and storing massive amounts of data. At Rs.10/megabyte, one terabyte of data costs Rs.10,000,000 to store.

The drop in the cost of computer processing has been equally dramatic. Each generation of chips greatly increases the power of the CPU, while allowing further drops on the cost curve. This is also reflected in the price of RAM (random access memory), where the cost of a megabyte has dropped from hundreds of dollars to around a dollar in just a few years. PCs routinely have 64 megabytes or more of RAM, and workstations may have 256 megabytes or more, while servers with gigabytes of main memory are not a rarity.

While the power of the individual CPU has greatly increased, the real advances in scalability stem from parallel computer architectures. Virtually all servers today support multiple CPUs using symmetric multi-processing, and clusters of these SMP servers can be created that allow hundreds of CPUs to work on finding patterns in the data. Advances in database management systems to take advantage of this hardware parallelism also benefit data mining. If you have a large or complex data mining problem requiring a great deal of access to an existing database, native DBMS access provides the best possible performance. The result of these trends is that many of the performance barriers to finding patterns in large amounts of data are being eliminated.

Data mining applications

Data mining is increasingly popular because of the contribution it can provide. To increase on the revenue and to control the cost, data mining contributions is more. Many organizations are using data mining to help manage all phases of the customer life cycle, including acquiring new customers, increasing revenue from existing customers, and retaining good customers. By determining characteristics of good customers (data profiling), a company can target prospects with similar characteristics. By profiling customers who have bought a particular product it can focus attention on similar customers who have not bought that product (cross-selling). By profiling customers who have left, a company can act to retain customers who are at risk for leaving (reducing churn or attrition), because it is usually far less expensive to retain a customer than acquire a new one.

Data mining offers value across a broad spectrum of industries. Telecommunications and credit card companies are two of the leaders in applying data mining to detect fraudulent use of their services. Insurance companies and stock exchanges are also interested in applying this technology to reduce fraud. Medical applications are another fruitful area: data mining can be used to predict the effectiveness of surgical procedures, medical tests or medications. Companies active in the financial markets use data mining to determine market and industry characteristics as well as to predict individual company and stock performance. Retailers are making more use of data mining to decide which products to stock in particular stores (and even how to place them within a store), as well as to assess the effectiveness of promotions and coupons. Pharmaceutical firms are mining large databases of chemical compounds and of genetic material to discover substances that might be candidates for development as agents for the treatments of disease.

Successful data mining

There are two keys to success in data mining. First is solving the problem precisely. A focused statement usually results in the best payoff. The second key is use of right data. Either by considering the data available to us, or perhaps buying external data, it may be required to transform and combine it in significant ways. The more the model builder can “play” with the data, build models, evaluate results, and work with the data some more (in a given unit of time), the better the resulting model will be. Consequently, the degree to which a data mining tool supports this interactive data exploration is more important than the algorithms it uses. Ideally, the data exploration tools (graphics/visualization, query/OLAP) are well-integrated with the analytics or algorithms that build the models.

DATA DESCRIPTION FOR DATA MINING

Summaries and visualization

To build a predictive model, it is essential to understand the data first. It is likely to gather Numerical Summaries (including descriptive statistics such as averages, standard deviations and so forth) and looking at the distribution resulting in cross tabulations (pivot tables) for multi-dimensional data. Data can be continuous, having any numerical value (e.g., quantity sold) or categorical, fitting into discrete classes (e.g., red, blue, green). Categorical data can be further defined as either ordinal, having a meaningful order (e.g., high/medium/low), or nominal, that is unordered (e.g., postal codes). Data visualization most often provides the leading to new insights and success. Histograms or box plots are some of the very common and useful graphical displays of data that display distributions of values. In case of two or three dimensions of different pairs of variables we may prefer scatter plots.

Visualization works because it exploits the broader information bandwidth of graphics as opposed to text or numbers. It allows people to see the forest and zoom in on the trees. Patterns, relationships, exceptional values and missing values are often easier to perceive when shown graphically, rather than as lists of numbers and text. The problem in using visualization stems from the fact that models have many dimensions or variables, but we are restricted to showing these dimensions on a two-dimensional computer screen or paper. For example, we may wish to view the relationship between credit risk and age, sex, marital status, own-or-rent, years in job, etc. Consequently, visualization tools must use clever representations to collapse n dimensions into two. Increasingly powerful and sophisticated data visualization tools are being developed, but they often require people to train their eyes through practice in order to understand the information being conveyed. Users who are color-blind or who are not spatially oriented may also have problems with visualization tools.

Clustering

Clustering divides a database into different groups. The goal of clustering is to find groups that are very different from each other, and whose members are very similar to each other. Unlike classification (see Predictive Data Mining, below), you don’t know what the clusters will be when you start, or by which attributes the data will be clustered. Consequently, someone who is knowledgeable in the business must interpret the clusters. Often it is necessary to modify the clustering by excluding variables that have been employed to group instances, because upon examination the user identifies them as irrelevant or not meaningful. After you have found clusters that reasonably segment your database, these clusters may then be used to classify new data. Some of the common algorithms used to perform clustering include Kohonen feature maps and K-means. Don’t confuse clustering with segmentation. Segmentation refers to the general problem of identifying groups that have common characteristics. Clustering is a way to segment data into groups that are not previously defined, whereas classification is a way to segment data by assigning it to groups that are already defined.

Link analysis

Link analysis is a descriptive approach to exploring data that can help identify relationships among values in a database. The two most common approaches to link analysis are association discovery and sequence discovery. Association discovery finds rules about items that appear together in an event such as a purchase transaction. Market-basket analysis is a well-known example of association discovery. Sequence discovery is very similar, in that a sequence is an association related over time.

Associations are written as A à B, where A is called the antecedent or left-hand side (LHS), and B is called the consequent or right-hand side (RHS). For example, in the association rule “If people buy a hammer then they buy nails,” the antecedent is “buy a hammer” and the consequent is “buy nails.” It’s easy to determine the proportion of transactions that contain a particular item or item set: simply count them. The frequency with which a particular association (e.g., the item set “hammers and nails”) appears in the database is called its support or prevalence. If, say, 15 transactions out of 1,000 consist of “hammer and nails,” the support for this association would be 1.5%. A low level of support (say, one transaction out of a million) may indicate that the particular association isn’t very important — or it may indicated the presence of bad data (e.g., “male and pregnant”).

To discover meaningful rules, however, we must also look at the relative frequency of occurrence of the items and their combinations. Given the occurrence of item A (the antecedent), how often does item B (the consequent) occur? That is, what is the conditional predictability of B, given A? Using the above example, this would mean asking “When people buy a hammer, how often do they also buy nails?” Another term for this conditional predictability is confidence. Confidence is calculated as a ratio: (frequency of A and B)/(frequency of A).

Let’s specify our hypothetical database in more detail to illustrate these concepts:

Total hardware-store transactions: 1,000

Number which include “hammer”: 50

Number which include “nails”: 80

Number which include “lumber”: 20

Number which include “hammer” and “nails”: 15

Number which include “nails” and “lumber”: 10

Number which include “hammer” and “lumber”: 10

Number which include “hammer,” “nails” and “lumber”: 5

We can now calculate:

Support for “hammer and nails” = 1.5% (15/1,000)

Support for “hammer, nails and lumber” = 0.5% (5/1,000)

Confidence of “hammer à nails” = 30% (15/50)

Confidence of “nails à hammer” = 19% (15/80)

Confidence of “hammer and nails à lumber” = 33% (5/15)

Confidence of “lumber à hammer and nails” = 25% (5/20)

Thus we can see that the likelihood that a hammer buyer will also purchase nails (30%) is greater than the likelihood that someone buying nails will also purchase a hammer (19%). The prevalence of this hammer-and-nails association (i.e., the support is 1.5%) is high enough to suggest a meaningful rule.