3 Design of Learning System

Welcome to the e-PG Pathshala Lecture Series on Machine Learning. In this module we discuss the design of a learning system. However please note we do not go into the details but try to give an overview of the design process.

Learning Objectives:

The learning objectives of this module are as follows:

To understand the various steps in the design of a learning system
To understand how to design a system that learns a model from data
To know about issues of feature selection and evaluation of a learning system

3.1 Basic Procedures in the Design of a Learning System

The steps in the design of a learning system can be listed as follows:

Choose the training set X and how to represent it.
Choose exactly what is to be learnt, i.e. the target function C.
Choose how to represent the target function C.
Choose a learning algorithm to infer the target function from the set.
Find an evaluation procedure and a metric to test the learned function

3.2 Design Cycle

The design cycle is shown in Figure 3.1. The first step is the collection of data. The next step is the selection of features. This is an important step that can affect the overall learning effectiveness. In most cases, prior knowledge about the input data and what is to be learned is used in selecting appropriate features. The third step is model selection, which is essentially selection of a model that will be used to fit the training data. Here again prior knowledge about the data can be used to select the model. Once the model is selected, the learning step fine tunes the model by selecting parameters to generalize it. Finally the evaluation and testing step selects the parameters of the model, that fit the data and that also generalizes well.

3.3 Definition of Learning

Before we proceed, let us understand the meaning of learning in this context. We will explain learning using the example of the Hand-written character recognition problem. In this scenario we can define the problem as

Task T: Recognizing hand-written characters (as given in Figure 3.2)

Performance measure P: percentage of characters correctly classified Training experience E: a database of hand-written characters with their classifications

Figure 3.2 Handwritten Characters

This example will be used throughout this module to explain the design steps.

3.4 Details of the Design of a learning system

Collection of Data: As already explained, the first step is the collection of the data D={d1,d2,..dm,…dn} where each data point represents the input data and corresponding output (Figure 3.3).

2. Feature Selection: Feature Selection is essentially the process of selecting relevant features for use in model construction. The selection of features depends on the learning problem as given in Example 3.1.

Example 3.1 Feature Selection

The students of a class have different attributes associated with them. Examples of such attributes include marks, native place, height etc. If the learning required is to find the association between native place and height, the marks feature should not be selected.

Feature Selection could be in two ways; one is by reducing the number of attributes considered for each data point. This type of feature selection is called dimensionality reduction as shown in Figure 3.4. The second method is to reduce the number of data points considered where the original D={d1,d2,..dm,…dn} is reduced to D={d1….dm} where m<n (Figure 3.5)

Figure 3.4 Reduction of Attributes

Figure 3.5 Reduction of Data Points

3. Model Selection: The next step is the model selection where we select a model that would most likely fit the data points. A linear model is one of the simplest models we should try to fit to the data. A model (its hypothesis) has a set of parameters; for example, a and b, the slope and the intercept in the simple linear model, shown in Figure 3.6.

y = ax + b + e
e = N(0, s)

An error function eneeds to be optimized. A simple example of an error function is the mean squared error given below:

1 å(yi – f (xi))2
n i=1n

In this function n is the number of data points, yi is the actual output obtained while f(xi) is the predicted output obtained by applying the function selected by the model ( Figure 3.7). We will go into the details later on in future modules.

Figure 3.7The Error Function

4 The Learning step involves the finding values of the parameters that minimize the error (Figure 3.8).

5 The final step is the application of the learnt model to apply (evaluate) the learned model for predicting ys for new hitherto unseen inputs x using learned function f(x).

3.5 Processing Data

The data given to the learning system may require a lot of cleaning. Cleaning involves getting rid of errors and noise and removal of redundancies.

Data Pre-processing: Data Pre-processing is another important process for effective learning. Pre-processing techniques include renaming, rescaling, discretization, abstraction, aggregation and introducing new attributes.

Renaming or relabeling is the conversion of categorical values to numbers. However, this conversion may be inappropriate when used with some learning methods. Such an example is shown in Example 3.2 where numbers impose an order to the values that is not warranted.

Example 3.2 Relabelling
Categorical Values	Conversion to Numbers	Remarks
High,Normal, Low	2,1,0	Right
True, False, Unknown	2,1,0	Wrong
Red, Blue, Green	2,1,0	Wrong

Rescaling, also called normalization is the transferring of continuous values to some range, typically [-1,1] or [0,1]. Discretization or binning involves the conversion of continuous values to a finite set of discrete values. Another technique is abstraction where categorical values are merged together. In aggregation, actual values are replaced by values obtained with summary or aggregation operations, such as minimum value, maximum value, average, etc. Finally, sometimes new attributes that define a relationship with existing attributes are introduced. An example is replacing weight and height attributes by a new attribute obesity-factor which is calculated as weight/height. These pre-processing techniques are used only when the learning is not affected due to such pre-processing.

3.5.1 Data biases

It is important to watch out for data biases. For this, we need to understand the data source. It is very easy to derive “unexpected” results when data used for analysis and learning are biased (pre-selected). The results or conclusions derived for pre-selected data do not hold for general cases (Example 3.3).

Example 3.3 Risks in pregnancy study

Survey: The sample survey on risks in pregnancy was sponsored by DARPA at various military hospitals. The study was conducted on a large sample of pregnant woman.

Conclusion: The factor with the largest impact on reducing risks during pregnancy (statistically significant) is a pregnant woman being single. That is the conclusion that single woman has the least risk. What is wrong with this conclusion?

3.6 Feature Selection

Sometimes the size (dimension) of a sample collection can be enormous. The selection of features requires prior knowledge about the characteristics about the input data. A typical example is document classification, where the document corpus can be represented by 10,000 different words. The data can be counts of occurrences of different words. Such a data collection entails the learning of too many parameters but not provide enough samples to justify the estimation of the parameters of the model.

Feature selection reduces the feature sets. There are methods for removing input features. Such a technique is called dimensionality reduction. One method of dimensionality reduction is to replace inputs with features. Another method is to extract relevant inputs using a measure such as mutual information measure. Principal Component Analysis (PCA) is a method that mathematically reduces the dimension of the feature space. Another method of dimensionality reduction explained with an example of document classification is the grouping or clustering similar words using a suitable similarity measure and replacing the group of words with group label.

3.7 Model Selection

The next important step in the design of the learning system is model selection. Again, prior knowledge about the data collection would help in an effective model selection; however, only an estimate can be done. Initial data analysis and visualization can help to make a good guess about the form of the distribution or shape of the function. Independences and correlations among data points in the data collection can help in selecting a model. There may arise the over-fitting problem especially in the presence of bias and variance. Over-fitting is the problem of selecting a function that exactly fits the data, where we are not able to generalize in order to make predictions about unseen data.In other words, a model over fits if it fits particularities of the training set such as noise or bias.

3.7.1 Avoiding Over-fitting

One method to avoid over-fitting is to ensure that there are sufficient number of examples in the training set. Another technique which will be later used for evaluation is the Hold Out method. In this method we hold some data out of the training set and train or fit on the training set (without data held out) and finally use the held out data for fine tuning the Model.

Another important mathematical technique is the use of the concept of regularization, which is the process of introducing additional information in order to prevent over-fitting. This information is usually of the form of a penalty for complexity. A model should be selected based on the Occam’s razor principle (proposed by William of Ockham) which states that the explanation of any phenomenon should make as few assumptions as possible, eliminating, the observable predictions of the explanatory hypothesis or theory. In other words, the simplest hypothesis (model) that fits almost all the data is the best compared to more complex ones; therefore, there is explicit preference towards simple models.

3.8 Evaluation

There are simple methods for evaluation and more complex methods using different methods for splitting the data.

3.8.1 Hold out Method

As already discussed the simplest evaluation method is the holdout method. In this method the data is divided into the training and test data sets. Typically 2/3

of the data is used as training data and the other 1/3 is used as the testing set( Figure 3.10 (a) & Figure 3.10 (b)).

If we want to compare the predictive performance on a classification or a regression problem for two different learning methods then we will need to compare the error results on the test data set and choose the method with better (smaller) testing error for better generalization error.

3.8.2 Complex Methods

The complex methodsuse multiple train/test sets based on various random re-sampling schemes such as cross-validation, random sub-sampling, and Bootstrap (Figure 3.11).

It is the generation of multiple training and test sets block of Figure 3.11 that will change depending on the sampling method. In random sub-sampling, simple holdout method with random split of data into 70% for training and 30% for testing is repeated k times. In the case of cross-validation sub-sampling (k-fold), the data is divided into k disjoint groups and tested on kth group where the rest of the data has been used for training, that is leave one out cross-validation. Typically a 10-fold cross-validation is used. In the case of bootstrap, the training set of size N=size of the data D is used with sampling with replacement. These concepts are shown in Figure 3.12.

Figure 3.12 Sampling Methods

3.9 Illustrative Example of the Process of Design

We use the example of handwritten character recognition (Figure 3.13) as an illustrative example to explain to illustrate the design issues and approaches.

We explain learning to perform a task from experience. Therefore, let us what is the meaning of task. Task can often be expressed through a mathematical function. In this case input can be x, output y and w the parameters that are “learned”. In case of classification output y will be discrete E.g. class membership, posterior probability, etc. For regression, y will be continuous. For the character recognition the task is as shown in Figure 3.14.

The following are the steps in the design process for character recognition.

Step 0: Let us treat the learning system as a black box (Figure 3.15), here we assume that the input a set of handwritten characters and the output is the letter q.

Step 1: Next we collect Training Examples (Experience). Without examples, our system will not learn as we are learning from examples (Figure 3.16).

Step 2: Representing Experience

The next step is to choose a representation for the experience/examples. In our example the sensor input can represented by an n-dimensional vector, called the feature vector, X = (x1, x2, x3, …,xn). We can assume a 64-d vector to represent the 8X8 matrix of pixels (Figure 3.17)

In order to represent the experience, we need to know what X is. Therefore we need a corresponding vector D, which will record our knowledge (experience) about X. The experience E is a pair of vectors E = (X, D). Now the question is how to represent D. Assuming our system is to recognise 10 digits only, then D can be a 10-d binary vector; each correspond to one of the digits (Figure 3.18).

Step 3: Choose a Representation for the Black Box

The next step is the choosing of a representation for the black box. Here we need to choose a function F to approximate the black box and for a given X, the value of F would give the classification of X.

Step 4: Learning/Adjusting the Weights

We need a learning algorithm to adjust the weights such that the experience from the training data can be incorporated into the system, where experience E is represented in terms of input X and expected output D. The function F(X) would be modified with weights W to obtain the learned output L as given in Figure 3.20.

Step 5: Use/Test the System

After learning is completed, all parameters are fixed and an unknown input X can be presented to the system for which the system computes its answer according to the function F(W,X) (Figure 3.21).

Summary

In this module the following were talked about

Explained the basic steps in the design of a learning system
Outlined how a function is chosen to fit the data and the parameters are tuned to minimize the error
Discussed some methods of evaluation
Explained the Design Process using an example

you can view video on Design of Learning System

Web Links

http://people.cs.pitt.edu/~milos/courses/cs2750-Spring03/lectures/class2.pdf
http://learningforward.org/docs/default-source/commoncore/comprehensive-professional-learning-system.pdf” type=”application/pdf
http://digitalcommons.ilr.cornell.edu/cgi/viewcontent.cgi?article=1405&context=cahrswp” type=”application/pdf
“http://www.cse.hcmut.edu.vn/~tru/AI/chapter11.pdf” type=”application/pdf
http://www.holehouse.org/mlclass/11_Machine_Learning_System_Design.html
ssdi.di.fct.unl.pt/pc/0607/files/PCaulaT03-10-06.ppt
http://www.cse.hcmut.edu.vn/~tru/AI/chapter11.pdf
https://www.physicsforums.com/threads/why-ockham-razor.122774/