5 Types of Learning II

Welcome to the e-PG Pathshala Lecture Series on Machine Learning. In the previous module we discussed one of the types of learning – the supervised learning methodology. In this module we will discuss the other types of learning methods.

Learning Objectives:

The learning objectives of this module are as follows:

• To know about Unsupervised, Semi-supervised, and Reinforcement Machine Learning methods

• To understand the need and applications of these Machine Learning methods

• To understand the differences and complexity involved in applying these methods and where they are useful.

5.1 Unsupervised learning

Unsupervised Learning is an important technique of machine learning because most big datasets are without labels. Unsupervised learning tries to find hidden structure in unlabelled data where there is no error or reward signal to evaluate a potential solution since the data is unlabelled. The basic idea of this learning method is to create an internal representation of the input e.g. represent extracted features and then learn the patterns in the input without any specific output.

5.1.1 Basic Flow of the Unsupervised Learning Method

The basic flow of an unsupervised learning system is shown in Figure 5.1. The input to the system could a set of training data which could be text, images or audio. Let us assume that in our case, the input is a set of documents. Each document has been pre-processed and 5 features have been extracted and represented. Remember there is no label (or required output) associated with the input. This input representation is given as input to the machine learning algorithm which builds a predictive model based on the characteristics of the input data. Now a new unknown document (test data) is given which is then similarly pre-processed to obtain its feature representation. This is given to the predictive model which then gives the Likelihood, Cluster label or a better representation of the test data.

Learning 5.1.2 Unsupervised learning –Variations

There are many variations of the unsupervised learning method based on what function or problem the unsupervised learning approach is trying to solve.

Clustering – This is the task of grouping a set of objects based on their features in such a way that objects in the same group (called a cluster) exhibit more feature similarity than objects in other groups (clusters). We will discuss clustering in detail in section 5.2
Probability distribution estimation – This is the task of trying to find the underlying probability distribution from which the samples are drawn.
Finding association (in features) – Association rule learning is a method designed to discover interesting relations between features.
Dimensionality reduction is the mapping of data to a lower dimensional space such that uninformative variance in the data is discarded. In other words a data dimensionality reduction produces a compact low-dimensional encoding of a given high-dimensional data set. Clustering thus finds a subspace representing the data. Methods of dimensionality reduction include Factor Analysis and Principal Component Analysis (PCA). Associated with this task is Independent Component Analysis or Dictionary Learning which finds (small) sets of factors for observation.
Density estimation in statistics is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is assumed to be the density according to which a large population is distributed and the data can be assumed to be a random sample from that population. Here, the typical methods include Gaussian mixture model (GMM) and graphical models.
Sequence analysis tries to find a latent causal sequence for observations. In bioinformatics, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Sequence matching also called sequence alignment and sequence segmentation are important subtasks of sequence analysis. Important techniques used include Hidden Markov Model (for Discrete state) and Kalman filter (for Continuous state)
Novelty detection is the identification of new or unknown data that a machine learning system has not been trained with and was not previously aware of with the help of either statistical or machine learning based approaches. Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen. It is a useful technique in cases where an important class of data is under-represented in the training set.

5.1.3 Components of Unsupervised Learning Task

The unsupervised learning task can follow different paths depending on the specific application and the data available (Figure 5.2). The raw data can be processed to obtain the distance or similarity matrix, this can undergo either non-hierarchical or hierarchical clustering, the results can then be validated. Another path is when the distance or similarity matrix is subjected to dimensionality reduction before clustering takes place. On the other hand, the raw data can directly undergo dimensionality reduction, be graphically representation and then be validated. The other paths shown in Figure 5,2 are also possible

5.1.4 Applications of Unsupervised Learning

Applications where data is large or labeling the data is a tedious and time consuming task, unsupervised learning is used as we know that the input data given to the learning system need not specify the output (label) for every input sample of the training data. However this does not mean that insufficient unlabeled data is enough. Some of the typical applications that use unsupervised learning are listed below:

Face detection, object detection and recognition – here the goal of unsupervised learning is to discover the underlying probability distributions of sensory inputs. The identification of an object in an image nearly always depends on the physical causes of the image rather than the pixel intensities.
Image segmentation
Multimedia event detection
Economical and commercial usage

5.2 Clustering – an Unsupervised Approach

For explaining the concept of unsupervised learning we consider clustering. Figure 5.3. In Figure 5.3 (a) we see data points represented graphically. In Figure 5.3 (b) the points are grouped to form two clusters while in Figure 5.3 (c) they are grouped to form 3 clusters.

Clustering is about finding a set of prototypes representing the data. Typical methods include K-Means and Spectral Clustering. Hierarchical clustering is a method of clustering which seeks to build a hierarchy of clusters (Figure 5.4). As we know clustering is an unsupervised learning task. In other words, there is no target value to shoot for. Identify groups of “similar” data points, that are “dissimilar” from other groups are discovered. Thus clustering finds similarity groups in the given data called clusters. Essentially during clustering we partition the data into groups (clusters) that satisfy the constraints that points in the same cluster should be similar and points in different clusters should be dissimilar. Items can be clustered based on similarity or words based on similarity of usage. Examples include identifying Internet communities; identifying celestial objects, and clustering geographical regions based on population density (Figure 5.5).

5.2.1 Hard and Soft Clustering

Based on the cluster assignment clustering is divided into two types namely

Hard Clustering: assigning a data point to only one cluster with maximum similarity.
Soft / Fuzzy Clustering: assigning a data point to one or more clusters depending on likelihood of belonging to that cluster.

We will study these clustering approaches in detail in the module on clustering and clustering algorithms.

5.2.2 What is Clustering for?

Now the next question is “What is the need for clustering, “Is it useful in real life or in actual business?” Yes clustering is one of the most commonly used machine learning techniques used in situations which have large data that has not been labeled. Today clustering plays a major role in many Big data analytics applications and business intelligence. Let us look at some examples to illustrate this point.

Example 1 – Size based Clustering: Clustering groups of people of similar sizes together to make “small”, “medium” and “large” T-Shirts. This is necessary because shirts tailor-made for each person: too expensive but going for a one-size-fits-all does not satisfy the customer since the one size does not fit all.

Example 2 – Targeted Marketing: In marketing, clustering segments of customers according to their similarities. This allows the marketing section to carry out targeted marketing. This is a very important application especially in today’s context of online shopping.

Example 3 – Topic Hierarchy: Given a collection of text documents,clustering organizes these documents based on their content similarities, thus producing a topic hierarchy. In the area of online newspapers this allows the online agency to send appropriate articles based on the topic hierarchy to interested subscribers.

5.2.3 The Clustering Process

Now let us study the actual process of clustering. The various steps are as follows:

5.2.3.1 Pattern Representation

The first step is the representation of the patterns or features of the data points. Patterns are represented conventionally as multidimensional vectors, where each dimension is a single feature. The issues in the design include the number of classes and the number of available patterns such as circles, ellipses,squares, etc. An associated aspect is the selection of features where in some cases wrappers and filters are used for the selection. Another aspect is feature extraction which can even involve the production of new features obtained by summarizing the existing features. This can be possible using dimensionality reduction techniques like principle component analysis (PCA).

5.2.3.2 Distance measures

The next step is the definition of an appropriate pattern proximity measure. This is an important step since we want clusters of instances that are similar to each other but dissimilar to others. Therefore there is a need for a similarity measure also called as distance measure.

The distance measure usually depends on the type of attributes. For nominal attributes, the distance measure is given (Figure 5.6) as:

In the continuous case, we denote distance with: dist(xi, xj), where xi and xj are data points (vectors). We can define the distance between the points xi and xj using the general Minkowski measure given below (Figure 5.7) where h is a positive integer.

In the continuous case, common distance measures such as Euclidean measure and the Mahalanobis distance can be derived from Minkowski measure. When h = 2, we get the Euclidean distance (Figure 5.8) which can be used for compact isolated clusters.

When h = 1 we get the Manhattan distance (Figure 5.9)

5.2.3.3 Aspects of Clustering

Once representation of the patterns and the definition of an appropriate pattern proximity measure or distance measure are decided, the actual clustering is carried out by using an appropriate algorithm. There are basically two types of clustering algorithms: Partitioned clustering and Hierarchical clustering. We will discuss these methods in detail later on in the modules on clustering. The quality of a clustering result depends on the algorithm, the distance function, and the application and the distance (similarity, or dissimilarity) function. The quality of clustering is increased when Inter-clusters distance is maximized and Intra-clusters distance is minimized.

5.2.3.4 Data Abstraction

The next step is the data abstraction process where each cluster is abstracted and represented as a cluster profile. Two of the most common data abstraction (DA) techniques are sampling and clustering. In clustering, data abstraction is a compact description of each cluster, usually in terms of cluster prototypes or representative patterns (centroid). In document retrieval and information filtering, millions of patterns with a dimensionality of more than 100 are clustered to achieve data abstraction. It is unknown in many cases how well the selected abstraction represents the whole dataset or how reliable the patterns discovered based on this abstraction are.

5.2.3.5 Cluster Validation

Evaluating clustering quality is very hard because correct clusters are not known beforehand. Some of the methods used are user inspection, study centroids, and spreads or rules from a decision tree. For text documents, one can read some documents in clusters. Some other measures include Entropy. which is measure often used to measure the disorder of a set of data and Purity which measures the extent that a cluster contains only one class of data.

5.3 Semi-supervised learning

When the training data has only a few desired outputs, then unlabelled data used together with a small amount of labelled data can produce considerable improvement in learning. One of the reasons that motivated the use of semi-supervised learning is the fact that labeling is expensive, difficult, and sometimes unreliable (example Segmentation applications). Sometimes there is need for multiple experts which can prove to be expensive. On the other hand, unlabeled examples are easy to obtain in large numbers (Ex. Web pages, text documents, etc.).

5.3.1 Semi-supervised Learning

As we have already seen classification is transductive that is it predicts labels of unlabeled data using the labelled data. We know that inductive learning involves the learning of a prediction function covering the whole sample space. Almost every learning problem has a semi-supervised counterpart. For clustering the semi-supervised counterpart is constrained clustering while for ranking it is semi-supervised ranking.

5.3.2 When Semi-supervised Learning works?

In order to make any use of unlabelled data, semi-supervised learning must assume some structure to the underlying distribution of data. Semi-supervised learning algorithms make use of following assumptions:

Smoothness assumption – If two points x1, x2 are close, then so should be the corresponding outputs y1, y2
Cluster assumption – If points are in the same cluster, they are likely to be of the same class.
Manifold assumption – The (high-dimensional) data lie (roughly) on a low-dimensional manifold.

Now let us see how unlabeled data can be helpful (Figure 5.10). The top panel shows a decision boundary we may learn when we see only one positive (white circle) and one negative (black circle) sample. The bottom panel shows a decision boundary we might learn, given a collection of unlabelled data (gray circles). Figure 5.11 shows an example where unlabelled examples help to learn decision boundaries that are circular.

5.3.3 Semi-Supervised Learning Methods

Now that we have seen how semi-supervised learning techniques work, we now describe some common methods. These include:

Generative Model: This is a classification with additional information on the marginal density or otherwise clustering with additional information.
Low density separation: This Implements the low-density separation assumption by maximizing the decision boundary of unlabelled points.
Graph based methods: This builds on the concept of manifold assumption.
Change of Representation: This method first performs an unsupervised step on all data ignoring labels. Then we Ignore the unlabeled data and perform plain supervised learning using the learnt function.

5.4 Reinforcement Learning

Supervised (inductive) learning where labeled data is available is the simplest type of learning. But how can an agent learn behaviors when it doesn’t have a teacher (or labelled samples) to tell it how to perform? The agent has a task to perform and takes action but at some later point, it gets feedback telling it how well it did on performing the task. The agent performs the same task over and over again. This type of learning is called reinforcement learning. There is positive reinforcement for tasks done well while there is negative reinforcement for tasks done poorly. This concept is illustrated in Figure 5.13.

Examples of such learning can be best explained with game playing where when playing chess reward comes at end of game but for games such as Ping-pong there is a reward on each point scored. In essence it is a trial-and-error learning paradigm with rewards and punishments. In other words, learn about the system from its behavior and control it with minimal feedback. This paradigm has been inspired by behavioral psychology. The goal is to get the agent to act in the world so as to maximize its rewards. In this context the agent has to figure out what it did that made it get the reward/punishment. This is known as the credit assignment problem.

Reinforcement learning approaches can be used to train computers to do many tasks such as playing backgammon and chess playing or controlling robot limbs. By simulating rewards from sequence of actions, this method can be used in decision making (robot, chess machine). The method can learn action so as to maximize payoff. This is used when there is not much information in a payoff and the payoff is often delayed. We can learn from reinforcement or (occasional) rewards. This is the most general form of learning. An example is an agent learns how to play Backgammon by playing against itself; it gets a reward at the end of each game. We only get feedback in the form of how well we are doing (not what we should do). Therefore there is no supervised output but delayed reward. Examples where reinforcement learning can be used include a robot cleaning the room and recharging its battery, robot-soccer, how to invest in shares, modeling the economy through rational agents, learning how to fly a helicopter and scheduling planes to their destinations.

5.4.1 Basic Reinforcement Model

The basic reinforcement learning model consists of:

a set of environment states
a set of actions
rules of transitioning between states;
rules that determine the scalar immediate reward of a transition; and
rules that describe what the agent observes.

5.4.2 The Agent-Environment Interface

Figure 5.14 show how the Agent and environment interacts at discrete time steps t = 1,2,3,…,k. Agent observes state at step t: st ∊ S (set of all states)and produces action at step t: at ∊ A(st) where A(st) is the set of all actions. When it does the action it gets resulting reward: rt+1 ∊ ℜ and goes to resulting next state: st+1 . The action taken by the agent influences the state of the world which in turn determines its reward. The agent’s goal is to maximize the reward it receives in the long run. If the sequence of rewards received after t step is denoted, rt+1, rt+2, rt+3,… then the question arises as to the precise aspect of this sequence that we wish to maximize. In general, we seek to maximize the expected return, E{Rt} for each step t. In episodic tasks interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. This is given by the equation shown in Figure 5.15 where T is a final time step at which a terminal state is reached, ending an episode.

In the example shown in Figure 5.16 the reward = -1, if the agent is not at the top of the hill and return is minus of the number of steps needed for reaching top of the hill. The return is maximized by minimizing the number of steps to reach the top of the hill.

5.4.3 Issues of Reinforcement Learning

Some of the complications associated with reinforcement learning are that the outcome of the actions taken by the agent may be uncertain, the agent may not be able to perfectly sense the state of the world, the reward may be stochastic or it may be delayed (i.e. finding food in a maze). In addition, there is no clue (model) as to how the world responds to your actions or there is no clue (model) of how rewards are being paid off or the world itself may change while the process of learning is taking place. Another important question is the time needed to explore uncharted territory before the agent can exploit what has been learned.

Summary

Outlined different types of Machine Learning methods available.
Discussed the need for these Machine Learning methods and its application areas.
Discussed the complication involved in using these methods.

you can view video on Types of Learning II

Web Links

http://www.ee.columbia.edu/~vittorio/UnsupervisedLearning.pdf
https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf
http://www-users.cs.umn.edu/~cherian/ppt/MachineLearningTut.pdf
http://www.astroml.org/sklearn_tutorial/general_concepts.html
http://www.acad.bg/ebook/ml/MITPress-%20SemiSupervised%20Learning.pdf
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.511.8881
https://www.ics.uci.edu/~welling/teaching/271fall09/RL271-f09.ppt