36 Machine Learning

Bhushan Trivedi

epgp books

 

Introduction

 

We referred to Machine Learning a few times in earlier modules. In fact Machine Learning is not only related to AI but many other disciplines and probably is becoming a discipline on its own. If we want to describe machine learning in layman terms, “It is a computerized system which can learn on its own or which can infer from data on its own”. Another definition is “a program which looks at patterns in data and try to derive intelligence out of it”. A more detailed and precise definition is probably not possible to be provided. The examples that we provide will give better idea. Interestingly machine learning is addressing problems which are also of that type; harder to define but easier to provide examples of. For example identify a mail being spam or identify a packet being malicious or find if the image of a person indicates he is either bored or happy. There are many similar cases where the problem is either ill- defined or dynamically changing to define it precisely. Dealing with them we need ML. ML solutions look at many examples provided; for example sad and happy faces, spam and non-spam mails, malicious and non-malicious packets etc. ML looks at the features of the data and lean to discern how it can relate a typical set of features to a typical category.

 

ML is gaining popularity like never before. Currently there are many attempts to use ML in various fields. Speech recognition (working on learning about who is speaking and what is being spoken), vehicular control (learning traffic, road conditions and help vehicles choose a proper path for a given destination); deep learning (which enable further learning from the images and other inputs), astronomical structure classification, self-driving cars and healthcare are a few domains which are being explored.

 

Machine Learning

 

In earlier modules we have looked at how one can define problem and solve using AI techniques, representing knowledge and mimicking experts. The idea of knowledge extension and learning is also presented a few times. Knowledge must be inserted in the system and contextual information like ontology based information must also be provided for more human like reasoning is also discussed. The knowledge must be represented in a way that the search algorithms can respond back in real time. Not only the domain knowledge but the meta knowledge which helps in search and help in problem solving (heuristics) is also required to be represented. For all but trivial problems, the knowledge to be inserted in the system is quite huge. The knowledge is not only to be inserted but manipulated as well; one would like to search that knowledge, find answers to queries, and convert it into other forms which can be used by other systems for reasoning.

 

If the complete information must be fed into the systems by humans, they lead to two problems. First, this process is absolutely tedious and error prone, and thus pretty slow. Second, the problem of keeping that knowledge recent using knowledge update process still remains. It will be impossible to update knowledge in real time using manual methods. The other problem is that even when such data is collected somehow, the data is too huge to be processed using any known manual or computerized method.

 

Thus it is imperative that machines somehow learn on their own and develop ‘machine intelligence’ which can help them solve problems and also improve their problem solving skills over a period of time. ML helps find patterns in such data and ‘make the data talk’ for making decisions. For example a behavioral based scanner might look for network behavior and learn to detect attacks better over a period of time. That means when it is provided enough sample of attacks, it can differentiate between a normal network behavior and a network behavior when it is under attack with more accuracy. More the number of samples, better is the detection.

 

Any real expert will not always confine to what he learns in the beginning. He continues to learn and further his skills continuously. It should also be able to detect new methods to detect intrusion apart from improving the existing ones. For that, the program must be able to explore the world, like looking at attack samples and detection tricks from the specific websites or expert blogs. Similarly a spam checker must be able to learn to identify spam mails better over time. As a general rule for MI problems is that the system becomes more and more accurate with more and more examples. MI is used for both, knowledge gathering as well as updating.

 

In most such cases the amount of data available is so huge that it is impossible to deal with them manually. For example network logs, it is impossible for an administrator (or anybody who is assigned the job) to look at all those logs in real time and take decision. Spam filters are also dealing with humongous amount of text derived from large number of mails coming in from outside. ML solutions can be useful for such cases. ML solutions derive patterns and generate meaning which experts can verify. For example when the ML algorithm suspects a packet to be suspicious, the admin can verify. One such case is a doctoral work of the author’s student who has designed a model to aggregate attack alerts and correlates them for a much smaller set of description, which in turn helps to have better decision making by an administrator. That process looks at piles of records generated by many IDS sensors over multiple networks and aggregate them with intelligent correlation. Such a process does not only reduce false positives and negatives but also improves administrator’s efficiency as he will have to look at only the consolidated reports and not raw alerts.

 

ML is also useful for domains which are not sufficiently explored so we can have explicit algorithm and information to build solutions. Most such fields provide datasets (Intrusion Detection is one such example, diagnosing genetic reasons for diseases is another). The dataset contains huge set of examples derived from the domain. Various researchers use these datasets to learn about patterns and their relation with various aspects they would like to test their algorithms against. One such example is a Ph D work from one more research scholar who worked under the author. He use KDD cup dataset for his behavior based intrusion detection algorithm1. Medical domain has many such datasets for researchers to explore and find patterns for detection of diseases, reasons for spreading of killer diseases, diagnose the diseases and even curing them using various medicines and methods. The medical datasets are much bigger and has pretty large number of features to detect patterns and find solutions out of them. In fact ML made many solutions possible otherwise impossible for humans to do it in conventional ways. Genome mapping is one such example. Genome mapping was done with the help of many nations, took 13 years to complete, and $3 billion were spent between 1990 and 2003. Now, with the advent of machine learning algorithms, a company in Pune, is able to repeat the same process in three days using conventional servers.

 

2.“1 Not only such datasets simplify the process of testing, it also gives authenticity to work when the results are published using such well known datasets.”

 

The process of learning

 

When the ML programs attempt to learn automatically, the process includes two different phases; first, collecting the knowledge from various sources (if the dataset is not available) and second, assimilates that knowledge in proper form for reasoning by generating useful patterns from that data. The learning process also takes help from induction and generalization. We have already seen examples in previous chapters about these two ideas.

 

Let us take an example of a spam filter which can detect if a mail is a spam or not. First, let us try to see if it is really a machine learning problem. Let us try to define a mail being a spam. You can easily come to a conclusion that it is really hard. If you try listing the characteristics of spam, the list grows big in no time. Also, that list cannot clearly differentiate spam from a non-spam. Take another example of detecting a malicious packet and more or less the same situation arises. How can one define a malicious packet? There are some possibilities of examples like a packet with same sender and receiver addresses (an example of an attack called land attack) or a packet with many slashes (an example of an attack called slash attack) or a packet which is trying to access a parent directory of the current directory without permission (called directory traversal attack) and so on but defining it precisely is almost impossible. Similarly a mail containing words like Offer, discount, free, etc indicate spam but it is hard to define it. Also, these definitions are not foolproof, a mail containing all these words might also be a genuine mail. One must look at the content of the mail with more detailed outlook for checking. What to look for when is a very hard to learn.

 

The learning process can only happen in one way, provide as many examples as one can and let the program learn from those examples. The spam filters try labeling mails into two, span and non-spam mails. Once they do so, they expect user’s decision for verification. When user identify a mail originally categorized as a spam mail as a non-spam mail or vice versa, the program not only changes the label but learn from that example to improve further. If you have seriously observed so far, the spam filters have improved considerably in last five years. It is due to the fact that they keep on learning and they have huge data to learn from.Let us take an example for another domain, a program that attempt to learn from the image whether the person in the image is happy or sad. How can one define “happy” or “sad” in the context of the visual information? You can easily conclude that it is really hard. What the program can do is to find some patterns in the image and relate them to being happy or sad based on expert’s input continuously until the program becomes reasonably good in estimating.

 

“2  This human genome mapping project aims at finding out patterns in genes which carry possibilities of diseases inherited from the parents. A Hollywood actress Angelina Jolie’s case has become quite well-known after she took a preventive measure to avoid breast cancer which genome mapping predicted to be about 85%.”

 

The ingredients of machine learning process

 

One researcher of ML describes three ingredients of the machine learning process.

 

First is the problem which needs machine learning. Though it might be hard to completely characterize this problem, one must have sufficient number of examples to illustrate attributes of the problem to make it adequately clear for the problem solver. Let us again reiterate that if we can define problem precisely and provide clear cut algorithm, it is not the case for machine learning.

 

Second is the yardstick of some sort, using which one can define the performance of the program. For example one such measure is the amount of false negatives (number of attacks which remain undetected; number of spam mails allowed to go to inbox, number of times the person is not happy and marked as one) and false positives (number of non-attacks which are classified as attacks, number of normal mails blocked as spam,number of times the program fails to capture if the person is happy). Other such measure is confidence level. The system can respond that the attack X is really happening with some confidence value Y. This Y is called confidence level. More the confidence level the better is the performance for correct identification. More such attacks are detected with better confidence level, better is the performance of the system.

 

Third is the set of examples to be provided to the system. Sometimes they are called training instances or training experiences. Better and complete examples are essential for learning.

 

The examples, if picked up properly, can uncover patterns which can differentiate a malicious packet from a non-malicious one or a spam from a non-spam. The process derives some ‘rules’ for this differentiation and start using them. It is quite possible that further examples led to modification or even scrapping those rules. Such rules can be continuously accumulated over a period of time and reasonably good approximation may be possible after that learning phase.

 

Supervised and Unsupervised learning

 

The process of learning is possible to be categorized in two types; supervised and unsupervised. The supervised learning happens with the expert looking after the process. The system is allowed to decide the outcome and the expert is allowed to comment whether the output is right or wrong. For example when an IDS system detects a packet to be malicious, the expert might agree or disagree with the outcome. The expert’s response determines further learning for the system. When the user discards the system’s decision to label as mail as spam or labels a mail considered as normal by the filter, he modifies the decision made by the system which is an example of supervised learning. The supervised learning has an advantage of learning from its failures. Sometimes the supervised learning only has the pass-fail response like both of above cases. When some other methods like neural network may provide response which help determine how far the system went wrong. We have also seen that the neural networks help the system improve even when the identification is correct but marginal.

 

The unsupervised learning works without expert’s feedback. It takes inputs and check for similar patterns. It groups the inputs having similar patterns together and dissimilar patterns apart. Thus it groups similar inputs into separate categories. An interesting example is to classify attacks into various categories or identify facial images indicating different moods or identify given mails into official, personal or promotional etc. and label them accordingly. Such process is elementary in the beginning of any classification process. We must know ‘how many classes’ before we define classes and labels input into those classes. Even when we have idea about number of classes (for example we only need to label the packet as malicious or not or a mail being a spam or not), we still need to group inputs into two even when we do not have clear idea which input patterns indicate what. The unsupervised learning process helps us there. For example when we provide all packets to the unsupervised learning process, it might classify the attacks into some classes based on the patterns that they contain. We may just have two groups of packets with different characteristics and we may decide which group is more likely to represent malicious packets and which group does otherwise. Learning about those classes (their attributes and content, packet header values and so on) may throw more light on how one can classify packets. Similar techniques are applied to many other domains. One researcher from Pune discovered patterns that classify the cells images in a way that it becomes easier to determine if the cell is cancerous or not. The first part of that process which determines criteria which can differentiate cells into two classes is an example of unsupervised learning. In fact the other important thing this example illustrates is that unsupervised learning is useful as a prerequisite to the supervised learning sometimes. In this case, once the attributes that indicates cancer are found in the first phase, the second phase used supervised learning to determine if the sample cell is cancerous or not. They take biopsy report and matches it with the result of the algorithm. The same process can be done to determine learning deficiencies of school children. All details about students with their learning outcomes values are input. The processing will determine patterns which identify some relation between the typical attribute (for example food habits and atmosphere in the house or genetic characteristics). Once such patterns are identified and students with similar patterns are classified, one can see if it is possible to relate them with their learning deficiencies.

 

Training testing and generalization

 

Unlike unsupervised learning, the supervised learning requires a training set of examples, testing examples and some process to make sure generalization happens. While discussing neural networks we did discuss about all these issues. One interesting method used by many researchers is called N Fold Validation. In that process, the user divides the input set into some partitions (folds). N-1 folds are used as training instances (allowed the program to learn from supervisor’s inputs) and Nth fold is used for testing. Now 1st to n-2nd folds + nth fold is provided in training and n-1st fold is used for testing. This process continues till all the instances are correctly trained and provide correct test results. The process of testing using unseen inputs is critical to determine if the solution is properly generalized. Thus once the n fold validation is completed, many systems are tested with unseen inputs.

 

Usually each input consists of some attributes and their values. The training instances allow the system to update its status after the input to better classify that input next time (like what we have studied during our modules 9, 10 and 11 about neural networks). We have represented the input as n dimensional vector which is more or less true for ML as well3. The attributes values can be continuous or discrete. For example the port number is a discrete integer value between 0 to 255 while network load can be any (continuous) value between 0 and 1.

“3  Neural networks are one of the most popular techniques used for machine learning.”

 

In the spam filtering case, the process happens like this. Each mail which users have classified as spam are input and let the system decides if so. For that, we can use BPNN which we discussed in module 10. The input layer will input the words of the mail and few other important features (like whether the email contains poor English or uses completely different URLs than mentioned in anchors and if it contains invisible fields to collect information users are not aware of and so on). If the system comes back with no (no spam) it is required to improve its decision making and it is asked to do so. If it correctly identifies it as spam, it is fine. For each input, this process is done and repeated until all inputs are correctly identified.

 

The idea behind supervised learning is to distill features from the training instances and successfully use them to correctly identify the testing data.

 

One method to machine learn is to take each inputs one after another. Look at each attribute of the input. Assume an n dimensional space where n is total number of inputs. Also decide the point which represents the input in that space based on the values of attributes of that input. This n dimensional space is sometimes denoted as feature space and placing the input as a point in that space is one of the methods of machine learning. In a way it is a method to find out which inputs look similar to each other by placing them in the space determined by the features and finding out distances between them. This

 

Another method to machine learn is to discriminate between inputs based on attributes they contain. Some attributes are critical in differentiating between inputs. For example, find out which packet contains symptoms related to which attack thus it is possible to discriminate and suggest that the packet contains attack X or Y. similarly, when the patient complaints about stomach problem it helps doctor in determining that it is more likely to be the case of typhoid rather than malaria.

 

The first method based on feature space is called generative model while the other one based on discriminating based on features is called discriminative model. Thus for a new example, if we are using a generative model, we will first of all place the input in the feature space and then will assess how near the new example to a typical class and thus determine the probability of that example belongs to that class. If we are using the discriminative model, we use the symptoms (a kind of boundary) to decide if the packet indicates input type A or B or none. In generative model we stress of attributes for labeling the data while in the discriminative case, we have the boundaries to decide where the data belongs to. We use discriminative features of the input to test and determine the class the element belongs to.

 

Thus both models can be used in practice. In fact some researchers have proposed solutions which combines these two types as well. Both, the vicinity to a typical class (or classes) is determined and also discriminating attribute values are judged to ascertain it further.

 

Naïve Bayesian classifier

 

The bay’s theorem that we have seen earlier is an excellent example of generative model. Let us recap the example of character recognition. Each character’s probability is decided based on each feature using bay’s formula. For example if the dot (a feature) is present or not or a horizontal line (another feature) is present or not and at which position. The system which uses bay’s formula in determining the class the element belongs to is also known as Bayesian classifier. As this classifier looks at the features of the input and determines the probability of the input belongs to each class. Here each class represents a character. Thus when a character belongs to class a, it is character a, if it belongs to class p, the character is p and so on. Eventually the membership is decided based on the highest probability value. For example if the character has highest probability to belong to class k, it is decided to be k. This is an example of generative model. Thus if there are n features that we consider for a character, for an input character, these features are checked. Based on those features, a place for that character in the feature space is decided. The point is, the input character features are matched to find its place in the feature space. Now the character which is nearest to the input is the actual character.

 

An important assumption is that in this example, each feature that we consider is independent of others. For example there is no dependency of a dot being observed and a horizontal line being observed. (There is a dependency of a dot being observed and the character is either j or i, though.)

 

Suppose we would like to test if our product is popular for a typical class of customers. We may try testing social media.We can use Bayesian classifier in deciding the probability of a given post is related to our product or not and if so, is it about favoring our product or otherwise. That decision might be based on some attributes that we have provided (for example when somebody dislikes the color, it is not good and the argument is against our product, when somebody is not happy with the delay in receiving the product it is not about liking or disliking the product or when somebody is happy that the item is working fine even after the scheduled lifetime it is good argument in our product’s favor etc.) Looking at the references, it is possible for us to decide which features of our product are liked or disliked by which type of customers and thus can determine the popularity of our product with respect to a typical customer segment with respect to typical features.

 

Hidden Markov Model(HMM)

 

The problem with Bayesian classifier is that we cannot use it when the problem is not completely visible. For example in a card game or casino, the player plays without complete knowledge of the problem statement. In this case, there are two types of parameters, the first type of parameters are those whose values are known to player (the cards he has for example), the second type is those parameters whose values are hidden and not available to the player (the cards that the other players have). It is also an important point that both hidden and available parameters have some relation. (For example the cards that the other players have should not include the cards the player is having currently).

 

The HMM has two components, states which are not observable (and account for parameters that are not visible) and observations (generated by states that are not observable). Each state which is not visible, is responsible for generating one observation. Thus the system changes from one state to another continuously and produce observations. The idea is to observe the sequence and predict the right set of labels or states responsible for that set of observations. For example we might hear few bars of music and figure out which keys are pressed. The key pressing sequence is not available to us but the observations generated from it (the music) are available. Similarly in a speech recognition system we have the spoken word with us (as a part of a statement) but we do not have the statement which is spoken (sequence of words which are read to generate that spoken set of words). Another example is the address being received as a paragraph and system is given the job of determining which part of the paragraph describes which part of address. The text received is the observation and the city, state pin code, street name are labels. The output is a portion of text (for example India) is assigned a label (for example country).

 

The idea is to somehow link the observations with the labels. The HMM assumes that labels are dependent on each other but the observations aren’t. The observation only depends on associated label which influences the generation of that observation. For example our observation of a spoken word “e ai” from a speech only depends on the actual word “AI” which was spoken. This word itself (the label) might be dependent on other words spoken before or after (the labels in sequence before or after) but the observation of “e ai” only depends on the word AI and no other words.

 

Like neural networks, HMM requires training. Also like neural networks HMM can be trained in both supervised and unsupervised way. The supervised method includes both label sequences as well as observation sequences for learning while unsupervised method only has observations sequences.

 

HMM is also an example of a generative model. The output for a speech recognition case, for example, is the word chosen by the algorithm based on amount of matching it has with potential classes. That means, for a given observation sequence, finding out most probable sequence of states.

 

Concept learning

 

Semantic nets, scripts, CD, MOPs are examples to illustrate the relation between concepts. There are languages like prolog which can extend the conceptual relationship as mentioned in the following snippet of a prolog program describing a rule.

 

Mama(X, Y): – mother (X, Z) and brother (Z, Y)4.

 

The idea of concept learning is derived from idea of “learning from examples”. Remember our earlier discussion about looking at examples and learning about the concept of car in module 1 and later. Concept learning is about taking examples of something, learning about its attributes and learns to identify that concept. It is quite possible that similar concepts are seen in past but not recognized. Now they can be.

 

Concept learning is more than concluding from the straightforward rules like above. For example if you want to learn how employees are seriously motivated to retain with the organization and you decide to assume some incentive strategies like provide a foreign travel an year or monitory incentives for doing work in time or provide an additional holiday or provide a paid vacation etc. Now you may determine some of these motivational strategies to be the best for your organization. Is your decision about the best set of policies is correct? It is not that straightforward to determine. In other words, have we learned the concept of how one can retain an employee in a company? If we determine a set of policy which can exactly do as we wanted, help retaining employees, we have learned that concept. This concept, obviously, is quite complex compared to one which started with using a prolog example. 5

 

“4  Here also we will face the problem of limitation of a language. How can one define such a relation in such a rude manner? Unfortunately we have no way in representing relations like Mama (Bajarangi, Shahida) using this rule. You probably would agree that this is a much stronger relation than what we can define using a language.”

 

This process is little complicated as it is also dependent on some factors related to the employee. For example if he is married, what his/her spouse is doing, whether he has kids and what are their preferences and so on. The motivational factors which affect one type of employee probably is not adequate for another type. One must have clear relationship between those factors. The problem is that it is only HR manager’s perspective from which the incentives are designed. He has no idea about what employee’s interests are. What he can try is to assume some of the attributes contributing to the willingness of the employees to derive a function which can say whether the employee is going to stay or not. The interesting point is that the HR manager might try based on his understanding of some attributes of employees (whether he is married or not, what is his age, what is his career graph so far and so on) and his own understanding about the relation of those attributes with his own retention strategy (one foreign trip an year for employees older than three years, free school fees for two kids for an employee, one dinner with spouse every month etc.). Now if he is able to improve the retention period based on his strategy, we call him learned the concept of retention. Devising such a function is not straightforward (one of the students of the author achieved a Ph. D. degree by providing an automated solution to part of this problem) and requires learning. The idea is quite similar to our discussion about neural networks. The HR manager must clearly be able to derive weights associated with various attributes and optimize the retention policies matching with most of the employee’s interests.

 

In above case, the HR manager may start with an assumption that one foreign trip an year for a three year or more old employee will increase the retention by at least 30%. It is called hypothesis in the domain of concept learning. There are many such hypotheses that we can derive. The hypothesis can also be tested looking at the data that we have about our past employees (an example of dataset). We can safely assess our strategy is successful on them or not.

 

The problem with this approach is that our conclusion is biased if the sample data does not represent the complete universe closely. It is quite possible that some important parameter about the employee is missing in the list of attributes. When we do not have such important parameters, we will never be able to learn about that relation and our conclusion or hypothesis cannot be correct. When some employees are misclassified (our hypothesis says that they should be retained but they do not, or our hypothesis says they will not be retained but they do) the concept is not clearly learned. If fact what the HR manager is trying to do in this case is to make sure he gets the right formula for retaining employees. He might consider some potential candidate strategies for retaining employees (we call this set the Hypotheses space H in literature) and try to fit one or more with actual data. If the actual (the correct) hypothesis is not part of the H it is impossible for him to get it.

 

“5 Management domain contains large number of such concept learning issues like how customers like a product, how can one win the election and so on.”

 

One more method that extends this simple method is to use a decision tree based on observations that we have to device the liking of an employee. For example you may start asking what his ageis, if his age is above 60, choose what will be the next question, most other attributes are out of question now. This is little smarter as it is more precise and useful. The pain point is to learn the chain of questions that one may ask and how to choose the next question (based on answers provided so far). Many methods that we have studied so far can be used to manage this part. In computer science, one can call this method ‘decision tree’ but it is known by many other names. Many algorithms dealing with building such decision trees also provide proper ordering to make sure the tree is minimal and thus the instance is labeled correctly by minimum number of node traversals (by asking minimum number of questions).

 

As this approach uses differentiating rules to decide the class the input belongs to, it is discriminative type.

 

Clustering

 

So far we have sample data and sample input output pairs in our examples. For example we discussed the case of HR manager where he has employees’ attributes and also is aware of how long they are being retained. Our idea is to determine the relation between these two things. That cannot be possible always. For example a researcher is given a job of finding out different types of attacks based on different types of packets. Many attributes of the applications are reported(including packet loss, delivery status, round trip time, average packet size, sender’s IP address, receiver’s IP address, sender’s and receiver’s port numbers and so on). Unlike the earlier case, the researcher has no idea or presumption about which attribute contributes and which attribute does not for a given attack. In fact, he does not even know how many types of attacks are described by the dataset.

 

Thus the problem is not confined to find out relation between a set of attributes (retransmission time etc.) with a known attack (for example denial of service attack), but segregating input packets in groups describing some specific attacks. For example one exercise generated by one of author’s students is to divide input packets into three classes, one describing land attacks another describing slash attacks and third describing none.

 

The solution looks at the attributes of the packets and somehow groups the packets with similar attributes together. The process is quite similar to the process adopted by scientists to provide classification to species. This is basically an unsupervised learning approach.

 

Many methods for unsupervised learning exist. One very popular method is known as k-means clustering algorithm. The idea here is to represent all packets as points in the n dimensional space (where is n is number of attributes) and find clusters of nodes describing a single attack. User assumes there are k types of attacks and thus the algorithm tries to generate k groups out of the input packets based on the distance between nodes. Here we will have total k groups (each describing one attack) and some packets belonging to each group. One of the important considerations is that each packet belongs to precisely one group and each packet is considered in the groping process and none of them are left out6.

 

The algorithm works like this.

 

1. Begin with random k clusters

2. Pick up each point in the graph describing a particular packet based on the attributes of the packet

3. Assign that point to one of the k cluster randomly

4. Decide a central point (centroid) for each cluster

5. Loop

a. For each node, find out nearest group (described by the distance between that node and the centroid of all groups) and reassign that node to the nearest group.

b. For each group, regenerate centroids based on changed group members

c. Find out distance of all nodes belong to the group from the centroid

d. If the new summation of distance is less than the previous summation by a specific amount continue

 

The algorithm thus clusters nodes nearer to each other. It stops when the groups are formed in a way that each node becomes part of the group which is nearest.

 

Deep Learning

 

We have already studied an important method to machine learn, the neural networks and BPNN. In recent years, people have done lot of work in that direction and a new branch called deep learning is emerged.

 

We have seen that it is required to have multiple hidden layers when the problem is little too complicated and each feature at higher level is described as a collection of multiple features at lower layer. An interesting example is the Deep Genomics project by University of Toronto. They have started this process more than a decade back. They have designed methods to learn variations in genetic structure of people and relate that to diseases. They use machine learning methods that they have developed to find patterns from the huge dataset they have generated from patients, how cells process genomes and produce biomolecules. They are able to successfully predict genetic variations and relate that to diseases7. They have used deep learning process here. They have used many layers of neural network to accommodate millions of features. Their algorithm trains each layer one after another and makes sure that each layer contributes to decision making process.

 

“6  An interesting problem here is called a multi-class attack. One of the Ph D students of the author has obtained his patent describing a solution which solves this multi-class attack solution. The idea here is to have a packet indicating multiple attacks. In that case a packet belongs to multiple groups. It is imperative that the intrusion detection system identify all attacks to thwart all of them and not just one of them.”

“7  They have also made their project available free for non-commercial users. The project is named as SPIDEX”

 

The conventional backpropagation and other algorithms are found to be slow while applied to seriously high volume of data. The other problem is the middle layers in the algorithms fail to contribute to decision making process after a few iterations and thus advantage of getting finer feature extraction process is defeated. The newer algorithms which train each layer one after another and where one layer’s output is served as an input to next layer has eliminated both problems. Deep learning is still in infancy and many researchers are testing their algorithms on large datasets to check for their algorithms’ validity and effectiveness.

 

Summary

 

We looked at machine learning in this last module. We have seen that machine learning based solutions are gaining popularity in recent times for problems like speech recognition, SPAM filtering and so on. Machine learning allows the knowledge extracted from various sources and automatically made available to AI programs. ML is important for both, obtaining knowledge as well as continuously updating it. Many domains with huge datasets are available to researchers now. They are trying to find patterns and also the relation between patterns with specific characteristics. ML works by picking up examples and learning useful patterns from them. If expert verifies the output, it is called supervised learning otherwise called unsupervised learning. One needs a problem, a performance measure, and right set of examples to machine learn. The process of generalization, the machine learned system to solve problems not seen before, is also important here. Bayesian classifier, HMM and concept learning are different examples that we have seen during this module. We have looked at deep learning process which extends our basic neural network model by training each layer separately and make sure each minute feature of the input helps in decision making process.

you can view video on Machine Learning