 This afternoon I am going to share my thoughts on the use of neural networks as a soft computing tool. Basically I am working on the use of neural networks, genetic algorithms etc. From the point of view of data classification, that is mapping the numbers to symbols, whereas probably you would be working more on estimating some continuous variables based on the numerical data available to you. But the techniques can be adapted to regression without much change. And my test bed is images from remote sensors, that is space born imaging sensors that collect images of the Earth's surface. And for hydrological applications, I suppose the remotely sensed images are a very important input. So what are artificial neural networks? See the word neural relates to the human biological nervous system that exists in our brain. And when you say artificial, it means it is something related but not exactly the same. It is a computational model and when we say network, it means there are several elements which are interconnected in a particular manner. The computing that happens in neural networks is highly parallel. It is said that our brain has several billion neural elements, neurons, each one of which is connected to tens of thousands of other elements. That is how it is a very huge parallel computing network. And each element is very, very simple in its capabilities, very slow compared to the electronic processes that we use. Yet the sheer number and the fact that thousands of them are connected in parallel makes the human brain a very powerful computing system. And more important, the behavior of the neural network is learned rather than pre-programmed. We take different actions under different circumstances even if the same inputs may be presented. Therefore it is a learning and adaptive system that we are talking about. And the computational equivalence of the neural networks are implemented on standard computers as well as on large parallel networks. Why study the artificial neural networks? What is the motivation? First thing is we want to understand the brain, the functioning of the brain and our behavior under different circumstances. And then we also want to use the artificial neural networks as an engineering problem-solving environment for problems of data classification as well as regression. As Professor Hildo mentioned, neural networks are today used by virtually all branches of engineering and science, phase recognition or the use pattern of credit cards. In fact, neural networks became famous after the Visa credit card company used it for analyzing the pattern of use of credit cards and basically used it as a learning or data mining tool to identify normal use against fraudulent use. There are a large number of other applications and the one which we focus on today would be data classification. So if we want to define classification, it is the task of assigning a set of data elements of a numeric but they could be numeric as well as symbolic. So this data is assigned a set of labels or they are mapped to some categories in such a way that the accuracy of this mapping is as high as possible or in other words the cost of this assignment is minimum. And many of the ideas that we are going to talk about are borrowed from the field of pattern recognition. Now in the beginning I said that our neural system works on the process of learning and not by pre-programmed methods. So what do we mean by learning? So if I want to give a definition to learning, it is the process of extracting useful knowledge from the data presented to it and here we are talking of computational learning methods. So essentially we are trying to understand the relationship between the data and the useful knowledge. Once a system has learned this relationship, then that system will be able to perform whatever task it learned more efficiently compared to the state it was prior to learning. Like use of a computer, driving a car or interpreting images, solving equations. So there are any number of examples one can think of where learning is involved. So compared to the state the system was before learning, after learning the system becomes more efficient at performing whatever task for which it was learning. How many ways in which learning can happen? One obviously is using supervised methods that is when you have a teacher that is domain specific knowledge or a human analyst to guide the learning process. And the second is unsupervised learning when such teacher or domain specific knowledge is not available. In such a case the learning is essentially exploration of the data space to understand what kind of structures or groupings happen within the data. The most common example is clustering. When you have a large data one can easily look at the way the data elements are getting grouped together. What is the test of the learning? The most important test is that when a system is presented with new data in the event that the system has learned satisfactorily the new data should be properly interpreted. This aspect is known as generalization. So any classification tool once it has learned the relationships between the data and the knowledge that is to be gained from it any new data presented to it should be properly interpreted. Common example is suppose we solve some problems in the class for the students that is the learning phase for the students. So if the learning is satisfactory in the examination when you present them with some questions they should be able to solve them. The questions that they see in the examination were not those they were exposed to during the regular lectures. So the same applies to various types of recognition. Once we recognize one kind of say automobile we should be able to recognize a different company's automobile even though we have not seen it before. So the learning is the forward phase and generalization is the reverse phase where we have to apply the knowledge available to understand the data. Now when we are talking of data what do we have exactly in mind? We are talking of what are called features or attributes which characterize the input data. For example if we are talking of a satellite image we are talking of the measured values by the sensors in different wavelengths of the electromagnetic spectrum based on which an interpreter may be able to interpret the data as water bodies or say open areas, fields, forest and so on. The features that are presented to the classifier should be rich enough to capture all the information necessary to extract the knowledge and in various applications one would use different types of features say in phase recognition or speech recognition, in civil engineering applications, in power systems applications. There are various types of features that the classification system would use for learning and one of the tasks that classification system has to perform prior to learning is to extract the most important features so that we can perform the tasks with maximum efficiency and with maximum accuracy. We should not have too few features or too many features each one is in itself a major area of research often we may have to perform a feature reduction to remove less useful features so that we can perform the tasks with greater accuracy as well as speed. Before we go on to neural networks we will look at a simple statistical classifier whenever we are performing the classification task the available data at our disposal may or may not be completely informative or even complete for that matter. So often we can only estimate the likelihood of the data elements to belong to various classes in which case classification is essentially computing the conditional probability of various classes given the data. So the data is coming from our sensors, field measurements etc. based on which we have to estimate what would be the class to which these data elements should be mapped to and for mathematical tractability very often we make a very simplistic assumption that the data is Gaussian distributed what is the meaning of that? The meaning is that every class can be characterized by some mean response with a certain spread around the mean this is a very convenient assumption and the Gaussian distribution also has elegant properties for mathematical tractability. You have in one dimension and in multiple dimensions so every class can be characterized by a mean response or mean values with a spread around it the spread is characterized by standard deviation. So we have an expression for the multivariate Gaussian and the likelihood that we are estimating is in terms of the given data. So what is the probability of a given class given the data? Very often it is very difficult to estimate it directly therefore we take the alternative route of Bayesian estimation. In supervised learning particularly it would be possible to find some samples for each class so that the conditional probabilities of the data given the class can be estimated. So this quantity can be estimated by the process of what is called training or by collecting prototypes or samples training data there are various terms used for this purpose. So the conditional probability of a class given the sample can be estimated in terms of the conditional probability of the data given the class and with the prior probabilities. So as you can see this term here is what is popularly known as the Mahal Novice distance between the data vector and the corresponding class mean. The sigma i is the class covariance matrix. Now here we are assuming two things one is that the data could be modeled by multivariate normal distribution and secondly all the data variables are of similar range so that we could compute these distance matrix Mahal Novice distance Euclidean distance etc. If one variable has a range in hundreds or thousands and another variable say between 0 and 1 these distance matrix will not make sense because the numerically larger variables will dominate the others. So when we are mixing data of various statistical distributions a more appropriate method for classification would be to choose that kind of a method which does not depend on the distributions. Neural networks is one such distribution free classifier. As I said before neural networks are inspired by the processes in the human brain. Just to make a comparison between the human and computational strengths we are very good at dealing with noisy data right even if we cannot see something very clearly or even if say there is a disturbance when you are watching television for example we can still see and understand we can process the data presented to us and we are good at dealing with unknown data data presented to us first time massively parallel yes because the number of processing elements are very large and fault tolerant. Because we see only part of the data like a small I mean a part of a person's face we can still recognize the person or we do not see the face but we see how a person is walking that itself can be adequate for us to recognize and we also can easily adapt to the circumstances. We do not store the data as much as the knowledge as opposed to this the computers struggle when we are when they are presented with noisy data or in recognizing new data and compared to the parallelism that we have in the brain the number of computers that can be connected together is very small fault tolerance is available to some extent but not a lot and computers can only store data very efficiently but not so much the knowledge. So the knowledge based systems that one can think of store very very primitive knowledge the structure of the biological neuron then in the next slide we will see its computational equivalent. So the neurophyses have given as a model wherein a neuron is supposed to have various dendrites which carry the signals coming from other neurons and each of these signals is scaled by what is called as synapse that is they would be multiplied by some weights. And all these inputs then get collected in the central core of the neuron called the nucleus where all these scaled inputs get added up and the resultant is if the resultant magnitude is large enough the neuron will fire that is it will send a signal through its output channel if the input signal is weak then nothing goes out through this neuron. So this is the biological neuron the model for which the computational equivalent we present various types of data that we collect to the artificial neuron and these inputs can get multiplied by certain weight factors they will get summed and the sum passes through a nonlinearity or some activation function most often nonlinear and this result goes out to connect to other neurons artificial neurons. The inputs are known whatever data we have collected what kind of processing that should happen here we can decide what we do not know would be the weights with which we scale up or scale down the inputs. So essentially learning in the computational sense is nothing but estimating the correct weights so that the given inputs get mapped to correct output we will see this repeatedly in the coming slides. So the components of the artificial neuron essentially consist of weights summation element activation function and also there will be a bias or what is also called an offset. So there are number of inputs getting scaled then they get summed and this is passing through a point nonlinear function which then produces the output of the neuron. So this is our artificial neuron so the computation that happens is essentially a summation followed by a nonlinear transformation. So this is the artificial neuron now we need to have learning mechanisms so that the inputs and the output that is the inputs x and the output y are meaningfully related. So if the inputs say are to be mapped to class 1 then the weights should be of a particular configuration and if the inputs are to be mapped to class 2 then they should have an alternate configuration. So the weights should be adapted in such a way that the result gets adjusted based on the way the inputs are changing. So this is the process of learning where a neuron adjusts its weights because this is the only thing unknown. What kind of nonlinear transformations that we are talking about? One could be what is called a hard limiter. Input magnitude below a limit there is no output once it crosses a certain limit we have a certain output. So one state can be called 0 another state can be called 1. So this is what is called a hard limiter so there is an abrupt jump beyond a certain value. It could be a more gradual transformation in the sense that the output increases linearly with the input. This is not very useful because a linear transformation is effectively no transformation. The other nonlinear transformations which have continuous derivatives have been found to be of lot of practical use in neural networks applications. One of the most common ones is called the sigmoidal function which varies from 0 to 1 over the range minus infinity to plus infinity. So whatever be the input range it gets squashed to the range 0 to 1 that is why it is also called a squashing function ok. We have Gaussian functions also. One popular example for this is the radial basis functions. The neural net activity that is research in neural nets had really been boosted by the development of what is called the perceptron model back in 1958 that was before I was born by an American scientist called Frank Rosenblatt at Stanford University ok. So a number of these neurons connected together in a layer and the perceptron learning essentially updates the weights which are scaling up or scaling down the inputs. So the learning process essentially is according to an updating procedure where if the desired output is less than the computed output then we increase the weights and if the desired output is less than the no if the desired output is less than the computed output we increase the weights and if the desired output is more we decrease the weights. So the simple perceptron learning rule had attracted lot of attention and many applications were shown to be possible using the perceptron learning rule. And by connecting a number of these neurons in a processing layer many pattern classification applications were demonstrated ok. So the weights which the updates the weights are coming in terms of gain factor and the inputs. So as I said we increase the weights or decrease the weights according to the difference between the computed output and the desired output. So the assumption here is that it is a supervised learning rule where we have at least a small sample of data available for which the corresponding classes are known. So the neural network is a very active linear element network which is also related proposed by another scientist Bernard Widrow I think he was from MIT. So the knowledge as far as the neural network is concerned the knowledge relating to the data and the classes lies in the weights. So once we can estimate the weights it means that we are able to learn and apply the classifier. There are a large number of examples that is two class classification problems which are demonstrated to be successfully classified using the neural networks such as the Boolean operations like and or not etc. So by estimating the weights we could apply any of these inputs and get the corresponding classes. So these are in general called linearly separable problems or linearly separable function that is whatever be the data space one dimension less will be the separation which can separate the data into various classes. So in a two dimensional plane the classifier would be a line. So one side of the line will be class one other side of the line will be class two and even if you have multiple classes the classifier can still handle them so long it is possible to separate the data into different zones by these linear discriminants or by hyper planes in a multi dimensional space. Essentially computing the equations of these hyper planes is the process is a task of classification so that we are confining the data into different zones according to the discriminant functions or the hyper planes. The problem comes when the data is not linearly separable. So we have an example here category one means both data I mean the both features of the data can be high or low and if one is high another one low it will be second class. So this is the Boolean exclusive OR operation or the XOR operation. One can find practical situations like we say like poles repel and opposite poles attract that is exactly the situation here both north poles and both south poles repel each other one north pole and one south pole attract each other. So we have a two class problem which cannot be handled by the simple perceptron classifier. Incidentally in the late 60s a famous I mean a pair of very famous MIT professors had sort of very strongly criticized the whole perceptron research itself and that set back the progress in neural networks so much that for the next 15 years very little happened. And since MIT people were considered next to God if they objected the funding agencies also stopped funding. So again another group of MIT scientists in the mid 80s they had shown that the same perceptron elements configured differently could handle any classification problem whether the data was linearly or non-linearly separable. Right before we come to that just a pictorial representation of what we mean by learning we are essentially adjusting the weights so that we are moving towards a state where the error associated with the classification is a minimum. In other words if we can find the gradient of the error function we are coming towards a minima of the error function for which we move along the negative gradient. So we are trying to look for the minima of the error function. The risk however is we do not know whether we will reach this minimum or this minimum. So the updates to the weights happen according to the gradient of the error function. So whatever is the gradient magnitude a negative of it is with a scale factor is added to the original weights. Now as I was saying it was shown that the perceptrons instead of one layer if one had an extra layer in between before connecting to the output this was shown to be perfectly capable of classifying any means really any type of data. So there were mathematical proofs to say that a neural network of this configuration where you have what is called an input layer with an intermediate layer or a hidden layer before the output could in principle solve any classification problem. The problem was that many people had I mean conceptualized such a structure but they could not come up with a learning rule to train a network of this type. So a group of scientists led by one professor Ruhmel Hart at MIT AI lab. So they came up with a learning rule after which there was no looking back. Really from the late 80s the neural net applications have grown at an exponential phase a pace. So essentially from the inputs we have the first level inputs to the intermediate layer whose outputs go to the output layer and the output of this final neuron is the output of the overall network. Finally the intermediate set of neural elements comprise what is called the hidden layer. What do we mean by hidden? The end user looks at the inputs that are being provided to the network and the end user will observe the output coming. The intermediate neurons are in essence hidden from the end user. That is why the term hidden layer or hidden neurons is used essentially to describe those elements which are neither at the input nor at the output. So the process still remains the same how to compute the link weights which connect different neurons. So we still have to compute or have to minimize the error between the desired output and the computed output. This philosophy still remains the same and we will skip the equations. Basically what we are trying to achieve is how to compute the update for our link weights. So which are function of the derivative of the error which in turn is computed in terms of the output that the network provides and the activation function that I mentioned earlier with the one which has nice derivative makes it easier to come up with the learning rules. So even for the multi-layer neural network it is possible to update the weights in this manner. However the complexity is that we know the desired output and we can compare it with the computed output at the final layer of the network but at the hidden layers at the hidden layers it is difficult to estimate what is the desired output and what we only know what is the computed output but we do not know what is the desired output. So how to adjust the weights of the intermediate neurons is a challenging task. In the forward pass whatever inputs are provided to the neural network get added and scaled and they pass from left to right from the input layer progressively towards the output layer and the learning or the adjustment of the weight happens in the backward phase that is the error between the desired and the computed outputs is propagated backwards and thus comes the famous name of error back propagation algorithm. The BP algorithm which had thrown open the neural network domain to hundreds of applications including all areas of engineering finance so wherever we start estimating the weights we move progressively towards a point where the error of classification that is the difference between the desired and the computed outputs goes to a minimum. So as we can see the weights cannot come in one step one has to estimate them over several cycles. So starting with some initial weights which may be taken by simple pseudo random number generators they are iteratively corrected and in the process the error curve also goes down. So there may be phases when the error hardly changes and then it may move. There may be initial quite rapid fall in the error and then maybe for a long time error may remain unchanged then it may fall again. So this kind of a behavior is pretty common. The risk however is that one may stop at a local minimum of the error and not the global minimum. This is a problem with all gradient based estimators. The back propagation rule as it is formalized has no way of escaping the local minimum. It is a chance it may come to the global minimum or it may stop at a local minimum. There are of course lots of thumb rules for jumping over the local minimum and go on to the global minimum but it requires considerable amount of trial and error. The issues in using a neural network of such a structure is that for a given problem how do we come up with the configuration of the network? The number of inputs is in our control the number of outputs is obviously in our control but what goes in between that is the hidden elements there is no theory to precisely quantify the number of hidden layers or hidden nodes or hidden neurons that should be there for a given problem. There is some theory available to compute the capacity of a given network but for a given problem how to find the most appropriate network that is still not available. So given the problem one may have to go through a few cycles to come up with the best possible configuration. One should not have too few elements in the hidden section nor too many. So where we start learning that is with what initial weights we start has some role in achieving success with the neural networks trained by the back propagation algorithm. Either we may get trapped in a local minimum of the error surface or we may reach the global minimum. So genetic algorithms or other global optimization methods can be employed to have a smart initialization of the network weights. Genetic algorithms is one option there may be others also like simulated annealing. Another issue is that when we are updating the weights there can be the problem of oscillation. So the error may go down in one iteration it may go up again. So accordingly the weights also will be drastically changing iteration to iteration. So to dampen these oscillations one often adds what is called a momentum term which minimizes the difference between the weights over consecutive iterations. So the user has to specify the value of this momentum term as well as the gain or the scale factor which multiplies the derivative of the error function. Essentially the correction to the weights comes from the derivative of the error with respect to the weights. So the scale factor and the momentum term both are user specified. If the gain factor is very large the weights change quite substantially. If the momentum term is very large the rate of change of the weights may be a bit slow. So depending on how the network is learning in between sometimes one alters the values of the gain factor and the momentum. There are lots of rule thumb rules when one is dealing with the neural networks. Two most common problems one is that during the learning phase the error doesn't come down at all. It is substantially high even after a very long learning time that is problem number one. Problem number two is the network may perform well with the samples or prototypes that you present but once it is given new data it fails to generalize. So these are the two problems which confront the user of neural networks. So they said there are several thumb rules. So if the error is not decreasing sufficiently what should be done one train it for longer time or change the initial weights completely and repeat the process or one can alter the learning rate and the momentum. These are some of the things which can change the rate of convergence during the learning phase. If the problem is very complex that is what we are trying to learn is very complex sometimes we need more nodes more neurons in the hidden section. Now if the problem is simple and you have used too many nodes in the hidden layer then also the problem the learning can be very slow. So one has to have a certain amount of understanding of the problem space for configuring the network. So depending on the situation one may have to have a variety of approaches to handle the problem. Learning but not generalizing this is more often the case. If your training samples are adequate you will be able to learn quickly but if the network is not properly trained then one gets into the problem of poor generalization that is unknown new samples will not be classified accurately. Very often the neural network suffers from the problem of what is called over learning or over training so which can give rise to this problem. So one has to know when to actually stop. As I said my test data comes from satellite images so I have a land use land cover classification example based on the supervised classification approach. So for the user I have one has to identify the classes one has to provide the training data that is some pixels from the images and their corresponding classes have to be provided and then configure the network. So depending on the dimensionality of the input data the input layer size and the number of classes which defines the output layer size would be easy to determine how many elements should be in the hidden section. So how many layers how many elements in each layer this is the you can say it is partly a science partly an art in image classification and specification of the gain and momentum terms. Most often at least in image classification applications no more than two hidden layers would suffice and often one hidden layer is also enough in many cases. Now there is one issue which we learned from experience particularly when we were working with the satellite images was that the training data should not be presented class by class. We had say a few hundred pixels for every class and initially we presented all samples for class one trained the network then all samples of class two. So in the process what happened after all the samples is presented whichever class was presented the end the network was able to generalize well with that but it performed very poorly with other classes. So what we had to do in such a case was to shuffle all the training samples so that the training data was presented to the network in a random order so that the network essentially was able to remember the relation between the training data and the corresponding class but not the order in which the samples were presented. Okay we have a typical example of satellite image with different agricultural areas this is the Ganges river the town of Varanasi nearby some wheat fields here rice and masoor some forest in the south. So this is a typical satellite image and the corresponding classification so when you assign all the pixels to various classes you can color code the result and display it nicely is a colorful picture. So we can now quantify the area under various categories by counting the pixels and multiplying by the ground area of each pixel and if you produce such images over different dates you can monitor the change that's happening on the ground. This is one of the most important applications of satellite images that is monitoring what's happening on the ground periodically and one of the follow-ups after classification is to estimate the error. So out of your reference data how many of the elements are correctly classified so an error matrix or the what's called confusion matrix would give you an indication of how well the network had been able to generalize after learning this comes from the test samples which were not part of the training samples. Another example where the task was to identify the boundaries and line-like features one image was from Bombay another from Bangalore we have all the line-like structures well extracted from the neural net this is in Bangalore this is from a more recent satellite which gives a lot more detail compared to the older satellites so you can see this like a collection of objects buildings or swimming pool and the marshy vegetation near the Tana Creek this also is classified and we have different classes like roads and buildings and water etc so the problem here is mapping the data into classes by classification and without changing much the framework one could use the same structure for estimating any continuous variable in a regression framework as I was saying earlier if we can initialize a network weights instead of randomly in a more efficient and a smart way one may be able to achieve convergence faster and also accuracy better accuracy genetic algorithms can be used in one such case whereby we start with populations of neural net weights and using the genetic operators one could update the population that is by applying crossover and mutation the fitness function in genetic algorithms is application dependent right so in case of neural net weight estimation the objective function is based on how accurate the weight set would be in classifying my training data so the fitness of a weight set would be just the accuracy of classification of my training data so higher the accuracy higher would be the fitness of the weight set the advantage of genetic algorithms is that we start with a large number of candidate solutions and we know that I mean we can hope that at least one of them would be able to reach the global optimum that is global minimum of the error function or global maximum of the accuracy so it's if I show you the flow diagram what happens we initialize the popular that is we start with maybe 50 or 100 randomly initialized neural net weight sets and we perform crossover and mutation according to some probabilities after that we assist the fitness and based on the fitness we drop some of the uh poorly performing weights we select the better performing ones and repeat the process