 Namaste, since last few modules we have used TensorFlow API for building models for image classification and regression problems. One of the important data types that we frequently encounter in practice is structured data, the data that is stored in databases or in tabular format in CSV file. In this session we will study how to build models for this structured data using TensorFlow API. We will use Keras to define the model and feature columns as a bridge to map from columns in CSV to features for model training. In this exercise we will load CSV using pandas which are library for handling structured data in Python. We will build an input pipeline in batch and shuffle the rows using tf.data library. Then we will map from columns in CSV to features used to train model using feature column and finally we will build train and evaluate a model with tf.keras library. Let us look at the data set for this exercise. We are going to use a data set provided by Cleveland Clinic Foundation for heart disease. The idea is to build a model to predict whether a patient will suffer from heart disease. There are several hundred rows in the CSV file. Each row describes a patient and each column describes an attribute of the patient. You can see that the data set has a bunch of columns. There are mixed columns in terms of the data types. There are lots of the columns are numerical columns but there are also few categorical columns. The final variable over here which is a target. The target variable has the classification label. We use label 1 if patient has a heart disease otherwise we use label 0. We will first install a scale and package which we will use for splitting the data into training and test set. Next we will install the tensorflow package tensorflow 2.0 and install tensorflow and other associated libraries like feature columns, layers. From a scale learn we will use train underscore test underscore split library. Apart from that as usual we will use numpy and pandas are libraries for manipulating the data. We are using pandas for the first time in this course. Pandas is a python library that is useful for handling structured data. Let us run this particular cell. Now what we will do is we will download the data set and read it with panda. Now what we will do is we will read the CSV file directly from the URL using pandas read underscore csc function and we will print top 5 rows in the data frame. You can see that the top 5 rows are on the screen and the second row has heart disease. The second patient on this particular record or row has a heart disease rest of the patient seems to be healthy patients. And you can see attributes like age, the sex, there and another cholesterol level and other associated attributes in the table. Next, now that we have loaded the data set in memory we will split that into train validation and test. So, we will use train underscore test underscore underscore split library from a scale learn. It split the data and print the statistics about training validation and test examples. So, we have 193 training examples, 49 validation examples and 61 test examples. Test data will never be exposed to the model during the training. So, model building will happen on 193 training samples, we will validate the model on 49 samples and we will finally test its performance on the 61 examples. We have loaded our data set in the pandas data frame and what we will do is we will wrap the data frame with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the data frame to features in the model. If you are working with a very large CSV file, we could also use tf.data to read it from the disk directly. This is since we are dealing with a small data set we are not going to cover this particular functionality in this tutorial. Let us look at df to data set method that we are using for wrapping the data frames with tf.data. So, if we look at this particular function what is happening here is we are first copying the data frame so that any of the changes are not persisted. Next we remove the target column or the column which contains the classification label and we remove it and store that in the in the labels array. Next we create data set from tensor slices. The tensor slices are created by obtaining the dictionary representation of the data frame and the labels column. Then we shuffle data set if required we are passing a flag for shuffling the data set. If the flag is true then we shuffle the data set and finally, we batch the tensors of specified size specified by batch underscore size variable and we return the batch of tensors which will later be consumed during model training. Now what we will do is we will try to demonstrate we will try to convert we will try to convert the data frame to data set and see whether the conversion whether the convergence is as per our expectation. So, this is a small code where we will use a batch size of 5 and we convert training validation and test set into the data set. Now that we have created our input pipeline let us call it to see the format of data returns. We have used a small batch size of 5 to keep the output readable. So, let us run this code to see what the data set returns. Now you can see that we have list of features printed from this particular statement. Then you can see the 5 values from the column edge or from the feature edge and we also have 5 values from the target. And you can look at the shape and the data type of this tensors. So, both of this tensors are vectors they contain exactly 5 elements. The data that is contained in them is essentially a 32 bit integer. So, TensorFlow provides many types of feature columns. In this section what we will do is we will create several types of feature columns and demonstrate how they transform a column from the data frame. We create a utility method to create a feature column and to transform a batch of data. Let us look at the first type of feature column for numeric attributes. A numeric column is the simplest type of column. It is used to represent real valued features. When we use this column our model will receive the column value from the data frame unchanged. So, numeric values are passed as it is to the model. So, we use feature underscore column dot numeric underscore column to convert the numeric columns into the features into the feature underscore column. So, edge contains numbers or numeric edge is a numeric attribute. So, we use feature underscore column underscore numeric underscore column for transforming edge. Let us look at what it returns. So, you can see that this particular utility method returned us returned a vector containing 5 elements. So, we have first 5 values printed over here because we used a small batch size of 5. So, we have 5 examples printed on the screen. In our dataset we have seen earlier that most of the columns are numeric. Now, what happens is that we also have a large number of columns that are categorical and we cannot really feed a non-numeric data to the tensor flow. Hence, we need to convert the non-numeric data into numbers. Let us look at some of the ways in which we can use feature underscore column to convert non-numeric data into feature columns. So, the first option that we have is called bucketized column. Here what happens is instead of using a number directly in the model, we split its values into different categories based on numerical ranges. Let us consider age of a person. Instead of representing age as a numeric column, we could split the age into several buckets using bucketized underscore column. Here what happens is bucketized underscore column represents age as one-hot values based on the range matching, based on the range match. So, here in this case there will be 11 11 ranges created. The first range is for all the ages below 18, then between 18 to 25, 25 to 30 and so on until 60 between 60 to 65 and more than 65. Let us look at how the first 5 numeric ages are converted into bucketized column. So, you can see that the age is 60. So, the 10th value has got 1, then look at 65, 65 the 11th value is 1. You can see that there are 11 values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. So, we use 11 length vector for representing the ranges because there are 11 ranges based on the boundaries that we have selected and you can see how the other values are also represented here. So, in this data set there are also several categorical features, they are represented by strings like fixed, normal or reversible. Again, we cannot fit strings directly into the model, instead we must first map them to a numeric value. We have already seen one-hot encoding as one possible way to encode string values. In this case, we will use categorical vocabulary column as a way to represent string as a one-hot vector. The vocabulary here can be passed either using a vocabulary list or vocabulary can also be loaded from a file. Depending on whether you are going to pass on the vocabulary list or file, there are two methods. One is categorical underscore column underscore width underscore vocabulary underscore list and then a similar one for vocabulary underscore file. So, let us look at let us look at conversion, let us look at converting a categorical column with vocabulary list. Here T H A L is an attribute which has got 3 possible values fixed, normal and reversible. We use feature underscore column dot categorical underscore column underscore width underscore vocabulary underscore list to convert T H A L attribute into one-hot encoding feature. So, we first use the category column, categorical column with vocabulary list and then pass on that feature column to indicator underscore column to get a one-hot encoded representation. Let us look at one-hot encoding of T H A L column. In each of the unit exactly one particular position as one rest of the positions are 0, this is exactly how the one-hot encoding works as you may recall from the previous classes. So, here we have most of the categorical attributes that do not have lot of different values, but you will often encounter data sets in in real life that would have large number of strings in in a column and that could happen for multiple columns. TensorFlow also provides a mechanism to encode columns with large number of possible string values. So, there are a few mechanisms like embedding columns or hash columns that helps us to convert a column with large number of strings into into numbers. So, what happens if we have thousands of values per category? So, we do not want to really use one-hot encoding here because that representation will be extremely sparse representation. So, instead of using one-hot encoding here we use what is called as an embedding column. Embedding column represents the data as a lower dimensional dense vector in which each cell can contain any number between any number and not just 0 and 1. The size of the embedding is a parameter that must be tuned. In this particular example here we are creating an embedding for thal column and we are using number of dimensions as 8. So, let us try to see how it works. So, you can see that now, now let us compare this is the one-hot encoding where exactly one value was 1 in each of the row. In this particular case you can see that it is a dense representation we are we have converted thal into 8 values and each value can take any number, number can be either positive or negative. Apart from embedding column we also have another way to represent a categorical column it is called hash column or hashed feature column. Remember the whole here the central idea is to use hashing. This feature column calculates a hash value of the input and then select one of the buckets to encode a string. When using this column we do not need to specify a vocabulary and we can choose to make the number of buckets significantly smaller than the number of actual categories to save the space. One of the important thing to note while using this particular technique is that it has a downside that there may be collisions in which different strings would get mapped to the same bucket. However, in practice this scheme works quite well for some data sets. So, let us convert thal into into numbers using the method of hash buckets. So, we use categorical underscore column underscore width underscore hash bucket to convert thal into hash buckets of size 1000 and then convert the hashed representation into an indicator column representation. So, we will again get a one hot encoding kind of representation after we convert the feature column into an indicator column. And you can see that now we have one hot encoding in with one hot encoding done with a vector which has got 1000 entries. Only one of them will be will be one based on based on the bucket ID in which the value was hashed. In the past we also saw that in order to construct complex data set we often need to cross columns. So, combining features or which is commonly known as feature crossing is a popular way to build complex decision boundaries. So, after crossing the features we create a new feature that is cross of two original features. In this case we cross two columns one is the edge bucket that we created based on the bucketization and thal value. The cross column does not build the full table of all possible combinations just because it could be very large and take a lot of memory. Instead of that we use hashed column for hashing the values coming out of the crossed column. So, you can see here we have to simply use crossed underscore column to cross two columns and you have to specify the hash bucket size. After crossing the values in the columns hashing is automatically carried out based on the bucket size that you specify here. Then we can convert the hashed representation of the crossed feature to an indicator representation using Indicator underscore column command. You may recall that indicator underscore column command is used to create one hot encoding representation for a feature. Now that we have studied a few methods to convert the non-numeric features into numbers specifically we looked at methods like one hot encoding using list or one hot encoding using values mentioned in files or using hashing technique or embedding techniques to convert strings into numbers. For numerical attributes we looked at the numerical columns or we also used bucketized column to bucketize the number into various buckets and then get a representation of that particular column in a bucketized format. And we also look at how to construct feature crosses and represent the crossed features using crossed underscore column. Next we will choose the columns to use for training a model. Here we select a few columns arbitrarily to train our model. If your aim is to build an accurate model you should take a larger dataset and think carefully about what features are more meaningful for your model. And then include only those features or construct meaningful features from the given representation. We define feature underscore columns as an array or as a list to hold the features that we are going to use. So here what we do is for all the numeric columns we use numeric underscore column function and construct numeric feature columns. Then we construct bucketized feature columns for edge based on the boundaries given over here. Then we construct indicator feature columns for thal and we also construct an embedding feature column for thal and we cross the edge buckets with thal and construct and construct crossed column feature columns. Now that we have defined our feature columns we will use a dense features layer to input them to our Keras model. The dense features layer takes feature columns as an input. Let us create a model based on the feature columns defined earlier. First we will create a baseline model with logistic regression. The logistic regression model is constructed with tf.Keras.sequential in which we specify the feature layer and then there is an output unit which is which has got one unit with a sigmoid activation. Once you define a logistic regression model we compile it and we compile it with Adam optimizer. We use binary cross entropy as a loss because we are solving a binary classification problem and we will use accuracy as a metric to track during the training. Finally we will find out the loss and accuracy of the model based on the test data. Let us run this particular code. So let us construct the model then we will compile it and you can see that the model is getting trained and after 5 epochs the model has accuracy of 71 percent on training set and validation accuracy is slightly higher it is 77 percent. Let us look at the accuracy on the test set. On test set we got accuracy of 75 percent. So this is our baseline model. Let us try to build a neural network model. It is always a good idea to build a baseline model. For classification problems logistic regression serves as a good baseline model. If you are solving a regression model always start with a linear regression model as a baseline model. Baseline model also helps us to understand what kind of performance we can obtain just using the data that is given to us. And then we can and then we can use bunch of strategies to improve the performance and doing the base planning also helps us to understand how each of these new strategies help you to improve the performance of the model further. Let us build a neural network model. In this particular case let us look at the structure of this neural network model. So we had a logistic regression model to as a baseline model and we have a neural network model here. In case of logistic regression model what we did is we had bunch of feature columns and we had exactly one output. These are all features and we had sigmoid as an activation function here. We had sigmoid as an activation function here which gives us probability of a patient having a heart disease. In case of neural network what we do is we take these features and we define we set up a neural network with 2 hidden layers each containing 128 units each of these input and finally we have a single output node. So we have got sigmoid activation and here we use ReLU as an activation. So this is the neural network architecture we are using a feed foreign neural network with 2 hidden layers each containing 128 units and ReLU as an activation and we have an output layer with a sigmoid activation as you can see over here. We use Adam as an optimizer, we use binary cross entropy as a loss because we are solving a binary classification problem and we will track accuracy as a metric. After compiling the model we will we will run a training loop with train underscore ds as a training data and while underscore ds as a validation data and we train for 5 epochs. Let us train the model, let us define compile and train the model. You can see that we get accuracy of about 75 percent at the end and validation accuracy is still slightly higher and let us look at the model summary. Model summary helps us to see what kind of model we have set up and we can also see the number of parameters of the model. You can see that a total number of parameters in the model is very large. We have about 148 k parameters and let us evaluate the accuracy of the model with the test data. We get an accuracy of 75.75 percent accuracy of the test data. You will typically see best results with deep learning with much larger and more complex data sets. When working with a small data set we recommend using other classifiers like decision tree or random forest as a strong baseline. The goal of this exercise was to demonstrate the mechanics of working with structured data so that you have some idea of how to work with structured data when you start working on your own. The best way to learn more about classifying structured data is to try it yourself with some data set. I would strongly encourage you to find a structured data and apply the concepts that we studied in this particular session. To improve the accuracy you should think carefully about which features to include in the model and how they should be represented. In this module we learnt how to use ML models for structured data with TensorFlow API. We built a logistic regression model followed by a neural network model for prediction of heart disease in the patient. We also learnt how to read features from the structured data and convert them into feature columns. You are now equipped with lots of potent tools to build your own machine learning models for a variety of data type. Hope you had fun time learning these concepts. See you in the next module. Thank you.