 And first of all, thank you very much for the organizers, to the organizers for inviting me here. And today I'm going to talk about applied deep learning. So who am I? I'm working as a data scientist in Berlin. I love machine learning. I'm a Kaggle junkie. And my research interests are automatic machine learning. And I like handling text data a lot. And of course, I like big data. So before we start with applied deep learning, we all want to know what is deep learning, right? So deep learning for me, and I think for most of you, is now a buzzword. So if you want a job as a data scientist, you must put deep learning in your resume. You're going to get selected for an interview. To me, deep learning is neural networks, all kinds of neural networks. Not shallow, but larger neural networks. And deep learning helps us to remove manual feature extraction steps. So when you're dealing with images, just imagine 10 years back when you were dealing with images, you were using features like sift features or surf features. So you don't need them anymore, because deep learning takes care of that. And many people might think it's a black box. Oh, we are not going to use neural networks. It's a black box. It's not. We'll see. So history is always good, right? We must learn how deep learning machines have evolved, how continents have evolved. So in 1989, we used to have something like this with the convolutional layers and sub-sampling, then again convolution, something like this. And then we had something in 2012, we had max pooling, which improved image and accuracy a lot. But we are humans. We always want something more. So we need to go deeper in neural networks. And then Google came up with this Google net, which doesn't even fit on my screen, with a lot of inception layers, a lot of convolutional softmax pooling layers. And since 2012 to 2016, there have been hundreds of papers discussing neural network techniques. So we'll go one by one, OK? And then you'll be like this, a bored cat. If I start talking about all the deep learning techniques for the last four years. So let's start with something interesting. What can deep learning do? We all know, like images, of course, it can classify images like in different categories, butterfly, dog, volleyball. People love classification between dogs and cats. And something RCNNs can classify different parts of an image. And it's very fast. And also, like, is it a muffin or is it a dog? It's difficult for humans, right? Don't eat the dog. Or maybe something like this, maybe a pug or bread, loaf of bread. So deep learning can also be used in natural language processing with text data, sentiment analysis. There's been a lot of research in that field. And also for speech processing. So if you're using Siri and coming up with stupid jokes, so it's machine learning and deep learning. It's everywhere. Deep learning these days, it's everywhere. So the question is, how can I implement my own deep nets? So at the end of the talk, I think most of, if you haven't implemented your own deep nets yet, then you will be able to do so. Yeah, so you can implement them on your own. So to implement them on your own, if you know what kind of deep network you're planning to implement, you have to decompose it into smaller parts. And then you implement the layers one by one, use Tiano or something. And then you start the training. But there is something more. It's called fine-tuning. So you can save yourself some time, and you can fine-tune a pre-tuned network. So for fine-tuning, for most of the times, maybe you need to convert the data. So that's one of the steps. The next step is you need a network definition. So how does your network look like? Do you want to fine-tune AlexNet, GoogleNet, VGG? And you need to define a solver that solves a net. So I'm telling you all this because I'm going to talk about Cafe developed by Barclay Vision and Keras, which is a Python library. So first, I'll start with Cafe. How many of you have used Cafe? How many of you know about Cafe? Great. So why should we use Cafe? It's very fast. It's open source. It's modular. And if you don't know coding, it's expression-based. So even if you don't know coding, you can implement your networks in Cafe very, very easily. And it's developed by a community, which is the best part. So you always have something new implemented in Cafe. So for Cafe, you need to convert your data. So Cafe uses LMDB. It's kind of database for images. If you have a lot of images, if you want to train faster, then you need to convert your data first. You need to define a network, which is a proto-text file. I'll come to that. And a solver, as I said earlier, which is also a proto-text. And then you train the network with or without the pre-trained weights. So proto-text, it's actually very, very simple. So this is one of the solvers here. And you define where you store the net, which is network.proto-text or train.proto-text. So here, train.val.proto-text, it accesses that. And then you have the number of test iterations. After how many intervals, you want to test your network on the validation set. That's the test interval. And then display is one. So after every epoch, you get the results. You can define average loss, learning rate, what kind of learning rate policy you're using, then other parameters like gamma and momentum. And snapshot prefix is where you want to store your model. So after every 100 iterations or 1,000 iterations, you want to store it somewhere. So you can give a location there. And the best part of Cafe is it can run on a GPU or a CPU. So you just need to change it to CPU if you don't have a GPU. It's going to be slow. And then we come to the train.proto-text. So in train.proto-text, you have to define a network name. So here it's lenit. So that's why. Lenit. Then you have to define the layer name, what kind of input you're using. So here's data input, and then the shape of the input. And then you can just carry on with writing the layers. So convolutional layer and the bottom layer is data. So data comes to convolutional layer. And then you have the number of outputs associated with it. Kernel size, stride, what kind of weight and bias initialization you want. And then in a similar way, you have pooling layers. So here it's using max pooling and kernel sizes to two strides. And after that, you can actually keep on adding as many layers and pooling layers or whatever layers you want. And in the end, you have something like inner product layer, which is a dense layer, and with a number of 10 output layers. So in this case, I think it was training on MNIST data. So 10 output layers. 10 outputs. And then softmax, because you want probabilities. So training on net using cafe is very, very complicated. It requires a lot of coding. So that's it, actually. So you install cafe, you ask it to train, and then you define where your solver is. And it will start training. And then you wait and see awesome results. Yeah, but somebody might say fine tuning. OK, I don't want to write my network from scratch. So what's fine tuning? So in fine tuning. So Google has developed Google net, and there's been a lot of research going on. And they publish there what kind of neural nets or convolutional neural nets they are using. And you can just get the weights, and you can fine tune those weights for your data. So here, I'm going to talk about how to fine tune using Google net. Or this is one of the use cases that you will see. So why Google net? Because it has Google in its name, and Google is always awesome. It won the ILS VRC challenge, which is a made challenge. And it's very complicated. So always go for complicated stuff. You will learn a lot. And cafe has something called a model zoo. So there you can download all these pre-trained networks and their weights, and then you can play with it, including Google net. And you can get it on the GitHub of cafe. So next thing is one of the use cases. So this was actually a challenge. So I said, I like taking part in machine learning challenges. So distinguishing between honey bee and a bumble bee. So the training data set had somewhere around 4,000 images and 1,000 images for test set. So it's a very small data set. And 79% positive and 21% negative samples. And the evaluation metric was area under the rock curve. So can anybody distinguish between honey bee and bumble bee? It's very difficult for me. I just hate bees. They hate me too. But yeah, it's tougher than distinguishing between a dog and a cat. Why? Let's see. So I started with a very simple model, not so simple at all. So with three convolutional layers, three pooling layers, dropouts, two hidden layers with 2,000 nodes each and two outputs because it's only honey bee or bumble bee. And it was something like this. The last year is binary cross entropy, log loss. And the number of epochs, OK, here it's showing all little 50. I trained it for 100 epochs, but the error was not going down. And error of more than 0.4 is very high if you have a binary classification. So what should I do? I thought, OK, let's try fine-training Google. So for fine-training, you have to create training and test files. So since the data set is very small, the training and test files are going to be text files. I'll show you. Get the prototypes from model zoo, of course, where all the network definitions and solvers are there. Modify them according to your need. First, run it as it is. If you don't see any improvement in your previous networks, competitive previous networks, then fine-tune. Then modify it, modify the parameters, and then run the cafe solver, which is, again, very complicated. So the training file looks something like this. So you have all the images and the labels separated by a space. And it's very easy to do that. So if you have a folder with a lot of images, you can just write a simple Python script to do that. So here I'm taking 10% validation samples. You should always separate. And that's it, actually. The next thing is, after downloading Google and Prototex files, you, of course, want to change it. So I modified it to use image data. It's very small. I'm using text files for images. And I changed the batch size. So here it was using LMDB. I'm not using LMDB anymore, so I need to define the batch size, height, and width. And then I changed, if you want to test, you have test file similar to the train file. And so here I'm not showing all the things that I changed. You can find that on my GitHub. And then I changed the dropout ratio. So if you're changing a layer, you need to change the layer name. So from loss 3 slash classifier, it becomes loss x or whatever you want it to be. And of course, the number of outputs. So from 1,000 output to two outputs, because we have only two classes. And similarly, you need some changes in the solver. So one of the changes is you need to define where your training Prototex does now. If you want to change the number of test iterations or test interval, you can do that too. I made display equal to 1, because I always want to monitor what's happening and increase the average loss and decrease the learning rate a bit. I kept the rest of the parameters as it is. Some more changes were like gamma. I changed gamma. I changed the weight decay. And yeah, where do you want to store it? So it's very simple. It's very easy. And that's it. So now you have modified the Prototex, and now you want to see how it performs on your data set. And yeah, very difficult task. So you train it, and you have to. So only the extra parameter here is the weights. So this model you downloaded from Cafe GitHub, and you just need to supply the weights, because you want to modify the weights. So OK, we fine-tuned. Now we want to know, OK, did it really help? OK, and somebody might say Google said, yeah, it has all the classes you want. It doesn't have these. So these are the TSNE projections of the data initialized with random weights. So we can see we cannot separate out green and blue, maybe or bumblebee. And when I used Google Net as it is, I got something like this. So you can see here is a small cluster forming. There are some clusters here, but still it's not separable. And this is the fine-tuned result. So you can see what kind of accuracy you will get here. I got an AUC of 0.997. And to see how Google Net performs, so you have nine layers, and you can extract output from each one of them. You can do it using Cafe and Python. It's very easy, and the code is available. And you see something like this. So the red line is random, blue is pre-tuned, and fine-tuned. So you can see, even with the small layers, the accuracy is pretty high. It's more than 80, but it keeps on increasing. So it's achieving somewhere around 95% accuracy. The pre-trained weights and random weights doesn't matter. And yeah, so I've already discussed this, so why fine-tune? Because it is faster. If you want to train your neural network for several days and invest all your time, like weeks or days or months, it's better if you use a pre-trained network if that can solve your task. It is better, most of the times, because there are researchers from all over the world who have invested so much time in that. And why reinvent the wheel? If somebody is offering you something, just take it and use it. And now the question comes, how do we train a deep net in Python? So for Python, also, it's very simple. For Python, I'm using Keras. So I'll talk about Keras. And actually, you can use Cafe, too. It has a Python interface. You can use TensorFlow, Tiano, Lasagna, Keras, which is my favorite. You can use Neon by Nirvana systems, and lots more. If you really want to look into the deep learning landscape, it looks something like this. This is from this month. No, from June. So you can see, like, top libraries, Wicked of Forks, TensorFlow Cafe, and Keras. So after this, the second use case that I'm going to talk about is classification of search queries. So the place where I work, I'm doing classification of a lot of search queries. And for that, I'm using neural networks, deep neural networks. So for search queries, I won't go into a lot of details. Queries can be classified into three different categories, navigational. So if you know where you want to go, like if you search for Walmart, then, of course, you want to go to the Walmart website or Amazon, then you want to go to Amazon website. And then you have transactional queries. So if you want to buy something, like, let's say, if you want to buy iPhone 6s, so it becomes a transactional query, or if you want to book a hotel in Bangalore. And then you have informational queries. So who is the president of the United States? Or something like this? Or if you just search for Bangalore or Kolkata. And the transactional queries can be further divided into four different classes. Awareness, evaluation, decision, and retention. So in awareness phase, customers making himself aware of the product, then evaluation, he's evaluating, he's checking out the reviews or something, and then he's trying to make a decision when he's planning to buy. That's the decision phase. And retention is like, I have a broken iPhone screen. So yeah, that's retention. So I came up with this idea of representing queries as images. So all the search queries can be represented as images. So this is the image of David Vier, the query. How is this image formed if you know about virtual vector embeddings? So when you search for David Vier on Google, you can extract all the title words and then create these word vectors for each one of them. You can stack them together, and it will give you an abstract image. Similarly for apple juice or search query, I wish you have these images. But what happens now? I don't see much difference between gilt bars or search query, which is a game or apple juice. So it becomes something like this. You can be anything. So that image can be anything, but we can tackle it. So there are two different kinds of machine learning models that you can use. One is convolutional neural networks. You can use the images directly or use the random crops from images. So you take the whole image and put it into convolutional neural net, and then see how it performs. Or you can do something really cool like this. You can extract random crops. I won't tell you how you can combine it, because I'm also still researching on it. But one very simple way would be averaging, and then you can train a convolutional neural network on it. And implementing neural networks with Keras, it's very simple. So here I'm implementing a multi-layer perceptron. So it's a sequential network. You define the model as a sequential model. Then you keep on adding layers. If you want dense layer and then add dropouts, then the number of classes you have. And then it uses categorical cross-entropy. So here you have a number of classes. Categorical cross-entropy is multi-class log loss and an optimizer. And then you just fit the model. So access your data, YC is your labels, number of epochs. And that's it. So simple. And similarly, you can design a convolutional neural net. So you just add convolutional layers instead of dense layers, and then proceed the same way. You just keep on adding layers, as many layers as you want. And in the end, here, you have to flatten. And then if you want softmax or sigmoid, depending on your task. So as a result, a very fast framework for classifying search queries was implemented. So it's very simple. The only thing you need to take care of is a lot of data. So how will you classify 800 million search queries? But with neural networks, you can just do online training. You can update the weights. So the model improves every day, these kind of models. And they outperform traditional methods. And yeah, of course, papers and programs. And if you want to know more about it, then you can go to PyData Berlin website, and where I gave this talk. And you can see the full talk, because it's like 40 minutes. So we have approached problems like, OK, classifying images or classifying search queries. Search queries are converted to images, so it's basically images. But what happens when any kind of data comes to you? When I say any ML problem, it's mostly tabular data, the data that you see every day in most of the tasks. So I came up with this framework last year, which is pretty simple, not very complicated. So here, the pink lines represent the most common path that is followed. And blue lines are if you have different types of data, like categorical or text data. So of course, you split the data first, and then you have an evaluator function where you send the validation set. And then if you have numerical data, then you just don't do anything and go to the model selection and hyperparameter selection. But if you have categorical variables, then you either convert it to labels or binarize them. And similarly, if you have text data, then you can use some kind of text transformers. I like TFIDF and some kind of decomposition, like SVD is required. And yeah, then this framework in the end, it gives you the best model with best hyperparameters. So this is one of the papers, and you can read more about this in detail if you want to. And similarly, this year, now since everybody is talking about neural networks, we want to know how to optimize neural networks. So this year, I came up with a similar framework, which is simpler than the previous one. It looks something like this, very similar to that. So here you can see, now I no longer need different kinds of decomposition methods. So if I have a data set, I split it into validation and training. And then I identify the type of features and stack the features. One of the very important steps of training neural networks is normalization of the features. So you can either do Z scoring or you can do log scaling. It depends on how different kinds of feature normalization techniques you have used and implemented. So you can put all of them in the feature normalizer. And then you have a network architecture selector, so which is very simple, very basic. Like 20 years back when neural networks came into existence, we talked about how to design neural networks. So it uses the same kinds of method. And then in the end, of course, you have the neural networks at best. I won't say best validation score. It might not be the best model, but nobody cares about 1 to 2 percent increase if you're working in the industry. So yeah. And selecting a neural network architecture. So all these frameworks, unfortunately, they are not open source yet. But I'm working on that. And maybe in a couple of months, they'll be open source. And everybody will be able to use them. So selecting a neural network architecture, it's pretty simple. Well, these are my tips. So this is how I go. So you can always use SGD or Adam Optimizer. Adam Optimizer converges fast. So you can train a number of different models and see the performance. Always start low with a single layer, 100, 500 neurons. Use batch normalization, which is very effective. And RELU, rectify linear units. Then, of course, a dropout, because you don't want your network to overfit. And then you observe the validation score. And if it doesn't work, you are not happy about it. Or if your boss is not happy about it, then you add new layers and increase from 1,200 to 1,500 neurons, a very high dropout for 50% to 50%. And even then, if everything fails, your evaluation metric is saying very bad score, then go for a very, very big network. Of course, first buy good hardware, and then go for a very big network. So 8,000, 10,000 neurons, very high dropout, 60% to 80%. And if it doesn't work, then go for random forest. So since I'm talking about automatic machine learning, how kind of frameworks select the best models, I'll also talk a little bit about the auto ML challenge. So there was this auto ML challenge lasted for one and a half years. So with a tweak-a-thon phase, final phase, auto ML phase, where your models are run on new data sets and a number of phases. So the framework that I showed you how to select neural network architecture or how to approach any kind of machine learning problem, I used these frameworks for this challenge. And it performed really well. So these are the results of the final one. So final one was a CPU phase. And final four was also a CPU phase. So I participated on only two phases of the competition. After the first one, I kind of forgot about it. So in final four, the framework performs very well and got like third position. And this is the GPU phase. If you look at the architectures of neural network that I have built, it's very, very simple. And other people built really complicated networks. And they could not win the competition. But it's similar to the industries. So don't invest a lot of time. If you're not working in research, don't invest a lot of time in optimizing neural networks. One to 2% increase won't make your boss happy if you invest one month for it. So this was the GPU result. And what do we have now? So we have partially automated framework. I mean, I say it's partially automated because it's right now works on only tabular data. You can have a lot of different types of data sets. So this system gives very comparable results. And sometimes it's beating automated systems like hyperopt. And yeah, how this framework for neural networks or the other framework, how they were designed, was like, when you deal with hundreds of data sets, you know what kind of machine learning models will work, what kind of neural network can you design. So knowledge from some kind of knowledge from past data. And of course, we have a future in this because right now I'm working on autotuning of convolutional neural networks, which I think should be big. And this was also discussed in very much detail in PyRata Paris, like 20, 30 minutes of talk. So you can see that. And then you can separate out different parts of the framework and see how the whole framework is designed. And it's all in Python. So final words for this talk. I don't think deep learning is magic or a black box. It gives you feature importance. You have to explore a lot of things when designing deep nets for specific data sets. And yeah, as I said, 1% to 2% improvement, I don't think that matters on industries. And thank you very much. Visti. Hi. Hi. In your experience, is there an architecture that's best fit for temporal models? Temporal models? Yeah. No, it doesn't work. OK, thanks. What do you mean by temporal nature of text? OK, sequence-to-sequence learning, right? So no, I'm not using sequence-to-sequence learning. So you can convert. You can convert to like here. Let me get back to that. So here you can. So how I'm combining these, so you have a search query. And then you have the top results for that search query. And all the results have a title. So these titles are converted to word-to-word representations and stacked together. That's it. And if you just want to train on the search queries, then you can use word-to-word embeddings for each of the word in the query. And then you can train an LSTM. So I was wondering if there is some way of like grid search based pipeline method, which we have in scikit-learn. Do you have something like in CAFE, or is there a way to do something similar, brute force, parameter optimization here? Well, I don't use grid search. And I'm not sure if CAFE has one. Maybe it does. I'm not sure about it. Most of the hyperparameter tuning that I'm doing comes from experience. So you know which parameters to tune and which parameters to leave. And grid search is very slow. I would prefer using something like Bayesian optimization. Can you please suggest us some ways for evaluating train models? Like one is TSNE, you already told. Something, can you please suggest us some ways for evaluating the train models? One is TSNE, when you have more number of classes, like more than thousands. So TSNE is not for evaluation. It's just a representation. So you can see, this is on the validation set. So you can see, yeah, it separates. It's nice. If you want to evaluate your model, of course, you split your data in training and validation, train your network on training set, and then see what's the error on validation. I prefer that way. So I wanted to see the types of problems that you've attacked are fairly well cropped image problems or the text-based problems that do not include sequencing or the temporal sort of nature of the data. I was wondering if there were any pre-processing steps that you would suggest, and you were very good results. What I'm trying to see is, are there any pre-processing steps that you would suggest so that you can convert the type of data, more complex images, different kinds of text data into a type of data where you can get good results using the techniques that you add? So like pre-processing steps for images or text? Both. Both, OK. Yeah, it depends what your subject is. When you're talking about images and you're classifying different objects. So if you're classifying a bottle, and if it's in front of a laptop and a lot of other things, then you need to center it. Or sometimes you can do some kind of rotations. But most of the time, I'm not doing these steps. You really can't apply it in batch, right? So for example, if a bottle is in front of a laptop, not all of your data will be bottled in front of laptops. They will all have different sorts of rotational issues. So if it is not images, in terms of text, whether it's temporal text or national languages, or you did mention search queries, but if it is a different type of document that you are analyzing, are there, or if you've handled text other than the ones that you've mentioned here, did you end up using certain preprocessing steps that can be applied on a batch way over the data that you analyzed? In batch. So preprocessing text is fast enough, right? Why do you need batches for it? So what you can do if you want to clean text, so you can remove stop words from it, yeah, like these kind of things. And when you're dealing with search, then you have to take care of HTML tags. Yeah, so these kind of things. And then TFIDF, most of the time, it captures everything. Hello. So you just said that it's not worth fighting for one to two percent accuracy. And probably time is more important. I'm sorry I didn't get you. You just said that it's not worth fighting for one to two percent accuracy. It's basically a trade-off between time and accuracy, right? In some industries, it does matter, like a lending industry, or when you're trading a huge portfolio of money, a one to two percent saves you millions, which matter. So what is your strategy there? Because the only way now is you get into a research mode, or you get into a mode where you have to come up with your own algorithms, or maybe like a fine-tuning algorithm, which is not applied. So what is your strategy there? Yeah, so first I'll ask what kind of industry. So if we have to deal with a lot of data every second, it's like millions of samples every second, then I won't even use deep neural networks. They are too slow. And I'll go for something like logistic regression, which is simple and easy. And if you don't need any kind of online evaluation, or you don't want to test your model online, if it can run offline and it takes several days or several hours, if it's OK, then if one to two percent accuracy is equal to a million dollars, then yeah, of course, do it. Now my question was, do you have a framework for that? Like, do you have a framework in mind? Because I mean, research is what academics do is kind of not documented very well. Like, you've documented the ML framework very well, right? So do you have like a research framework when you're fighting for like one to two percent accuracy? So the framework that I described, right, it gives you a direction. So it gives you very basic simple models that you can use. And the thing is the space, the parameter search space is very limited. And it comes from knowledge of different other data sets. So if it has seen some similar data sets previously and it has used parameters in a certain range, then it will use them. And it will give you a model. And then you can do it manually. So it won't give you the best model in the world, but it will give you just a direction. It will give you a good model. That's it. And then one to two percent accuracy, it's not dealt with by the framework. We have two more questions on that. Hi, thanks for the lovely talk, big fan. Anyways, like you said, with the network, if there is a batch normalization, is it necessary to have dropouts as well? Batch normalization is a normalization technique. And dropouts, you use it for regularization. So yes, you should. Actually, I was referring to the original paper. They say that dropouts is a regularization technique, which can also be implemented in implicit way by batch normalization. That's why I am. Thanks. Is it? OK. Might be. I don't know about it. Thanks. Last question. Hi. It might be kind of a stupid question, but going back to a fine-tuning example, say I have a binary classification for images, exactly what you did. So what you're suggesting is go with the AlexNet or something which is very well-trained and just replace the last softmax layer and just train your network on that. Is that the technique that you followed roughly? Is that? Is that the technique you followed roughly that just remove the last layer and put in a layer with random weights and just train it again? Is that what your mumbleby and those two type of base training? No, I changed a lot of things in that. Ideally, that could just be a starting point, right? Just change. Yeah. Yeah, if you don't want to fine-tune, if you just want to see how this network performs, then just remove the last layer and modify it from 1,000 classes to two or three or whatever number of classes you have. And you might need to change to softmax or sigmoid, depending on your problem. OK. And how slow is this as compared to your original model that you had? Because original models, there wasn't a lot of layer. Can we take this question offline? Abhishek will be available for the birds of feather session at 3.15. So if you have more questions, you can talk to him then. Thanks, Abhishek. Thank you very much.