 I would like to thank ODSC organizers for giving me this opportunity. As this is a beginner session, I am not going to expect any knowledge in deep learning or machine learning or AI, but I'll try to keep it reasonably technical. So anyone with any engineering background should be able to easily follow what I'm going to talk about. My name is Murthy. I will introduce International School of Engineering and what we do towards the end. I'll directly get into the topic. So if you look at the literature of deep learning, even if you don't actually practice, you must have been hearing a lot about the cool things that are happening in deep learning. Particularly in computer vision, around five, six years back when we were following the traditional machine learning practices, on ILS VRC we had what, 28 percent error and human was at 5 percent error. Today, I think after 2016, I stopped tracking it because it became boring in terms of accuracy improvements. I think 2016, it was 3.06 or 3.05 with a bunch of ensemble models. That's impressive. In five years from 28 percent error to 3 percent error, where an expert human is at 5 percent error. There was one fascinating paper I was looking at in terms of how vision can influence or impact an evolution. So I believe around 500 millions after the evolution started, the rate of growth of species or evolution of species drastically saw an exponential change. Evolutionary biologists researched and found that until then, the species didn't have vision. But then, so life was an accident. Life form would be floating on water. If food came near it, it would eat it, else it dies. But suddenly after 500 million years, the life form developed vision. So they started going to the food, eating it and food started running away from it. So nice ecology and evolution developed, and the growth of species went up. So many people in AI believe that our ability to do computer vision now is a similar point in artificial intelligence. Just like how natural vision propelled natural intelligence growth, artificial vision they believe will improve artificial growth. But no question about it, past five years we saw some really fantastic things happening. Similarly last year, both Google and Microsoft are I think six months back, announced that they are getting closer to a human level performance in speech level understanding, and the language translation. So while text we have still some more journey to make, growth has been impressive, not as impressive as vision, but growth has been fairly impressive. Now the question that I want to address today is how are we able to do that? Why is it that we have seen this kind of growth in our abilities today? What made this possible? Yes, technology, math is one part. We now have much better gradient descent algorithms, or we have much better GPUs or serverless computing. That allows us to work on a large volume of data and apply probably more exciting mathematics. But that's okay, that's one part of the game. What I really want to focus on is a fundamental paradigm shift that came about in the practice of machine learning. Today we do machine learning a bit differently, a whole lot differently in certain areas, compared to how we were practicing it four or five years back. I want to emphasize on that, and those changes happen both at an architectural level, the paradigm shifts, and at engineering level. Given the time, I want to focus a bit more on architecture. Actually when I wrote the synopsis of my talk, I thought I would focus more on engineering, but I thought for this group, architectural paradigm shifts would be more interesting. So what I'm going to talk about is slightly different from what the synopsis is. But given the time I have to focus on one, I just pick architecture. So the focus in the next 30, 35 minutes is what are we doing differently architecturally compared to what we were doing five, six years back. And fundamentally, representations. Today we think of data very differently from we were thinking before. And architecture has become a lot more inclusive and a Lego block like, I'll talk about that. And transfer learning made for businesses, models are now not anymore a problem specific, scientific solution, but models are now assets that the organizations are building. And I'll talk about this problem mapping. We suddenly now seem to realize that we can think of what was solved until now as a structured data problem, suddenly becomes a image mining problem. So we are mapping problems from one domain to other domain a lot more efficiently. And of course, because of deep learning, we are sort of forced to look into the model very differently from the way we were looking at until now. So I'll talk about each one of these things at some detail. So contemporary architectural paradigms and how they differ from machine learning. If I look at how, let's start with the data representations. We understand data five years back, six years back. Let's say I have a bunch of numeric attributes in my model. How do I, what do I do with it? Before I implement my random forest or logistic regression, whatever I did, what did I do? I log trans, I normalized it. I log transformed it. I did a PCA. So there were some feature engineering mathematical techniques that people were using. And if you use something like PCA, you transform a set of variables into other set of variables. You either use this or that. You don't use both of them. Those were all that we were doing. But off late, since Jeffrey Hinton actually introduced restricted Boltzmann machine and Andrewing popularized auto encoders, people started asking themselves, can we let the net engineer more features? So can I actually have the net design more features? And between 2011 and 12, the strategy that I'm going to talk about was extremely popular. In fact, every Kaggle winner would just, the standard recipe of winning Kaggle is take given data, run it through a bunch of auto encoders, get additional features, mix all the created features with the original features, and run a linear regression with a good regularization. You are in the top five. The recipe, it's not as popular anymore now, but I end up using it once or twice and do see some improvement. What exactly do we do here? Auto encoder is an unsupervised neural network where I'm trying to predict my input. I have a bunch of inputs. I predict my input. So what is so great about it? I try to predict with, then the most first versions of auto encoder, I try to predict with a fewer set of nodes than my input. So if I have five input attributes, I'll try to predict my five back with a hidden layer of only four inputs. Now, for today's discussion, let's think of a neural net as not a bunch of perceptrons connected in synchrony. That's one way of looking at it that gives us the nonlinearity. But what I want you to start looking at is a neural net as a tool that generates slightly better representations than the original representation. Every layer is a slightly better representation of the original layer. If you take that view, auto encoder suddenly becomes interesting. If you just take a predicting input, I already know the input, so what is so great about it doesn't look good. But if I realize that in that process, I am forcing the net to learn a more condensed representation, but I am forcing that representation to be good enough to predict back itself, then things get exciting. And those representations are nonlinear summations, not linear. So they are genuinely new independent features that I can use in my model. So people started with auto encoder, so I built from these five four and I store the weights. Now I build another auto encoder with this layer now as input and I try to predict itself with slightly lower number of nodes. And I cyclically add. So from my original five, I now created nine more features. All are numeric attributes. There is a very, most of you know, Mintra. The chief data scientist at that time shared this wonderful use case of this application. So Mintra obviously being an e-commerce company wanted to add to their revenue. They decided to go for private labels. Being a technology company, the way you normally design a private label, you hire models, I mean the real models, not the data models, and do the ramp walks and try out new designs and all that stuff. But Mintra being a technical company, it's just not equipped to do that. So the CEO said, hey, you know what? When you nowadays, when you don't find a person for a job, you say hire a data scientist and let him figure out how to do that. So they exactly did that. Let's hire a few data scientists and let them design new clothes. So these guys, obviously they're data scientists, they don't design clothes. So they looked at past year of all the tops that they sold, and some of them sold well and some of them didn't sell well. Then they sat and what sells a top? Hey, it could be the collar, it could be the neck, it could be the sleeves cut on the thing, color, pocket, whatever. So they identified humanly some 2025 attributes. So these are my access and why is whether it sold well or not. Let me build a regression model to predict it. They got some accuracy, not very satisfactory. They sent it through a auto encoder like this, generated another 75 to 100 additional attributes, now built a model. As I said, that was the work at that point and the accuracies were very satisfactory and they productionized that system. They interestingly, their top selling brand, Moderapido, is completely governed by an AI system designed in these models. I don't know what tweaks they made in the past couple of years, but in the original version, take a bunch of attributes that are given by the human and run it through a neural net, let it generate. I mean, that's a very interesting paradigm shift, solving a numeric problems. Now, coming to categorical attributes, we represent them very differently. Thanks to Google's 2013 Word2wek, Google did a fantastic branding job of it. Until then, the Word2wek work started in 1980, singular value decompositions, people showed, but they used to say, you know what, all words are clustered in one, all similar words are clustered in one location and things like that, they used to publish it in a very academic way. Google came about and said exactly the same thing, but they said, you know what, king minus man plus woman is queen. World got up and said, wow, this is so cool, but actually king minus man plus woman is queen. If you reorganize, king to man is queen to woman. I mean, that's what others were saying for the past 30 years, nobody really founded that interesting. Suddenly this one caught up, but 2013 when they applied it to words, somehow either machine learning community decided to not try it on categorical variables or they didn't want to discuss much about it, but a silent revolution that started them is to start embedding every categorical attribute. Unfortunately, talking about structured business data is not very exciting for academicians or academic research, you know. If I differentiate a cat and a dog with a 99.8% accuracy, I get published. If I talk about forecasting a demand with 5% extra accuracy, I know that boring stuff, you go and do it, but you know, so probably that's the reason why these things don't get too much of the attention they deserve. Around three months back or four months back, I was in West Coast, paneling a session. My co-panelist was from Google, so I asked him, what is one cool trick that you can teach me? He said, if you don't know already, embedding categorical variables. So what exactly is it? Let's say your database has 25 nations. How do you represent them? How do you input them, feed them to a neural net? Okay, the one way is put them in ascending order, one, two, three, four, five, 25. What's the problem with that? The problem with that is you are telling the model that your USA is 25 and Uganda probably is 24 and Canada is what two, based on lexicographic order. You are telling the model that USA is a lot more similar to Uganda than to Canada, right? You are confusing the model unnecessarily. Now, so nobody does that, people realize that. But what machine learning community chose to do that, do instead of that is, hey, you know what? I will give a dummy representation. I will call USA 10000, Uganda 0100, Canada 00001. Now, what you are telling the model is, all countries are equal, equidistant from each other. That's as bad as telling that USA is closer to Uganda than to Canada. But for 75 years, machine learning was practiced that way. Either very, very rarely, in my 15 years of consulting from 2000 to 2015, maybe two or three occasions I saw actually business users sitting and coming up with the right way of assigning values. Very, very rare. You know what, I think USA should be 5.4 in Uganda because it's not scalable. As a human, our brain is not wired to do that. So, with categorical attributes, either you ignore them or you demify them. That was the strategy until 2014 till Google showed us, hey, you know what, you can actually embed them. And now, it's beginning to actually, where people are trying to put a categorical attribute into a neural net and learn the embeddings. Learn a vectorial representation of a categorical attribute that actually preserves most of the properties you are looking for. If you are not practicing it, I strongly advise you to go back and replace all your machine learning models with categorical attributes with embeddings. It's worth it. You see the improvement in front of you and the architecture suddenly becomes so simple and the embeddings that the nets learn becomes so exciting. Unfortunately, you don't find too much of published literature on this, simply because structured data is not sexy. You know, there's no other reason that I can think of. There is one that I really found interesting was this Kaggle competition in 2016 or 15. I don't remember. This is for a German, some grocery store. People had to predict the, how many products that the store sells in a day. In a, you have to forecast demand. How many products does the store sell? I mean, I'm sure many of you here have worked on demand forecast. How many of you worked on demand forecasting? Good. How many smaller number than I thought, but how many of that number actually solved a demand forecasting without a time series? Unheard of. Demand forecasting is time series, right? These guys in 2015, 16, they just wanted to prove a point that, you know, the embedding is powerful. They solved demand forecasting as a causal attribute, causal problem. And the attributes they took into consideration are laughable at best. ID of the store. I mean, ID of the store, that's the first thing that we delete in our data preprocessing, right? 1115 store IDs were taken as an input into the data. State ID is another input. ID of the day, third input. ID of the week, fourth input. Only one I would have used in the whole system in my original traditional thinking is whether they were running a promotion or not on that day. Other than that, everything was an ID. And they took six or seven such IDs, embedded them, and got the presentations. It's a simple two-layered neural net. One hot dummy encoding. Second layer net learns the embedding. Third layer is your softmax. Beauty is this. This is a German competition. One of the IDs they took was state ID, okay? And this is the German map. Now, the net was able to learn a vector representation that more or less mapped the geography of Germany. This is massive. They didn't feed in those values. They started with a dummy variable like this. The net came up with a vectorial representation, which sort of from the sales data figured out how German states are organized, right? So if you do your categorical embeddings correctly, it's really going to obviously enhance. You are not confusing the model anymore, right? So that's where we are in categorical attributes. We learned them differently. Now, CNN's obviously were the starting point of all these deep neural networks. If you, those of you who are familiar with the CNN network, you have CNN layers, a bunch of them followed by a fully set of fully connected layers. Now, again, if we understand it the way we should understand, which is initially it is the pixels, then that learned slightly better representations. It learned slightly better, even better, even better. And the last layer I have is the best possible representation of this image. So nowadays, I have seen many people who are solving an image problem to one time learn these nodes and use them as a representation of the original image, right? The FC3 here is used as a best possible vector. This will be normally let's say 100 dimensions or 1000 dimensions, whatever. This is 100,000 dimensions. These 100 dimensions are the best possible representation of the original image. So for all your future problems, where you just store that as a representation, right? By the way, just an anecdote here. If you are referring to any paper or any Kaggle blog or anything, they may call it FC7. Now, obviously the rock star of our deep learning as many of you may know is Jeffrey Hinton, right? Like when a physicist has a problem, he goes to Einstein. When a data science guy has a problem, he goes to Jeffrey Hinton to solve the problem. So Jeffrey Hinton in 2011, when he proposed his AlexNet, his AlexNet had eight layers. So he talked about the vector representation of his last but one layer. He called it my fully connected layer, FC7, is the vectorial representation of my image. Later, Net's got deeper, 18, 19, 150, 2000, whatever. And people were taking 999th layer, but out of sheer respect for Jeffrey Hinton, the last but one layer, we call it FC7. Though you have 18 or 20 or 25 layers, last but one is always FC7. It's a mark of honor for the grand old man, right? So when you, in your paper, you see a neural net with 18 layers and they say, I took out FC7 for my vectorization. Don't pick seventh layer. Pick the last but one only. Just, you know, something that one should be aware of. And encoder decoder architectures, sort of gave us a way to learn vector representation of words, of the phrases. Words, anyway, embeddings themselves are there. But of a sentence, of a phrase, of a paragraph, I get a nice vectorial representation. So the point here is five years back, I was probably not doing anything with the images. I was using a one-heart vector representation for words and no representation for a paragraph or text. Numeric log transformations, categorical, I'm ignoring. Today, architects don't think like that. Hey, give me your data, whatever you have. Let me run through these black boxes, get vectorial representation, and then feed them into my model. So that's a very fundamental and extremely interesting architectural paradigm shift that, so again, I am not today going to focus much on the cutting edge research. I am not qualified, actually. I'm much more involved in engineering and applying these things in business. So I am only going to talk about things that the business has started implementing, right? So business, I have personally implemented and have seen examples where people learn representations and use them in models for tangible business benefit. Now, the second major architectural change is this inclusive and Lego block architectures, where I really finally believe that architecting a business solution is again back in the hands of tech guys and business guys and not the scientists. Support with the machines is great. I mean, even today, when you look at the math of it, it's romantic, the way he imagined kernel trick, the way he imagined quadratic optimization. Oh, beautiful, but trust me, that is one thing I think that took away machine learning from business. And I saw this transformation happening in front of my eyes. In 2000, when you go to an analytics company for consulting, the thing is, I have 30 PhDs. How many you have? I have 20, so you lose. You know, the analytics was the work of this center of excellence filled with 25 idiosyncratic PhDs. I mean, I myself am one, so I think I can really talk about it. It's an offshoot of the thesis that you write, you know, the idiosyncrasy. So it's expensive, very difficult to understand, kind of thing. Now, on the other hand, when I go to organizations, they say, hey, you know what? I had this bunch of interns. They went through three month training and they've just beat an industry benchmark. From science, it became engineering and you know, most of the times now what PhDs are called for is, tell me what the hell they just did, you know? So the job is slightly changed. But anyway, one of the reasons why that has become reality is because of the way we are now connecting and putting together things without science. But there are these 10 things that have been developed and that have shown to be really working well. How do I put them together to solve my problem? The state of the art, contemporary, classification, regression architecture. Recently, someone asked me to build a recommendation engine. Normally, recommendation engine, you do collaborative filtering. So this time, I just wanted to solve it as a supervised problem. I said, yeah, give me whatever images that the product has. Give me whatever reviews that the user had written. And of course, some categorical attributes that I embedded. This is the FC7 of CNN from images, product images, reviews, context of the RNN, and numerical attributes autoencoded. And each one of these is a black box. And I fed them into the model. And I was able to get state of the art with a fairly simple architecture. And putting them together was very easy because all these I just picked up from the market, I mean, open source. One Pinterest actually published a nice blog again about structured data. That's where I got inspired actually while building this model. They said, this is not Pinterest thing. They said, the real advantage of doing deep learning in structured data is not actually improved accuracy like vision. It is the simplification of the model. It just becomes so easy and so doable, suddenly. And this is a Kaggle competition winner. Ashio Benjio did this. These are the longitude, latitude of a taxi location in the first 10 minutes. He took five from the first five and five from the last five. You have to predict where they are going. Again, a very inclusive architecture, numerical attributes, categorical attributes, embedded, fed into one-layered neural net. It's a Kaggle winner. Kaggle winners generally are mess of an architecture. The Netflix guys rejected the winner because they're 600 submodels. Imagine winning a Kaggle competition in 2016 or 17 with an architecture like this. Unbelievable. So that's the power of these LEGO blocks. And another very interesting LEGO block architecture is LSTM itself. See, context has always been a problem for data scientists. How to represent context better? So in the 80s, people were doing HMMs, where I said, the next word is a function of the past two words. And then I construct this large transition probability matrix. What is the probability that Trump followed Donald? What is the probability that Duck followed Donald? So next time I see Donald, I'll predict, hey, this is the probability that it's Trump. This is the probability that it's Duck. So massive transition probability matrices. I mean, imagine if I go even a second order and I have a vocabulary of 100,000, I'm talking about 100,000 cubed matrix. It is not a very pleasant thing. 10 power 15. 10 power 12 is Giga. 10 power 15 is the next guy. I don't know what it is, but really big, even for today's standards. So then that is one limitation. Second one, hey, in this case, the clouds are in the dash. Feel it? Most likely skydive. So how do you get it? OK, if I drop all the useless words within one or two words, I'll find the guy. But then there are slightly more complicated problems where hidden Markov models didn't work. 2008 to 10 time frame conditional random fields became popular. People said, hey, it's not just the past two words. I will look at every feature that I can think of. Did it start with a capital letter? Was there a question mark at the end? Was the one before it is a proper noun? I can add as many features as I want. People went crazy and added 100,000 features in one of the non-scalable. But just let's read that. I met a French girl, Leonie, a few months back. We really like each other. I was worried about how my conservative folks would accept her. But I was sure they would love her once they met and arranged a blind date last night. My mom instantly fell in love with Leonie. Dad liked her too. Wow, I'm so excited. And I put all those exclamations in real writing. All that is left to meet her parents. That is left is to meet her parents. Here, I'm booking my tickets to Dash. Where am I going? France, how did you do that? How do I even code a computer to know this context? And machine learning community for 50 years was doing what it was doing. Trying to come up with feature after feature, complicating it so much that nobody understood it. So normally, what would happen in college five years back is you take a complete machine learning course. Then you want to get into NLP. They said, take NLP courses. You want to go into computer vision, take those courses where you learn Fourier transforms. You learn whatever, you know all this stuff. But LSTM, on the other hand, took such a Lego block approach. They asked, hey, how does the human brain did that French thing? How many times did I say that in the last sentence? I won't go back, or you can cheat. You don't remember, right? One of the best ways of learning is forgetting. Whatever is not needed. So they said, hey, OK, I have my context, OK, the train. And I have some intermediate context. I'll talk about what it is. And I have my new input. Let a model learn to forget. At such a, yeah, forgetting seems to be very important. So let me build a neural net to forget. That just specializes in forgetting. Then let me build one more model. Obviously, I'm learning with every input. But whatever I'm learning is very different from what you learned from the same input, because our contexts are different. She may find it's very hot. I may say, hey, you know what? Compared to last summer, this is very cold. Because of our context, we learned two different things. So yeah, let me have one more neural net to figure out what to learn. OK, now I know many things, but to solve a problem, I only extract just some small part of what I know. OK, let me have one more neural net to figure out what to extract. So that intermediate context is created. That's it. So a LEGO block, as opposed to manually sitting and writing thousands of features, I personally was involved in doing a project for entity extraction, where there's a financial data. If you have a sentence like richards, paid richards and co at richards. Now first richards is person. Richards and co-richards is a company. At richards is a place. What rules will you write? And we actually sat and figured out six months, two PhDs and one PhD, one domain expert, and one really good engineer. We got around 98% accuracy. Last year, when I started experimenting with LSTM, I went back to the client and said, hey, can you give me the same data? No money charged at this time. I just want to experiment. Two interns in 15 days got the same slightly better accuracy. Tag data. So wonderful LEGO blocks. And this is another fantastic architectural marvel. You have encoder, decoder. Encoder context summarizes the context, and decoder deconvolutes. I mean, it gives it. Translations, people use this. And a very influential idea here is an attention model, where if your sentence is very long, this decides for this output, I will only take first three blocks, first three words. For the second output, I'll take second three words. So there is another model that decides how much of this I should take. So this model, in English, let me figure out a very good representation for English. This model figures out, yeah, let me figure out how much of this should be supplied here. And this figures out, given a very good representation of English, how do I translate it back? Now, how am I doing with time? Someone said they will wave hands. 10 minutes, oh, I thought a lot more. OK, I'll then go a bit quickly. But just to give you, recently I was working with a retail firm. They wanted sentiment analysis. So huggies are more comfortable than pampers, but are more expensive. What they wanted to extract from them is huggies comfort, positive, cost, negative. So you have to do two things here. Extract the attributes for huggies that are sentiment token and the sentiment on each attribute. Domain expert had to sit and write the attributes. Then they did all kinds of topic models to extract more things. And then they did some really clever algorithm. I look at five words before and five words after and measure the density of positive and negative words, put it in a support vector machine, and then figure out whether it's a positive sentiment or negative sentiment. 41% accuracy. You call it accuracy, but it's still whatever it is. So I said, hey, why can't I treat it as a translation problem? Huggies are more comfortable than pampers, but are more expensive is a very expressive language. Huggies, comfort, good, cost, bad, is a Arnold Schwarzenegger language. So it could be a nice translation. So tag 150,000 reviews, and then put it up. And 71% with the off-the-shelf model. Three-layered, stacked LSTM, 71. From 40% to 71%, no kidding. I mean, I was shocked. Sometimes you don't believe what happened. Something probably must have gone wrong terribly, or gone right terribly in this case, because you have to explain why you are doing so well also. So that's another very, very logo blackish approach. And phase detector, this I thought is a cute example, but I'll skip through that quickly. So LEGO blocks is another major paradigm shift. Nowadays, the deep learning scientists learn to let net do most of the stuff, give as much to the net as possible, and let me put multiple blocks together to get my solution. And then transfer learning probably is one of the things that they figured out, and this is, I think, serendipity mostly, is net learns general features like edges, et cetera, in the lower layers, initial layers. And as layers get more and more deeper, the features that it is learning is getting more complicated. Towards the last layers, you really learn faces, et cetera, et cetera. So now general to specific, that is exploited in transfer learning. If my net is learning general features, and it's going to specific features, can I build one net? And then whenever I want a new model, I will just chop off the last two layers, add a new layer, where the specific features are learned only in those last two layers. The general features that I learned can be used. That's the whole idea. So with a fewer data set, I will be able, this is what is making, I think if it's not yet, this is what will make deep learning a darling for businesses. Until now, business is baffled with the fact that for the first problem, you need six months. For the second problem, you need six months. For the third problem, you need six months. Now we are saying, hey, first problem, yes, I'll take six months, I'll hire 10 interns, tag 150,000 rows of data. But second problem, I'll solve with 120,000. By the time you come to seventh problem, maybe all I need is 10,000 records. So suddenly, the productionization of models, models as assets is becoming a possibility, and that's a huge thing for business. And transfer learning is what is making it possible. Now this is a very interesting thing that I learned in a through failure. So one of the HR managers asked me to build a scoring angel for who will I recruit kind of a thing. She told me very specifically, our organization in the past has been hiring men more preferentially over women. I want to change that. There's a decision at the CXO level that we should change that. So please ensure that your model is not biased by gender. So I said, okay, cool. So I said, I told my guys, look, guys, we should not use images because the net will learn and you should not use the gender as one of the categorical inputs also. No input that had any inclination to gender, let's not use it. So we decided to not use the photo of the candidate, not that it will really help in figuring out whether they will turn or not. So anyway, we were not planning to take but gender we removed from the overall system and we built the system and submitted it, productionized it. After a month, I get a call from her furious. She said, I told you not to get gender sensitive. I said, no, I didn't. I didn't input. And she said, no, come on. And she showed me two resumes that are more or less same except gender in every other qualification. My model gave higher score to the man. How did it happen? And then I went back and started looking at internal nodes and saw for these two, what nodes are lighting up differently? And I realized that one of the inputs I took is the letters that the communication they had before the job. I gave that as text input, run it through and develop the context using a RNN as I mentioned. Take text, take images. I was going the same way. And a few nodes there were gender sensitive. From the writing style, it was able to figure out the gender. When you are building a linear model, you know what you are giving is what you are getting. When you are building a deep model, know that it can learn things that you don't want it to. Just like how Google very nicely published King minus man plus woman is queen. That is great. But the same word to work learned that father is to doctor, mother is to nurse. And this is not movie blocks they used to generate word to work. This they used editorials from Google News where you expect people to be less gender biased. But maybe in the articles, whenever I talked about a doctor, I said he around it more often. Whenever I talked about a nurse, probably I said she more often. So it learned that and started assigning genders. So be extremely careful about what is happening inside your network. Previously, people didn't care. This is one more. When things don't work, Microsoft puts up its blog and suddenly it becomes a racist. Facebook's engines start developing their own language. Things happen. I mean, very, very rarely it happens. But when they happen as a paradigm, you have to be aware that your net is learning things that you don't want it to learn, right? The last one, I'll just talk about it very briefly and then three more minutes. I'm doing all right, yeah. One great thing that is happening because of this facility is people are beginning to map problems from one domain into another domain. Recently I was looking at a patent. This guy had solved fraud detection. We all solved fraud detection if we solved it as an anomaly detection problem. Lots of good data, very few bad data, learn as much as you can about the distribution of the good data, whichever gets a low probability, you raise an alert. But this guy said, hey, you know what? My ability to understand images is so much better. So why not I capture the mouse movement of the guy who is going through my web, store it as an image, and then figure out whether there is any difference between mouse movement of a guy committing a fraud and mouse movement of a guy who is doing a regular transaction. He actually did image analysis on the fraud detection and got state of the art accuracy. Google Go, DeepMind, the one that beat human Go competitor, the input is images of Go Boats, winning and losing, right? Recently, I was working with the e-commerce site. One of the e-commerce sites daily jobs is they have to look at the competitor sites and collect price. Typically, you give text, write regular expressions, pull out the, you know, find a number if there is a rupee or dollar before it, you know, you pull it out and look for these price. You write your regular expressions like that, right? So I just said, hey, you know what? As a human, when we look at the page, we don't read the entire text to look at price, right? We know, hey, it may be here, here or here. So maybe it's an image problem. So I just took JPG sense, fed it to a convolution neural net regression. Input is the image and output is the price, state of the art accuracy. So mapping problems from one domain to another domain is a major architectural shift that you need to cultivate in this field. Because representations are so easy, these things are now feasible. So to summarize as an architect today, know that we represent very differently from machine learning. Number one, know that we architect as engineers, not as scientists. Put together things to solve the problem. Don't invent things to solve the problems. And number three, build model as an asset. There is a transfer, what you build now, think about how you can cross-use it for other problems, for the highest accuracy. And number four, know that your net may be learning things silently that you don't know or that you never thought about, or is sensitive to things. As I said, nowadays I'm called when there is a problem. That's when I have to really look into the net. One medical imaging company built a, you know, I had a five-day program on convolution neural nets. At the end, they said, we'll build a medical image detector. And I said, do you need my help? They said, no, no, I think we can do it. And they just went ahead and built 92% accuracy. After three months, they called me and said, suddenly the accuracy fell down to 82%. What happened? I looked at both of the images. Why is it calling this kidney? And this is also a kidney, why is it not calling? It took me 15 days to go back and figure out that the OS of that medical image was updated. The collector was updated, and that was causing the problem. The way pixels were stored was slightly different. So look into the net. Look into the net. Don't let it be. It's not a black box. It cannot be. The last one, try to map problems from domain to domain. My sales team would shoot me down if I don't show the next two slides. I represent International School of Engineering. And we are data science specialists. We offer consulting and teaching, training, hiring solutions. And these are the customers. Thank you.