 Thank you. It's a wonderful opportunity to be able to come here and share some thoughts. In the student days, one of the things we do when you say write a two-page essay is you have enough content for the one page and then for the next page you begin to wonder what to write. Then we use the in-our-words technique which is begin the next page with in-our-words, right? So you fill up two pages. So the challenge for me now is we've had fantastic set of speakers preceding me and I'll try my best not to do the in-our-words, right? So I'll try to take a different perspective on what cloud is for Yahoo and maybe sometimes act a little bit of odds with the people who came up preceded me but that's okay. This is an intellectual discussion. So there is room for that I proceed. Excellent. That's great. I'm known to do that. So good opening. Okay, quick one through a very two second definition of what cloud for Yahoo means. Little bit of what the data centers in cloud means. Yahoo recently was awarded the prize for designing the best or the greenest data center. So quick some thoughts on that. And why do we need a cloud infrastructure? Everybody talks about cloud seems very cloudy. It should actually be called sunshine software or sunshine computing system but everybody calls it cloudy and what does it mean to have clouds? Some case studies, right? Case studies are very important. Let me begin with a quick little story. There's an airplane that was about to take off and there were 10 CEOs who were going to attend a quote unquote CEO enclave, right? And then just before the plane took off, there's an announcement that said, hmm, there's no pilot in this airplane. A little bit of an upset. That's okay. No pilot. We've had pilotless drones spraying fertilizers all over the place. It's okay. And then there's another announcement that said, oh, by the way, the software that runs this airplane is one is the one that's written by your company. Now we had nine people bailing out, right? My grandmom is not male. My PTA meeting is this afternoon, all kinds of reasons and nine of them bail out. The one person remains very, very confident, absolutely unshaken. And so the people asking, what makes you stay here? You're very, very confident of what's going on. Oh, my company wrote that software, the plane won't even take off. Absolute sense of confidence. And what I want to do is to be confident in a very different way, which is, how do we open source? Which means we get a lot of code that is written by people outside. How does open source work in production? Does it really make sense? So case studies are important because I want to, I want to leave you with a message that we eat our own dog food, right? Basically what we develop, we use within Yahoo to make heck a lot of money so that they can, fine. Quick architecture. We won't spend much time on it. There was a good session yesterday about what architecture is and key challenges. That's the key thing that I probably want to focus. For all bright minds like you here, what are the challenges in distributed computing? Right? Is it real? Where can we take this? If you can pick up a couple of sections from what I stress, challenges and give us answers, wonderful. Q&A, obviously. Okay, so quick. Deepak Salam by laying, making a very profound statement. Cloud computing is not new. I fully agree with him. Cloud computing is not new. Right? When you type www.yahoo.com, for those people who have gray hair, if you were to paint a picture about 10 years ago, you would first get your leg and then the right arm and then all these comes together. It's cloud. You had no idea what was happening www.yahoo.com for a new site. It went, spawned out so many processes, routers routed them to different places, you got your result back. Fantastic example of cloud. So what's making cloud more realistic? What's making cloud more valuable for us? We have started to trust it more, right? From just getting content, we are willing to do cloud transactions on Amazon. When you buy book, Amazon uses a very interesting concept of how you may manage transactions. Somebody asked a question on transactions and I'll be happy to chat with that gentleman who asked that question. Very noble way of looking at what transactions are. We have started to trust them more. And why? Actually, the reason extremely simple and I think Deepak mentioned that. So the cheaper bandwidth that this one is actually supposed to become big, he didn't start workshop. They said, but doesn't matter, we will make it big. The cheaper bandwidth. And I think he alluded to very correctly. The systems are all first paper in distributed computing. When was it published 72? Right? So distributed computing as a concept is not new. But the cheaper bandwidth is what is making it fantastically usable, right? Very, very usable because as he was mentioning, they can they have their data centers somewhere in a place where the power station is very next to where the data center is. Talk of being cheap and talk of being not losing trans due electricity when you transmit, right? Very nice way to do it. Yet you can connect and as if your computer is in the next room, you can continue to do all the work that you wanted to do. So that's what is basically driving it. So what cloud computing therefore essentially becomes fire and forget, submit a job, get the result, which is again not not new at all, right? Go back to the picture of how the the story of how your human body got painted if I had my picture on the internet 10 years ago, the shoe would have come first, the nail would have come next, but it's all stitched together. So it's not new. Why should why should data centers be cloud? Right? This very simple reason is scalability. He alluded as a couple of speakers alluded to a lot of data. The one thing that's becoming common these days is data, data everywhere, right? And the more people come and search in Yahoo site, the more data we generate because we want to find out what people do. So when you generate more data, you will have to have an ability to scale pretty seamlessly. But I don't want to be able to honor a machine, expecting that something will happen three months from now, get the machines delivered six months after the event has actually happened, then spend another one month provisioning it, and then spend another one month installing all the software, right? I want to be able to use it when I need it. So that's what is driving the cloud actually in a way, the explosion of information that needs to come out that needs to be captured and the data and the intelligence we need to gather out of it, all pretty seamlessly, right? As if I want to open the tap, the waterfalls, pay for it. You want computing like that. So that's what is actually driving the cloud. So it's in a way the data explosion and the fact that you want to do it cheap. You can always buy 99.9999% accuracy machine, right? But we'll have to pay all the money that we have in the bank to buy machines like that around Yahoo. We can't, unaffordable, simply unaffordable. So it's basically the ROI in terms of doing computing cheap and effectively for the large amount of data sets that you generate. So let's look at each of these in a little more detail fashion. So we'll hit them as we go by. You can look at scalability in two different ways. If you have a lot more machines, I can do things faster. In Yahoo, what we believe is if you have a lot more machines, you can do a lot more things much faster than your competitors. So that's where there's a little bit of a controversy in terms of how we in Yahoo look at cloud. We're not looking at it as a cost saving proposition at all. We look and I'll give you some case studies to explain that. We look at it as ability for us to innovate to do things faster than the rest. Simple mathematics, you can do things fast log in years faster. If you have 1000 machines, you can do things three years faster than your competition. Interesting study, lot of our time that we spent in the information centers in our data centers, we spent on doing road work, software install, getting quotation, provisioning and then running applications for the rest of the 30%. So if I have all my engineers working on writing software without worrying about provisioning and installing the software and all that, how much more productive I can make my engineers. That is a huge advantage in terms of going to these infrastructures that essentially removes all the intermediate steps that you need to get your job done. Professor Janakram is going to talk about this very, very interesting challenge. You have the conflicting demands. You have enterprises that begin with private clouds and then when the peak or when the bust happens, the demand peaks up pretty unpredictably, you want to be able to go to public clouds, from your private cloud. Professor is going to talk about a very interesting topic. I'll give you a quick little example, though not an exact example of what the professor is going to talk, but then 2611 happened. All of the servers in Bombay got loaded. Everybody was going to yahoo.com, new site. All traffic, I don't know how many of you went and looked at yahoo during that time, but if you notice, there was a little bit of a performance degradation, not a whole lot, a little bit, but we moved a lot of traffic that was coming in addition to what we could handle in Bombay to Singapore. That got overloaded. And then we routed all the traffic to Sunnyvale. But in a way, if you look at the Bombay Center as Yahoo India's private cloud, what we were attempting to do is to get more resources to satisfy all of our users who come to the yahoo site to get their news, which is what makes you come back. Now, tell me a way to do it without using clouds. I'll buy the answer. And I'll work for you, actually. So what's coming down the pike? All of us, for those of us who are in the industry and for those students who will get to the industry or get to an academic institution sometime, depending on your interest, you will have all these nice little web interfaces which will tell you about all the history in your company. Now, you have a web for that. Why would you want to do something else different for your clients? So why don't you manage everything in a cloud? So that is another thing that will drive this. When you, as an individual in a company, working for a company, have all these nice little web interfaces, any business that you will eventually run, we also get into the cloud. That's another area that integrate both these pieces. And when you do something like that, when you move your business to cloud, there's nothing like closing the doors. I can't let anybody in if I close this door. It doesn't happen in a cloud. Anybody can come any time completely unpredictable. You cannot say only 15 people can shop on my Amazon.com today. No way. So the very fact that you will have to go web to survive. And when you go web, you can't close your doors and you can't predict how many number of people will come in. We'll essentially force you to go to an architecture that is very, very, very scalable. Seamlessly. Cloud. Only choice, at least so far. Why cloud at Yahoo? We have a very simple business model. It's amazing how simple our business model is. We want users who come to our site to have fantastic experience. Nothing but fantastic experience. Very simple business model. But that should stay irrespective of what evens happen. That's a challenge. So the challenge is very technological in terms of how do I give the best experience to my users. You want to be able to scale one node at a time. You want to be able to scale data centers at a time. Bombay to Singapore to US. How do you do that? Obviously, if you have an architecture that supports that, unfortunately, or fortunately, is cloud. That's the only way to do it. So we're trying to list a few use cases for why this is very, very important for Yahoo. We have millions of users who come to our sites. We have lots of data that we create when people do search. We have so much content that we get when we provide news. A lot of content from many different sources. Reuters.com, ABC.com, PQR.com. We integrate all this news. We enrich this news. So the amount of data that we produce is humongous and very unpredictable in terms of how people will use it. Very unpredictable in terms of how we will get it. If there's a big event that happens, we get a lot of data a lot more faster. Not on a normal day like this. Normal. Yeah, whatever that means. So cloud for us is the only answer. We need to do this very cost-effectively. We have to be multi-tenant, which means provide for any number of users coming in. And really, with all the data, we want to be able to experiment. So before we started using clouds, our researchers couldn't find out what happened three and a half months ago because that's all we could store. So if we want to find out how well the Sari or the shirt shopping season was this Deepavali versus last Deepavali, we couldn't do it because we couldn't store the data. But now we can. We'll talk a little bit about how to fuse a couple of slides. We use cloud computing and we build clouds on very, very, very commodity hardware. So what that does for us is provide storage, which is fairly cheap. So you can pretty much store data thinking that at some point you will use it. Earlier paradigm was you store data only if you can justify that if you can use it. So you first tell me that I need this data 15 minutes from now and therefore I'll store it. With clouds, the paradigm has become very different, at least for Yahoo, which is a store. It's cheap. Store it. Make 15 Xerox copies because Xerox copying is free. So we store those copies and what that does eventually is provide our researchers an ability to be able to go look at the data and then build excellent statistical models. So for those of you who have been using Yahoo Mail recently or for those of you have been using Yahoo Frontpage, there are significant enhancements, right? Spam. Yahoo is doing an excellent job of putting all the spam to the spam folder, right? The accuracy is very high because of cloud. The fact that if you go to the front page your news moves and you can see which one is being read by most people is being done by cloud within Yahoo. It's all real time. We look at how people click. You and you clicked on the same news and therefore the fourth person, crowds behave similarly, right? Which is where when somebody throws one store at McDonald's, all of us throw the same store. Clouds behave similarly. So we expect that the fourth person who comes in will also probably want this news and therefore put that news instead of having you click on it. One mouse click. Every click is, every click saved is many click saved for many users, right? So all that. You can only do them with unpredictable, very unpredictable peaks. The business model which we can't get away from unfortunately, right? So what are we doing? We are building private clouds. We are not doing any public cloud. Two reasons. This year for Yahoo is big year because we are spending a significant amount of time on security. So we're trying to solve the security problem for our clouds and hopefully after that we'll be able to make it public. But we don't have good enough answers for security. The problem with cloud security is unlike the old paradigm where the weakest chain. So as your data gets transmitted, if there's one point where the security is weak, doesn't matter how secure cloud is, right? So we haven't solved the end to end problem and we are spending a significant amount of effort this year to solve it. So hopefully after that it will make it public. Hadoop is our platform. It's completely open source and the reason we do open source is because we believe there's phenomenal amount of idea. I work a lot with universities. Basically go steal their ideas and put them into Hadoop, right? I don't have much here so I just go steal ideas. Open source provides a phenomenal opportunity to be able to gather all this collective intelligence and contribute to make a product, right? So that's why I'll spend some time talking about the slides and say how we are using this in Yahoo to make a... So quick note, these are the four segments that we look when we talk about cloud and Hadoop specifically for those of you who attended the talk yesterday comes here. It's a batch processing framework. So we have a lot of number crunching and a lot of data processing that we do happens in Hadoop. Please read about this. As I said everything that I'm talking here is open source. So even our design for Hadoop is open source. If you go to Apache.com and search for the word Hadoop, everything that we do is there including our design, including how we have done Hadoop. This is a classic case study of how cloud will actually help. New York Times wanted to get all their images, period of converted to GIFs so that people can search. Classic example. They even got machines from EC2. Use Hadoop as a platform to do their processing. Got it done, right? Otherwise, would you do this by investing a lot of machines? You would not. So that's where I was talking about the ROI, right? Sometimes you just don't necessarily plan, but you need to do things. And that's where cloud comes very handy because it's cheap, easy to use. And that slide that our friend mentioned was very nice, right? I mean, if you look at the last line that he showed, the flat, that's through a benefit of cloud. The costs have all been amortized. So you can continue to use it as opposed to buying a machine and then deciding how to amortize it. So the cool thing about Hadoop is that it's planned around failures. We have written the softwares so that we assume human beings will fail. Let me give you a quick little story. Two people were digging at road. One person was digging and the next person was coming and closing and then walking, moving on. So there's a gentleman who was walking around and he looked at that, look, a lot of work being done but absolutely unproductive, right? Very, very unproductive. So he goes and asks them, hey, what are you doing? Digging, somebody is closing, nothing meaningful seems to get that. Oh, the reason is actually we are three people. The person who actually plans the trees called in sick today, right? Failure wasn't planned. Hadoop is not like that, right? We actually planned for the fact that a machine can fail. So a machine can actually call in sick and the task will get done, right? That's the key thing about, so the map reduced itself was pioneered by Google. What we have done is taken that paradigm and built a platform infrastructure on top of it so that a person doesn't showing up, doesn't kill the job, right? So I would encourage you to read about the architecture behind Hadoop purely for the simple reason that the way it has been designed is to assume failures, which is quite unusual, right? I mean you don't plan for bugs when you write software. You just say it happens. Of course it's job security, but you say it happens, right? You don't plan on putting bugs, but at least in Hadoop we've planned that it'll fail and therefore jobs speculatively execute. If a machine is going bad, we know a machine is going bad, we'll take the job and put it on to something else. If a hard disk reads really slowly from last one to this one, it's on the watch list. And if it continues to behave that badly, the task will be moved to a different machine and the hard disk actually, the whole machine actually comes off the rack. We will not send any more jobs to those machines. We'll blacklist those machines. So interesting way to design software. Fail more, fail more often. How many of you know scum or the agile methodology? Fail more, fail often, right? Classic way. This is a Hadoop has two components, map reduced in HDFS. There's a big talk yesterday about what it means. I'm not going to spend a lot of time. Again, for those of you are interested, everything is downloadable on the Apache side. So please go ahead and take your time to read and give us feedback. We'd love it. So when you want to take challenges and learnings, lots of opportunities, what does this teach? So distributed computing, as I said, is not new, but there's lot of challenges as we try and make this workable for any system. Somebody asked a question on transaction systems. Amazon does a little bit of that when you buy your, when you give your credit card, actually they have a very wonderful model of how the transaction actually happens. What have we learned? Scaling. Right? Major problem. Even though we have those 25,000 machines, scaring is still an issue. How do we scale? The problems that you encounter when you move from 100,000 machines is very different from the problem you'll encounter when you go from 1,000 to 10,000. But it's amazing. All machines are the same, but why did they behave differently? I have no answer. So maybe one of you can provide the answer. Right? Availability. Handling failures. Handling failures is something we have done fairly well in Hadoop because we have believed, we have programmed with the paradigm thinking that, hey, machines are like human beings. We fail, they fail. Very nice way to relate and data diversity challenges. When you come to Yahoo, you leave a ton of data which is all very different format, very, very different format. Wouldn't it be great for me to know that you came as a male user and left a trace and when you come next time to look at finance news, I give you the piece of information that you came and your mail will be fantastic to be able to integrate all this big data and see you as one user. Right now we can't. Right? But huge challenges in terms of the kind of data we get, how to integrate and get meaning out of it. Growth challenges. People address this already. And how do we make this seamless if we want to cloud bust as Professor will talk in the following session? Do we have enough hooks in place and write APIs in place to do it? Right? How do you define, how do you find out bugs in distributed systems? If a machine failed 10,000 miles away and your job didn't run, then you know your distributed. But can you go fix that machine? Can you go find out that actually, can you even find out that the machine is failing? Right? Challenging problems. Adoption challenges. This is again for overall cloud. Can I go to Azure? Can I go to Google App and say that I want to stay with them? What happens if for some reason, as he mentioned, there's a fallout and we want to go to a different vendor? Why? Users. Why? We need them. We don't need them. When you give, have you ever my heart does? I don't have enough space. When you give free space, it always fills up, isn't it? That's the problem we face. There's a lot of storage, but everything gets filled up because everybody thinks they need everything that they'll always. So how do you handle these kind of challenges? So the problem of plenty, you have a lot of storage. You have a lot of data. But then how do you manage that? Our QE, we get input from people like you who actually write code and put it into Hadoop. We have to do quality control of that. Make sure it works and run our search. Right? So you get, you give a software which if you can actually put as part of a production code and use it to run our search. Nothing can be more open than that. But huge challenges in terms of how we do it. How do we scale? Right? How do we do repeatably? How do we, how can we reliably make some tests? Very, very interesting challenges in distribution. And if you, you can pick any one of these actually, and it's a research area in itself. There's enough challenges if you just pick scale. How do you test for 25,000 nodes on a test environment that is not 25,000 nodes. Right? How much can you simulate? How can you simulate first of all? And if you can simulate how much can you simulate? That is a problem in itself. This is where we want to get. We want to be able to move the curve up where we can quickly do things. And, and huge challenge in terms of testing. Given the fact that it's highly distributed, given the fact that we get software from external sources. People, anybody can write software and put it into Apache. Right? Because of its open source nature. Checkpointing parallel applications. Very, very interesting challenging research problems. And this is where I run to the universities to get some ideas. Right? How do we schedule? A job. If there's a problem how do we reschedule? We are not even good at scheduling now. Or how do we schedule optimally? We don't know. Talk about rescheduling. Right? Performance modeling. Energy-based optimizations. If all six cores of the machine are not being used, how can I shut off two cores? We are working with Intel to do that. So that at at its processor level we can determine how to keep them powered on or powered off. Right? Very, very interesting set of challenges. Extremely interesting set of challenges. Performance problems. How do we understand external failure? So we have become fairly good at that, but not fully yet. We can begin to understand machine failures and essentially blacklist that machine. But there are a lot of other things that can go wrong. Network. We don't know. How to find that out? How to prevent data from flowing from this machine to this machine over this network? Because this network may go bad. We don't know. Right? So are there ways to find out if something is going bad? A research problem in itself. Different kinds of challenges for providers. Right? Hardware, software, quality of service. And I guarantee quality of service. Somebody asked the question about banking. Unless we provide quality of service, ICICI Bank won't come to a cloud. Right? So we are far away from doing that. Very, very interesting studies in interesting area for research to try and come up with models that can be expected to run predictably. And you can use those models and run them on the clouds and then have these real-time customers come on to the cloud. For us as users, different sets of challenges. I need to be able to bust, go out, contract. And I also need the ability to be able to go with go to a different provider. Right? Do I have the right kind of APIs? Am I stuck if I go to Azure? Am I stuck if I go to Hadoop? Am I stuck if I go to Google? Do I have APIs that I can move across? So our view really is, right? It's not about saving money at least within Yahoo. For us, the plan it's about driving innovation. We build Hadoop, which is a platform on top of commodity hardware so that we can have storage that is very cheap and huge computing power to take the data that we store for long periods of time, compute and allow our research people to come up with predictive models that can be used very reliably and hence provide excellent experience to you as users. So that's a we think it's more about driving innovation than about money but it may eventually get there over a period of time that graph that he showed eventually it may get there but at least right now we view this as purely for innovation our ability to be able to do things much faster that we could not otherwise do. I have one scenario please share your thoughts on that. Suppose I have an organization I have 3,000 desktops with me and within my organization there are many projects are going on many projects using the third party tools third party applications there is a licensing issue with that but with the cloud at night all this my 3,000 desktops just shut down. So using Hadoop what are the possibilities to efficient utilization of my this 3,000 desktops? So I'll give you a quick answer now maybe we can chat later offline what the key thing to remember is Hadoop uses a map reduce paradigm so the fundamental requirement is if the application that you are trying to run is not conformable to the map reduce paradigm you can't use Hadoop at least now but we are working with to try and provide other parallel crossing interfaces like MPI and all that as part of Hadoop so that's the key thing typically in desktop applications they are mostly pieces that people individually work on why and I don't know what your infrastructure is whether all of them are virtual or all of them are desktops or all of them you simulate something on top of that I don't say desktop so in that case it becomes a little hard what you can do is if this machines can be combined together to run other applications that are map reducible then you can use Hadoop but right now for your situation I don't think Hadoop is an answer thank you