 Welcome, everybody. I hope you're having engaging time at the summit. I think we did keynote earlier today, and I think this will be much more detail. And at the end, we'll probably have time for a few questions. Here with me are some folks. And one guy doesn't have his name there, but so I'll say his name before everybody else. He's Jeff Gibbons. He's the director of engineering responsible for running and building the cloud. This is Kalyan. He's responsible for the engineering effort and contributing and whatnot. This is James Down. He's the principal architect. He's the one who built the cloud for Walmart. So he was the champion in Walmart to get OpenStack into Walmart, which is a big challenge. So he got it in, and he's the one who can tell you about his horror stories from Folsom to Grizzly to not so painful Havana to even less painful Juno. And hopefully in the next Kilo release, there'll be no pain. But then where's the fun in no pain, right? So I'm going to start with a brief history about Walmart. And we've already talked before in the keynote. We are a retailer, a large retailer, and also a large technology user. So we are, as I said, we have the largest. We built the largest private satellite network when the internet wasn't on there. So we always look for solving our big problems, looking for creative solutions and technologies. And if we find one, we go all out. We don't shortchange in something. Any technology when we adapt, we become huge and big in it. Our story is, again, very simple. We'll talk about our challenge. And then we will talk about what was the vision and strategy in getting everybody aligned to solving the problem in a technical technology way. Then we'll talk about the execution, which, again, a lot of times we forget about. It's also all these strategies, most good strategies fail with bad execution. So execution is a critical component. And obviously, what is the promise of OpenStack? And how good it is, and all of you, since the fact you guys are here in Vancouver, which I don't think we'll have to spend too much time in the ability of OpenStack. Our challenge, I think physical scale, a large retailer, a lot of shell space, billions of items on hundreds of millions of square kilometer of shell space, so huge scale. 250 million customers step into our stores or digital properties, so it's huge. Digital scale, 10, I think now it's about 11. So we keep adding more properties. We have about 1.5 billion page views, 21% growth. And more than that, I think this is the frontier where we are growing faster than anybody else. We are bringing in new products. There's exponential growth, but we're bringing in new products. We brought a savings catch, which I talked earlier today. These are the kind of new benefits which we want to give our customers, who already come to Walmart for the everyday low prices. They'll get some more. E-commerce 3.0, this is where digital and physical is same. There is no concept of fulfillment center. Every shelf is a fulfillment opportunity. I mean, every store from a neighborhood store to a super center to Sam's Club to whatever properties, physical properties we have, combined with the digital fulfillment centers across the globe, that's the promise of E-commerce 3.0. It's any, any, any product, anywhere, any time. Deliver to your home, deliver to your office, picked up from the store, picked up from kiosks, picked up from lockers. So any product, anywhere. In terms of what we needed to solve this problem, I mean, we wanted to have, obviously, we didn't want to spend too much money, didn't want to pay arm and leg. Obviously, we wanted it to be secure. We wanted it to be scalable. We wanted it to be highly available. We wanted it to be distributed. And also, our developers and engineers wanted agility in the on-demand nature of infrastructure. Most of the times, when the new applications are delivered, built or delivered, you don't know how successful that's going to be, and we wanted to be able to iterate over whatever we produced. And that's where our vision and strategy, being a technology armor Walmart, that's where our team got together, heads down, led by James Downs, to build the vision and strategy for the cloud. So the vision and strategy really was that we were in the middle of rebuilding the e-commerce platform that we were doing. So we needed something to accelerate the company in ways that Walmart had never used before. So Walmart's a big company. We've used traditional IT services the whole time. Cloud brought something else to the picture, potentially. And we really needed two things. We needed a more agile way to deploy things. We needed to be self-service, and we needed automation. And all those things come together. When I first came to Walmart, deploying a VM was a series of tickets, a series of people okaying even having a VM. And sometimes it took as much as a week to get a VM. So in order to make any of the cloud, well, that is cloud, right, fastest cloud, to get any of the newer things that we wanted to deploy, even working, cloud was sort of a first thing that we really needed. Two things go with that as well. So you have to have your developers doing self-service things and you have to use automation so that the self-service and the cloud infrastructure can do what it needs to do. And so we talk a bit about elasticity. And that's the need of the elastic cloud. I'm deep mentioned it earlier. Holiday is 10x more traffic than the rest of the time. So we have really elastic workloads. And one of the things that we wanna be able to do is offer, you know, PQA environments, Dev environments, staging environments, give places for developers to run their code, also in production. If you do it all in a cloud, you can take some of that capacity and turn off the staging environments or turn off the performance environments and use instead that capacity for something like Holiday. These are some of the things that we're still working on because we need to answer some of the security and separation problems. For example, you know, how do you keep your QA databases from or QA applications from talking to your production databases? If you don't do separation and security pieces properly, you have things that happen like QA talking to production databases, which becomes a big cleanup problem. One of the other problems that we have right now is that, and I think our guy who does capacity is somewhere in here, we don't have a good chargeback or showback model right now. We're not really set up for it within the company. And so pretty much developers use more and more capacity without any sort of downward pressure on the usage that they have. And we're working on some good chargeback showback models to allow you to say to a VP or to an application group, hey, your application is costing us X number of, I'll just say millions of dollars, it's probably not that much, right? X number amount to run your application. And I think that that's what really makes some of the cloud models work is that, yes, it's elastic, yes, you can use what you need, but if there's no downward pressure on your usage, the cloud models don't really work. It's just VM sprawl like we've dealt with forever, right? The other thing that we really needed to address was Walmart scale. And we're moving all of our applications from existing platforms onto cloud platforms. And this meant that we were going to have the kind of exponential growth that we saw in the slides that, actually, I think we reuse that slide. But we needed to know that we could scale beyond a rack of gear, two racks of gear, some dev environments. We had to go into much, much larger environments, right? We needed to hit production scale. And we dealt with the questions of, okay, well, this is an open source project. And we're bringing a large open source project into a big enterprise. And there's some scary things about adopting open source, right? Bringing bigger open source projects to Walmart is something that we've begun to be able to do. And OpenStack proving itself at holiday scale this past year means that there's a solid base for Walmart to deploy cloud on. And that's OpenStack. So thank you guys. That bet from the beginning, that OpenStack was gonna make it, paid out, and we took holiday traffic last year. And thanks to all the other Walmart guys here who helped make that happen as well. Another reason that OpenStack is important to us, no matter what flavor of it we sort of release or use, giving back is a Walmart value. So there's a couple of items here of ways that Walmart has given back to the community. And we're not contributing a lot of code right now yet, but that is one of our goals. And we have given back or contributed in open source a lot of code, mostly in our mobile area. Some of our platform teams have some things that they'll announce a quarter or two later. But OpenStack gives us a way, and it gives operations people a way to contribute back to the community. So that's one of the reasons that we chose OpenStack, that it was something that we could do. That's not readily available with very many other cloud operation systems. That brings us a little bit about how we did this. You can draw a gigantic Vizio diagram and plunk all of the, by the way I stole the slide. It's been making its round somewhere. So if anybody knows who made this, nobody seems to know who the slide is from. I'd love to give attribution to it. Anyway, so all the cloud technologies you can plunk onto a big Vizio diagram, right? You can't deploy that Vizio diagram from day one, right? You can't, all the pieces aren't there. We don't even know what we're going to do with some of the pieces. So the idea here is that we deployed as little as possible that was gonna make our customers happy. And for us that meant starting with a compute cloud. The other thing that we found along the way is you start talking to developers, or you talk to our developers anyway, and you ask, what do you want? What will you use in a cloud? What services do you want? And we started talking to them and hearing, well, it would be great if we had volume services because then we could make a volume, configure it the way we wanted to, clone it and attach it to all of our VMs. If you look back a couple of slides, you see that automation is something that we have to get them using or we can't really deploy cloud applications. So we started with a cloud deployment that didn't have any storage. So I'm gonna have to go ahead and disagree with the SolidFire keynote earlier. We have a successful cloud that has no storage services in it right now. And the reason simply is we wanted to keep people from cheating on automation. So one of the next big projects we are doing is storage and solving some of those issues. The other big issues that we have here, and this is for Simon, is that capacity planning is tricky. So we had so much pent up demand that we would bring on new regions of capacity and they would fill up almost faster than we could onboard people into Keystone. And that goes back to the problem of, well, how do you plan capacity? How do you manage what people are using? And what we're seeing is that quotas are only part of the picture, right? In trying to provide capacity to people, giving them exactly the quota that they need today and following a sort of resources model or a model that you would use for physical hardware doesn't work in the same way. The quota's used up like that, right? And then they want more. So you're always adding more capacity. That's awesome if you're a public provider because you charge X number of cents per minute for everything that all of your users are using. In a private cloud and an enterprise, that's horrible because now you just used up all your capacity for the whole year, your budget's gone and what are you gonna do for the next quarter, right? And we're still developing some of those models. The charge back showback model I think is very important. You need to be able to give people some way to have a downward pressure on what they've got and what they're using and to show a value for what they've got. But additionally, dealing with quotas and dealing with back to the elasticity, you have to have enough extra capacity that people have the feel of elasticity. Even if you don't have unlimited hardware and you don't, right? Even the public cloud providers don't have unlimited capacity. What you have is the feeling for capacity and we're still working out those models in a world where everyone uses as much of everything as they can possibly get. And I like this slide because it's the opposite of the slide that Amadeep had earlier. Amadeep was talking about taking the mess that we had of all the different applications, running in all of our data centers. And to a large degree, we didn't know the interaction of the different pieces in the old platform. In the new platform, we do know what the pieces are. It's all service oriented. It's all clean deployments. But what we're looking at in this slide here is the idea that we've had to train our developers and our operations people and everybody used to running the traditional existing platform to say, look, you don't know where that VM is. And you kind of don't wanna know where it is. And you don't know what piece of hardware it's on and you kind of don't wanna know what piece of hardware it's on. Oh, and by the way, when that hypervisor fails, it's up to you to make sure that your application keeps running. So in many ways, the traditional IT world of physical hardware and even of VMware where you are doing specific placements turns into, well, you gotta let go of some pieces. And that's one of the hardest things that I think, especially for operations people, right? Operations people are kind of control freaks, right? I'm one of them. And you have to, to a certain degree, say, I can be okay with not knowing exactly everything at every moment. So the first thing that people say is, well, we need a CMDB, right? So we need a record of where all the VMs are. And my challenge to the operation team has always been, you don't need to know where everything is an hour ago. You need to know where something is right now when you're trying to track down a problem. And that's where the APIs come in, is you don't need a database that tells you where every VM is right now. You need a API that you can call that tells you where that VM that's misbehaving is. So, you know, in a pets world, you kill it off. Sorry. Wow, good thing I don't own any pets, right? We don't like some of those pets very much, right? So in a cattle world, you know, you just replace it. I'm gonna turn it over to Kellyanne. Hey, so I know I'm gonna need to talk about pain. And when we move from one version of OpenStack to another version of OpenStack, and I'm sure all of you have heard a lot of scale and scale. In my short journey into Walmart, my six months into Walmart, I mean, scale is what amazed me and actually took me back. And very fortunate, all of our execution team was able to attend the OpenStack Summit and they're sitting here with us in Vancouver. So it's been a great experience. So pain from upgrades and scale, it's more pain. So this is where all of our execution thing has come. So I know we're doing, I mean, OpenStack is going through a six months pre-cycle and we're a bit behind. We currently have our production in Juneau, but for us, it is very, very important to lock down on calendar dates. Thanksgiving, Black Friday, and Cyber Monday are super critical. Those dates, we cannot fail. I mean, we have to plan releases and schedules accordingly so that we don't disrupt that cycle. So regardless of the community and the contribution and so engineering and execution has a very, very unique challenge in that sense. All of our infrastructure initiatives are agile methodologies and we follow two week sprint cycles. And we really have like three pillars within the whole execution and the OpenStack engineering team. So there's the OpenStack engineering team, which is a cloud build team and the cloud operations team. These three pillars are primarily responsible for delivering on time and the newer versions of OpenStack Clouds before it hits into the platform or into the application teams and before they get it into their hands. All the tools that have been written around internally are primarily in Java and our long-term strategy and plan is to move them and eventually put them, at least some of them into Python. For now, I think there was a mention of about 40% enterprise applications still on being on .NET and everything else. So on the storefront and the e-commerce is a bit different in the Walmart world. We do have a lot of .NET presence as well in the tools. So for our private cloud data centers, several data centers, I'm not allowed to give specific numbers, but two of them are primarily largely onto OpenStack, entirely onto OpenStack. And these are now almost at the verge of getting everything upgraded onto Juno. Last holiday, 2014 holiday was completely run on Havana. And we're there. And our engagement is with several partners right all the way from hardware to the tools side of things. One of our vendors that is applying the generic hardware actually they do a factory rollout of Express. We do burn-in tests extensively and it's a roll-in process into these data centers where we get all the racks assembled on site, delivered and last holiday we deployed 2,500 nodes with half a dozen of an organization or a dozen folks in the organization. So that's a tremendous feat that we completely leveraged our partners for. The newest servers that we've got are 50% cheaper than what we've traditionally been investing. So ever since we embarked into the OpenStack and generic hardware, we have 50% and these servers are two times more powerful and we decreased our managed node cost, our total operating expense by a total number of servers by around 300% right now. And the intent is to keep getting the next generation of servers from our vendors on generic hardware. On the image front, we are currently using Ubuntu images for our hypervisor and controllers. Our guest images that the application teams use are primarily CentOS and Red Hat. The flavor sizes has been a contentious issue and I think several applications require different sizes of the VMs. So we offer small to what we call 3XL. So this is essentially your compute and your memory footprint that each application requests based on their capacity and how they put through those requests. So all these VMs are of different sizes and this comes in through our one ops which is our pass layer. So it used to be 90 day turnaround in average time and we have now taken that down to 90 machines per minute is what we can provision and what we can provide. So that's a fantastic turnaround that the application teams are already seeing and they're benefiting largely from this stuff right. So this is literally farm to table if need be and we try to keep that and we do a little bit of an extra capacity during the holiday. So there's a lot of capacity planning that goes on to make sure that we don't break anything during our holiday and our critical dates and time zones. So we can grow application workload on demand. We can patch the nodes for security. We can bring down if we suspect any of the VMs have become vulnerable and they've been breached. So we can move across data centers. We can fail nodes. We can take off the workloads. We keep our applications balanced between data centers and for fault tolerance. So in our history, I think we've come across a long way. As James probably mentioned, Folsom or was it Grizzly? We started with Folsom and I'm sorry. We actually have three nodes still running Essex but there's no workload on them and we just have to get our capacity manager to turn them off. He's in the audience. That's what I'm saying right there. So older builds maybe in a legacy hardware what we call legacy hardware is still fairly relatively new but the execution and the engineering team, the operations team is very, very quick about seven day break fixes. So they get things up and running fairly quickly. I'm happy to know that they are still running and our production subscription is a little over two is to one ratio. So we do over provision. Most of the workloads in our two entire open stack data centers are running production workloads. There may be a little footprint of Dev and QAVMs in there. So for 2040, we had one tenant in the entire elastic cloud and we wanna move other tenants. You keep hearing about other markets that Walmart has already acquired across continents and several other countries but the plan of record for 2015 is to bring several other markets also into the open stack clouds and move them into our. So the eventual goal in the near future, how far out is that near future? It's something we'll probably get back and you'll hear more from us to be able to give data center as a service. So DC as a service and that's what we would like to eventually be able to offer to our markets. So this slide is essentially the IS and the pass and the SAS layers. The IS team is the one that is essentially the three pillars that I talked earlier about that is offering all these workloads. But all the restful APIs were originally written in Ruby and right now we're just standardizing our infrastructure components to our DN SAS, LBAS, SAS, et cetera. And one ops was acquired in 2013 on the past layer. So they are responsible for the provisioning of the VMs into the application teams and offering all of the capacity. So going forward, I think in place upgrades is what we are mostly focusing on. We did not do an in place upgrade from Havana to Juno and we skipped high house but the plan of regard is to hopefully have a seamless in place upgrade going from Juno into Kilo in our production data centers. Havana again, brand of Black Friday and Cyber Monday, 1.5 billion pages in on page views. So this is fantastic with zero downtime during 2014 holiday. I think that's a tremendous feat. We invested very heavily into Swift for the 2014 holiday. Our current capacity is around a petabyte of Swift storage and 70% of our traffic came through mobile this year which is the 2014 holiday season. So mobile applications were doing fantastic. And this is the third iteration. So for the Kilo release, we also plan to introduce SDN when we do bring it into our production. So there we are, six DCs currently, 14 regions. This is, these are actually old numbers. I think we are way over them right now with a 100,000 plus production course. I think we are 25 or 26 tenants currently and the hypervisors have also increased considerably to the magnitude of 1,000, that's it. That's all I have. You don't have any content but we are open to... So I think James, we were gonna tell us why we went into OpenStack and not the other things and that was the promise of OpenStack. I think that's a slide which we missed. So I mentioned a little bit about this earlier but one of the things that was mentioned in the keynotes and it's true to us as well Walmart, we have a lot of vendors as well. We buy a lot of things from a lot of different hardware providers, we have different needs and different parts of the organization. So the underlying pieces to what makes up different pieces of our cloud is going to be heterogeneous. There's no way that Walmart will be able to run one single unified piece of hardware that makes up the cloud. That being said, there is a set of APIs that just happened to be open source that thousands of developers have written that we have access to and that we're running in production and that's called OpenStack. So that wasn't necessarily part of the initial vision because nobody really kind of saw that in 2012. It was a good bet based on looking at the size of the community even then that there was a momentum that OpenStack had that not everybody else had available and it's a bet that has proven itself out. So whether we're running a cloud in the stores or in the clubs or in GEC, we have a, that's global e-commerce, that is a unified API and OpenStack gives that to us unlike anything else that's available. Another thing I think, James, we wanted to talk about our biggest challenges and I'm gonna give you a hint, it's in the networking area and it's one of your pet projects. Sure, so Walmart runs kind of traditional data centers as you might imagine and we have kind of traditional enterprise networking as you might imagine and maybe somebody can guess how things are separated on the network. VLAN's right, exactly. So what do we do? In 2012, we were still looking at Nova Networks. There wasn't a lot of sophisticated features there. So we sort of did what Walmart had done and the rest of the networking is we did a flat network for the cloud as well and I don't think that this is an uncommon story. The problem is that now you've got tens of thousands of VMs on various flat networks and the security people start to freak out, right? There's no real way to coordinate a tenant network from one cloud to another cloud, separating traffic from different zones. By the way, we cheated a little bit on that because we just didn't put PCI in the cloud, right? So we avoided that whole gigantic elephant in the room on that. In order to allow our auditors and security people to have some peace of mind, the networking model is going to have to evolve. It's gonna have to be a lot more sophisticated. You know, and I think we've got our security guys to the point where they admit that you can't really just put a physical firewall or a pan device of some sort between any two things in the cloud. I think they agree that that's not a workable thing. So where do you go from there? Neutron has made giant strides from even the quantum days to the neutron days. And one of the next big challenges is how do we deploy SDN in an existing cloud? You have to have a pass layer that Kalyan mentioned. The pass layer has to understand how do you compose interesting network architectures for some application that you wanna do. If you have a database tier, a virtualized database tier, you probably only want to talk to its own application tier. You need to talk to a load balancer. So those things are the next challenges that we have. So the pass layer has to understand it. Interestingly enough, your developers have to understand it as well because the application has to take into account the fact that it's going to have to understand differences in the way the networking works than the flat network that everyone's gotten used to. You can't just SSH into every single VM so your access methods are different. If you wanna secure things with PCI, maybe you can't SSH into any of them at all anymore and how do you get into them to diagnose logs or problems or restart things? So that's the next big challenges. What are we gonna do with SDN? So this is a challenge from a very strategic point of view and I think we've solved some of the big challenges on the compute side and maybe working on the storage side but this is an area where our biggest pain point is and this is an area where I think some of us are very, very scared off. Not so much you but so this is an area we'll be looking into in the future. We're already working on a bunch of POCs and some things of our own. I think another challenge I think I'm gonna ask Jeff here about is the recent challenge which we had last week and how we performed on it and it's the poisonous thing I'm talking about. Yeah, so you're talking about Venom. Yeah, so as you know, Venom came out last weekend it was a zero day so we were quick to work with the community as well with a vendor to get a fix and our team quickly worked and actually deployed it within hours and removed that issue that was out there. So great job to our team to do that as well as the community for releasing something so quickly. And the promises, I mean again, here's the promise of OpenStack. We were able to do that within, I think within the office hours our entire cloud rebooted all the VMs. We have, I think a lot of the credit goes to the application developers also who've kind of built not all but most of them have built the applications which are kind of cloudy in nature so we were able to move traffic from one DC to another DC and take the entire DC and all in, all at the same time and we didn't have any noticeable outage. We had hiccups here and there but that's the benefit of having high availability that you might have some applications not working once in a while but rest of the other applications can take care of the traffic and whatnot. I think those were the things which we wanted to cover unless you think we missed some other things. And we've got maybe four or five minutes so maybe a couple of questions. Two questions? From the mic here if you would please. I'll ask the question and we'll repeat the question. So the question is what kind of hardware are we using? James, you wanna? It's done what is called a singles queue. So it's a standard block and we have now divided that into ones that are applicable for application, for database and for storage. I don't know if we'll go down and tread down the same path in terms of when we just revisit network and SDN and everything else. The intent is not to be vendor locked down, right? And if you really think about it, it's actually infeasible. When you look at the markets that are globally there and where you get it and if you have a quick SLA and a seven day uptime and different regions having different market dates and everything else. So we do not want to actually standardize on a specific vendor or a specific set of hardware. We wanna keep it as heterogeneous as possible and what we wanna define is a blocks like a standards queue for the application, the standards queue for the databases and the standards queue for the storage. I'm sorry you'll have to repeat that I can't hear. In any ways are we working with open compute project? I think I can answer that. I think we are looking at different open compute options but I think we are in between completely open compute standard and enterprise and I think for our use cases I think that's probably the next iteration. So we started with I think in the beginning of with five skews and we've now narrowed it down to three skews which is basically one skew with a certain up down just a storage which changes. Everything else is same. And next I think we still have a lot of redundant components in our systems and so the next system design is probably gonna get rid of some of those components and then I think the last step would be to go into the open compute. And honestly open compute requires a scale which is still we are almost there but we are not completely there. So I think very soon when we as more and more properties and more and more applications from the Walmart stores and everything comes in we'll be going into the open compute. Which distribution and which configuration management tool we use? Yeah so we're using a Rackspace public cloud and the deployment is all based on the community Ansible Playbooks. We're still sort of playing with some of the configuration pieces. So previously we have used Chef and we use Puppet in our PCI environment for auditing and that sort of thing. I think for the pieces that don't work well in Puppet I think the next thing or sorry in Ansible I think we'll be investigating Solstack for that stuff and that'll sort of keep us on a single language platform. At least for now there's a session later talking about whether open section allow non Python. For now I think it's an opportunity to sort of unify the language across with the things that we're doing. One last quick question. We have like 45 seconds. Go ahead. You mentioned tools. What other tools have we developed or enhanced or open stack? I think we can. So I think we've kind of you know one we acquired a company for the past layer which manages a lot of our you know on demand provisioning and deployment for an application developer. On our engineering side we have 50 seconds. Okay that was wrong. Oneops.com is our past layer. So you can take a look at it there. The other piece we've developed some stuff that make running the cloud a little bit more doable. Operating the cloud from an operator's point of view. So there are different ways of looking at the same problem. Developers look at the problem in one way from their application they want fast deployment. From an operator I would like we want to you know be able to reboot restart move capacity here and there. So we build some tools that side and then we have all obviously you know monitoring tools and a lot of other tools. Thank you very much for coming.