 Okay, okay good deal Hopefully yeah, there we go Steve Eastum here. I'm a Director of web architecture for best buy comm With me today is my name is Joel crab and the chief architect for best buy calm cool So Joel is gonna cover I'm gonna go one more slide here Joel is gonna cover a lot of the background About best buy calm what we're doing is a platform Kind of where we're headed is cloud overall and then I'm gonna pick up on our particular use case with the CDC great So apologize a little bit in advance if you went to the keynote some of the slides we we are reusing And from coming from best buy first have to tell you a little bit about best buy Where the world's largest e-commerce? multi-channel retailer as well as being the 11th largest e-commerce site so we had about 1.6 billion visitors last year and Reward zone program is one of the largest reward zone programs that Exists And what else? Oh, yeah, we every we have outstanding staff that tell you unbiased things about all the great devices that we sell so Impartial and knowledgeable advice competitive prices and ability shop anywhere you want and That's that's the spiel from best buy itself now. Let's get to the good stuff we started looking at Cloud in about 2010 and what we started with was that We had the need to have a Disaster recovery site. So we created what we called dr. Light, which was a very lightweight site that does both browse search inventory store lookup and store locations and The reason for that for us was a lot of our traffic that we get on best buy comm is really people Going to the site seeing if things are are available in the store and then going to the store So even if we're on an outage in best buy calm We still want to support all the people that are going to this through the site to try and find out What's in the store? So this was a big thing for us. We built this out we put it on to a cloud vendor and We can pull this up in about 10 minutes it sits cloud resonant all the time and then we really just have to elastically scale it out whenever we Want to have an outage or an outage window for best buy comm not that we ever want to have those but it does happen occasionally Next there's a few smaller properties that are also in the cloud My reward zone is probably the best known one That's completely cloud resonant except for a little bit of database on the back end and then what happened kind of naturally was many teams that were out there started to use cloud for testing as As in most companies infrastructure is always a little bit tough Procurement and getting things actually built out in data centers takes longer than most teams are willing to wait and many teams Just started using cloud resources to do that So you saw a slight earlier. We had a cloud rearchitecture We're in the process of rearchitecturing best buy calm when turning into a new e-commerce platform as part of that We have a very large amount of our traffic is what we call just browse and search People come into the site looking at stuff searching for products. That's upwards of 90% of our traffic and As you can see from this graph, this is just pulled off Wolfram, Alpha Alpha our Traffic spike around Thanksgiving time frame is about seven times our normal traffic So as we looked at our various traffic, it's quite obvious that elastically scaling our Brows and search platform and having only to use that many resources during Thanksgiving For you know one week of the year is much better in a cloud Then trying to build that out in our data center, which is something that we've done every year prior to starting to go Move the architecture towards the cloud and then as we said today The first time we're really talking about it. We actually served about 25% of our traffic last year Best buy calm during holiday from about July onwards. We served about 25% of our traffic off of our cloud architecture So really high level. What are we doing at Best Buy on the browse layer? we're putting a global traffic manager in front of Multiple clouds we're going for multiple cloud vendors really not trying to get locked into any particular vendor and multiple vendors because vendors typically fail Occasionally and we don't want to be tied to just one to serve our browse architecture. So we can transfer between multiple cloud vendors through the global traffic management at any time and then a lot of data still comes our traffic still comes back to our data center itself All the commerce traffic all the secure traffic that's still all served by our data center slightly lower level view of what we're doing in the cloud is You know, we is pretty similar to the Samsung guys view who have you guys were here for the last presentation You know, we have cloud load balancers in front Web application tier Tom cats in general and we have a very high scaling service aggregation tier So we take a huge amount of traffic Last year estimates were that we were the number three traffic site during holiday So we we have traffic similar to the biggest players in e-commerce And so our service aggregation tier the main point of it is really really highly scaled really high caching and to serve up disparate services 30 to 50 services that we're calling at any given time for any given call for any product detail page and Putting that all together serving it back out to the page and in all in less than about a second and Then as you see we've reflected this architecture across Multiple clouds multiple vendors multiple availability zones multiple regions for the increased Reliability and scalability that you get from that and then we still have a lot of data in the and back in our data center And the one thing you do see at the bottom is our product data Our product catalog is one thing that we've scaled out to to be cloud resident as well. So The product catalog actually replicates out from our data center into all the different cloud regions that you're seeing and to serve up We serve about 90% of our page from the cloud right now from the product detail pages And we're trying to get to a hundred percent this year a clicker seems to only have to click twice so As you saw before what what does this mean for customers for us? Which is really where it boils down to it what we think it means a better experience and if you describe a better experience as a better Cleaner product detail page The one on the right here is our new product detail page that we put in as we put in our cloud architecture Versus the one on the left is our old product detail page Which is served by our data center and you'll see that we roll this out over time So category by category we started to put in the new Architecture and move it to the cloud so that we came very slowly up onto the cloud. We we Started out at zero percent and then we went to like one or two percent And then as we ramped up over the course of months before holiday this year We ramped all the way up to about 25 percent of our traffic being served off of our cloud so The best part of it for us is that this page is just significantly faster than the other page This page the old page had a lot of post-render JavaScript So as you loaded the page onto your client onto your browser It would go off and call 50 60 sometimes third parties to kind of fill in the rest of the page as it built And as you all know that that's not very reliable the various vendors that are supplying the third-party content Don't necessarily have the SLA that we want from them And so that's why the range can be anywhere from like seven seconds to 30 seconds It really depends on your particular connection and whether the vendors are working well today So what we did as part of this cloud program is we've actually pulled out that all that data That was being served a post-render on the page and we put it Through the servers and now we render the entire page to you at once so We get a much more consistent experience It's usually about two two and a half seconds in general and that leads to better consumer views of our pages and and happier people and We all like that at Best Buy. So what what we are Getting to today though is if we go back to all of our tests all of our teams started using various vendors clouds to do their testing it became very Chaotic so And any time we might have 40 different teams working in parallel across Best Buy.com and it all rolls up into one massive build right now So they all have to integrate at some point and what was happening because our our Lower environments really weren't up to snuff A lot of those teams started to just move to the cloud because you know They couldn't get their things provisioned in our in our integration environments Integration environments for failing they had 40 teams trying to use like four or five different integration environments Just really doesn't work so They all went off and just built their own stuff and then and you got really serious Inconsistencies in what they were building they had oh it got really expensive too because teams would go off and they'd build up a Environment and then they just kind of walk away from it and we just sit out there You know clock in a couple cents of an hour on stuff But it all adds up when you got 40 teams doing this kind of thing and so What we ended up doing is we decided we needed a solution for this problem And the solution that we came up with is what we now call our continuous delivery cloud We built an open-stack cloud that we allow all of these teams to use as tenants and give them Basically free access to build their own test environments and on top of that we tried to Mediate mediate the inconsistencies by giving them the ability to deploy What is very close to a best buy comm? Production environment with a click of the button into these these environments that we had for each team So now every team is working on a consistent Architecture and consistent infrastructure as they go to do their individual testing So when they finally get to the integration testing We no longer have the problem that these teams haven't tested their own stuff first before testing it in conjunction with everybody else And so Steve's going to talk about what we actually built and this is where I hand it over to him All right So I'm gonna go back to the way back machine when we were started looking at this is in mid 2011 We ran a VSM exercise looking at how much we spent For a major release VSM is kind of a lean manufacturing concept Toyota has used it Successfully, but we looked at what were the friction points those friction points were around You know a lot of the data the lot of the manual, you know, we you know a lot of the communication teams asking other teams Hey, can you load this data for me? Can you move this build out here to this this set of infrastructure and You know You know a large part of those costs were discrepancies regressions that found their way in and Availability problems, right? So that's what happens, right? You have more hands doing things manually, right? You get more and more trouble We also Needed to find a way to get more parallel development. We had the time three Qa Swim lanes and you know all the teams are all trying to integrate at that Qa environment not earlier So it was really, you know trying to find a way to get earlier integration Get things happening earlier and better and then we when you're just looking at the cost of infrastructure our cost structure Was about $20,000 it could be even higher for a fully managed, you know with monitoring the whole nine yards Just a VM on a Infrastructure that we'd put in place so that couldn't continue We knew we couldn't continue to scale at that time We did not know that we were going to get the green light for Our new platform that Joel's been talking about that. We've been rolling out. We're going to continue to roll that out You'll see in our quarterly analyst calls on the street We're going to be talking about that and some of the new things coming Can't really get into too much detail. I'll let our CEO talk about that And then we didn't know that we're going to need that rapid pace You know have more automation more API is more teams be able to run in parallel And we needed to do that to get our new platform out the door by holiday last year So what our continuous delivery cloud provides us? It's an innovation catalyst our teams can go in if they're looking at Launching some new cash technology some new new no sequel platform whatever they need You know they can jump ahead they can go do it themselves We also kind of coined a we ran a product team like a little mini team still running today and they they built a push button development environment in Jenkins called Omni tank again. Don't know about the name, but it's an awesome name and developers can go in and Fire up their test environment just on demand. They just fill out a form Upcomes a test environment. They get their app server. They get the web server. They get Everything they need they get DNS names For that the name that they give it So it's really kind of an integrated thing Then you know more the more advanced teams that they want to build their own Omni tank They want to build their own paths And launch it they've got the full API driven Capabilities of open stack there for them And again, we talked about earlier integration in the last slide the whole goal and web development Especially parallel development is like understand your dependencies on other teams earlier if you're waiting too late in the game You'll get into regressions. You know deadlines will slip not a good thing and then finally the more automated You are the better things are gonna go People are in there touching things by hand You're guaranteed to have failures Architecture so for you techies out there will get deep into The architecture we use in the CDC The whole goal was to have scalability Just like we need to scale best by calm have scalability at all three Parts here so compute storage and network have have really a horizontal scale kind of approach when we were looking at Building out the CDC a little over a year ago a bun 2204 is actually in beta. So was Essex We launched we we made the choice. It was like they came to me. Hey do this stuff some beta And as a product owner, I'm like, okay Well, I felt comfortable with Essex because I've been on the mailing list and I tell you what The kind of testing that we're trying to get to They were doing the the the guys running open stack They were running that automated testing running through and they had that continuous integration world going and so I felt comfortable that they had Really good test coverage going into that release. So we went with a beta version of a bunch who and a beta version of Essex 2012 and you know, we've upgraded along the way to get past Beta but it's been rock solid. I mean even the beta I mean launching going up to a hundred VMs within like a week or so worked really well Again, we use glance the typical file-backed images copy-on-write kind of world We support most all images of Centos and Ubuntu our production environments run a lot of red hat primarily to have some of the support around Around Java Have they actually supported stack if you need to open a ticket with a vendor. We haven't had a lot of that but we do and then Standard Keystone set up our scale-out storage We'll get into lessons learned. We actually did not launch originally on seph. We launched on a different product And then a few months in we pivoted off of it We'll get into that in lessons learned and kind of what our our experience has been there What we use is a four terabyte shared file system for the actual images Once you pick an image a certain flavor of an image It puts it into the base directory those of you that run OpenStack know this And what we decided to do was push that block device out use OCFS to and What that gives us is the more often an image is launched a certain flavor The faster it's going to load because every single host is Reading and they're caching right in the file buffer cache. I've got quick reads on the those Those base images. So what that allows us is really fast launch times. We're not copying big files We're I mean the first person that launches it, right? They're paying the price but after that Everyone else is is paying for it and and you'll get it or not having to pay for it But as we get into the the kind of numbers and more detail, you'll see how important that is We have a lot of VMs at launch We run automated tests and tear them back down if you don't have really fast launch you're copying big files around Not a not a good thing Then on every host server, we have a one terabyte File system block device that all the actual data is written to so all the the rights go to one volume All the reads on the the shared volume Then we have with Seth we use the s3 gateway so teams can have Images that they want to put out there if they have a you know like a big file or some sort of you know anything that you know typical object storage And then we use I SCSI targets For Nova volumes, but a lot of our stuff is just on the base image teams will spin up larger base images and And most all of our stuff runs on the base image scale-out network It's kind of a Interesting deal our network team had kind of put a good core of Nexus gear in I guess Nexus 5k 7k 2k all that stuff All we had to do we'll get into funding was Provide or or buy the 2k fiber extenders and We use 3 1 gig interfaces per host So we use you know one for fixed network one for the floating at work and one for the storage So all that storage writes and reads and all goes over a different interface Our sizing for the internal closest all internal space is a slash 18 for the fixed And a slash 22 for the floating You know and then we have a quota around you know how many addresses each team can have And right now we'll be part of our roadmap slide. We're just running the multi-nick flat DHCP kind of world We know that we'll get us Full scalability, so we're looking into grizzly and some of that server hardware rack-mounted commodity Our first purchase was 24 core machines They were to 2 socket 24 core and then by the time like a month or two later when we bought purchases second rack You know we got you know for less price. We got it, you know even more cores. All right, so we got 16 core Times to socket servers Our latest build out is some 1u servers just for compute So I'll get into I have a picture here in a second and then one gig networking It's the the current servers the two racks we have they're modular You could plug the two and a half inch hard drives into them and make them storage servers if you wanted to And then again kind of getting into the CEP Distributed storage we use you know some people have coined it rain redone it a ray of inexpensive nodes Versus raid so we use the 10k sass dries for the cost the amount of space you get And the size and the servers were buying the 10k sass were the right mix for us This is what one of our racks looks like don't know which one it was could have been the first one Could have been the second one all the servers at the top with the one little red dot That's one hard drive. This is the just for the base OS Fourth the bottom is are the storage nodes that that have the 16 sass drives in them Bootstrap crowbar. It's a wise investment in my opinion and and what we've learned Provides us with bare metal Install so it has a an ISO image and pixie setup. So any time we plug a new Node in it shows up you drag you drop it up comes a storage node up comes a compute node Config management for the host servers We get into a little bit more about that base monitoring comes with Nagios Ganglia that kind of thing Costs everybody's always interested in costs ours was not really around cost, but cost it You know it's good to have work on it and have the cost model Good so roughly around 81k a rack for Hardware we add in roughly 10k for labor from the network team the rack and stack guys Then your total, you know cost per rack around 91k a big difference from $20,000 of provision VM to to this this kind of cost model just to put stuff in the data center config management We use we license ops code private chef we offer that to all the teams that use the CDC They can have their own tenant we have automation around that kind of gives us a scale We like the multi-tenancy. We've been running and operating in chef for a long time We go back to that 2010 use case with Our dr light site that was I think on chef 0.8 boy how things have changed there and then chef solo we Kind of talk about lessons learned. We don't like going in and tweaking The crowbar stuff a whole lot. We're gonna look at crowbar again. It's part of our our next steps but in our roadmap, but We'll we'll use crowbar for the initial and you know it'll sit there and run and apply that You know make sure the policies away at once, but if we're rolling out like monitoring we'll get into that We use chef solo a lot. So let's check all our stuff in run chef solo scripts We'll get into that Jenkins push button Jenkins from Jenkins. I think we're in a presentation yesterday about this and that caught somebody's eye So the idea is that you give a team the master Jenkins They can go in fill out a form push a button. They get their own Jenkins something. We've moved too lately We've tried all different models. We tried, you know dedicated slaves for teams. We tried, you know All sorts of things, but it's getting to the point where if you're a power user kind of team Why not give you Jenkins button push button Jenkins, right? So you have your own? I've used knife open stack. We still use knife open stack for some things. It's obviously use a fog library But we have recently developed our own tool called ginsu. It helps us with managing our chef dependencies integrates with our git repository and The key thing is that we've extended the API functionality to go beyond what knife open stack provides So you know like volume support and as we you know of all this will open source this This is one thing we can do. We have approval to open source this about anything chef right now So you'll see ginsu out there. We have a github site Best buy chef. I don't have the link up here, but I can get it I want to put it anyway, but we will be open sourcing ginsu. It's not out yet So I didn't want to go too deep into it Covers that monitoring cloud monitoring both for our production clouds and Internals, you know, it's a unique kind of challenge when you're looking at you know the the scalability the Elasticity you know instances come up. They're gone They move around a lot. So the traditional Monitoring techniques that I'm very well aware of kind of start to fall down in this world So we kind of you know, we use a lot of the same tools in production as well But in our our CDC we use since you collect the Graphite and we have a custom dashboard. We'll get it all those in depth here Sinsu not the prettiest baby out there for the GUI, but it's got you know, it's about the API not the GUI Self registering JSON configs and it's an expanded community. I think it's an awesome tool Collectee we use that for the systems collection Has good, you know good performance is it's mature. It's been around a while graphite scalable graphing Kind of fits in our distributed computing world. We're moving to you kind of feed it data It's kind of like a you know graphing as a service if you will orbits Release this back a few years ago at one of the Java conferences and I've I've used it at several companies now Easy to input data into the carbon back in and then it has a lot of functions So this is a picture one of our nodes will get into some of the things that we have changed in Essex or tweaks we've made we can take hosts out of the pool and Then do operations on them whatever, you know test a patch whatever and then put them back in the pool You can see this host going back up into the pool recently This is our dashboard in reality. This is one screen, but it doesn't look good on the things I've broken in the two So starting on the top left is our master Jenkins as a service Obviously more little red spikes there. That's when something is down We'll roll out Packages all the time. So, you know somebody wants a new version of groovy or something. So a lot of that is us deploying new packages for Jenkins next one down as a dashboard whether the OpenStack dashboard is up next one is Ceph our storage layer and Ceph Rados and then our the API so chest. This is kind of modeled after the the AWS or You know or the rack space Status in fact, that's the URL the status to it On the right top up here is active instances running in this case We're right around four hundred ninety four. We've been up and down it could you know It can go up a hundred or fifty during a day or down And it just depends upon who's doing what? Total instance is created for the last year fourteen thousand seven hundred and and it was funny We're talking about that. We first got the dashboard built We're almost to ten thousand and we're like hey, we should have a part We hit ten thousand by the time the time that we started talking about it We're I well past ten thousand so we're like, okay. Well, it's a pick another number. It's too late Roadmap so the roadmap for the CDC not speaking to best but overall just best by comm CDC I'm gonna be looking at hardware options. I was pleasantly surprised. I saw that Facebook has a Guy talking about open compute. He's back in the corners. You know the finding but But it was actually quite quite cool And then we've talked to HP with about their recent moonshot. There's other vendors. I just mentioned a couple Just really looking at and how we can really You know pack more more gear into less space with less voltage What it's everybody's trying to do the same thing Bootstrap kind of relook at crowbar or look at we were at Matt Ray's presentation yesterday I don't know if Matt's here but and He kind of called out a few things there Where open stack? I mean, sorry ops code for open stack was his presentation Expansion of our current cloud. We talked about compute. We have additional compute. We're building out And then we're gonna be adding a second instance of the CDC don't know what we'll call it yet we have to find some New and tricky name for it because you know that third sub domain is CDC right now. So we need to new name for it And then looking at 10 gig networking again, we talked about using three one gig connections today to kind of split the traffic Really want to get to commodity 10 gig as we move forward Open stack upgrade. So we're probably gonna skip the whole Folsom release. I Really just haven't even had time to look at it to be honest with you. I mean we're on what we're on it's been stable and You know We'll look ahead to grizzly a lot of good stuff are there part of that is some of the sender, you know The RBD backed instances and volumes Looking at the storage software upgrade We've had that on the the docket for a couple iterations kind of paused on that. We're gonna look at Bob tell I know that somebody's gonna be presenting about that while we're here quantum open v-switch SDN so trying to get to Software to find networking get away from the flat DHCP networking We know that you know you hit about probably 1500 machines in a v-lan You'll run into trouble. That's too much broadcast traffic What it does for us, this is unfortunately, this is a if you were at the keynote This is a rehash, but for you those of you that weren't there Really trying to get to a developer-driven culture You know more of the self-service less of the the teams that you would go and open a ticket with You know and really remove the blame game so teams aren't like well I can't release because you know, I can't can't make my date because you know the environments weren't good enough So really get to the point where it's more of a self-service world and the teams are really taking taking it upon themselves taking the initiative Parallel development we talked about that a couple times and then you know the teams are free to innovate and reducing the cycle time What it's all about Okay, lessons learned talked about that a few times So, you know again, we don't mess with crowbar a whole lot Once you initially run the packages you have the the server comes up You drag it into a category and then launch it We're not going into the bar clamps and the recipes and using that chef to really manage moving forward a lot of our deployments We'll be looking at that talked about that storage. We first initially launched. We launched on a file based Cloud storage. I think Red Hat's about the company since it's Gluster We learned a few lessons there the thing about Gluster. It's I think it's a good product It has some unique features that we saw but but the one thing that we didn't like we learned was You don't see in testing if you have really large images. So let's say we have a team that runs a no-sequel Cluster and they have let's say 200 gig images and you take a you know Have a storage server problem of some kind take it out of the pool put it back in the pool When it's out of the pool, it's fine. So it handles a an outage fine It's when you try to re-sync it all the file systems go read only and the big ones take a long time Talking four hours. So it didn't really didn't like the recovery of that You know someone could come up to me and say well, you did it wrong and that's all great I would love to hear that but I tell you what we just did not have the time We had to pivot off of it moved to something that was a little more self-healing and if it Experienced a failure getting back on your feet didn't take teams down. So again most everybody was fine It was just the one team with the 200 gig Root images that were that were affected for a while So that was a lesson learned Took us a while to pivot and implement sef, you know one host at a time kind of thing and Then tools we talked about a knife open stack I think knife open stack is fine and all and people are using it just doesn't have the full support I think even you know Matt talked about that yesterday. It's it's kind of an evolutionary thing And that's why we've moved to our ginsu setup. We can easily add support for a new apis for volumes And the team that started moving that way they may you know, we'll see what happens with the community Open stack pretty darn stable talked about that in the keynote Essex has been really solid. It's everything we expected it to be and The other one is kind of funny if you're having trouble and everyone in the whole department logs in the dashboard You'll crash the API DDoS it yourself kind of learned that kind of funny And then kernel updates lots of little tweaks here and there we you know, we would You know be on a certain version of a bunch to certain kernel and then some something would hit us Right, we would you know be uploading images or you know, maybe the resize command would hit some weird bug But for the most part we've everything we ever Looked at by the time that we found it was there was already, you know community of passage It is already out in a in a bunch to update so a big deal there just time And our changes for Essex we put in a filter for the scheduler That allows hosts be taken out of the pool, but you still launch in the host. You just have to have the right metadata you launch dashboard enhancements we added a bunch of links and and RSS feed stuff to our support blog and then Some small bug fixes like the volume create And that's pretty much it hopefully Everybody absorbed that I went pretty fast. I don't know for ahead of time behind time You want to come up as well come up do Q&A? Anybody have questions? Yes. Actually, do you want to step to the mic so everybody can hear okay? Yes, yeah, you want to cover that yeah, so we actually talked about this a little yesterday too We're building what we need for best buy calm right now. We're both calm guys We have a lot of connections with the enterprise enterprise and that best buy as a whole is looking at all sorts of things for cloud Right now. We can't really go into that too much, but we feel like what we've built is is world-class world leading and that's We're publicizing it to our Executives and to anybody that wants to hear about it. We actually have teams From the enterprise side that actually use our CDC for their own development So we're kind of more going about it as we're putting this stuff out there. We built it We're using it if anyone wants to use it We allow them to use it and we're eventually waiting for people to come back to us with the questions But hey, how'd you guys do that which are already starting? Yeah, cool back here. You want to? Yeah, that's probably wise The last meeting no one was happy because no one could hear it was had a riot about the mic question on your on your You know total cost of Ownership because you guys said that there were 40 teams and bunch of them were working on the cloud like putting stuff out I'm assuming in AWS or Azure and stuff and then you brought something back How do you compare the two and the comparison? I saw was $20,000 per VM, which was I think internal your own private I mean your own sort of in-house implementation versus what you have which is still private cloud So it's just how did you think about? Full cloud, you know full outside public cloud versus private I can get this one Yeah, so a couple interesting thing the $20,000 of provision VM is not cloud that is data center virtualization kind of stuff Costs for the cloud are interesting the problem is everyone spun up clouds all sorts of product owners across the company spun up Cloud test environments and it's tough to know how much they were spending or even if you know as Joel said You know, there's a bunch of machines just kind of like, you know, who knows what happened to him, right? There's kind of like floating through people's credit, you know corporate cards up through the up through the balance sheet somewhere, right? so You know that this has been a way of getting away from that right and that's that's hard to track So I don't really have a great answer for you on Public cloud costs because again, it's it was hodgepodge people every which way doing it What we're trying to do is get standard here and and really reduce that That that overall thing it was high enough that people Executives were getting mad about the cost of the rolled up cost. That is true. So cool Yeah, so we started with one rack They were it's you know, that's the the 12 servers at 24 cores And then we added a second rack of 12 servers at 32 cores. So I don't know that 7 800 core range and You know, we we think we can sustain in the low thousands of VMs You know, we were up there in the mid hundreds and it'll burst, you know higher than what you saw So does that kind of give an answer that you're looking for? Oh, and then sorry, I didn't the follow-up Probably about I think we have about 40 terabytes of online storage. You know, that's counting the redundancy thing with the block storage object storage, uh-huh So you talk about developer driven teams and one of the biggest impediments I think to private cloud deployment is often the disconnect between developers and IT So IT has a certain set of functions and parameters that it needs to operate within and yet developers are looking for a whole different set So to what extent and how did you make sure that you brought developers into this? So that what you were building was actually something that they really wanted to use But at the same time would also meet your guys needs I can talk to that So I it's interesting it. I'll let you follow up as well So I have a couple teams one is the automation tools team. That was a That picture didn't make it into the final keynote, but some of the guys are here in the room So that team Actually engineered deployed the open stack. They deployed Jenkins. They deploy a lot of different tools for different teams And then I also have a product owner for like cloud browse, right? Which is our render tier for best by comm so I have teams that use those services, right? So it's kind of interesting being the product owner. You know, you've got one team using it another team Deploying it. So again, we are our biggest customer. We have pulled in a lot of other teams across Best Buy We have you know, like our EIP our security team using it. We have enterprise teams using it And so, you know, we've driven, you know, a lot of the features and things out of our own ecosystem within our product development org and then we have you know, again as Joel said Kind of driving it into the rest of the company nicely. Yeah we had about five or six teams that were basically the early adopters of the cloud and They I remember some of these meetings They pushed back hard on these guys when the cloud wasn't performing for them because from their point of view as a developer Hey, you just gave me this cloud. You took away my cloud my External cloud environments, but your internal cloud is not scaling or it's going down or whatnot So there was a couple of months there where it was pretty contentious I actually wouldn't say contentious But there was a lot of back-and-forth between the teams about hey, this is what we need We're not getting it and then the the add-at team the team that built the cloud Going back and fixing all these little pieces that as it went along So, you know, it's not super clean But if you have teams that are willing to work together then it actually worked out great. Yeah Right here. Yeah Yeah, so I'll take what repeat the question. Yeah. Oh, yeah, sorry So he wanted me to clarify on the storage. We use Ceph You know for the the distributed block storage object storage as well, and then we use LVM I scousy target. So here's what it is if We're if you're launching a VM and you're launching that base image no matter how big it is 200 gig 300 gig I talked about those teams that is going to be launched on and written to you know on the copy on right to That one terabyte volume that block device that is on each host Okay, every host has a one terabyte volume, right? So, you know, basically all the rights go there We also have a Ceph Device right a block device, but it's using OCFS to to be a shared Block device almost like you know if you have shared sand and you know You multiple nodes can read from that or write and there's locking involved We're using OCF OCFS to and it works out really well because reads can't are you know reads can happen, right? And there's no locking involved, right? So all these nodes have that the base image directory, right mounted With a OCFS to volume and all the reads right there They're never gonna impact each other and that's mostly what that is right the rights only happen when a new image flavor Combination happens somewhere and then they get copied out there, right? so does that make sense and then and then last lastly sorry the the The Nova volume like you go in and you want to attach a volume to do a backup or something. That's not using the the Ceph Stuff, right? That's just a couple of nodes with Whatever it's like a redundancy thing. Anyway, cool Someone else right here We did we look I mean we're we're going back Going back about two years now at mid 2011 If you go back to the mailing list, you'll see that I I was out there asking people about Diablo I I ran a small POC under Diablo myself as an architect just kind of looking at how it worked. I Mean there was other clouds and try remember back then I guess there was VV cloud or some stuff from VMware I would say the key thing is is that at the time that we chose it It was the best that was API driven the closest thing to a public cloud API at that time And we didn't want to go to something that was halfway there, right like a V cloud or something like that So we wanted something that would be again API driven give us a restful interface Something that can scale we figured we could take it from there as an engineering team Yeah, we didn't do an RFP either. No, no, we just went out looked at the products and picked one. Yeah No, no big formal RFP because we did it ourselves. So this being a dev test environment use case and Besides price I looking at any of the metric which shows you that this is Being a success for example reduce tickets more features per development cycle faster development cycles. How yeah those last ones So make it actually launching your stuff on time and then getting more and more releases and faster release cycle And that's what it's about, but I'll enjoy. Yeah, so a couple we're not really formally tracking metrics But we have gone down from a monthly release cycle to a two-week release cycle And a lot of that is because all these teams can go off and create their own test environments and test their code So there's lots of other factors going into all these things as you probably know, but we think that that's a big win itself we've also Haven't really advertised the fact that this cloud exists and it's been working on word-of-mouth And I think we went you can do the exact numbers from from zero to how many tenants do we have in the course of about three months? We were up in them, you know, like probably 30 to 30 to 40 within just a couple months Yeah, so teams are hearing about this and they were just coming to us. So So yeah, no no formal metrics though. Yeah Over here, I think he's gonna. Okay. Actually. Yeah, let me let him you'll be nice. What do you do for life cycle management? We saw the picture of the zombies You have a comment on that Yeah, I mean we just started a spring cleaning effort looking over here at Shane. He's one of the lead guys You know most of our stuff. We're trying to get the teams to use Jenkins Kind of move it up a level and fire your stuff fire up your your selenium test and tear it down That's where we ended up with that 14,000 number in one year It's obviously we have a good chunk of the the community using the rapid You know launch it run your tests or run through your stuff and tear it down We probably have about a 20% range there of teams that aren't able to do that or have stuff left around one thing we do is we require every team that that Goes into the cloud to provide us an email address a skype ID that kind of thing So, you know if they had a few running and they just kind of sit there after a while We'll go through and send an email. Hey, what's going on here? But some teams are in that 20% that run for a long time It's not easy for them to build and tear down because they have like no sequel clusters And you know, they're using those two or three hundred gig volumes and loading their data is not something that's easy Right, so are you just gonna tear those kind of platforms down and rebuild them from scratch all the time? So those teams kind of tend to run longer. Thanks. Okay. Actually, let's go back here I can repeat the question Yeah, yeah, I can talk about that. So Yeah, let me think probably the best segue into that so we're using this as a automated test You know continuous integration kind of world Even for like cloud browse we have Tools from Akamai that help us simulate like the edge side includes the fragments for the page So we're testing prototyping integrating there Then we're going out to a like a staging pre-prod environment and then we But the key thing I would say is like, you know, like from the G's humble book is We're using the same war right the same Artifact and it starts right very early on and then goes all the way up in through production And that's and that's a recent change for our legacy platform Which Joel said we've gotten up to two releases a month from one But in our cloud platform, we've done that from the beginning, right? We have you know, so you're starting there You're the same archive. We use Artifactory is like an artifact System and that same artifact is then getting pushed through the different environments and Eventually makes it to production The other nice thing is the global traffic manager we showed in the slide that actually helps us release in our new platform in production because we can turn off a cloud instance and take all traffic off in about ten minutes and deploy everything run through our testing and then You know final regression testing and then turn traffic back on and if you know something went wrong You always turn you know turn traffic back off remix traffic get a patch out there really fast That kind of thing is that kind of what you're getting at? Anything I missed okay Okay Yeah Yeah, so today We're on that the question was what are we doing for networking? Yes. Sorry. I forgot. Yeah, what is it? What are we doing for networking? So in the networking slide? I talked about that today It's the the flat DHCP networking. So it's using the the EB tables IP tables and the flat networking so using the underlying network that the hypervisor Like KVM is providing or is it KVM or whatever? You know what I'm getting at In the future that's part of a roadmap We're gonna be looking at 10 gig on the physical side on the on the network side commodity 10 gig and Getting to some sort of overlay right some sort of You know v-switch of some kind we know we can go pretty far with what we're doing But you know you get up to the 1500 2000 VM range and you'll have too much Traffic too much broadcast when nodes enter and leave Yes So are you looking at infrastructure support? Is that kind of the layer or both? Yeah Yeah, so I mean we have an infrastructure Service you know an infrastructure support vendor that we use for infrastructure monitoring for our cloud environments and They're not fully integrated into the private cloud yet. They're in the public cloud We'll be looking at putting their service. They haven't run like a small agent And then they have monitoring automation and ticketing and stuff like that. So we'll be looking at that We're not doing that now But that's part of the part of the plan You know, I don't know that that's I don't know that they we are we were in a session yesterday I kind of ask that question. Well, who's distribution did you use right? I don't know if you know about crowbar but it pulls down you point it to a Branch or a trunk and then it builds right it builds that up, right? So I don't I don't know that we are looking for any any particular vendor stack I mean as long as it works and and we we know the status of it, right and can patch it You guys actually see it your customers asking you for an SDN solution with network visibility Or is it just you're doing it as part of the scale? Okay, you're you're getting at so beyond what they can do themselves I mean because when a team launches, right? They've got the private networking, right? The the fixed and then they get a set of floating so they can put like a their proxy or their web host on the floating Beyond that we have had a bit of churn here and there. I mean You know team needs a VPN connection or something similar. They need firewall rules Something like that. We did come up with a pretty unique thing for firewall rules If teams are talking to other parts of the company they can go in they can check in to Jenkins their ports and ranges that They need connectivity to Then we have a process of taking that and going and requesting the changes made Once those changes are in place the job goes green it actually tests the ports So that's that's gonna help. I does that kind of get into it a little bit. I'm saying that why are you looking at SDN? What is the driving factor? Oh Scale primarily scale. I mean you cannot just keep pushing Broadcasts right up to you know, 1500 1200 whatever that we don't know that there's no magic number, right? But you get up into the thousands you have VMs launching like we do all the time We're gonna have trouble and we are gonna need to scale What else? Okay, awesome. Thank you. Thank you