 All right, let's get started Welcome everyone. Thanks for coming out. Thanks for sticking around. I think we've got a pretty cool story to tell you guys here today Gonna give you a little rundown about how we kind of built out our platform team give you a perspective from that side What we did to start onboarding all the teams to ultimately get ready for a pretty cool pretty pretty cool experience I've gone through our first holiday season on Pivotal Cloth foundry My name is Nick Boeimer. I'm the engineering manager for the platform team I've been with the Dix organization for five years now almost five years Previously they came from about nine years experience doing information security and compliance work So there's a pretty interesting jump going from that world to this world I've been learning a lot, but I've been in this role for about a year now Hi everyone, my name is Samir Hashmi. I'm a senior platform actor at Pivotal I come from an app dev background and as a Java developer I've been developing applications from GSP service to microservices With Pivotal for the last one year I've been working with large enterprise and organizational like Dix supporting girls and my main key role was to enable and onboard them to Pivotal Cloud Foundry And so far it has been an amazing journey the the transformation we have done in last 11 months That's that's incredible to see and that is what we'll be walking you through today. So quickly going through the agenda first We have divided our talk into two key sections the first section We talk about all the struggle and challenges that were faced within the organization and how those things change after adopting Cloud Foundry In the second part we'll talk about the holiday season the very first things that how we prepared for it How did the platform operated at scale and second thing that what were the key incidents where we have few Mini heart attacks and we were just tackling those without impacting any business applications At the end we'll be having a quick retro next steps What were the key lessons that we learned and we'll take questions in the end? So jumping to transformation and on-boarding story So to kind of paint a picture about how things were before we began making this transformation The old way of working really resulted in Some silent groups Not a whole lot of collaboration cross collaboration across those groups and kind of see the picture It was really a lot of throwing things over the wall Not really owning what you were creating Coming from a security background I was pretty familiar with the security patch processes and what it took to actually accomplish those Those tasks through the month-to-month activities that we had to complete We had to integrate a lot with the network engineering team the Windows engineering team Unix engineering team we had to find the downstream Applications that were affected by these patches and then the security team would be involved All this was bringing a lot of people to the table on a month-to-month basis and burning a lot of time and energy trying to Accomplish something that should be relatively straightforward Our change management cycles were pretty long as well. We had a pretty rigorous Change advisory board that would meet on a weekly basis and basically Approve or reject different changes. There are all there's all sorts of paperwork that was associated with that it was really a hindrance to Product teams or applications moving forward with solutions that ultimately would provide value back to our customers Infrastructure was also sized beyond what we really needed on a day-to-day basis except for maybe five days out of the year That that infrastructure was running hot around the throughout the year So we're really burning a lot of a lot of money a lot of maintenance a lot of time from teams managing that equipment And that cost kind of built up for five days out of the year We had exactly what we needed but for 360 days We had three four five times as much resources as we actually needed. So there's a huge waste of money We're also a byculture Living in the siloed environments that we had throughout DSG we'd find teams looking for solutions to problems reaching out to vendors themselves and Potentially identifying solutions that might have been able to be addressed across the organization and solve problems for this team and this team Instead you ended up with a solution for this team and a solution for this team two vendors that Potentially could be competing with each other providing the same service a good example. This was our address verification system This was one of the components that was recently transitioned over to the PCF environment so I think there were nine different address verification systems making sure that we could get product from Point A to point B and the address was correct nine different systems to do the same thing We had a pretty cool Opportunity with our shipping team to build out an on-prem solution using PCF to actually solve that problem and do away with those unnecessary systems Fast forward to where we are today. It's a much better environment at least over the last 11 months that we've been doing this But we really wanted to focus on a couple of things We wanted to make sure that we were focusing our customers and re refocusing our efforts on ensuring that they were receiving value out of What we were delivering? We wanted to leverage the meaningful technology that we were creating to make sure that they were providing services and things to the customers They actually wanted and in a timely manner not something where three four weeks from a request coming in that something was delivered And now they have to go and check to make sure it actually Achieves what they were looking to accomplish by making that request We wanted to drive those efficiencies down into our back-end process as well So instead we wanted to make sure that what we were doing day-to-day in the in the office Was also not wasting time and getting that information and solution out to the customer So Paul Gaffney joined the organization about a year ago, and he brought with him some pretty good vision to change the way that we do things and I give him credit But I think the real credit is in the teams that were able to adopt what he was trying to present We had like I said the old way of working in that siloed organization we had a bunch of teams that kind of rose the occasion and really adopted what he was after and Through that we were able to implement some pretty cool things. We were able to isolate components of our e-com environment in time for our Holiday season which we'll get to in a little bit and if you caught the Wall Street Journal He was pretty excited to present that we're going to continue to move some of those things into our PCF environment And I think we have the right team to do that So he also brought the the Cloud Foundry idea to Exporting goods it was it was just an option. It was a way of working. He brought He had one team do a proof of value Against whether or not this is actually going to work and they were able to deliver a result in six weeks I think from the time of conception and to your point This proof of value project was very crucial at that point of time because that was aware That was the point where the developers they start to see the real value of Cloud Foundry How they were pushing their applications without even thinking that they have to figure out load balancers or SSLs or DNS routes They had a marketplace where they can self-survision all those services that they want to use whether it's RabbitMQ or whether There's MySQL services. So that was one of the critical points where they saw the value One of the other things that we were focused from day one was how to major success So success was us making sure that where they stand right now getting those numbers in baselines doing doing a deep value Analysis stream analysis on them and then when Pearl Cloud Foundry has been implemented. What is the success looks afterwards? How long does it take for you to patch? Security patch how what is the time for me time to repair? So there's a slide later in the deck where Nick will walk through those quick pins as well One of the other focus point was to eliminate all the manual steps So there was as Nick said earlier that there was manual steps And there was a lot of handoffs between developers and operators whenever they had to release any features So the the idea was to automate everything that we can not only for the developers so that they can release their features But also for the platform team so that they can evolve the product as a platform and for that we needed a platform team So that was the first thing that was created Within the exporting was in early May and there was some product marketing was done as well So Nick if you can just tell how did you came up with the name start? Yeah, so we're sporting good company and we root a lot of our analogies in sporting environment start is the root word for stadium and In the same way that baseball players will all players all come to the stadium to play their game We wanted to create an environment that development teams could come and play theirs So it really gave us a place a central place to come together and they can do some pretty cool things there And as per the definition, you know the platform was the one place where which can take all the workloads running within the organization and One of the other things that was done Amazingly well by the leadership team was that they handpicked one Expert from each domain for example one from DevOps one from networking one from security They combine all those seven personas as one single team so that if there was an issue on the security side That one person can resolve that issue instead of he going out and you know delaying all those the time for for that issue As of today Within 11 months that one platform team consists of those seven persons those seven are running seven foundations in different eyes is That those seven foundations are serving hundred of developers which comprises of 40 product teams and the idea is to go more than that Keep and keep keeping that in mind that at that point of time these seven platform engineers were new to Cloud Foundry And they not only have to implement an enterprise great solution within the organization But also to onboard these 40 product teams. I think that was piece of cake or like it was really easy We've recognized pretty early on the importance of automation We made that transition from silo teams to balance product teams And it gave it gave us the route for how this this whole environment was going to be built out From a platform perspective though It was it was extremely important to automate that onboarding process so that those 40 or 45 teams that were being spun up So quickly had the same experience and same opportunity to roll out their new solutions so we we spent a lot of time doing that we wanted to make sure that Across those seven environments the teams were getting the same amount of access across So so they can make the decision to deploy to this I as or this I as We use a lot of tools like vault and concourse and those are all built into the same automation So we literally kick off a script provide an active directory group and that provisions the entire Base of the foundation and everything that the teams are going to need to actually be on boarded to our foundations Early on we conducted a lot of one-to-one reviews to it was new for us So it gave us a chance to reiterate and relearn what we were supposed to already know at that point month one But it also gave us a chance to share that knowledge with other teams and as the teams began to grow We went from 10 to 15 to 30 teams and ultimately to 40 We started seeing a shift in the collaboration between the teams and they were to help they were helping there was a community through confluence that People were producing all sorts of documentation for teams to be using and it was really making our life easier from this perspective So we could focus on building the next great thing from a platform perspective And while you were conducting these one-on-one with specific product teams We thought that it was very important to conduct trainings and workshops for the entire developer set either It's a Java side or it's a dotnet site So what we did was that by conducting these trainings and workshops the idea was to make sure that DSG developers not only know how to write a microservice or how to use those Cloud native design patterns like circuit breakers or config servers or service discovery But also know the real background what happens when a CF push happens You know, what is the staging process how the droplet is created if they want to log some Log for their applications out to any log aggregator system. How to use log log regator for that What are the different? Four levels of high availability the concept of AI's and SI's were totally new to them And the concept of auto scaling was also new so to teach them how to use the auto scaling How to configure them then write which was very important factor going through the holiday season On the side, there's another chart which shows the growth of the number of AI's that Dixapoint Dixapoint was had gone through If you see in there may is the time when the platform be set up The very first bump came from June to July and that was the time where we were conducting all these workshops and one-on-ones The next bump was from October to November Where we asked the applications team to do their load testing and scale out their applications for the holiday season We don't get a lot of credit for the actual steps that were happening after November because there was actually a scaled-down event from holiday we were actually able to leverage a lot of that auto scaling and Lower the number of instances and infrastructure that needed to support the environment because we were no longer running at that holiday peak scale So you see the slight steps in there, but there's probably a little greater growth between November and December, but the even better story of all this is Prior to having this in our environment. There was a freeze period towards the end of the year We could not touch systems and the idea of deploying new things to the foundation or to any environment It was almost unheard of so seeing that continued step growth and that Really that larger growth that would have happened between November and December Is kind of a cool cool story to tell for some of the changes that have happened since adopting cloud foundry This might look a little familiar Jason presented something similar to this in a previous talk But we're seeing a lot of the same productivity in the teams even as we're on boarding more teams We're seeing increases in productivity The dev-the-ops ratio 120 to 7 and then next to no time spent doing any security patches me being the security guy I'm gonna make sure I point that out This is a small snapshot small look into what things might have looked like for old releases For during the old way of doing things it involved getting a lot of people into a room Sometimes this would happen on nights or weekends And it was probably not the best work environment to really enjoy And certainly things have changed after this after implementing cloud foundry now the developers are deploying their features without any downtime And the platform team is pushing all the upgrades and patches again with canary style deployment using Bosch So we took a picture of that as well and now they do the deployments like this So this was a real event and what happened was that our very own make the platform owner He pushed a major upgrade of PCF from 2.1 to 2.2 While the other person Rick who is an engineering manager of multiple product teams He had multiple workloads running on the foundation He didn't got any alert and notification that you know, there's a patch going on and there was an upgrade going on So that was another benefit, you know, you got your weekends back with this And then this is our value summary When we took a look at adopting this technology, we took a look at what this was going to do for developer productivity We wanted to make sure that what we were putting out there into DSG for these teams to use it They were going to be able to leverage it and return value on what they were working through We spent some time looking at the operational efficiency of the number of users or the number of Operators that it's going to take to support this and I think we see some pretty immediate Return on that knowing that we've got a platform of seven operate platform team of seven operators First is what we would traditionally need for a normal environment and an infrastructure and cost avoidance There's a huge opportunity here for us as we continue to offload some of the legacy applications the monolithic Applications that exist in our current environment and break down some of the reliance that we have on third-party vendors We're going to start seeing cost there shift down as well And that leaves us to the holiday so we did all this work in About five months by the time we started and by the time the holiday season hit. It was only a five-month period Our holiday time is probably one of the most critical times We do a significant portion of our year's business during this period. So any hiccups any problems? result in Some fun conversations that we would like to avoid having forever So what do we do to prepare prepare for this this is new for us This is the first time we are going through as a platform team the holiday season We first we start we focused first on the IS infrastructure the things that lived outside of PCF But we're critical to our environment and that was mainly the load balancers So we met with teams made sure that those were configured correctly Then we moved into the scaling of the PCF components the aga cells aggregators go routers I'll get into how we failed miserably there later Learned a lot in that process though, and then we reviewed the go router configuration It's making sure that they were sized appropriately can handle what we thought might be reasonable traffic through holiday PCF health watch alerts were also configured We wanted to make sure that those were tuned appropriately We integrate all of our monitoring through Microsoft Teams, and we wanted to get rid of any alerts that might Be noise to us we wanted to make sure whatever was coming to us was actionable And that we weren't wasting time looking at things that didn't make sense And then we expanded work quotas. I think a lot of teams were unsure of what to expect on PCF going through holiday They knew what it was like in the old environment, but we were taking wild guesses to say yeah Sure, I need 500 gigabytes of quota or I need this or I need that we're taking guesses and We guessed all right so while that was going on at Exporting goods what we did at the pivotal side that we as a balance account team We reached out to multiple product teams within pivotal the idea was to get an idea of what are the best practices the right Configuration and what are the key components to monitor when we hit that high peak or you know when we get all those number of requests We talked to team Diego logger Gator PCF metrics health watch go routers and Even cloud controllers as well just to make sure that you're we are doing all the best practices The output of that was a 10-page document in which we have 30 more than Contributors from the pivotal side which was shared with the exporting goods just to make sure that okay These are the right things. These are the right metrics. These are the right monitors that we have to watch out The other thing that we did was we created a checklist for the developer team and we rotated that for each deaf teams Few of the items in that deaf team was how to how to manage scaling the very first thing that we said that Manually scale your applications right before and then let the auto scaler kick in so that we don't want to hit any of the performance issues Even if it's four milliseconds Other item was to make sure that they are implementing circuit breakers, right? They have their fallback mechanism if something fails that they are properly showing a proper error message to that Another thing was to have their application load and performance test very well because that was a very critical Point that we faced earlier when we were there in cyber five days Last point to mention out here was it was a great to see the Cloud Foundry community coming together when Exporting this platform team and T-Mobile platform team. They exchange note with each other T-Mobile Experienced Cloud Foundry user they have been using it for their iPhone release launches And we want to make sure that we want to pick their brain that what are the key things that they saw when they hit that? Spike so that was one of the things that that was helpful very In those in those five days Talking about the workloads on Cloud Foundry. So the exporting goes operate right now three main e-commerce sites Dexporting with them as as well Golf Galaxy feel and stream whenever you hit for a e-commerce site There are a few key components running on it which is for example search product display page that was running on PC of that that point of time in some capacity Mark search is another a success story with it itself it was a labs project which was developed within pivotal labs and It was meant to be run in store that was also ready to take traffic on PCF So at this point of time these are just few applications which are listed out there But there are multiple applications which were ready to hit traffic at that point of time and there is showtime So what actually happened during holiday? I think this one this was interesting for us when we look back at it, but Ultimately when we first started out we wanted to make sure that going into holiday we had information coming out of it of about everything that went on and The team took to documenting everything all the support calls All the tickets that were opened anything that they changed on the fly just so that when we go back next year and try To figure out what's what needs to be sized differently. We have a good baseline for what a holiday season looks like So Thanksgiving Day cyber 5 let me explain cyber 5 real quick is the time the period of time between Thanksgiving and Cyber Monday I mentioned earlier. We do a significant portion of our Business through those years through those five days. It feels like you're sometimes But Thanksgiving Day was really a warm-up to the main event We had an on-call rotation. It wasn't ideal. It was kind of an older way of doing things and we're focusing on Mitigating that for future cyber fives, but we had seven guys around the clock looking at these things And it turned out to be a great learning experience As you know, we were only on there for five months at this point So anything that we could get out of that any feedback from the system from the environment from the logs that for issues That we're running into were critically important to understanding how this whole thing was going to work We did identify a few misconfigurations early on and we're happy to figure that out on Thanksgiving as opposed to cyber Frat or Cyber Monday or Black Friday a lot of teams had autoscale enabled. They had it bound to their applications But they didn't have rules configured. So we saw pretty significant spikes in CPU utilization We were able to catch those early and it was a good learning opportunity for the development teams, too So that they were aware of that for next year. We also did some scaling of components I'll show you on the next slide that we should have scaled further On Black Friday, we had our first support call And we realized that some of the health watch reporting was unreliable We actually at 1130 roughly had a gap in Any logs coming over and we had thought we lost the foundation being new to it We didn't realize that it was the logging components that had fallen over and our pivotal support was awesome about helping us through that Scaled a couple things that we had missed previously and we're able to get that back on track But what's pretty cool about this piece is we were able to make all those changes in the middle Of holiday on one of the busiest days of the year While the site was taking production traffic, which is I mean I don't I didn't work in Econ before But I feel like that's an incredible feat to be able to accomplish during one of your most important times of the year We ran into some issues with PCF metrics also Again worked with support on this and I think we identified some pretty cool opportunities to improve documentation and improve our Runbook for how we go into holiday There were a couple components in there that failed that we just had to scale a little bit I did some Interesting things just to tweak how that environment was running and we're able to get through those five days Without too much of a problem all in all it was a pretty successful holiday We took we saw a thousand plus plus orders a minute at some at a period of time And it raised significant confidence in the platform I had a couple people after holiday come to me and made make those comments that they feel pretty good about how things were running how their apps were performing and I think that's gonna help with that trend that we saw earlier as the apps can instances continue to get deployed to the foundations Key results coming out of this though. There was zero customer impact We did not have any downtime for the applications that were sitting on cloud Foundry, which was a huge win for us first year on the platform the team did awesome with Working through those five days and not complaining too much Partnering with pivotal was really awesome. So the support team there Scots here in a easier to use a huge help But they've really transformed how we were building and running software at tix boarding goods Yes, so quickly going through the retro that we had no doubt It was a very successful event and there was no business impact no application impact But there was a lot of key learnings out of that the very first thing that we saw that we have to plan that earlier because we Last year we started in September. We thought that, you know, we are we are delayed in the process So this time what we'll plan is that we'll plan earlier We'll we'll reach out to those all the product teams making sure that they are doing all their their checklist well before in time As Nick said that we we dish out to the application teams talking about the auto scaling configuration But few of the steam teams were still lagging behind and that's the reason we saw some of the CP because they spikes So what we'll do is we'll review the checklist again We'll walk them through with all the dev teams making sure that they're doing the load test They're doing the performance test well ahead of time Most importantly, we never catered the logs that were coming out of out of their applications And that was the key things that we saw that PCF health was or the logarithm components falling off What we are also designed to do more fire drills and implement chaos engineering So that is one of the thing that okay if some component goes down There is no impact on any other microservices because this year we'll be expecting more microservices more products on the platform team And then the next thing that we need to address is the active active foundations We were so early in the game that when we were transitioning things over to the foundations We still had the fallback plan of moving to the legacy hardware the legacy environments So ideally we'd be moving away from that. We need to get an active active foundation so that we can support Peak load during those critical times of the year That's our story. We love it. We're here for any questions if anyone has any good So Samir had mentioned. Sorry. The question was whether or not we were Do we have any mechanisms in place to perform load testing to make sure that developer is Okay, got it question is when developers are doing load testing do we have any mechanisms for them to make us aware So we work pretty closely with all those teams. We have Microsoft teams channels that Are actively communicating between teams are letting us know whenever they're doing things We have a sizable non-prod environment away from production as well and Everything's terraformed and automated as far as is how those are deployed So they do that testing in a lower environment, but against apps that are identical to what's in production Sure Hi, that was a very impressive zero customer impact number before we're proud of it Congratulations on that Outside of health watch and the tools you're using like logger Gator. Are there any? Special tools Dix is using to monitor the infrastructure. We're working through that That was it definitely our MVP going into holiday And I think we were focusing on getting things up and running and stable going into holiday and less on monitoring which Heard us a little bit in the end, but we're looking at integrating with Prometheus right now We're doing some work with pivotal support to get that up and running and We have a whole CRE organization customer reliability engineering organization that's been stood up And they're going to be helping the product teams integrate their logging with whatever Susan I think we're looking at we're considering elk We're trying to figure out where all those logs are going to land But we need something more than 14 days worth of app logs so that they can actually troubleshoot their apps Thank you sure Anybody else great. Thanks a lot. Thanks guys