 Okay, thank you Colin. It's always difficult to follow Maria She by the way has more commits on Bosch than anybody including you or Dimitri and you too. Oh So come over here. Yeah So I hope not to disappoint Colin But we're certainly going to talk about something different Let me first say that you know some of this is our experience, you know at IBM I'll be honest some of it is pretty dumb Some of it is difficult right things that we learn in and you know as we're operated in production And when I say dumb meaning like we were learning so Hopefully Some will resonate with you. Yes. Yes Okay So we have 30 minutes Lots to cover and hopefully there are some questions So let me first introduce my colleague Zushi That's my best Chinese Mandarin he goes by Matt, so he's not gonna fall. I hope let's stay stay with And he's part of the development team in China that we work with Obviously the I and IBM, you know what it stands for so that means that we have people all over the world And the China team is huge and he's part of the team so he's gonna help me with this and Let's get started. We have one intro slide That may be boring to you guys and then we'll get into the meat of it Okay, so the intro slide is what is blue mix you probably already know, but let me be clear about this So first thing is a certified cloud foundry pass it is The largest installation of Cloud Foundry right now The public version, but we also have private and also a dedicated version The dedicated is sort of you get a slice of the I as which we use called software And then you get your own blue mix for your enterprise and then of course the private means that you get some installation inside Your own company and various companies have similar things like pivotal has their own offering similarly We announced that we have one million registered users Does it mean that those one million are? Actually developing right now on blue mix. I don't know maybe I hope But we're certainly adding about 20,000 per month or more. That's the public data that we announced There's over a hundred thousands of apps running right now. Hopefully they're not all hello world And one thing I know for sure is we have more than 500 Services in our catalogue and easily beats anybody else. I'll challenge anybody You know compared the services that we have So we have a huge amount of services and some of it is pretty unique Yeah, I know We got to fix that You can keep into update me dr. Nick anytime It runs on software It's a challenge. I'll tell you so a lot of the problem you'll see or challenging It's it's it's really good to have you know the resource that we have in Bosch people like Maria Especially it's a mystery for instance that helped me fix problem there that you know Even inside IBM very few people know so it's a challenge. It's also also a very good Alternative I ask to AWS if you're considering it. There are some issues But if you if you if you're looking at that, let me know Open stacks are and he is also part of what we use right now and then finally as I mentioned's worldwide development team and Then we'll get into the meat of it. So the meat of it is we'll do a top-ten list So it's sort of like, you know, David Letterman when he was there. Okay, so we'll calm down so to help me Matt will announce each one of those and kind of like a synopsis and I get into the details and He doesn't speak a lot of English. So he'll do it in Chinese Okay Yeah, number 10 change digital改變 some someone's no Chinese Great a lot. So you can check him. Okay. Yeah, great All right. So what we found is There's a need for tightly controlled change request process And the reason we do need that is because we have searched a large team, right all over the world So when Dimitri releases a new stem cell and you update and we have to get it in there Or the press of cloud foundry, for instance, how do you get those fixes and the changes in? Well change request process we found is important. Of course, it has the bads. So it's slow You know from the time we know that there's a change that we need to apply to actually executing it That's a problem Sometimes it requires meetings very late at night because we got to get everybody on the team If you work with teams in China and all over the world Sometimes you got to have either very early in the morning, especially for us West Clusters or late at night So pretty much I have two shifts of work You know 10 p.m. Pretty much I talked to these guys because it's their midday The good thing is that it will limit propagation of some one of your changes, right? So if you have problematic changes, it will limit it. So the lessons learned is To essentially use some kind of a tool to alleviate, you know, the fact that you're gonna have You know times one differences for very large operations and then making sure that you still Coordinate with a change request process. Most people will probably be doing this But if you're not you should probably consider that Okay, number nine audit D9 Tiao Shen He So what we found is There you had to have some level of audit for the health of the system because as you know The system goes in steady state. You you might want to check so the problem of course is that Those audits can be very manual, right and how do you automate some of those can be difficult? the good thing is and If you guys don't know Tony from pivotal, you should talk to him anybody here. He's here somewhere. I don't know Tony No, okay. He's here somewhere. I saw him You should talk to him and the reason is because he'll teach you a lot about how to operate Bosch And one of the things already on about two years ago when I started looking at You know participating in Bosch is to understand canary based deployment. So if you don't understand that you should talk to him And that sort of gives you a set of audits, you know early on in your deployment But as the system goes into steady state, that's the kind of audit some talking about so the lessons learned is to have Potentially some tools like what we did it's called IBM doctor and I'm not trying to sell it to you I don't think we have a you know that tool available for people to use But we use it internally and what it does is it allows us to have one entry point where we can see all of the different deployments So there are hundreds of bluemix deployments and with this tool you can kind of like oversee everything sort of like an uber Right, so you can see everything And it allows you also to to to look at the logs and then even Apply some audit rules if you need to So I checking to see if people did update their stem cell For instance, that's an example Now if you're doing single deployment Small deployments the Bosch tool set is probably good enough for you, but if you're not that's where you need something more number eight logging So one of the things that I found out as I started working with the team and and I guess it's not new maybe Is the fact that you know you have to have access to logs The problem, of course, is that with large deployments you get log rotation? So you lose logs so you could try to say okay. Well, let me increase the size of disk and stuff like that All right, it's not always easy when you're talking at scale of hundreds of deployments so Having a strategy that allows you as you scale that propagates those logs into one place and And have access to them is important The other thing is keeping all of this log is expensive and it takes time and and planning So you got to you got to do it though And separating certain logs is also very important So one of the things we we found out as we started doing a new CPI is being able to identify the CPI log versus the whole rest Can be problematic so making sure that you have ways to do this early on is important Before it starts going and then it becomes an issue The good thing is there is the aggregator That allows you to stream all the logs, but you still don't have a way to save it So you can just access to log and you'll have to save it So important for you, I guess lessons or an is to introduce some kind of a tooling early To deal with the fact that you're gonna have a lot of log Certainly, we don't work. I mean, we don't have any x I don't even have a block account But I've used it when I was working at pivotal with them. It's actually a pretty good tool at IBM We have our own room solution as well, which is pretty good, too But I don't know again if it's being sold so you can talk There are other tools too. Yes, definitely. Thank you, but it's important to start thinking about that because When the house starts burning You know, it's not the time to start introducing tools like that. So make sure you have it Number seven number seven complain Deceit are bowing All right So I searched who's and then Drake came up. I actually like Drake a lot, but he does complain a bit So maybe that's what it is So I worked on Bosch in it I actually some of the first commit in Bosch in it. I was part of but There are some problems and I think a lot of it is being addressed But the bads and I think Matt Would make it may be able to explain this if you have question But there are issues like if you're trying to recreate a director, you know using Bosch in it like an existing director, right? Certainly there has been a lot of frequent updates as the tool got released and sort of goo and that caused problem for us The good thing is it's certainly a single binary a good binary and it's in general pretty easy to use and It also introduced external CPIs So if you know the history of Bosch as Bosch in it got introduced that was part of also the change to external CPIs So the good thing here is that you know because the tool is an active the development. I spent a lot of time with the team Doesn't seem like you've spent a one week at pivotal working with Dimitri and the rest You kind of see things are moving slow. It actually moves extremely fast within a month They add features. So sometimes I go like I visited China six time last year I would go back and come back and it's as if I never talk to DK at all Because there are so many new features and it's sort of the process that goes on right now So you have to be sort of planned for the fact that things are moving very very quickly So like Bosch 2.0 for instance, right and pretty soon Bosch 2.1 and 2.2 plan for that number six customer release Deload our payment industry So this fits quite well with what Maria was talking about right so she showed you Essentially how to build a release and so on it can be difficult So don't you know, she has a lot of expertise would look easy, but it can be a bit difficult I'm not trying to say that you shouldn't do your own release Actually, I'm telling you you should do your own release and this is something that I used my experience in Bosch and Certainly advice from Dimitri to essentially migrate all our custom software to custom releases so what ended up happening is whether you like it or not big companies such as IBM will have custom software and Good luck trying to convince You know Joe Schmo somewhere that's building this custom software that they shouldn't be doing it Because that's their livelihood right there part of the corporation They've been using it they sell it to customers so you can't convince them So I don't think you should try to I think you should just give them a path to add that custom software The bad thing is to bake it into your stem cell Really bad idea. So we did do that for a long time. Yeah. Yeah, so I saw you move But we fixed that but it was a it was a tedious process it was very hard So certainly using co-located releases was excellent feature for us and allowed us to sort of move out of this notion of Creating our own stem cell with custom software So we ended up, you know creating our own releases So basic lessons learned You should trust the Bosch team and just use the stem cell that they do now If you have your own eyes like in our case stuff layer, we're gonna create we create our own stem cell so Because of that, you know, we can kind of control the speed of it But one thing that we work very hard and again Using the Bosch teams advice is not to add anything else. So our stem cell is pretty much Done for software on our pipeline, which is public, but it's essentially just adding one file. That's different Because we configure the agents differently than everybody else. That's it We have absolutely no additional bits that are different Everything else gets added that caused That that significantly solved things for us, but it was a long process to move to that So if you're in that same boat, make sure that you do not add anything else Trust the Russian. He'll give you a good software Okay, number five no power DNS The Wu-Tiao to party s. Sorry. I can't translate party s So this is this is number five. It's actually not for power. Dennis. It's against it So here's the thing When you are running Bosch at scale You can't really have IPs for every single node. I mean like fix IPs, right? Maybe you could but it will be expensive and it's also not a good strategy So dynamic IP solution works and it's important for you to to to consider it But of course if you're starting to add dynamic IP you get the problem of you need a DNS Now if the I ask that you have Provides DNS that's great. Otherwise you use something like power DNS But that's a problem because it's a spot, right? So a single point of failure and It will go down and when it goes down your whole deployment is dead pretty much until you revive it or you fix it or something like that So that's a big problem And it's not very easy to make power DNS HA so I Think there were some teams and IBM that tried out and if you look through the web You might see people trying that I think it's I mean you could try it if you if you have some solution I'd love to hear it and I'm sure other people might as well But definitely You know just get rid of it now. How do you get rid of it? some I ask Like I think AWS provides some level of DNS that's HA So if you if you use those then that's great in our case We don't have a good power DNS solution That's HA so what we've been doing and I'm actually working right now with the bush team with Tyler and the team to add A solution for power DNS to get rid of power DNS So essentially a solution for DNS So if you're interested in this we should chat but basically that's that's the lessons run I guess is that think long and hard in general not to have any known HA jobs or nodes Because as much as you think well, you know, I'll protect it. I'll you know somehow You know have somebody so you can pay somebody to just like monitor this one. It will fail And when it fails, you're gonna have to deal with the consequence and in this case certainly with power DNS It will be a bad consequence so We can talk after or maybe in the Q&A I can tell you a little bit more details of the solution that we're working on number four security update The city are and wrong and全部 so one of the things that you should all know and like a certain presidential candidate and That the internet has a lot of evil doers So do not trust anything anything on the internet pretty much. I mean you should question everything So what does that mean? That means that there are? CVs all the time right pretty much every day like you know security vulnerabilities all the time that you got to fix and The good news is Well, the other bad news I guess is that you willing those security updates can be very costly especially when you have thousands of notes right takes time and You know effort and things fail right so How do you deal with security? Well There's a good news here. The good news is the Bosch team and I've seen them have been part of the team if you come to vivid all you'll see it they now have a security czar and and Essentially every time there's a CV you'll see tons of discussion if you're in the private mailing list You'll see some of those discussion very early. So any chip gets involved in that as well So the foundation knows and then all the different You know people involved get you know know about those and then the Bosch team Essentially dedicates a pair or even I don't know if they have more than a pair You can talk to Tyler. He's the leader of the team to to essentially address the CV and typically that results in your stem cell and what I'm seeing now and then I don't want to quote anybody but I'm seeing about about a Weekly new stem cell almost maybe not new stem cell, but new updates that you have to address now If you're paranoid about security then you're thinking okay Well, I need to update very frequently But if your update takes you a day or takes your significant amount of time or effort then it's a problem So what do you do about that? So what we found is working with the IAS Can be a very good thing. So we we do two things With some one of which I kind of learned a little bit more detail yesterday talking to our BlueMix manager Fabio Is that he uses a IBM software called CVO the endpoint management to essentially push patches to the different stem cells or the different nodes Certainly you could use Bosch also to update And that's certainly a way to also do it for small Well, we do that But we since we own the IAS that's part of how we're able to do this I don't know if you can I mean certainly, you know, if you're running on top of amazon, you can't really do that Well, so what I'm suggesting is the second piece is the reloading of os so this guy and a few other At IBM came up with a way and we had to do it to essentially Apply stem cells with instead of recreating the machine you save the machine and reload the operating system with a new operating system And that allows you to essentially Update all your nodes much faster So of course it means that you have to change your cpi and it's a specific You know addition to the cpi At first this looked like a bad idea lots of discussion with our russian friend And other people and now it looks like maybe it's a good idea And there's other reason for it also because if you had disk with a lot of data You can actually keep that data. You don't have to reload it. So it has some advantages from that perspective So there is a track actually right now when Dimitri's backlog to essentially allow this Have that as a first-class feature to essentially allow reloading of the os And that helps us significantly now we do it Independently of that feature because we do it in the cpi, but hopefully it will be more of a broad feature And hopefully it will help with that But definitely helped us without that. We probably wouldn't be able to scale bluemix Okay, number three, um multi-box deployment The center are two of the bush So this is one that i'm sort of new to and we haven't applied this but I figured it's a lessons learned because we need it essentially so the issue is that You can start with one bush director one deployment and then go it keep getting it bigger and bigger and bigger and bigger right And that works, but obviously it leads to some level of bottleneck And While it's easy, uh, it's gonna slow your growth So one of the things that I found out and I'll credit DK again Is that you can do multiple deployment and actually even a single environment you can divide it into smaller pieces so in our case for instance, we have thousands of You know nodes that are just running the apps the da's or when you know, you're looking at Diego the cell reps So if you have thousands of those and you want to kind of divide them into smaller pieces So that way you can try new updates on those independently of the whole And that helps you significantly because that way when you're doing a bush update you can kind of piecemeal update it Now obviously it's not gonna slow. It's not gonna require less time than the overall Bush update that you would normally do, but you can be it will be more manageable now. I'm pretty much repeating what Dimitri and I and Fabio discussed yesterday but That's certainly something that we were considering so Um, I'm not sure we haven't applied this. So let's be fair about this I'm mentioning it because I feel like it's something we should have done early on But from a computer science perspective, it's simple right? It's divide and conquer, you know Don't let something go so big that you know, it becomes difficult to manage break it into small pieces and bush has that and that's the important thing right right It's definitely breaking down your deployment into multiple. So using What I call multi bush deployment, but maybe there's a better term for it like multi like release deployment or multi manifest deployment I don't know but breaking down your release your deployment into smaller chunks because Even if your eyes is fast, right? Let's say amazon and you're doing one minute deployment, you know, like in terms of one node the whirling you probably can't do more than 10 or 20 at a time Right, you want to do a whirling deployment? And then you start multiplying that by the fact that things will fail probably So what do you do about that and then the fact that you're trying and you have customers So you don't want to you don't want to you know break everything at once. So You can see already that having smaller deployments where you can kind of like control it Even if you've already tested it in a test environment helps you because it Essentially reduces the risk and makes it a little bit easier to manage when there are problems Exactly that's exactly right. So that's why it's an idea right now in terms of the fact that we haven't Done this we're in the process of looking at it But I'm sure they order people maybe if you talk to Tony outside Maybe Pivotal already is doing that Number two 100 success the other tail 100 by Chenggong All right, so if I say 100 success you should be Worried how is he able to do 100 success? So the point is that there is never 100 success so the reason I say this is because Talking to the team especially the team from room and in the us and so on It becomes a cultural thing. This is one probably one of the more fun ones is that It turns out that every single deploy every single update never fully success So there is never a case where you do a deploy and then you know as the system got large that at the end It went through and oh success 100% never There's always some failure somewhere so what we essentially Realize Is that it's okay? It's okay that it fails Don't feel like because it's not 100% sexual you did the divorce deploy and it went through all the way to the end and everything's good It's not gonna happen when it gets big so just deal with it be happy so And the other good thing is it's still usable most of the time the deployment is actually quite usable So even though there's a failure job fail VM fail created you can still continue And then we can pick it up from where you left it So the tool essentially has that built in Take advantage of it. Don't feel bad. Don't panic That's just a fact of life. It's sort of like the fact of life that You know something in the cloud will fail and you just have to deal with it. So design for failure, right? So the lessons learned is trust the tool in many ways, right things will fail It never works 100% all the time at least one software now I don't know about amazon. Maybe it's better But my guess is no because I've I've used amazon in the past and I know they fail too So anyways, but maybe they're better. I don't Is this being recorded? Okay, number one backup detail pay for All right, so I left this one last as number one But it's one where most people would say Obviously dummy You should backup well You know the reality is sometimes, you know backups take search a long time that Things fail in between those backups and what do you do about that? right, so This is a real case scenario that happened that involved me. So I'm planning my vacation I think it was uh, when was it Last year. Yeah last year around october In the middle of the night chris ferris calls me We need you buddy I'm like i'm going on vacation Yeah, but I think it was why is you was one of the environments completely failed And then of course get on the phone before my flight Turns out the director database is dead Okay, wow cool Uh So more calls, but i'm getting my flight. So i'm on my flight. Alex gets involved. He's here my manager Then, uh, okay more calls And then they give me some synopsis of what it is. So I we got some log trace Uh, now, of course, I make sure I have my laptop all the bush code And then during my flight i'm Furiously looking to the bush could to figure out what happens and then I see in the bush agent There's a place where it formats the disks Yeah And pretty clear that's what's happening and I can see kind of the logic, you know, maybe something went wrong Okay, so get on the phone when I reached my midpoint. I think in dallas And uh, you know get on the phone with a bunch of people tell them, okay I see I think I know what it is feel kind of confident that I know what it is And of course what do you do you call dimitri It's saturday and Believe it or not. He's amazing like this. He will pretty much help everybody Um, so I got on the phone with him and then we go through the code And he agrees with me, but he's better because he identifies the reason why it got into this code and then of course it turns out that There's some special condition where if the bush agent tanks your disk is uh, Is not properly set up. It will set it up. So it will just basically Format it and reconfigure it and so on So that was the cause of the issue and then we found out, you know, going back why that that happened But the reality doesn't change, right? Like you can identify the issue, but the the the patient is dead, right? What do you do? You got to revive it. That wasn't your case though. You're good So I've got only a few minutes, but I'll tell you the gist of the story So I get to my vacation one week and pretty much every day we have calls And this is like, you know Calls with all level of the corporation, right and then even a vp You know jumps in the calls every now and then to check, you know, what's going on Anyway, so every day, you know, we're trying to work So these guys, of course, you know, they're working over time because they're working At odd hours, right 12 hours from well, is it 12 hours? Something like this from here So it turns out What ended up happening is we lost a disc where the director databases We have zero backup And the deployment is live customers are using it But we don't have a bus director talking to it. So and we can't do anything to that deployment It's just alive, but anything happens to it. We can't use Bosch So we have to figure out a way to recreate the director Right and recreate the database Yeah So I told you all the bad things. There's a good thing in there. So Good thing, I'll tell you what we did because it's a sort of lessons learned But the good thing here is there is a backup It existed even at the time when we did when this happened, but it wasn't very fast So that prevented us from using it So we've since and I think the Bosch team in general since improved it So now if you use a bunch backup, it will actually do a backup and it's much faster because I think it does It avoids doing the blob store which kind of slowed it down before The other thing is a lot of is have snapshots of disc So you should definitely consider that and talking to Tony for instance I know that pivotal does snapshots because they use a lot aws and that has really good snapshots So you want to consider using whatever the is provides if there's a way to do backup So that could be additional and from additional backup that you have But the lessons learned is in addition to backup Which you should always do including your phone and everything else that you have right all data that you have you should back it up is The notion of a dummy cpi So the guy in China the guys in China and the team we figured out the way we can reconstruct the database is to replay a deployment And then instead of replaying a real deployment where we're creating VMs anew and we're creating disk anew We're just repopulating your database And that's what we did. So we reconstructed The director database that way So I would highly recommend this at some point to be I don't know I have to talk to Dimitri and see if it's interesting enough for us to contribute that It's certainly it's not it might be I have specific also So it maybe doesn't make sense, but that's certainly a An idea an approach that works for us As I said, I put it as number one because it's the most catastrophic and it certainly impacted me directly the most But hopefully it's some it's one that you never have to do because you should all be backing up, right? Like everybody shake your heads All right with that I want to thank you and then we'll may have sometimes some questions We have time for questions. Here's where we have to work as a team. We need sufficient questions to cover until I can find the next speaker Plenty of questions, please We can do Matt may finish with the lightning talks later on that's our plan Any questions, ah here Well, yeah, you'd probably want to have one part of it. That's deploying the cc And all the different pieces that are important and then the things that you have to have multiple versions of Like things like the workers essentially and you know, yeah Because I can when we look at you know, blue makes and I say we have thousand nodes. I mean 900 odd Is just d a nodes or or d a good nodes, right? The deployment That's right and that's I guess part of it also is is Strategizing for that right because what ends up happening is you feel very confident With one big file with your entire deployment It came doing it and you kept growing and it goes it goes because it's easy, right? You just increase a number and then you're happy, but that leads you into a state where you know Like we we are right now with some of our deployments where we definitely have to break it down because it's not It's not manageable anymore By the what? Yeah, you can't do anything. That's right. Yeah. Well, we should definitely raise that Where is Dimitri? Is he here somewhere? All right I'm sorry I'm outside But we should definitely raise that because I think part of the issue is that you essentially When you're doing an update for instance, you can't do anything else now you could stop it and then you know resume but You know, that's that's the reality and for and I guess for I as that are super super fast Maybe that's not an issue but for us it's an issue Because it takes like we did an update to london Deployment and it took more than 18 hours To go through it now. It's a large deployment. Like I said, I mean we have the largest deployments. I think so Yeah, I'm it I think part of it is to deal with what you know I think what dr. Nick mentioned which is like you do canary deployments for single deployments to Make sure that when you're applying something there's a canary in the coal mine that will tell you something goes wrong And I think when you have searched large, you know, like so if each one of you are node Our deployments much bigger than that. So imagine if I have to give a secret to each one of you Right the challenge of making sure that you got the secret correctly So if I divide you into smaller teams Then I can give that team the secret And then I can potentially in parallel have matt give the team the other team a secret and we can kind of wall it out So it's not the network collection. Yeah network connection is not the impact here Just it takes time to take each one of them put them down Go to the process recreating reloading applying stem cell Restarting jobs everything if you've ever done a Bosch deployment that is non trivial. It takes time I mean it, you know, you could see right Maria. I was doing a relatively simple one. It took some time now if you apply if you increase it Right by typical CF deployment and then a live production deployment. It just takes time The good thing is the tool works by the way. So Bosch will keep doing its stuff and you know, you just need to You know kind of monitor it Uh, of course we use here when we even use uh concourse a lot I think the frequency of updates for us is a little bit slower than pivotal Uh, I as because I those are the two environments that I know Uh, I mean not that I know pivotal's details, but since I spent a lot of time there You know public information, I know um I would say we try to deploy um As frequently as we can based on You know making sure that Whatever got released. So I know CF gets released every week and then Bosch has some kind of a cycle We don't we're not in the bleeding edge But there are other environments inside IBM that are in the bleeding edge And then those are typically test environments and then those Depending on what happened there then we migrate because as I mentioned to you, we have lots of different custom software So we have additional releases that we have to test. So making sure that those work well and so on Maybe one more We have time. Okay. One more question. Make it good. Yeah That's a good question. I think at first we made the mistake of not Necessarily requiring and that's a huge mistake because now we are stuck with Lots of release that lots of services that are not Bushified and we're going through the process of bushifying them now because we have so many Uh, 500 I think uh, it's fair to say that there's a there's a significant number that are already bushified and we're doing quite well But some that are not and we have to go through the process. So I would say, um, we got work to do from that perspective Nothing that I can talk about I mean, I don't want to make this a sales pitch. I like I dr. Nick was mentioning. There's tons of different tools I mean, certainly if you're an IBM customer, then you should just ask about that but Generally, you're right, you know that You start feeling it very quickly and then figuring out what the solution is You know, it can be difficult It's it would be better to use a solution that already people are using and of maintaining and works with cf and so on Uh, but as far as I know, I'm not sure if it's if it's a offering at this point Thank you everybody. Thank you very much