 Hello everybody and thank you very much for coming to our talk. My name is Nima Caviani and I'm an engineer with IBM and also a commuter to Cloud Foundry's Diego. My name is Adrian Zankic. I'm an engineer at Pivotal. I was a former anchor of the infrastructure team which maintains console release and I'm the current anchor of the MySQL cholera team. Okay so the title of our talk is Unlocking Diego and we're going to talk about how we identified basically some of the flakiness that we had in Cloud Foundry and how we actually provided robustness and reliability by fixing the problems that we issue, especially with using console. So I think Eric's talk made it a lot easier for us to give this talk because of the great overview that he did on Diego. But to give you some background, Diego is the runtime and the scheduler for Cloud Foundry. So essentially the role of Diego is to receive jobs from Cloud Controller and then run those jobs and monitor those jobs and make sure that those jobs are running. The important point about Diego is that it's a distributed system in nature which basically means that similar to any other distributed system, its components need to be discoverable and highly available. So in Cloud Foundry, in order to provide that discoverability and high availability, we decided to use console mostly because console actually has features that allow for discoverability and high availability. But as we did further analysis of Cloud Foundry deployments and as we learned more about how console works, we realized that console may not necessarily be the best choice for Cloud Foundry. And we decided to kind of transition away from it. But before I tell you how we transition away, I'm going to hand it to Adrian who's going to talk about what console is and how console and Cloud Foundry actually have problems with one another. Yeah, so some of you may not know what console actually is and what it does in the system. You'll just see that there's three VMs popping up extra in your deployment that you have to have that are notoriously broken. So what console is, is it's a clustered system that offers discoverability through DNS. So this allows system components to register routes to this clustered system that then other components in the system could query to discover other components in your system. So you can find where your UAA is or database Diego, whatever. It provides health checks on those services so that if a node dies or whatever, it will deregister that route so you can't actually go to a dead route anymore. It offers a key value store for just storing whatever you want in there and it's based on the raft algorithm. So we're going to talk more about raft in the future for people who don't know what that is. So how do we use console in Cloud Foundry? So as I mentioned just now, we use it for services discovery mainly, right? So different components register themselves, other components can find those. We use it as a system state store so some metadata information is stored in the key value store of console and it's also used primarily by Diego and a few other components for distributed locking so that we could achieve the active passive availability model that Eric had mentioned in his previous talk. So basically what this means is let's say you have n number of system components and you want one that's the active node. So that one is maybe a web server that's accepting traffic and if that node dies for some reason goes away, a passive node can then take a lock and any component that wants to talk to that system using this lock will discover it through console so you'll only go to the active node so you only talk to one at a time. So why is console an issue for Cloud Foundry? So it's a software we all know and love called Bosch. So Bosch and console do not play well with each other. So Bosch does a lot of things very, very well. It handles stateless services amazingly well, gives us everything that we love about automatic operating system upgrades, software upgrades, all that kind of stuff. However, one area that Bosch could use some improvement in is stateful services and even more so clustered stateful services. So console is a double whammy of stuff that doesn't work super well with Bosch and we've seen a lot of pain around orchestrating this. So the reason why they're so painful is raft. So console uses a raft algorithm which if you don't know what raft is, that's okay, I'm going to talk just briefly about it. So raft is a consensus algorithm. So consensus is achieved when multiple servers agree on a given leader. It achieves consensus in a fault tolerant way. Console implements its own raft algorithm. So I'm going to talk about raft very briefly here. So here's a typical raft cluster that has not been initialized yet. So this is five nodes that don't know or talk to each other. So when you have a cluster and, you know, no one knows about each other, what happens is nodes will heartbeat to each other and when they find that there's no leader, then one of the nodes will elect itself as a candidate and send out a candidate message to all the other nodes and it will start an election. So what happens is every other node that receives that candidate message will then respond back with a vote to the first candidate that it sees. When there are enough votes in the cluster to maintain quorum, as they call it, then a leader gets elected. So then all the nodes agree on a single leader and that leader is a person who replicates log information, all that kind of stuff to every other follower in the cluster. So this will happen over the lifetime of the cluster and the leader in the cluster. So if a leader dies, goes away, whatever, raft is supposed to pick up where that leader died, elect a new one and just keep going forward. And hopefully all of your log messages were replicated everywhere, so then you should be able to just pick up and keep going. So one of the problems with raft console and cloud foundry is what's called a split brain, right? So a split brain and a raft cluster is when you have a partition between one set of nodes and another set of nodes. So let's say you're doing a deployment, something fails, there's a network, isolation, whatever. One node may elect a leader and talk to some amount of nodes and another node may elect a completely different leader and talk to another set of nodes there. So then you have two separate sets of data, two separate leaders that they should all be unified. And I don't want to go too much into the details about how console works, but basically there's clients on each system and then there's a central cluster and then each client will talk to the cluster, get routes about everybody else and register new routes. So then if half of your system is talking to one side of a cluster and the other part is talking to another part of the cluster, then one side of your system can't talk to the other side because it simply just cannot even get the routes to that other side and this is a big problem in Cloud Foundry. So a really common situation for split brains and how all this stuff breaks down is during Bosch deploys or network isolations. So what'll happen is Bosch wants to roll a VM which coincidentally is always a leader, right? So then it takes down the leader. Raft should elect a new leader and the cluster should be happy. So most of the time that happens but sometimes let's say a leader goes down, the remainder of the cluster takes too long to enact a new leader, the previous leader comes back up, it knew that it used to be the leader and there's no leader so it tries to participate in another election, a bunch of bad stuff happens and sometimes when a leader will come back up, it might have data in the future or behind where the current cluster is and if you've ever orchestrated a console cluster in Cloud Foundry, then you will know that there is that dreaded log not in sync error message which basically means go in there, wipe out all the data, stop everything and start it all back up from scratch. That is when console gets into a situation where it can't trust its data anymore. So console is a consistent system and when we don't know who the source of truth is anymore then you're just completely screwed at that point. Alright so next we're going to talk about how console failures affect Cloud Foundry and in particular Diego because Diego is one of the primary users of console in the Cloud Foundry ecosystem. So let's just start with an overview of Diego. I'm going to be very brief about it because I think Eric did a great job but essentially when Cloud Controller wants to run a job it contacts BBS or the bulletin board system in Diego which is the forefront of communications with all other components in Diego and it sends the job information as well as the resource requirements. BBS then passes that information to the auctioneer which is the scheduler in the environment, it receives the information for the job, the resources and then consults all the cell VMs for their available resources in order to decide okay where to schedule the job and once that is identified it communicates with the rep which is the process representing a cell VM, passes the job information to that cell and then the rep launches a container pulls the code for that job into that container and starts the container and then monitors its execution and then reports back to BBS. Another key component in Diego is the router meter and what the router meter does is that once the job is up and running it actually takes the route to that job and makes it available to the outside world by communicating it to the go router so that's how your applications once you push an application to Cloud Foundry it becomes accessible from outside. So Diego uses a console for high availability for service discovery and for data store as Adrian mentioned the model of high availability that is implemented is Diego is an active passive model where the active instance in a family of components is the one that is receiving the request and then handles the request and all the passive instances are basically sitting there idle waiting for the active instance to go down for them to take over. So service discovery is done in such a way that the active instance in a family of components registers itself under a domain name. So for all the other components in Diego when they want to contact like a given component they use that domain name and that domain name then gets resolved to the address of the active instance of you know a family of components whether it's bbs or auctioneer or anything else. So out of all components in Diego the bbs the auctioneer the reps and the router meter are the ones that heavily utilize console. So bbs auctioneer and router meter use console for service discovery for high availability and for distributed lock that they use for the active passive and high availability model and the rep uses console in order to provide information about the cell that it manages including some metadata about the resources that it has available. So if console fails and goes down then the way it affects bbs is that there's not going to be any active instance of bbs available so essentially no other component can talk to Diego right and then the dns record fails to update because we use console dns for providing the dns service and the discovery so that becomes unavailable and then essentially Diego becomes unreachable. When console fails for auctioneer auctioneer cannot schedule any new jobs because again there is no active instance of auctioneer that can take over that responsibility. When it becomes available for the rep the rep cannot provide information about the cell that it manages to the rest of the system. So if all reps lose communication with console essentially what happens is that you don't have any cells available in order to schedule a new job but probably the most important of all these components is the router meter because as I mentioned the responsibility of router meter is to make your jobs or your applications available to the outside world by constantly refreshing the routes that make those processes available. So if the router meter goes down basically the routes to the applications expire and Cloud Foundry and Diego start dropping those routes and essentially you lose all communications with your applications and that's quite catastrophic for Cloud Foundry because it fails at its doing its primary job which is basically running those applications. So in order to solve these problems we decided that okay it's probably wiser for Cloud Foundry to replace console and the way we decided to do it we started looking at different components in order to decide how we actually want to replace it. For BBS and auctioneer if you remember I mentioned that we use console for service discovery and for availability. So for high availability and the distributed locking mechanism that we had implemented on top of console we decided to move from console to a relational database and use it for providing that distributed locking mechanism. Well the good thing with using a relational database is that it's not well the availability of a relational database is not based on a raft algorithm. So we're not going to have all the split brain and other issues that we mentioned when we when Bosch actually has to manage a raft based system. So we implemented Locker which is a service that runs on top of a relational database and that manages all the distributed locking of the components. So that way the active instance can see register itself as the active instance and all the other passive instances can wait and listen for the active instance to potentially go down for them to take over. The important thing is that for service discovery we still are using console DNS. That thing is still not gone away but the long-term plan is that we're going to replace it with Bosch DNS and Adrian is going to talk about how Bosch DNS is going to solve some of the problems later on in this talk. So for the route emitter I think Eric also covered this for a little bit but the initial architecture that we had or the old architecture that we had for Cloud Foundry was that we had a single instance of route emitter managing routes for all the applications in Cloud Foundry and that was a single point of failure essentially. Once that route emitter went down then all the routes would start dropping and then we would lose route ability to all the apps. So we moved to a more distributed architecture moving away from a global route emitter to local route emitter. So in this new architecture a local instance of route emitter gets deployed alongside a rep on each cell in the deployment and this way because we've kind of split the responsibilities of updating routes across multiple route emitter if a cell goes down or a route emitter goes down you essentially lose route ability only to the instances of your application that are deployed on that cell and that way if you followed best practices of deploying applications on Cloud Foundry which involves having multiple cells and multiple instances of your application then hopefully you have other instances of your applications running on other cells whose route emitters are still available and that way your apps can remain reachable. So we've kind of moved away from that single point of failure by distributing route emitters across multiple cells. For the reps also as I mentioned we use console in order to provide metadata and data information about the cells and we realize that we can use the locket mechanism in order to report the same availability. So reps also use locket in order to register their presences and provide their metadata information for the rest of Diego components in a relational database rather than using console. So with that I handed back to Adrian who's going to talk about service discovery and Bosch TNS. Yeah so as Nima was mentioning you know we're moving a lot of locks things like that into relational databases things are a little bit more easier to orchestrate but we're still using console for service discovery right so at this point we've decoupled the active passive failover out of console and service discovery so we're still using that now. So in the future the Bosch team is working on a new feature called Bosch TNS so this is going to be Bosch native DNS you can find it at Cloud Foundry or the Cloud Foundry organization GitHub called DNS-release this is still an alpha release and it's a work in progress but this is sort of the direction that we're going so you know a lot of people might be wondering well why are we writing our own DNS server why are we doing all this stuff right so one of the core problems that we're facing with console is that we're using it as primarily a service discovery platform but due to its consistency guarantees and its reliance on raft it's hard to orchestrate and it's may not necessarily actually be what you want for service discovery right so we're using a consistent system for service discovery that if that consistent system goes down for whatever reason we completely lose routes and service discovery so instead what you want is more of an available system and that's what Bosch TNS aims to be so with Bosch TNS each VM in your deployment or across deployments already knows about all of the VMs in your deployment through the Bosch director so the Bosch director is going to give an instance information about all of the other nodes so that it could craft DNS messages or DNS records and talk to each other so the reason why this is a better option for cloth andry is let's say if your you know Bosch director goes down you have a network partition you have some catastrophic failure and part of your deployment anything can happen right every node will have its own local copy of routes that will never get wiped out unless the Bosch director specifically gives it an update so let's say half of your deployment just got partitioned and you lost an AZ anything like that your components will still be able to talk to the nodes that knows about and Bosch TNS will eventually have healthiness so then it'll know that hey I can't talk to this other VM anymore I know that it's there or that used to be there I'm just going to stop trying for now but at least I have some information about what's left in my deployment and who I can talk to so that's sort of you know the goal behind Bosch TNS we're eventually going to roll this out it's still very much a work in progress but you can check it out at this Git repo so just to wrap this up we've you know through you know orchestrating and deploying CloudFanjries we've determined that console is not necessarily a good fit for CloudFanjry so I want to take this opportunity to say that we've worked very closely with Hashi Corp on the console team they've been doing such a great job console is a really awesome piece of software so just because it's not a good fit for CloudFanjry doesn't mean that it's bad or it doesn't work console works fantastically we just have problems orchestrating it in our environments so you know if you need a services discovery platform that you want distributed locking stuff like that and maybe you have some different type of architecture console might work very well for you I recommend checking it out it's super fast it's a great piece of software so I just like to thank the console team for all the work you know that they've done and everything like console is actually a really good piece of software so you know as Neema mentioned you know we are moving to simpler technology in our platform right so we used to have an SED cluster in addition to a console cluster which SED so we used to have two separate completely different raft clusters and if you were still using an older style Diego release and you even had a third raft cluster so in a CloudFanjry deployment you could have three different raft clusters all at the same time which if anybody is operated a CloudFanjry you know that that is not a fun time so instead we're switching over to relational databases which we've been using for a very long time we know how to orchestrate as well there's many different options between you know internal CloudFanjry things things like RDS whatever those are a lot more stable and we're using a locket on top of SQL systems for that Bosch DNS which I talked about some is going to replace console DNS so we're still using console for DNS currently but in the future that will likely not be the case and this will allow us to have a more available service discovery system within CloudFanjry that should be more resilient to partial system outages and also make the deployment story a little bit easier and also you should be able to remove three VMs from your CloudFanjry deployment that we're just sitting there kind of just dying all the time so we're going to publish a blog post about this talk that's going to go more into detail about changes in Diego why we're making those changes common issues with console how we got around those stuff like that so we'll post that blog post and tweet it out do all that kind of jazz you can check us out on Twitter there's our Twitter handles you can follow us or not no big deal no pressure you know and that's it thanks we can take some questions yeah so there's so the first version of Bosch DNS is exactly what you're talking about so it would write records into hosts it was it would periodically refresh that but you couldn't really query the director for give me records in the zone or healthiness or anything like that so Bosch DNS is actually going to be a DNS server that you can co-locate on a job that the Bosch agent will lay down a file that has information about all the other nodes in your deployment and also other deployments on your system and then that would bind to a local IP address that's meant on the loopback interface then you can ask it for DNS records and there's sort of like a query syntax that you can say you know give me the records of nodes in this zone or things like that so currently it's in hosts but eventually it won't write into hosts anymore and it'll actually be a fully functioning DNS server I think it really comes down to how the database is managed I think the problem with console is that it's a lot easier when you have a wrap-based system to get into a split partition and if you use something like RDS as your backing data store you don't even need to care about the distributed like there's split situation or the split brain case because I think it's generally Bosch is not going to be fiddling around with it it's going to be like outside the the ecosystem of cloud foundry and how it's managed so the type of like consistency that it provides kind of ensures you that if you actually write and lock basically the active instance once it writes to the database and locks its presence in the database it's going to be there and the other components are not necessarily going to take over or yes exactly or so also too just you know if your database goes down in cloud foundry you've got a lot of problems right and orchestrating a database is a lot easier than orchestrating raft clusters right so you are absolutely right if your database goes down you're going to have problems but that's also the case today your database goes down you're going to have a bad time no matter what happens so this is just one more thing that's going to go bad if your database goes down so you know there you go yeah maybe we can take this outside after the thing are there any more questions i think that's it thank you very much yeah thank you