 Felly dwi'n gweithio. Felly dwi'n gweithio ffawr o'r cwrwm. Mae naill yn Ddunkan Ngu. Mae oedd y tawdd yn cyflwyno sydd gyda'r cyflwyno mewn cyflwyno a'r cyflwyno a'r cyflwyno oherwydd ychydig iawn cyflwyno yn ystod o gyflwyno. Mae oedd y cwrwm a'r cyflwyno yn ystod o'r cyflwyno. Felly mae'r cyflwyno ac mae'r cyflwyno a'r cyflwyno yma ar y tawdd. Dwi'n gweithio i fyddi'r cyflwyno. Er mwyn i'n gweithio, I'm going to take you on my personal journey with Cloud Foundry. So I've been involved with Cloud Foundry for a number of years and I help companies install and configure and help companies get started with Cloud Foundry. I'm a consultant for Pivotal. I've worked both here in Amir and I currently work in the US. And throughout these engagements, especially with this getting started experience, I typically find I come across the same set of problems and challenges again and again. So I sat down, I started to write a book for O'Reilly just to capture the technical concerns and the technical considerations around installing and configuring Cloud Foundry. And I've got some early releases here at the end, if anyone's interested. This is kind of half the book, although it's almost finished. Someway into that journey of writing about Cloud Foundry at the 2014 CF Summit, I was introduced to this guy, Onsy. And Onsy took the Cloud Foundry community through some of the changes with the elastic runtime and some of the challenges with the old elastic runtime and how Diego fixed a number of these problems. Now, this presentation is still on YouTube. It's a really, really good presentation. So if anyone hasn't seen it, I'd strongly recommend that you go and have a look. But just to summarise what Onsy talked about, he talked about the tight coupling of the components and some of the orchestration challenges that the Cloud Controller had because the Cloud Controller had too much responsibility and it had too much control over what was going on. He also talked about these triangular dependencies and this poor separation of concerns between some of the components. And this led to a real challenge. I'm just going to switch. Does this work still? Well, cool. So he talked about this poor separation of concerns. And this led to a challenge that Cloud Foundry was hard to reason over and hard to test and also hard to manage the existing features. But more importantly at the time, it was hard to add new features and this was a real problem because back in 2014, Cloud Foundry was just about the app and only supported one type of workload. And it was very platform-specific, meaning it had Linux-specific code hardwired throughout. And so Diego was pitched as this ground-up rewrite of the elastic runtime in Go. In September last year in 2015, the first GA version of Diego was released. This was the first version that supported zero downtime deploys. Now Diego is fantastic for the Cloud Foundry community but as an author trying to write about Cloud Foundry, this was really problematic. It meant I had to go away and refactor 50 pages very quickly and then if that wasn't bad enough, Dimitri came out with Bosch 2.0 earlier this year and it meant I had to refactor another three chapters. So writing about stuff is hard but when the technology is changing as fast as Cloud Foundry, it becomes even more problematic. And as I was delving into the weeds of Cloud Foundry and really understanding the detail, I stepped back and asked myself this question, why should I care? Not about Cloud Foundry. I think Cloud Foundry is fantastic but the detail of Cloud Foundry, the detail of how Diego works, why should I care? And at the back of my mind, I had the developer and thinking as a developer, why should you care about this stuff? Why should you care about Cloud Foundry internals and its abstractions and how it works? Again, borrowing from ONSI, we have this fantastic developer mantra, here is my app, run it on the cloud for me, I don't care how. And by the developer, I'm not just talking about the person who cuts code, I'm talking about the entire team responsible for developing and running that application. They by and large think of Cloud Foundry as this black box. They push their application, Cloud Foundry does its thing and returns an endpoint to a routable endpoint. So they don't really need to peel back the covers and understand what's going on. And for anyone who's been working with Cloud Foundry for some time, this is the highest level of abstraction that we're used to, that developers think about their work, they think about their apps, their tasks, their services. They don't really need to be mindful of what container orchestration you're using or what version of Linux you're running on. All of those are the platform concerns and that undifferentiated heavy lifting is really what the platform is there to take care of for you. So it got me thinking back to this book. The book isn't really for developers. Cloud Foundry is so easy to use as a developer that if you were going to capture those developer concerns, a book like that would be really small. And so in true agile fashion, I decided to actually start small and I wrote such a book, so I've got some of these around as well if anyone's interested. This isn't really targeting the developer, but this is the developer experience of delivering agile software with velocity. So in my naivety, I thought that small book would be about a week, but six months later, I was back to the original book and I was able to frame my question much better. As an operator, why should I care about Cloud Foundry's abstractions? And this is a really important question to ask, not just for an author trying to understand his audience, but for the Cloud Foundry community as a whole, why should you care about Diego and the internals? And to answer this, you need to understand the context of another question, and that's why a distributed system is so hard in the first place. So to level set in case anyone's completely new to this stuff, distributed systems involve a number of components and they're networked or they're connected together. And because there's many moving parts, they become inherently complex because you need to orchestrate them. Halfway through Onze's original talk, he had this comment, it was almost like an off the cuff throwaway comment, but it really resonated with me, that within a distributed system, it's hard to have an accurate picture of the world at all times. And when you look at Cloud Foundry, it's a multi-user, multi-component environment, it has a number of microservices, things can fail, usage paths change, autoscaling can kick in, self-healing can kick in, all of these things can change state and they need to be tracked and potentially responded to. Moreover, there are just inherent challenges with any distributed system. So for example, heterogeneity, the ability to support multiple different environments. Cloud Foundry manages this really well. It has a Bosch CPI for supporting different IS layers like AWS or vSphere or Azure. We have bill packs to support that polyglot programming environment. So you can support Ruby, you can support Java or all the languages that run on the JVM and you can support Node and so on. And it also has this marketplace for the middleware services like Redis and RabbitMQ and MySQL. But all those components need to be extensible. You need to be able to extend the platform. And again, a great example of this is the route service or root service. Instead of the marketplace just being about the middleware components, you now have services for the route to application. And again, new CPIs are coming out all the time with Photon and GCP. And also, you can extend the bill packs and you have other bill packs like the Tommy E bill pack coming out. So extensibility is really key. You also need transparency. Now, a lot of people look at transparency as everything being open and available, but actually in a distributed system, I feel it actually really means the opposite. Instead of seeing all the complexity and all the gory details, what distributed systems need to do is they hide that level of complexity and make it transparent to the end user, and make it clear to the end user. So the end user has a really simple way of interacting with that system. Distributed systems also need stable APIs. They need well-defined APIs that are stable. An example of a good stable API is effectively like a USB. It's been around for ages. A bad API would be like the new iPhone, headphone jack. I mean, it means you have to update your earphones and everything else. So those APIs to be stable as much as possible. This gives each component the ability to be independent and decoupled, and that's really valuable because then you can plug and play those components. You can swap them out, replace them, and the system as a whole carries on working. So just to delve into the details for Cloud Foundry for a second, a great example of stable APIs is the cell. So if you look at a Linux cell, applications run on cells. In Linux, they run on runC backed containers. Containers are managed by Garden, and the cell itself is managed by the Rep, and the Rep talks to Diego. Now there's a decoupled interaction between Garden and the Rep, but there's a strong contract between them. Because Garden isn't participating as part of that wider distributed system, you can make breaking changes to the API, but because it's coupled to the Rep, it will also have to change the Rep. By keeping the Garden API stable, it's got really strong advantages. So for example, you can swap out the back end, it was Garden and now it's runC. If there's a CVE, all you have to do is upgrade Garden, and we make a promise to the community that if you're on Diego version X and above, all you need to do is redeploy Garden, and you get all the later security fixes. You don't have to do any changes to the Rep or any other component, because those APIs are really valuable. Distributed systems also need to support concurrency. When you have many different users, whether they're users of your application, or whether they're different developers, or competing to view and update the same piece of data, the system needs to handle that in a robust way, and things like dynamic routing and distributed locks really help with dealing with the concurrency challenges. So the list of complexity just keeps on growing. There's things like security and self-healing, and monitoring and logging and user management. When you're running a distributed system, especially in a cloud environment, there's many components that go into making that system stable and reliable. So, in essence, they're hard to build and complex in nature, and it's because distributed systems like Diego are complex, there's a lot of value in really understanding them. For example, the cloud foundry operator, there's at least two reasons why they should be concerned about the internals. The first reason you need to care is because you want to debug your app. If the user is accessing an endpoint or routing cloud foundry from an application, and for some reason that route goes away, there could be any number of failures to cause that to happen, and without understanding the system, you don't know where to debug. So, for example, your container could have died or your app may have crashed in the container. The rep on the cell may have died, and it looks like your cell is offline, or the router may have died, causing the dynamic routing table not to get updated, so any one of these failures could cause that effect, and it's important to understand the flow of communication between the components in the system to really understand what's going on. The second reason why you need to care about the internals is because you need to establish resiliency. Components like Consuel and XED and MySQL use a raft protocol, which means they need a quorum, and to have a quorum you need an odd number. If you just have a single node and that node dies, you've lost your service, if you have two nodes and a single node dies, you still don't have more than 50% remaining, and therefore the system as a whole will shut down and may need manual intervention to restart. So ideally you need three nodes, preferably in three different AZs, so if you lose a node or you lose an AZ, your system still has quorum and it still remains operational. Other components like the brain, you have multiple versions for resiliency, but they can only work in isolation so they need to establish things like distributed locks to ensure that they're not trampling over each other. So again, by understanding the system, it helps you build in the right level of resiliency. So how does Diego deal with some of these distributed system challenges? The more I looked at Diego, the more I realized it actually has a very elegant approach and I've loosely termed these as Diego abstractions and I'm going to take you through some of the high level ones. Diego itself is a subsystem to Cloud Foundry. So things like Cloud Foundry's Cloud Controller and Cloud Foundry's Logigator access Diego as a client. Diego is comprised of a number of independent components and each component hosts a set of microservices which is scoped to the boundary of that component. And as I mentioned, the important thing to understand is how work flow through the system. Actually understanding the technical implementation at the BBS and what backs of BBS is less important. The engineers can swap out XED for MySQL for example and by and large the system should operate in the same way. So Diego is a decoupled infrastructure and a number of different components are responsible for the orchestration. And it handles this through a separation of concerns between these components and it uses abstraction layers as work flow throughout the system. And this gives the benefit that work can be agnostic so Diego can now support a richer set of workloads and is also container agnostic both in the image format you deploy to Cloud Foundry and also on the back end as well in terms of how you run your applications. The first abstraction I want to cover is subsystems. And again, this may seem like I'm going off at a tangent but actually subsystems are really important to Cloud Foundry. Diego has been encapsulated in its own Bosch release so its own code base. And this is really valuable because it's become independently deployable as a standalone system. Standalone systems typically offer a pattern of reuse. I don't know anyone outside the Cloud Foundry system using Diego so why is this so important to have an independent code base? It gives a couple of things. When you start to decompose conceptually at least a monolith into subsystems it allows each individual subsystem to grow and evolve as it needs to. This increases the developer velocity but it also has a really rich side benefit of testability. So the release engineer or the platform operator can take a more granular approach to testing because they have a number of different subsystems. So for example, if you deploy the latest version of Diego but you back it with an old version of Postgres if you then migrate that version of Postgres to a new version will Diego still be able to interact with it? So in addition to just having backwards compatible APIs you can also now have a more granular approach to actually migrating disparate components so it gives a richer testing cycle. But in addition for the individual teams it allows for isolated testing so now just the owner of that subsystem can focus on their subsystem. They can have assurance that the teams which they depend on have done their due diligence and they can also not worry too much about the downstream teams it means that their view of the distributed system is way smaller which then automatically allows them to move a lot faster. The second abstraction is workload abstraction. So most people know that Cloud Foundry has moved beyond just supporting applications to this more generic term of long running processes and tasks. So a long running process is Diego's view of your application it's desired to just keep on running. Diego has a view conceptually of a desired LRP a long running process and it's an actual process running in a container known as your actual LRP. If you scale that process you can then have multiple actual LRPs for your LRP. Now LRPs have a life cycle I'm not going to dig too much into Diego's state machine but these life cycles pretty much map to the behaviour you'd expect so if you see your application running it's likely to be in running state, if you see it crashed it's likely to be in the crash phase of the life cycle but the reason why life cycles are important to understand is if you go in and query Diego's database of BBS and you see your applications in unclaimed state it could well be you haven't provisioned enough resource for yourselves so by understanding something about the life cycle state it can help you troubleshoot as well. So on to tasks, tasks are guaranteed to be run at once they're effectively a singleton they should always terminate and they should have a finite running time so tasks could be anything from a one-off script like a batch job or a con job or maybe you're doing stream processing using ELT applications themselves can call out and run one-off tasks so applications can now use this task abstraction but the other important feature of the task abstraction is that Cloud Foundry itself uses tasks as a end user you ask Cloud Foundry to run your application your application goes through the staging process and the staging request from Diego gets encapsulated as an internal task from Diego so in addition to tasks and LRPs applications or actual LRTs also have this app life cycle the app life cycle is responsible for building and running and keeping your app running so there's three distinct phases there's a builder which actually runs your application sorry the builder will drop it and build your application there's a launcher which runs your application there's also a health check that runs alongside the application in the container application life cycles aren't anything new they've been around for some time so in the old world they were packaged as build packs on the runner the DEA the VM that used to be responsible for running the component of the applications and you can start to get a sense of this tight coupling between what actually ran sorry I just got back one between what ran, how it worked and also where it ran so this tight coupling was one of the core challenges originally build packs still exist but they're part of this more generic execution so now they're not coupled with the cell meaning the cell's significantly lighter weight and also because they're decoupled you have this plugability you can have build pack life cycles docker life cycles, window life cycles if you have a different type of workload in the future you can just add that in the mix so the way it works is when you need to use something like a build pack it gets injected into the cell so it's decoupled and it's plugable so looking at containers and I've touched on this before in a previous talk but I still see a lot of confusion out there in terms of what actually can CloudFoundry support in terms of the file system CloudFoundry supports docker images and it also supports the ability for you to push your application and allow CloudFoundry to build it into a dropler plus a stack so the file system is what you actually run the thing responsible for building and running that isolated process, that container management is garden and garden can be backed by Linux or it can be backed currently by Windows but really we talk about containers anything which adheres to the garden API can be plugged in so if you want to run on run V or you want to run on bare metal conceptually all of that can be plugged into that garden API so Diego itself has what we loosely come action abstraction and this is fairly complicated but all the pieces within Diego are totally independent and that means that they can solve their task at hand in isolation they can choose to have unique expressions of the problem as it flows through the Diego system and it can choose the right expression depending on the problem being solved now there's a lot of information there so let's visualise that as you CF push as an end user you're apt to the cloud controller the cloud controller then passes it into this bridge layer called the CC bridge that then goes to Diego's database it goes on to an auction and ultimately that request ends up as a scheduled process running on a container as work flows through the system it starts really coarse grained as this big kind of bolder like analogy and it gets broken down into finally something very granular as a scheduled process now if everything within Diego was concerned with this schedule process it would be incredibly brittle and it would be very complicated for the end user to interact with it so by adopting the ability for each component to have its own view of the work and its own expressions of that work and abstractions it allows for this plug and play model where you can take things out and replace them but it also allows for this really easy transparent client interaction so as this workload gets kicked off as the cloud controller interacts with Diego it's imperative it can say run this and then it can leave Diego to do its thing Diego knows about the desired state and it tries to rectify the desired state with the actual state and run that workload so the cloud controller beyond that point doesn't need to be overly concerned with how Diego does its job so the cloud controller which is Cloud Foundry in Diego have two effectively different views of the world Diego's view is very generic and Cloud Foundry's view is very specific for Cloud Foundry and this bridge component in the middle allows the translation between the two for example when you stage your application Cloud Foundry knows that the app needs to be staged but Diego turns that into a generic set of staging tasks or LRPs and so on and the uploader as well Cloud Foundry knows that you should upload droplets to its blob store and download droplets from its blob store but the upload action in Diego is only concerned about upload a file to this URL it doesn't have to specify the URL as the blob store that's Cloud Foundry's concern to tell Diego about it so the BBS Diego's BBS which is effectively Diego's API uses something called Composable Actions and this is the next layer down from tasks and LRPs so we talked about tasks and LRPs they get translated into Diego's Composable Actions things like download a dropler download action run action and again these are imperative actions they're an instruction set and you get this tree and these instructions end up being the components that actually run on the cell sorry, in the container so how do LRPs and tasks translate to these Composable Actions so let's take the scenario where you're staging an app or you're running a staged app you will have three Composable Actions you have a download action to download the droplet from the CC blob store there's another download action to download the app lifecycle binaries and then there's a run action to run all this stuff onto a container the important thing is this is another level of granularity beyond the LRPs but this is still describing the desired activity not the implementation so as I mentioned things like the implementation that you're downloading from the blob store all of that still comes from the cloud controller so the last abstraction I'm going to talk about is closed feedback loops and again this is key to keeping that distributed system in place so the CC bridge has two core components to offer this closed feedback loop it's the ability to handle domain freshness and this is done by NSync and it's also done by the TPS so the cloud controller tells the CC bridge it doesn't speak to Diego directly it tells the CC bridge about the desired app and then NSync tells Diego to go and make that so so Diego has this view of the world that it knows what the cloud controller wants it knows the desired state and it goes through its auction process to make that a reality so at that point everything's good and Diego is running the required level of apps if an app crashes because Diego knows the desired state it can bring another instance of that app back but just stopping there isn't enough because this is a distributed system it's eventually consistent and cloud controllers view of the world may change maybe it wants something different but there are messages to Diego that one of those messages has been lost so there's this NSync bulker that periodically checks cloud controllers required state its desired state and what's actually running over in Diego and then using cloud controller is the authority the NSync bulker can then give Diego a new view of the world a new desired state there's also this TPS listener and watcher as well and this gives the cloud controller the ability Diego and find out what's going on and it also listens for any crash events and it feed back into the cloud controller as well so this closed feedback loop is essential for getting that domain freshness in place and we see this model again and again so the brain to the rep interaction works in a similar way where the auction passes information over and then the conferger is looking for that state to rectify if anything's missing so again that's a whistle stop tour through the abstractions we've talked about subsystems we've talked about workloads and life cycles container abstractions, composable actions and feedback loops so these are Diego's abstractions for dealing with that challenge of complexity and Diego has a layered approach and the layered approach allows for each component to work in isolation and it also affords this pattern of reuse ultimately that gives you better user transparency and with each subsystem it's the Diego subsystem with respect to cloud foundry or it's the subsystems within Diego you get this iterative development this development velocity and better testability and you also get this plug and play capability as well so that's it, thank you very much