 Okay. Hello everybody. Thank you for coming today. We are Ronak Karlo from Rakuten. Rakuten is this big Japanese company that started out in 1999. It has been expanding globally over the last 15 years. It has a lot of services but it mostly focus on e-commerce and it has acquired many companies around the world during these years. You might know e-bates and Viber probably. We are here to talk today about our experience with CoD Foundry over the last five years. So we will start with that and then we will focus on what we are doing right now to update our deployment and what we plan to do in the future in the next few years at least. So let's start with actually describing what we do with our Cloud Foundry deployment. It's an internal platform as a service for our developers. We actually presented this platform already a few years back here at the CF Summit in 2013 because at the time we had just forked the platform to implement some of the features that were required in Rakuten. Some of them actually in the years were implemented in V2 as well, what would become V2 eventually. Some of them were never accepted and to support these use cases we actually had to keep diverging from the V2 development branch. So this allowed us to actually reach quite a sizable deployment. At a certain time we were probably like the second biggest V1 deployment in the world. All of these it has been running for five years with a team of seven people doing pretty much everything handling from user support to operations to development, architecture, everything. In these years we learn a quite a sizable amount of stuff about what is good to do with this platform what is not good to do. The first thing is don't try to make everything fit in. Sometimes application will have unique requirements sometimes for good reasons sometimes not. Don't yield to the oh my snowflake is so unique you need to support it narrative. In most cases it's the application that needs to be adapted to the platform not the other way around. Getting a good corporate champion to back you up is actually very very important to make sure that this goes the right way. Otherwise what might happen is that you end up forking and we learned the hard way that it's not a very good idea because there is too much value in what the community is doing, too much momentum. Don't ever think about doing that. Try to build everything that you need to build either on top or on the side. If you're building on top try to keep things as neat and lean as possible. Try to stick to public APIs so that things will not break in the long term. Another thing obvious somewhat is that engineering time does not scale. Every single thing that you leave behind every single manual step that you have to do in the long term is going to come and actually bite you. So don't yield to the temptation of turning your cattle into pets basically because this really really really doesn't scale and you end up with snowflakes everywhere that are not good. This doesn't just apply to provisioning. This applies even more importantly to the way you collect, aggregate, store logs and metrics. Knowing what your platform is doing is vital. You need to have this information in a very well-defined place where you can immediately access it and you can go through it and correlate events coming from different components. If you don't do that you are going to spend so much time going around and trying to find the information you're looking for. Keep in mind when you design your local election system what might work for 100 VMs will probably not work for 5,000. So make sure that you build it in such a way that you can swap in and swap out components quickly. And eventually also run them in parallel because that allows you actually to try out stuff without breaking your current capabilities. Also one funny bit, don't share your monitoring system with your users because otherwise you kind of risk losing visibility when you probably need it the most. This actually happened to us once so don't try that. Another important thing to notice is that when you reach these sizes, when you reach this scale, things fail all the time. Assuming in your design or in your components that things are going to work is a mistake. It's a mistake at any level because it's going to come and bite you when again at the worst possible time. As an example we had this log pipeline that was built with assumption that logs had not to be lost under any circumstances. But then what happens if some of the components at the end of the pipeline actually start misbehaving or slowing down, then actually this ripples through everywhere and eventually you're going to have problems even on your application. So we have had this deployment running for five years. It has served us very well. Now it sounded kind of we had many problems. There have been problems but there have been many success stories and actually it provided a tremendous value. But we realized last year that we had to catch up with upstream. It was not a question of whether we should do it or just that we had to do it sooner or later. And now Ronny is going to spend a little bit of how we actually do this in virtual. So once we started our journey for the second version of the deployment, we had to decide like the first thing for the infrastructure part, our provisioning part, how we'll be delivering the internal software tools and other things. Obviously with the upstream we chose Bosch for the provisioning because we want to stick with the upstream for that. We use concourse for delivering our internal tools. On the infrastructure level, as most of the CloudFoundry deployments, they're using a single deployment for deploying CloudFoundry. We have divided our deployment like partially on vSphere and on the OpenStack at the same time. The reason for this is back then we had some limitations on our OpenStack deployment where our cylinder was not giving a proper support with our Bosch APIs until few days back when we have this functionality working properly on the OpenStack. So we are probably thinking of moving all the components on the OpenStack. So when I say living on the multiple clouds, it's actually deploying a single CF on two different infrastructure. We have a Bosch director which has the VMware CPI which is responsible for provisioning the persistent components of the CloudFoundry like NFS, Postgres, HCD and our other internal tools. Whereas it is also responsible for deploying the OpenStack Bosch director which then is responsible for provisioning the stateless components of the CloudFoundry on top of OpenStack. So it's easy to provision the CloudFoundry with a single manifest but in our case we have to make some changes to actually create two intermediate manifests for each of the infrastructure. So we target them like individually while sharing the properties between them and deploying the persistent components followed by deploying the stateless components. Okay, so with the previous version of our deployment, we had three different deployments like most of the people do. They have dev, they have stage, they have prod. But for a team of seven people, it was too much. The overhead was higher to actually manage three different deployments. So we came up with this thing and we are actually giving our internal users three different logical environments within a single CloudFoundry deployment. How we achieve this, we are actually using three different DA groups and three different router groups which are responsible for catering the traffic for each of the applications. The elastic pools are still on road map. They will be coming up very soon. But then we had to come up with something. So what we did was we call it a stack hack. There's a property for the root FS itself where on the DA you can change the name of the stack itself. And when you have to push an application, you can mention this stack to put your application on a particular DA supporting that stack. So we created these DA pools on three different networks having the stack as development staging and production. So when a user has to put an application on dev, he has to put just an extra property in his application manifest, the stack as dev, stage, or prod. And accordingly the application will end up on that particular DA. This provides the provisioning DAs on three different networks, provides isolation and both on the security and the network isolation. After this thing, this is just the overview how we actually deploy our whole platform, what all components are in there. So with the previous version, like Carlo mentioned, we fogged the upstream version, which was a bad idea and we won't be doing that again. So we sync up with the upstream for the upstream Bosch releases, whereas for the internal Bosch releases, like for some of our logging in metrics pipeline, we have our internal Bosch releases and some of the user-facing CloudFondry plugins, we use concourse for shipping these internal releases. When we start deploying on not on the production but on the pre-production environment, it's always good to analyze the behavior of all your components. So we collect all of the metrics as much as possible and during the deployment, we check the behavior, the pattern of the graphs for pretty much all the components to check their behavior, followed by a server spec which is run using the Bosch errants for individual components. Individual components can work very fine as they are expected, but few times the functional integration between two different components can break even when both the components are working fine. So here we bring the infer tester for checking our subsystems, which is again the Bosch errant jobs for checking the communication between like our APIs or our internal tools, followed by the acceptance and smoke test on our pre-prod environment to check the uptime for the user-facing functionalities and how our platform will perform in case of failures and disaster. Over here, Carlo will brief you more about the other features of the next version platform. So we put a lot of care into designing a new log aggregation and collection pipeline that whose primary goal was to complete the couple for the producers from the consumers. So we have every single event log metric that is generated by any of the VMs that we deploy and we deploy all of them with Bosch. We send to Kafka using, well for syslog we use the on Kafka plug-in for collective write Kafka, et cetera, et cetera. For the application logs and the application matrix, we wrote a component that pulls from the fire hose and sends to our per application topic on Kafka. And on this side, on the right side of the slide, we have all the consumers. We archive logs on blob storage using Secure. We have a ELK stack and InfluxDB Grafana stack for logging and monitoring internally for operation purposes just for our team. We use Riemann for alerting and the complex event processing. We, all of these components had to be able to scale. This was one like requirement that we set. And even some components like Riemann that are not naturally able to scale because they are, okay, they're fully stateless in this case. But you need to have sometimes stateful metric and stateful events to monitor. So it's important that the event for a single specific component that you're monitoring always end up in the right Riemann that is tasked with monitoring that specific component. So what we came up with is actually a pretty clever solution to like redirect the matrix to the proper Riemann by using Kafka message IDs and the natural partitioning that is available in Kafka. Another thing that we set out to do is make sure that we really monitor everything that can move. We collect matrix at all levels starting from the system, all this stuff. Every component, both CF components as well as any other of the other components, if you have Java components, just pull everything from JMX. If you have NGNX, just pull everything from the monitoring endpoint of NGNX and so on and so forth. We monitor all the systems we depend on as well to be able to quickly isolate where a problem can be originating from. For example, we had issues in the last years with DNS. At a certain point our DNS system started responding erratically. And the only way we would have been able to catch that fast is if we were monitoring that DNS system for wrong answers. So we actually set out to make sure that we know this in advance so we can quickly pinpoint the source of a problem. And then we also run, we capture end-to-end metrics that capture the behavior of more than one system. Both from a passive point of view, so for example, just capture, for example, the latency of request to a particular application, that's easy. But then also capture events that we trigger and that we know should take a certain amount of time and make sure that we keep that amount of time constant. So for example, we have this job that runs every five minutes that pushes application to all the environments. And we know for sure that that number should be constant within 10 seconds mostly because the application is always the same. So these numbers shouldn't change either like transiently or on the long-term. We build in V2 all of our rugged and specific features on top of Cloud Foundry. We basically have nothing on the side. Some of these will be made open source soon. For example, the log access because we think there is actually value in that for the community. Some others are so specific to record and that basically makes no sense. But we can talk if you're interested. Moving forward, what we are planning to do next, after as soon as we finish the migration of our users from our current deployment to the new deployment, we are going to target Azure because we need to enable this actually one of our requirements burst to cloud scenarios. This is the first scenario. Actually we also have others. Better reliability so that we have like another data center to fold back into and then eventually better latency and performance for our users. We need to integrate all the service provider.Azure first and OpenStack. We have an internal OpenStack team that is working on Troll. We need to integrate that into to provide these services to our users. We want to provide HTTP2 termination in our initial phase until HTTP2 is actually fully supported everywhere so that application gets immediately the benefits of HTTP2 without part of the benefits. Something is required support from the application but part of the benefits should be available immediately by just enabling that on the reverse proxy. We are looking into certificate auto provisioning meaning users can actually push their certificate and have it installed everywhere on the load balancer for SSL termination to be done automatically. Eventually we want to do let's encrypt integration so that even if you don't have a proper certificate we actually create one real certificate for you and even testing it works seamlessly. Auto scaling is not just the application auto scaling side is actually the VM auto scaling side. That's what we are mostly interested in because it actually allows us to lower our workload and that's one of our long-term goals. We want to be able to scale this up as much as possible without increasing the workload on the team because the team cannot scale that much. It's not that easy to find people that can work on this thing and then eventually when elastic clusters are going to be available we are going to look into how to make this thing work across multiple data centers so that users can just push their application once, have it deployed on multiple data centers and via integration with the global load balancer that we have in Rakuten and eventually also in Azure or whatever we're going to use have the traffic steer to the right data center just automatically. That's a little bit of what we are planning to do. There is something that we are experiencing in our contributions to the platform that would be very nice to have some feedback from like all of you guys if you want and from the foundation. It is well mostly some of the things rely on missing documentation like there are some conventions that are not really well documented and this causes problem like when you open a PR because obviously they will complain. Many of the jobs are not really designed for collocation actually they explicitly warn you that they don't care so it's knowing at least how to make things on your side work nicely with the other components would be on the long term would be a very good. What we often find lacking on the Bosch side is the inability to know exactly which VMs you are going to redeploy before you actually try to deploy and also the ability to have like templates the preview of the templates render that would simplify the job of some of the guys on the team. Then there is one big complain that we get from our users and that's basically that logs are not just normally single lines and knowing that you are losing logs it's okay it should be the minimum it's okay to lose logs but you should at least know that you are losing logs and then we have internal use cases that would benefit greatly from having a way to actually hook into the API of the cloud controller and to perform complex validation and complex authorization of certain operations. So that's it we kind of overrun by a little bit our time if you have any questions you can either ask us or you can catch up later anytime. Thank you.