 Hello everybody. My name is Oresonuszko and I'm from Rakuten. Rakuten is this big Japanese company established in 1999, running multiple services and businesses all around the world, mostly focused on finance and e-commerce. Last years we've been expanding globally, we've been acquiring new companies, some of them are very recognizable, and Weidberg might be a good example of this. ARPAS is a platform as service for our internal users, for our engineers. It's based on open source. We already presented this platform at CF Summit in 2013. At that time we just forked the platform because we needed some additional features which were required in Rakuten. Some of them were implemented in upstream in the years, some of others were never accepted, but in order to stick with our user case we had to keep diverging from the upstream. This allowed us to reach pretty sizable deployment. At certain point in time we had probably the second biggest cloud foundry deployment in the world. All of this was being run by a team of seven engineers dealing with a wide variety of tasks like architecture, design and planning, operations, user support, security and so on. We learned a lot of things during these years. First important thing we noticed is don't try to make everything fit in. Sometimes application has unique requirements, sometimes for good reasons, but most often not. Of course you need to support your users to move to your platform, but not at all costs. It's because it's an application that needs to be adapted to the platform, no other way around. Having a corporate champion backing you up can allow you to make sure that things are going in the right directions. Accept from the very beginning that not every application fits your platform, otherwise you might end up forking. Forking is a very bad idea, please never do that. There is too much momentum, too many values coming from the community. If you want to introduce a feature, try to upstream it first. If you can't, stalk PMs until they agree. If it still doesn't work, then try to build it either on top, either on the site. If you build on top, try to stick to the standard APIs. Try to minimize the number of connecting points between your component and Cloud Foundry. Try to keep all of those as neat and lean as possible. Another thing is that engineers' time does not scale. Every manual step that you leave behind you is going to come back and bite you, usually at the very worst possible moment. When you create your operational tooling system, avoid shortcuts, those lead to snowflakes. And snowflakes are really bad. Not effective integration with your provisioning API component can cause even simple updates, like upgrading CPU, memory or just storage, to take weeks instead of hours. This also applies to your monitoring system. When you create it, try to build it in such a way that you can access all required logs and metrics from one well-defined place, from where you can correlate events fast. Otherwise, you will spend a lot of time on going to particular VMs and searching for possible error causes. Another thing is that whatever works for a few hundred VMs, it doesn't necessarily has to work for five thousand VMs. So build your monitoring system in such a way that you can swap in, swap out components quickly. So you should be able to scale it along with your Cloud Foundry deployment. And kind of funny thing, don't share your command view console with your users, otherwise you might lose visibility when you need it the most. That's something that happened to us once. Assuming that your system is going to work flawlessly all the time without any interruptions, without any errors, is a mistake. It's a mistake at any level, because if you take, for example, long time window, you will notice that your system is slightly failing all the time. But at scale, things break daily. For example, we had that log pipeline, which was built based on the agent with a smart assumption that logs are never going to be lost. But what happens when a component of the end of the pipeline starts misbehaving because you can end up with problems everywhere, even on your site? By listening to me, you might be under impression, like we had a lot of problems with our platform. In fact, we had some, but in general it was very successful platform, which provided tremendous values to our users, to our customers. Regardless of that, we were aware of the fact that we need to catch up with the upstream. So since the beginning of 2015, we started working on Cloud Foundry version 2. And hopefully, we managed to bring it into production in the beginning of this year. Currently, applications are being migrated, and we are working on this. If you take a closer look at this slide, you will notice that our deployment differs from the standard one. The reason is we have one logical cloud deployment running on top two separate clouds. So we have VMware endpoint, VMware Bosch with VMware CPI, which is responsible for providing persistent components like LCD, like elastic search, mainly databases. And from the other hand, we have OpenStack Bosch Director, which is responsible for providing stateless components, mainly DAs and routers. A consequence of this, we have two different manifests. We target those Bosch Directors like individually, and between those two manifests we have some shared properties. The reason for this was at the time we were designing this. OpenStack didn't provide proper answers. There was some issues with Cinder. But at the time, after some time it was fixed, so we probably would be able to move our current deployment to OpenStack. There is also one more special thing in our deployment. Most of users, when they deploy their Cloud Foundry instances, they deploy separately three different environments, like staging, dev, and prod. But for us, for a team of seven people, it was a little bit too much. It was generating too much overhead. So we implemented all of those three environments, like dev, staging, and prod, and one single Cloud Foundry deployment. To achieve this, we split routers to three groups, like dev, staging, and prod, and da is also to dev, staging, and prod. In addition to that, we collocated go routers with engine access, acting as reverse proxy at seventh layer for enforcing ACLs. So for example, traffic which is going to end up on staging is always passing staging router groups. So this enforces network, this ensures network isolation, resource separation, and security. How we deploy? First of all, we use Bosch for our deployment because we want to stick with upstream. As a CI tool, we choose the concourse because we found it very convenient to us. This tool was built in a view of Cloud Foundry. We use concourse for deploying our internal Bosch releases, as well as for CF plugins. Initially, we deploy with Bosch Lite. We do some basic units test there. Then we deploy on pre-prod environment when we collect some more data. We collect many different metrics to observe how this behaves. Finally, we deploy on prod and we redo the same test. During the deployment, we use Bosch around jobs for single component tests along with service pack. Also, we use Bosch around jobs for more advanced tests which involve testing more than one component. So we are just testing features if several components provide expected feature. We put a lot of care when we were designing our Kafka, our log pipeline in Cloud Foundry v2. It's built with an assumption or perhaps I can say it using some other words. The main goal for us was to decouple consumers for producers. We managed to do so by using Kafka. Kafka is a central point. On the left hand we have producers. On the right hand we have consumers. So our policy is to get everything we have, whatever moves, whatever produces any kind of data. If it's metric or if it's log file, it ends up in Kafka. We even wrote a component which pulls data from Firehose and puts them into Kafka per application topic. On the right hand we have... This Kafka, this broker is supposed to store all those data for three days. So every logs we can imagine are stored there for three days. So we can pull any kind of data we want. But for further processing we use, for example, ELK pipeline. We use also Riemann, EfluxDP, PagerDuty for alerting purposes. We push our logs to Blobstore. So that's the basic overview of our log pipeline. As I told you, we collect as much log and metrics as possible. But this is not enough. For example, we realize that we have to also enter some other components, mostly external ones, which are not in our scope, but we are dependent on, for example, DNS. We had some issues with DNS in our previous deployment. At some point of time DNS started behaving unpredictable. It was started responding erratically. So if we had this check implemented, we would have been able to catch this issue much faster. We also performed some more advanced tests, which include end-to-end metrics and active checks. For example, we have this job, which pushes application every few minutes. And because this application does not change, deployment time should be constant. So those are our Rakuten-specific features. Most of them are not applicable to you, but some of non-specific Rakuten-specific features. For example, log access can be usable to you. So we are thinking on open sourcing log access, for example. What's next in Rakuten? We are working on burst-of-cloud scenarios. First, we want to target Azure. This is not the only solution we are considering, because we have many different services. For example, we have open stack team working on Trove. We want to integrate some service providers with those services. We also want to enable SSL certificate out-of-provisioning. For example, user, instead of feeling complicated forms and going through complicated workflow, he could just push his application along with his certificate and have it installed on log balancers automatically. We are also thinking about out-of-scaling. By zainteresować to, to mean mainly VM out-of-scaling, because our team can't scale that much, so we are thinking about many different ways to afloat our team. Once elastic pools are available, we are going to enable our users to have their application targeted once and then deploy it on multiple clouds. There are some areas in which we are expecting from you, of course if you want, some feedback from your site. Most of our concerns are related to missing documentation to unknown standards, or about this is also related to job collocation, which are requiring some additional tweaks to be able to work properly. Of course, when you open PR, they usually complain, so if there were standards implemented, it would be much faster and easier to discuss some issues. We are also thinking of, that would be nice to have a possibility to preview Bosch job templates, before the job gets deployed, because sometimes it's very time-consuming. Also, some of our users complain about logs, that logs are multi-line, and we as our PAS team would like to be able to notify our users when reliably, when they are losing logs, which sometimes happens. We are also interested in hooks. The thing is, most of our organizations have their specific features that have to be somehow implemented when application is being pushed to Cloud Foundry. By interacting with API, we would be able to simplify this process and make our life easier. That's all I have. Thank you for listening. If you have any questions, feel free to drop me an email or just catch me later. I think I'm big enough to be easily found somewhere here. Thank you.