 All right, so welcome to this talk about principles and strategies of data service automation So few words about myself. My name is Julian Fisher. I'm the dirty guy on the right side I'm CEO of a company called any nines and We've been We are a Cloud Foundry consultancy with focus on Cloud Foundry and Kubernetes operation as well as data service automation. So I'm trying to collect some of the experience we've collected in in the past four or five years Automating the life cycle of around half a dozen a little bit more than that data services And presenting some of the strategies here with you The first thing I would like to point out before we come to the principles themselves is that It is very important that when you approach The automation of a data service a database a message queue or whatever That you should Make yourself a mission because there are so many things you can automate in so many ways on how you can do that So I've seen that a couple of times when customers approached that topic that they ended up with a very big chaos Because there were different teams that held the responsibility for a particular data service and in the end Every data service was a little different with a different operational model, so That in the end created a lot of effort to maintain and operate them So mission is actually meant to give you motivation and strength Strength to survive setbacks and also provide a a compass effect Enabling autonomous decisions within your team So I'm just giving you one of the examples your mission statement could look differently For any lines We decided to strive for the for fully automating the entire life cycle of a wide range of data services to run on cloud native platforms across infrastructures at scale So there are already a few constraints mentioned here that will change the way we approach the automation in the end and That that'll give us directions so it Its main purpose is to narrow down those endless possibilities and provide us Navigational means to design decisions So when we now look at those Principles you will see that if you change the mission statement significantly it may have impact on on the principles as well Most of them however are quite generic, so you should be able to Translate them to other missions of data service automations as well So the first and most important A principle it also is reflected in the title of the talk is that we want to have a mantra which is Automate automate automate So why is automation of data service is so important? I think from the experience of consulting Many customers on the digital Journey of digital transformation We've seen that the adoption of application platforms usually comes with ignoring the data service topic in in the first phase of Adopting to the platform there are there are data stores such as oracle databases somewhere in the enterprise and and in the beginning They people are very busy with learning how to do agile development how to do microservice architectures and so on But over time when you realize that such an application platform only works when there's a Counterpart on the data service side that will provide you with the same user experience just for data services What? Platform like CloudFoundry provides for applications. You will see that automation is Very important and data services need to experience the same automation as applications So in modern application development, you're looking at microservice architectures where Modelytic applications have been split into application systems more applications exist and subsequently More data services exist as the local autonomy of those teams will make decisions More specific to a particular service. So one service could have a relational database. The other could have a document database and so on so This leads to the consequence that nowadays the applications have numerous more applications than Before and these applications have more data services and more data service instances so as we still aim for improving the Overall innovation the speed of innovation. We have to reduce the operational friction that comes with that increasing number of applications and data services therefore automation becomes a competitive advantage among platforms and Therefore the automation of data services is also important Referring back to the mission statement there was the statement fully automate the entire life cycle and This is what the full life cycle management actually refers to it actually comprises the challenge of automating everything a database administrator would do and The reason why this has to be pointed out is because there are a lot of automation solutions out there that claim to be Data service automation solutions, but they ignore second-day operations So it's one thing to have what's for example of my sequel database up and running We have done that with chef like seven years ago But in contrast to application containers Your databases have state and the life cycle of a database usually is years so you have to guide that database through very different scenarios such as a failing host underneath or patch level releases Minor releases and even major releases So if you look into the topic, what is actually the life cycle of a data service? The first thought is about what is the life cycle of a database or data service instance in the terminology of the Service broker API. It is called a service instance at some point. You will create such a service instance and This diagram is not of it's not exact But it it intends to Express that the engineering effort of creating a service instance is not necessarily all you have to think about Because there are other life cycle operations such as various Possibilities to changes in an existing service instance that will eat up a lot of your engineering efforts as well So if you start Automate a data service. It's wise to iteratively increase the depth of the automation start with the low-hanging fruits those Life cycle operations that are most important and most frequently Requested so after a while this may look like this you have covered the life cycle of a data service instance creating a data service Maybe clustered maybe single versions of it for example with Postgres You could have a single VM Postgres You could have a plan with clustered Postgres both of different sizes. You could have Means to go through version updates Activate or deactivate Postgres plugins create backups and restore them. So this is the life cycle automation of a service instance. It also goes It also covers Very essential parts of what we are talking about here But I definitely have to point out that this is not everything a few years back And I've been giving talks about that topic for for four years now I would have said we are done here But we've been taught otherwise because if you look in the into the total efforts you have to spend on that topic You will see that There are other aspects that will consume significant time Which is the general release management the delivering of the automation releases into the platform Environments and then in the end the automation of the life cycle itself So if you look into a value chain like this here, we're using Postgres as an example You will see that on the left side There's an open-source database such as Postgres could be MongoDB Redis or RabbitMQ and Whenever there's an upstream change then we would like to have a automation release shortly after that So with that automation release you still need to ship it into the target Environments and from there users platform users can use it to create service instances and Life cycle manage means to guide that whole system that hold it delivery chain through the life cycle of That particular data services, but we come to that a bit later Back to that a bit later If you if you start automating it's it's kind of obvious that depending on the choice of your data service the the efforts of automation will defer as Some of the data services have been created with manual operations in mind Postgres for example is such an Is such a data service that has a lot of legacy in his DNA because it emerged in in a time where Automated data service operation was not a thing So subsequently it's a little bit more complicated to automate than services such as elastic search, which are inherently a bit more Modern and and maybe also because of its nature easier to Operate and therefore easier to automate So let's say that there are decision factors you should be investigating when collecting candidates for your platform and They are numerous this list is far from being complete, but it demonstrates the complexity of such a decision and In most of the cases you will also look at legacy use cases application that will demand that you automate for Legacy services postgres is a good example as every platform needs a relational database management system You are looking into postgres and my sequel for sure You will have and this leads us to the data service categories You will have to cover the document database the caches key value store message cues and so on so in in all these Services will defer on how robots they are whether they are clusterable or how they how they handle replication and so on So once you've picked your data services It it somehow determines how much effort you will spend on automation From experience, I can tell you that there's a vast difference between one and the other end of the spectrum I also believe that over time data service will be easier to automate as this will also have a certain Impact on the design of data service in the long run You can see that with postgres which has recently introduced client side fail over a big step makes making automation easier for example Another principle when automating data service is that you should design for scale and the reason for that is because Platforms such as cloud foundry they require a certain scale to be economically viable. So when doing platform Consulting we've seen that most of the organizations we've been talking to they've made million million investments into building such a platform Because it's just a very much effort to move a group of people to adopt to such a new thing and change the way they develop software And so on so in an organization. It's very expensive to do that So after a while a platform that's economic or viable has hundreds if not thousands or ten thousands of applications otherwise, it's just a very very expensive hobby and For that reason the data service automation should keep up that standard So whatever there is whatever cloud foundry is for applications the solution. We are looking for is for data services Now it's a fact that if you want to do something at scale that changes the optimal technical solution for it and As I've been doing software development for quite a while and while this seems to be obvious People tend to ignore that this is the case for software too. So I prepare that That little funny comparison. So we have three different technical solutions to do one thing, which is barbecue sausages So it's just the scale that differs See the left and the right one They they just are different in the scale. They do the same thing and it's pretty much the same for your data service automation So when I talk about the scale of a platform Then we are talking about thousands or ten thousands of data service instances. So if your automation would not Allow you to do that Then you should maybe rethink your mission statement because this is basically what a platform is designed to do for applications So most likely you will need that for data service as well But the number of service instances is a particular bad metric to Describe the scale of a solution because if you think back into that value chain We've just seen there are many influencing factors such as the frequency Application rights and reads from a data service instance The amount of data written to it the number of service instances Coexisting the data service types being available the number of of environments you're deploying to And the number of infrastructures you're deploying to so we can Simplify that and say well They are certain categories of scale and the most important one ones are the ones that describe the service instances itself the one that described the service program and its automation as well as the release management and delivery so if you look back into into that diagram where the entire value chain is shown that One of the aspects is in in a platform you'd at any time a user should have the possibility To on-demand self service So if you wake up at three o'clock in the morning and you want to create a Postgres database there you go The dominant Pattern to do that is today the on-demand provisioning of dedicated service instances Where such a service instance would be represented by a virtual machine a container or a set of virtual machines or containers as this guarantees a horizontal scalability in Contrast to shared clusters where you have a fixed set of virtual machines Which you will then slice up into service instances the on-demand provisioning scales until the Boundaries of your infrastructure, but it's not architecturally limited to the number of service instances Obviously when talking about a cloud Solutions you also want the vertical scalability So you can stop an application and just create a bigger instance of it And the same you want to do with your data of the database But with the difference that then you'll have to keep Ensure yet that you have to ensure that the data is still there after you're scaling this up and you want to minimize the downtime So this is about service instance scalability now if you provide a platform to your user and You've seen the impact of the no sequel movement over the year is that more database types become more popular and While splitting down the application from a monolith to more microservice-based architectures more applications will have Or will put the choice of particular specialized data services upon local teams So that it is likely that your users will demand more and more data services over time as they may Specifically solve a certain kind of problems very well So this also requires that when you automate data services You have to ensure that you can automate a larger set of data service efficiently And once you got those automation releases, you have to ensure that you ship those into the customer environment and just to give you a Understanding for example, we deliver data services into many different organizations each customer has its very own 300-page handbook on on security. So this delivery pipeline is highly customer specific and includes pen test and security tests Vulnerability scans and so on that are highly customer specific So while this has to be part of your delivery pipeline So a customer expects us to give him something you still have to be very quick in Adjusting that to the customer needs and keep up the pace So that this delivery pipeline does not slow down the delivery cadence of your overall solution. So That's what I'm saying is you have to get your release management, right? Because delivering fast is very important. So look at that scenario in the past There was a DBA who has access to the upstream source code of for example postgres and would be able to apply a patch To a database now we create Postgres automation release and ship it to the customer environment and own and then it is up to the platform user to update that service instance So we have three things to do the first of all is the automation release should be available shortly after the upstream change and In our case for example, this is done entirely automatic Automatically that means that for example if postgres 9.4 point one 9.4 point two is released a few minutes later We have an automation release that will trigger the execution of a comprehensive test suite to ensure that the contract between the data Service and our automation is still valid in intact So then you have to ship that to the customer as fast as you can Minimizing that time requires to have a framework here a template for such a delivery pipeline, but then somehow customize it with the specifics of a certain customer and Even one customer usually has half a dozen or more of those cloud-following environments for example, so that this is already a matter of scalability so you have to be quick with that and obviously now the platform users are Have to update their service instances so once you have installed the new version of of postgres with your new automation into the Cloud environment, then you still have hundreds of those database instances being locally held by different maybe customers of the customer and Need to be encouraged to update their service instances and to do that fast because you want to deliver security patches fast, right and To minimize that time you have to provide your user's means to do that easily So for example, just a performer CF update service and the update will take Will take place automatically Now one of the other strategies is it's basically something that's very essential to any cloud operations Is the approach to rebuild things instead of fixing it? This is not you for example the linux kernel in the early days had a lot of code about failure recovery and Some developer said why why don't we just take out that that entire code which saves us about 50% of the code base and just throw current panic and shout down the hole that somebody needs to restart that server and If you're running linux today, you will see kernel panic. So it's kind of a strategy that works today as well Coming back to data service. It's not as easy So while cloud found we will just restart your application somewhere in the cluster when an instance dies That's not as easy with your data service Because you have state and that state needs to be stored somewhere. So one of the most the more Essential strategies in in cloud operation today is to separate the life cycle of the ephemeral virtual machine Where your database is running from the persistent disk where the database stores the data and that is Meaningful because the machine where your database is being executed is a cheap off-the-shelf server Or at least it's not a high-end server anymore where the storage server is very specific Surely has hardware redundancies to certain degree and has a very different service level than the And availability as the host machine where your database runs So if you look into that virtual machine that contains your database And now the virtual machine goes away your persistent disk prevails It can be a gracefully unmounted and a new virtual machine can be created where you remount that persistent disk And there you go. This is something that happens We are using Bosch to automate the creation of service instances and this is something that happens Within minutes if you're using kubernetes The restart of containers even faster. So this can happen within seconds Be aware you shouldn't you shouldn't place your workloads of your data services on a kubernetes cluster We are currently investigating Deploying data services to kubernetes, but we wouldn't recommend using that until kubernetes has solved the i o Isolation issue that it currently has so you couldn't ensure that two database nodes be a two database instances being co-located on the same Kubernetes node wouldn't affect each other's i o performance, which is something that is very relevant for a database So now all the automation is very interesting and it's very nice So we've been through that creation of the database and so on But look now you've created they have the possibility to create a thousand postgres clusters MongoDB clusters rabbit MQ clusters Just like that imagine you have hundreds of customers who do who are using that in and your infrastructure has a hiccup and Or for some other reasons a larger number of clusters fail simultaneously You still need to ensure this On-demand serve service. So how can you do that? For example rabbit MQ if you if you use rabbit MQ Wrongly you can easily easily kill such a cluster. So how would you recover? depending on the data service It will always be Important that you will have such a recovery And as a last resort So if you want to automate data service you have to get your backup strategy, right and Going back to the mission statement where you want to create the automation of a larger sum of data services You have to ensure that you will find a way to deal with the heterogeneity as the backup procedure of different data services Can be vastly different So by providing a unified backup API you can actually abstract from this heterogeneity at some point and then settle on more generic Resolution logic for example of recovering data service instances Which makes the existence of such a backup framework for example necessary where you have data service specific plugins and also Storage specific plugins on the right side and in between a filter chain that does compression and encryption Now it's a pretty interesting time because we have cloud foundry on the one hand that allows us to Create applications we have underneath Bosch that can also be used to provision data services and at the same time we see kubernetes becoming more and more popular and I would make a bet that at some point that isolation problem will go away and Kubernetes will be a fair choice to use for data service automation So we should be prepared to ship into different automation formats including Bosch releases PCF tiles and kubernetes helm charts for example and This is more rhetorical question because at this point you have to somehow organize the answer to that question What is actually shared across the automation of PCF tiles and kubernetes helm charts? And the answer is your operational model because once you solved the Magic around organizing a post-crest cluster its cluster management its failover It's leader election and and the propagation of a new master node This logic will basically be the same because the underlying principles are the same So whether you are restarting a container or your or a VM will be pretty much the same thing And I I bet you could use the same test suite as long as you abstract the actual deployment So yeah, this should be done whenever you automate that you should But we use open source databases and because we believe that this is an open standard between the application and and the And the database and you can get it from everywhere. So preventing a log in The automation language so the language you as a platform user talk to trigger the creation of a database Should be an open standard as well and the open service proq API does a really good job here So this avoids a vendor log in as long as you have a post-crest on the left side and the open service program The right side you can exchange the automation in the middle looking at this very At this mission of having a broader set of data services one of the things that ensures fast release cadence is to never touch the upstream source code with the exception of temporary Hotfixes that are that have to be monitored as long as they applied You wouldn't do that in the long term because then the release cadence becomes slower because you still have to maintain the manual harmonization between your patch and upstream changes So that's why the only true way on on changing the source code would be to have a poll request against the upstream source code Another strategy and this is a major one is that you should solve any issue on the framework level if possible You shouldn't approach the automation of a broader set of data service without having a framework that contains solutions to most common Issues so for example on-demand provisioning of dedicated instance is something that we've put into a framework So whenever we automate a new service We can rely on that functionality and all we have to do is create a bot release and a component that manages Credentials as well as a backup and restore plugin that boils down the creation of new services to around four to eight weeks Which is which is a pretty nice thing to have and The reason why this is possible because we always think about any problem We have is is this a framework level problem or some or data service specific problem So only if that cannot be solved on us on a framework level or we have great benefit to solve a data service specifically This is what we do So one of the one of the issues that you could for example solve at the at the framework level is that you should You should protect your data service instances against such a trivial thing like like the disk is full You would be surprised how many data service respond to that event with horrific failure and data loss So a parachute mechanism for example that prevents that is one of the things you can easily build for all data services And then reapply it So obviously we've seen that the release cadence among the entire value chain is very important to us so without Release tests you how would you deal with the uncertainties of those upstream changes? How would you detect that the contact is broken between the post-crisp version? You just recently had the new one and your automation release for example as trivial things as a configuration parameter that has changed or a default value that has been introduced there will change something about your assumption Especially when dealing with clusters. There's a lot of complexity So you should be able to test those things and should provide a comprehensive test suite Walking through each release through the lifecycle of such a database. So for example, we excessively test The reintegration of a failed master in post-crisp just to give you one example This gives you the confidence that a new upstream Automation release that has been created by you will be Working when you deliver it more than that you should also take whatever you've been writing to the runbooks in the past years and And add it to your test suite so that a failure never occurs twice so This talk has been at least three time as long so I had to narrow it down I will also release a Plug-Pause series around that topic with a longer video version of that talk So if you want to dig deeper into that topic just go by the any-nines blog and you will find more to that to just reiterate I Strongly believe that if you change that mission statement for example and replace that Idea of having a lot of data service to be automated was just one that also has a lot of impact on the principles So setting your mission In in conjunction to your Platform strategy is very important it leads to some principles that should be known to all your team members and Enables local autonomous decisions if you split your automation team into smaller groups The automation is very important and should rely on a framework that has a lot of those common issues across data service worked into it and As you see there are technological and strategic shifts that make agnosticism important Where you try to abstract though so for example by using technology such as Bosch you can protect against Or you can be flexible and deploy to different infrastructures with the open service broker API You can support multiple platforms and so on so you try to Keep your investments safe by committing to the right things So I hope you gain something for your particular data service mission if you have any questions feel free to reach out We have a booth shoot any questions if Come by and ask me any question or just tweet me or send me a mail There are any questions left. Just feel free to ask. I think we have a few minutes left. All right, then thank you very much