 Hello, thank you for coming this will be talk a bit on Kubernetes but more on the work that's needed to transform a startup from a few persons company to a Real company with real Revenue and a few more people So first let me introduce myself. I'm Philip I'm doing devops. I help companies to develop faster to to grow up and still develop fast enough and I do it with helping them in the process for development and with the tooling that helps developers We are also teaching with my wife Nastya courses for programming and data analysis and in the past I co-founded the machine learning startup and I studied chemical informatics But I don't do this anymore so Now I spend most time with the Twisto Twisto is a financial company. They do online payments and One of their features is a twist of pay which lets you buy it to e-shops and pay after two weeks when when you Check check the goods and if you are not satisfied you just return them. You don't pay anything There is a cart master cart. It's it behaves like a credit card. You you shop for a month Then then pay pay after it if you buy a Lunch for your colleagues you can then split this easily in the application and Twisto also sells The scoring engines called Nikita It sells it for example to Alza for their own scoring as a service we focus a lot on on technology Twisto lends money and Because it's a startup. It doesn't have for all the the data sources and the bureaucracy like a big big bank But instead it can look at the online data it it can find about you and It can decide whether it will give you specifically the loan or not in one or two seconds and it can do it without in many cases without the documents ID cards or other papers and Still maintain the low fraud and and default rate and for this It relies heavily on being able to develop this to scoring and the technology So It's it's really tech focused startup now it it grows quite quickly Last year to splinter the second country. It's Poland There are now about 200 people working for for Twisto and It's it's growing faster and faster this year Twisto plans to enter a third market, which will be Romania But three years ago when I first came to to Twisto there were about 20 people in total seven developers out of it and From technology point of view There was just one server in in colocation Simple simple stack and genics Django Postgres a bit of elastic And one server for for database backup. So so if if the first server fires up We don't lose anything. We don't lose everything, but We can restore somehow from the backups if you wanted to deploy a new version of the application you did get pull on the server and Restarted a few services and that was it There was one country there was no no payment card it was all much simpler and These were the foundations for today's success Yeah This simple stack means that the changes that were needed to do were very easy If you have a small team You don't have the middleman you Have the CEO or the product guy talking directly to the developers if they need to change something in the product to test some Hypothesis it gets done very quickly the developer develops it He pushes it to production If something goes wrong he fixes it everything is it's okay The most of the developers were able to fit the whole system in their head This kind of stack is really good for Restarting starting company because you need to to make sure that you are developing the right product And there were a lot of dead ends and a lot of things that really had to change But There are obviously downsides the tools are simple and they they introduced risks During every deployment and there were several deployments today There was a slight chance of downtime during the day because nobody was writing database Schema changes in a compatible way. So yeah, it a few transactions were lost now and then the scale was small didn't really matter that much if The server it was two sides ago if it burnt burnt up There would be a day without without crystal well not not as good, but Somehow works. Yeah, there are there are some risks But still the benefits outweigh The risks tremendously the the ability to make changes for the early-stage company is is really crucial What is okay, okay? I'll try to refresh it. Oh Yeah, that's it. Yeah, so Everything's great. Sun is shining why why change anything basically Do you have any idea why is why would you want to change this the state? The problem is that the company was not making any money. It was it was in red and There was a slight margin on each transaction But it couldn't pay the rent and and the developers even for this for the small team So in order to change this the company needs to scale scale up and With scaling there are now new new problems. You you have to solve first and this will be the first part of the talk is resilience Every downtime now costs much more if you operate at bigger scale use more transactions and it makes sense to Invest more into make the tools more robust. The second is scaling the team. You are not not focusing on improving the the efficiency of three developers, but You are Focusing on increasing the throughput of the whole team. That means you want to be able to to hire more people and plug them into the development process and make more features and The third part is agility. You want to do this all Without without much effect on the on the original agility as I described the the deployment several times a day and the Ability to debug quickly problems in production. These are these were really good things. So we would like to keep them So how to how to do this first? Let's pick the resilience This was my initial reaction three years ago So how to how to make it more resilient? It's one server in the collocation. Let's move to cloud They have differential machines everything is auto scaling. So let's do this There was a bit skepticism at the company One one point was about the performance difference So we started to test it. How do you test the performance for some established application? you can Simulate the load based on the real usage data. You just pick the logs from the web servers and You see how the traffic pattern looks like this was there was a spike in the morning then during the noon people were at lunch and not ordering much and In the afternoon there was a smaller spike we saw what things people are doing actually and simulated the the traffic with help of the Locust tool and This is something where you write Python code to to simulate the behavior of the users and it helps you simulate a multi-step interactions for example going through the registration process The computation intensive step is at the end. So we had to simulate people submitting a few a few things During the process and then he think okay. I register And when you have when you have to simulate your traffic, you can just multiply it ten times fifty times 300 times and see where the application breaks. So where did it break? It broke in latency when we moved to a general purpose instance I think it was a digital ocean the first thing we tried The latency of the home page rendering went from 60 milliseconds to 300 milliseconds This was because there was a lot of jungle templating Going on on the home page and we found out that we had a really powerful CPU in the in the dedicated server and to it ran on much higher frequency than than the divergent machines in digital ocean and Yeah, like we could work with random milliseconds. We could optimize it somehow, but that doesn't make real sense similar story with with the database the elastic block storage Which is available in Amazon which is a disk on some disk array. It's it's great You can add capacity But the the performance is not as good as as the local isks so for the database where which you can rescale only Vertically because like buying more powerful hardware This was also it gave us a pause a bit and the third is Memory costs because the application was quite memory hungry because there was a lot of cheap memory in the server So nobody optimized it really so With these test results we found out that The Amazon was was doable but costly which would affect the scaling like if we want to to increase the the loads many times It would cost us and the cost difference was the way that it could pay for say to engineers maybe and And so so either we would invest in app optimization or we would pay this to the service provider Let's see how moving to cloud matches the goal we had The app can be made more resilient we could have auto scaling We would have some investment in in setting up the tooling, but yeah, it would work somehow Scaling the app. Yeah, it would work. We would pay for it Agility Does the cloud help us with with agility? We couldn't see how it maps directly to to keeping keeping it the agility so we took a step back and Thought what is it what makes the development easy and and fast? First thing is to have a quick quick deployments if if I am able to Put a bug in production and then fix it five minutes later. I am much more likely to To actually Do several appointments a day try try new features quickly. I do testing but I focus mostly on If the feature is is right rather than absolutely nothing has to break Testing to this level has a cost So having a quick deployment to fix the bugs has Has value if I have a monolithic application. I can run everything locally. I can I can test the Test all the features together This is This is nice. It's nice to have cheap resources It's a really different thing if I have to optimize the app now or if I have to optimize the app Half a year later. It's a technical debt, but if I have low interest rate It it makes it makes the development much much more fun and much Much easier Because some of the things will turn out that they don't need to be optimized the feature will be end of life in half a year The bottleneck will not be as as bad as we thought So yeah cheap resources really help With with the development and the last thing is visibility into production with it all server we had all logs in one place at the server and If you if you are debugging something you had one place where to look if something something goes wrong So this is something we also want to maintain so with with these requirements We finally Did this kind of architecture we have Kubernetes cluster with five workers in the in the data center on bare metal servers We have a beefy database database server With some with some backup But we just said okay, so let's let's make the server as redundant as possible inside the hardware So we will likely not hit the problem that that it it breaks down We and then we we use some external services. We use Amazon for file storage and for for the backups we use Rented HAProxy as a service from the from the data center So here The the rented services and the Kubernetes give us the resilience while The dedicated service in the data center give us the cheap resources The benefits of the cloud are that you can there's no sales person Person through which you have to go if you want to just another server. This is this is a big benefit But it's let's say four times four times more costly You have extra services the big pain with the bare metal services ordering the disks the hard drives You have to plan the capacity the service don't don't the service cannot house infinite amount of disks and This is actually the biggest biggest pain point the the physical physical hard drives and Also the dedicated servers help us that we sometimes need custom hardware like for Poland there is electronic signature token Which we just have to plug somewhere to a server So we need to have dedicated service somewhere And it works on Windows, but Here is just for your information the The cloud providers and the dedicated providers which with which I have some experience Ordered left is the brand name cloud. They have the most services. They are usually the the most pricey on the right there are Usually smaller companies who offer dedicated servers without as many managed services Sometimes you can negotiate a special thing like we have the the load balancer But usually didn't don't advertise it that much and you have the sales people there often and in the middle there is Cheaper clouds with maybe managed database managed storage, but not much on top of that so, yeah, this is how how we Managed to the resilience the second thing is Skyling the team how to move from a few people with the simple tools to bigger teams With the specialists who cannot work in isolation, but together they put up a more sophisticated product first You may have heard that API is something machine uses to talk to entered machine who has heard this kind of definition It's it's wrong API is something people used to talk to to other people especially the Developers of one service To talk to developers of the other service API is the contract between people that I will maintain stability of of this Of this interface So you can you can depend on it? That means that the team maintaining the API has more work, but in the end for the whole company this This means that more people can can take part efficiently in developing the the product for the front end Twista has three three applications iOS Android and web application and in the beginning the web application was Based on the jungle templates rendered to HTML and it's moving to react JavaScript single page application and all of the three applications are using now the same GraphQL API to talk to the back end that means that the people who make changes in the in the back-end code Don't need to deal with the templating HTML this sort of stuff Another thing is to pull in other teams from the company to the development First as I mentioned the risk engine which needs to model if you specifically are going to pay for the loan Twista gives you There are some statistical statistical models First when the company was really small the guy the data scientist who developed the model just created the pull requests The model got hard-coded into the application and everything was fine It was not enough later because it slowed him down and nobody really understood what this in the pull request and it anyway so To estimate the Models trainable in the application on the real data there was a little machine learning studio in the application and You could train training models there Again, it was not enough So now the data scientists Developed the models in their own environment to make a package out of the model Upload them to the application and the application just executes the model during runtime Similar thing for the parameters there's a lot of parameters The individual e-shops have different risk profiles like electronics is more risky than I don't know diapers Individual people have their monthly limits adjusted based on their history and in the beginning there were a lot of knobs and inputs in the admin console but Later we moved to creating a batch API so the risk team can can call the API change things in mass and and Now they are maintaining actually an active service which takes part in the registration process and It responds based on the real request from from the customers They have dedicated namespacing in Kubernetes it was it was Bit tough for the data scientists to to be thought how to deploy to Kubernetes What is Docker and and everything but they can manage because they see the benefit that they can they can really? Maintain their own own service without waiting for another team a lot of the work We as DevOps do is That's we try to remove waiting For some other people so with with Kubernetes. It's easy to create micro services. We have a few of them but The the decision whether to create another one for me It's it's usually is there a team who is willing to take care of this new micro service and and maintain it and does it bring them any any benefit Again, it's it's about the isolation between the teams like the API's It doesn't make much sense to to break up one monolith into two micro services or more micro services if there's still we still will be one team which who manages the The whole cluster of the micro services and if you need to do changes across multiple micro services You want to develop a feature? Doesn't make much much sense, but if there is Separated team like a marketing team who wants to create their own marketing pages or the the risk team who wants to to shape the registration process That's the ideal case for for the micro services and the third part of the of the talk is about keeping Agility in in our case Part of this Managing managing the process we have for the The teams which are empowered to to deliver deliver the whole feature Inside this one team, but here I will focus more from on the tooling perspective and I will show what tools we we have to Help the developers Deliver the features first thing is to automate the deployment The the deployment this is done several times a day usually and the features are delivered to as they are ready Sometimes behind the feature flag sometimes just deploy the pressure and There is a custom script which checks CI if if this this version is it's okay updates front-end Server, so it serves newer static files then changes in the database schema We have a tool which checks at the time of commits like in during the unit tests Whether the schema changes you wrote are compatible with the with the previous code which which was running on the servers So this should not disrupt the running application much Then we we update the back-end and then we notify to slack That there is a new new deployment and what features were deployed This is also useful for notifying The other teams in the company for example support that something has landed now people will call you about this One thing that's got worse from the old times with the single server is that the time to production You find out that there is a bug in production you commit the fix someone in reviews the fix for you May take a few minutes may take an hour who knows But then in the old server You did get pool restart and everything was was fine up and running if you found out that there was there was a mistake in Your fix you just repeated the process and in few minutes You are going now Building the the docker images uploading them to the cohab downloading them from the from the cohab to to the production servers Restarting all the all the containers everything it can take 20 30 minutes So this is not ideal and yeah, and I forgot sometimes Docker hub can be really really slow like dialed connection slow and To find this we are now working on Hotspix process which means that the the hotfix deployments will just download a patch sets from from github and We will just restart the services and you don't have to build build all the new container images You cannot do everything with just just patching the code But if you if you have a bug in your logic somewhere this can this can dramatically shorten the time the time to production like from 30 minutes to under a minute and Again the tooling tells you if the change you are trying to apply as a hotfix Makes extra sense if it's safe for for fixing core if there are changes to the database that you need to do outside manually or What what you need to do extra this is every this is all that the tools can can tell you One thing I forgot to mention is that the in this kind of organization the developers themselves Deploy the code. It's not the ops ops Organization it's not the DevOps who deploy the code is the developers themselves. So they need to be told If something goes goes wrong and they did they need to have the visibility into into production To be able to diagnose and debug the problems and they they usually do fine in the in the beginning with the Kubernetes and docker Like half of the developers didn't have any experience with the docker before But they they pretty much caught caught up to speed The developers are able to test Their features on a real scale of the database we have a half terabyte database in impulse grass and They are able to get snapshots of these database either anonymized for testing or in some cases Not anonymized for debugging some severe production issues. They can they can get this snapshot in five seconds This is because we maintain a replica of the database on a development server and with Linux logical volume manager we can take this copy on right snapshot and to start Start a fresh fresh database for for the testing for the database it looks like if somebody pulled the power plug and it just crashed and Now it's it's booting up after a crash and to progress is quite quick with recovery after after this This power outage. So each developer has their own own personal sandbox and They like almost almost all of them all of them uses it instead of smaller database with With some data fixtures, which they which they were using before on their local machines Another widely used feature is Multiple staging environments. We don't have any single staging environment But the developer or anybody in the company can Enter a branch name or commits commit ID into Into an input field press supplement and in ten minutes they get a public publicly accessible copy of the production production environment with with all the services running and They can test their features. They can show it to External people They can perform low testing they They can test the integration with external services which need to call call web hooks on our servers There are There are now about I don't know seven or eight the staging environments running at this at this very moment when the when the person is done with it, he just moves it to trash and For these tools we rely a lot on the cloud Either for the higher value add services like S3 or or Google Firebase we we use Virtual machines and AWS for the lean applications because it's much less hassle and then with the dedicated servers We have for the internal to internal tooling there like Wi-Fi portal VPN stuff There is a service which Unlocks the the encryption the hard drive encryption for the physical servers We have Jenkins workers in spot instances in Amazon and We use AWS also for the temporary sources like like these staging environments. I described a moment ago So these are the tools which gives us give us the agility even if the team grows In the beginning for example with the database the database was small It do if somebody has to to debug a problem with with the production data, which manifests it for some customers they could download all the 20 gigabytes of the database and run experiments Now it's it's not feasible, but but we have better tools for this So to sum sum it up What I wanted to say with this talk Is that you first need to know what the problems are When I came first to to twister the definition of the problems was a bit hazy They knew they want to scale. They didn't know really What what it means in terms of traffic developers, I think so the first part is define Define what what the problem is Then pick one problem Prototype measure invest a lot of the internal tooling. This is what? What really paid paid off for us? There are now Five people from the 40 developers who work mostly on the internal to link and it really Brings the value that it increases the productivity of the rest of the developers by quite a high margin and The most important point is to take the risks The risk has to be calculated There has to be some upside like if I take the risks if I take this risk. What do I gain? Does it actually bring me anything in terms of speed money, whatever and there has to be Limit on the downside like if the bet goes bad what happens if I lose the server because I have just one What what what goes What goes wrong? Can I recover from it? So yeah, if the risk is calculated if you know what is the upside why you are taking the risk and if You know, what is the downside like the the limits of the losses you can suffer from this risk you can decide that you want to take this risk and Without the risks you cannot do agile development at all Thank you for your attention and now now. I think we have room for questions and answers Okay, so the first question was on the tooling we use so to manage the service we use ansible and We use Prometheus for for monitoring elastic search logstash Kibana for the logs we have about 40 gigabytes of logs generated per day and We use a major major duty and sentry for for notifications We use a lot of slack. We have some slack bots. I If you have any specific area in mind I can share other tools, but I'm sure if yeah, okay Okay, so yeah, we we use With the Prometheus we we monitor mostly the the metrics which are Concerning the health of the physical servers like how many memory do they have? How if the is the CPU overload it or not? Excuse me. Is the CPU overload it or not? And how many how much disk space do we have and then also the application metrics Do we get any did we get any new orders in the past 30 minutes? If not something is probably wrong There are a few more specific specific application specific metrics length of the the background queues and so on and The second question was status page. Yeah, we don't have status page because the people who use Used to start usually don't come to our pages, but they try to pay somewhere so we communicate via error messages and error pages and We also try to make sure That's the payment by the physical payment cards or online payment by the master card always always works There are several layers Which which can approve approve the the payment? so we didn't have any any big big issues with the critical critical part and Yeah, we do have internal SLAs We are now more or less starting to formalize them Especially for the services we as DevOps provides to the other parts of the company Yeah, and and and we have to report SLAs back to the Czech National Bank every month How we met them, but they're quite formal ones and They don't really have to match the actual customer experience Any other question? Yeah, we consider it. We considered it. We'll probably do it at some point probably we will have In the Kubernetes cluster, we will have a repository for the production images to issues that First is I was showing the Pipeline here the CI built-in test. It would still take like 15 minutes to go to the production if we want full Docker build and tests and We decided that the priority for now is to skip as much as possible To do the hotfix, but yeah, we'll probably probably do this the docker hub for now. It's very cheap and I think we will move to to AWS registry for the Supporting images and to our own registry for the production ones, but so far There are other priorities Service mesh I don't Have experience with that. I assume it's something more like Heroku or Is it more? Is it something like Heroku or is it? Oh, sorry. It's too. Okay. Yeah, I get it Yeah, so far we we didn't need to use the service mesh Kubernetes has the layer 3 networking on DCP IP level and It has been okay for us so far Now we can see the the benefits for example of Istio or services like that For the communication between services which are managed by different teams. So for example To add encryption Between between the services This is something we are we are looking looking into but we are we are not using it yet The slides if you are interested will be probably in Near near the the talk annotation But if you want to check them earlier, they are at the bit.ly Any other question? Thank you