 afternoon. We have a very interesting talk right in the very beginning. We're making the idea from Avesto. We'll talk about going from minority to microservice. A journey I guess many of us have walked and also failed many times. So I'm looking forward to your talk. Thanks a lot. Thanks. So hi everyone. So quickly I won't take a lot of time. I'm from a Western company and we are supporting our customer around digital transformation. And I've been working there for 10 years now. I've been working on the M&A Topic in the United Transformation in Europe, US, Mexico and now in Singapore since six months for a large-scale company. And we start this speech by giving you a bit of feedback of the 2023 DevOps market. So in a nutshell what I have seen so far is that my customers are saying it's a chaotic environment with a huge workload where the teams are always late and over budgets and the releases are lacking quality and cybersecurity. In fact when we're looking in depth there are several topics that are emerging. So the delivery pace is an issue meaning that the business wants to release a lot of releases on digital products, releasing new features and the IT teams have some issue to follow that pace. The second part is regarding global consistency and we can see that there are some times cut forgotten. There are operation and quality of services that are not at the right level and same for the quality of the run. There are regression issue between the releases etc. Regarding the target operating model and the delivery operating model we can see that they are sourcing inconsistency between the team, between the delivery team. I mean there are some sizing errors in terms of strategic workforce planner in order to see okay I want to have 10 front-end developers, I want to have two backend developers and sometimes the capacity planning between the team and the right ones. There are some poor separation in efficiency due to the legacy of IT organization created in the 90s and finally there are some budget management principles in coherence between the development team and the operation one. Regarding the sourcing strategy quickly I've seen there are some inconsistency between the contracts and there are some issue by asking your outsource developer for instance to deliver faster because they have waterfall contracts and here it's a most common issue that I've seen and also the service level agreement part. Finally there are the human resources and psycho social hazards that we can face on the on your organization on which that you have overload, turnover and most common stuff that are coming now and mostly on the Singaporean market it's to keep and retain your talents in the company and in which storm we consider that there are five maturity levels that pave your way on the road to agile and DevOps. So I will go quickly on that one because I think the DevOps part will be the next slide it must be focused on. So first level is the tailored one on which everything is manually so it's the one that you you face like between the 90s and the 20s. The second one is the agile project so some of the application you started to develop it on agile way. Then there were the agile application on which you started to push the dev and the ops working together. The alias it's a bundle of application working all together in the agile mode and finally innovation first it's to let the bees work near the dev and the ops. But when we present that the most common mistake that I can see is sorry is that when we are talking about DevOps most of my my customer are thinking about this first part. DevOps for most of our customer in Western is only agile organization and collaboration culture. But in fact DevOps is much more than that. It's first CI-CD pipeline infrastructure as code and application architecture are the cornerstones of the DevOps methodology. So everyone knows the agile organization and the collaboration culture because they've been brainwashed and most common advisory consulting team are doing that. But the three of the pillars are very interesting because you need to consider that to be able to have fast delivery you must consider the infrastructure as code as part of it. Meaning that you need to be to become an infrastructure called broker for your application in order to reduce as much as possible the time to market, the time to deliver any underlying platform and services from an IT standpoint. I will give you an example for instance how long do you wait to have a new virtual machine, a new VLAN to be created if you are working on the legacy services and on infrastructure as code services. You need to ask a ticket on your ITSM, you need to wait for the team to be reassigned and then you need to receive the service. With infrastructure as code you end to reduce as a service meaning that it's click and you got the service rate right away. Regarding the CI CD pipeline it's the way to automate your code delivery from code repository towards your production in the fastest automated industrialized way. If we talk about DevSecOps it's including also the security part on your CI CD pipeline. Finally the last pillar is the cloud native and modular application. So it's good to have the tooling, it's good to have the infrastructure as code but if you have only one big jar file, so one monolithic application it's like asking an elephant to break your eggs in the morning, it's very efficient but not accurate. So my point here, the cloud native and modular application is the goal here is to work with the architects, to work with the business to redesign your modular and microservices architecture in order to be scalable, in order to enhance the fact to push small features into your CI CD pipeline, triggering infrastructure as code as much as you can and finally integrate cybersecurity on small pieces. So at the end of the day microservices and modular application is having the same application that breaking into pieces that leave opportunity for the IT organization to push it faster, easier and in an industrialized automated way. You can see on the bottom part of the slide that the market trends are the cloud offer. So the main cloud editor on the market follows the DevOps principles and you can see for instance if we can take your AWS for instance, the proposed DevOps with the CI CD pipeline infrastructure as code so all of the cloud provider now on the data and the line on the DevOps principles. So meaning that they are fully APIs to ease automation and visualization, they have services enabling modern architecture for application, they do propose transversal services for the run of your platform from monitoring standpoint, security, cost monitoring, etc. So it's open the door for finups for instance and finally there is integration with best of breeds DevOps tooling. So from the infrastructure as code, SCCM, build engine, etc. Now why a showcase journey is relevant for any IT organization? So I will go in detail so how to scope it just after but now why it's relevant. You will create a seamless team-involved business technical cybersecurity profile. You will focus the team on long-term product but with short-term milestones to achieve and thus you will limit the relative efforts to be done and you will take credit for the showcase and you will promote the showcase results and the people involving it in order to create your new DevOps hero or champion. And finally, this one is very important, you will find more easily the relative budget for developing it. So now in order to scope your showcase how to select the right project to be launched. So yeah there are four steps and the first one is which projects. So you need to consider that on your application landscape there are the project considered complexity and the needs of instability to be considered. I will give you an example if you are working in the banking sector or everyone knows mainframe. Mainframe does not need to change every day so it does not need to be unstable. So you need to have it the most easy as possible so it's not good candidate for DevOps. Regarding the project complexity here it's to consider that is it something difficult to develop, is it highly linked to any other application or I'm working on a very standard technology with a standard application. So here the goal is to take the more instable application like a mobile application for instance and the less complex one in order to ensure you have the best array in there. Then which application to be considered. So there is a technical compatibility to be considered in order to have the best application compatible to go on the DevOps path to provide the best compatibility on it. Now I really talked about it so how to prioritize so here it's to have the best ratio between the benefits and the investment and finally there is a priority model to be discussed in order to decide what do you want to do on development and production between internal and external. There are several assignments and topics to be launched in priority so showcasing several applications, the one relevant to the business. Creating a CIE for CICD infrastructure as code could be a good opportunity for you because it's something that will be relevant in the incoming future the more often you will have you will use your CICD pipeline the more often you will use the infrastructure as code and finally to transform the application technically incompatible into compatible ones to grow your ecosystem. Quickly on this one because I discussed several topics we can differ two main applications so the one that are devosizable and the one that are not. So the one that are devonsizable are the evolutive one, the one that have some front end so a lot of interface manual with customer either internal or end user, the one that have low flows meaning standalone, the one technically compatible and the one that are visible to the management and to the business. And under control the one that are not devonsizable is totally the malware here so the legacy one, a lot of back end, a lot of multi flows, a lot of technical incompatibility and that are not known for the business and the top management. Now how to create your devops a team for this showcase. So the goal here is not leaving the way people are doing devops right now the goal here is to bring the tech guy working with the dev and the dev ops and when I said tech guys is the people working on the platform and the underlying platform working on infrastructure as code working on the CICD pipeline in order to start and use the change with the team and very important part this showcase agility must be independent in order to avoid to be influenced by the management or any legacy organization. Then you can compete this team with ACMEs expert that came from outside the company either from outsourcing or recruitment. Then you onboard your team on the showcase that you would like to launch and promote on one or several facets of devops meaning that you can take an application and say okay one is monolithic I want to transform it first the monolithic application toward microservices and I want to use only a part of the CICD pipeline in order not to have everything complex at the beginning more infrastructure as code for instance and after three months you increase the complexity. Then you coordinate the other project devops if you have launched several ones at the same time with other teams and you might need to rush over the team at some point from a HR career point of view or from a skills point of view. Finally, the last part is to increase the audit of the cybersecurity the IT standards and the viability of the environments and then you repeat again again. So in the nutshell your showcase could be like C4 working in banking for instance a B2C application for customer on which that you'd like to promote this mobile application for your business with a bunch of features that require several interface with the end user then you want to develop it or replatform it for instance with microservices using a CICD pipeline and having infrastructure as code in order to reach I don't know one release a day to to push new features or to push new rush on every day. Here the showcase you can say to your management I'd like to have three to four months showcase on which I will have this application totally replatform from monolithic to microservices and having the KPIs the following KPIs that you can define with them and then after the three four months you can share the results having the most task completed and if you need an extra time you can ask to develop something extra to have any other features promoted there. So after the showcase what's next? So as you know because it's a very ongoing transformation develops it's an iterative approach by making DevOps a subject of adoption and skill development so this one everyone's not knows it. Then the second one is a bit tricky because you must dedicate teams architecture and environments for the showcase. This one low to no company are doing it. I think it's a mistake because by doing it you are able to get big for working like as a startup by creating a dedicated environment for that. Then you must facilitate the team adoption either the processes and tools they have tried to implement and test and by considering the IT agile net transformation to be allowed in the beginning meaning that most of the company here and Singapore market have launched years ago this DevOps transformation their cloud transformation. However when I'm talking with my customer here I can see that they are lacking some industrialization within the CICD pipeline. They are lacking some standardization on the CICD tooling for instance and finally there are some lead time to create a CICD pipeline to support a new application team to be developed. So here you can see it's because they worked only on the dev and on the ops part and not on the tooling so the IT agile part. So my point here is that the IT organization so the geek people in the organization must work as a team with the DevOps team. And finally so applying the DevOps principle on eligible application this one is important because a lot of the application landscape in the Singaporean customer here but I can see a lot of it are on the cloud certainly but on yes application so very legacy way of working as you had the application in data center or data. So my point here is not to jeopardize everything it's to take benefit of what you have but start on one application, two applications, three applications and you already did it maybe think it was the good way of doing it and you can start it again for instance with a compatible standard application that will fit for a showcase and on some environments meaning on your dev environment on your sit environment your player. And then you can you will be able to spread a touring DevOps culture within your company leveraging this showcase by improving it using the showcase plus more time and industrially using it is a sustainable approach. Again the sustainable approach is clear in the the company here what one that I've seen because there are no tooling from a CICD pipeline or infrastructure as good point of view. Then I think it's the last part of it you can extend this showcase to all target environment and defining that deployment pattern and principles. Here I can tell you what I have seen and what I've done in the US and Europe is that we created a dedicated environment in order to have an entrant DevOps environment from business to development to operation to the IT so the underlying platform and services and we moved iteratively the different application monolithic to re-platform one on their new environment with a dedicated operating model that brought value to the business. I will conclude here that it develops and in a agile way to quickly deliver value and share results to the business but also to mitigate the risk. So DevOps is agile so it's implementing DevOps at scale. I have finally now with you guys in order to answer any question that you can have and I have a bunch of appendices also if you need to go further in detail. Thank you. So when you are talking about dependency you are talking about dependency within the application or with other application. So within the application there are several approaches for the units so there is the code retro or reverse engineering in order to define the three main models which is UI, UX, front end, backend. Then there are common features on the front end and the backend that you can rely on so there are several patterns of standard monolithic application in order to match as possible what would be the microservices one. So my point here is that you can have 21 days to reverse engineering the application and then map as much as possible the current features in it to what could be a continuous application in the target. Okay now it's design, it's just design. You need to have this document in order to start creating a low-level target because if not it's like creating from scratch in your application. Any other question? So if I understood the question is you are asking the leader of the market to work on that kind of development activities? Leader in the company? Okay, the middle management. Okay so it's an incentive question. Okay so regarding incentive there are three ways that I was handling that topic. The first one is from the HR standpoint. So you must talk to the leader of the company in order to review the goals and the objectives of the middle management. Here you can incentive the middle management to work on the showcase or to support the showcase by putting it under their goals because at the end of the day if they are not supportive no bonus, no money. So the second topic is to promote the HR carry path from the middle management regarding DevOps in order to be fully ensured that they are working to develop their skills also on that. And finally the last topic is to create an incentive around it's what I called it's like a promote the success. So you put in front some of people with the middle management in order to take benefit of what is done for that. So it's free way of answering it related to HR. When I see a front end I always want to do an A-B test right? Yeah. A-B testing is something that takes time and is not trivially DevOps-izable. And so for me things are much easier to put into a proper CIDCT DevOps process on the back end side instead on the front end side. What is your standard? So I was talking about the showcase. So I was very showcase-oriented. You're right. I was very showcase-oriented meaning that it's easier for front end application like a CRM for instance compared to say okay let's go for an AAP. So no at the end of the day yeah you will have some bucket application that bring a lot of money that will be very relevant to be on the microservices application to your CIDCT pipeline. Thanks a lot. Thank you everyone. My call it my call it is the books about WANISH. So we are WANISH. We are the company actually WANISH software behind the open source project WANISH cache which is a you heard of anyone heard of WANISH WANISH cache or they use okay part of it. The rest of the people they don't heard of WANISH just a quick introductions. WANISH is doing HTTP reverse proxy. It's kind of web caching demon. If you compare some of the solution in the markets about NGINX or HA proxy very similar products but different objectives. WANISH is only focused on caching. You know NGINX is more like application. You can put a lot of module plug-in and actually proxy just do routing you know share the low balancing. We can do both but we are really focused on the caching side. So here we have the website the event website. Also we will exhibit in broadcast as well. If you want you can join us in broadcast as well. So WANISH that's why we the guy at the very beginning develop WANISH because slow website sucks right back to 10 or 20 years you know is still running the windows and then the browser doesn't start and you know it doesn't pull it away. So this guy accelerate the website. So that's what we do okay. So we build the software that define the web acceleration and content delivery solutions. It's mainly right now nowadays we focus a lot on CDN development, OTT streaming, web API, e-commerce, online shoppings, this kind of service you know like a tech talk also sucks we do streets also our targets here is our commercial partner base is also used by use the open source version at the very beginning and they later on they move to the enterprise they need more support more features but those of them actually start with open open source projects first. If you look in your projects you know some scaling problems you know you have problem in scaling the database the application itself but actually you first look into the caching see whether you can scale up in the caching level because caching level is easier to scale up for us it's just replicated instance you have all the service. We have different sector streamings you know who those Netflix they are using our solutions in in manufacturing you know Tesla somehow they use our solution in internally web API to accelerate the whole workflow because across the world they have a lot of factory and fetching the the footprints and design and all these things they will use the varnish to scale up and speed up the system. Okay so what is varnish actually varnish just sit between the client and the back end the client can be your browser can be your cell phone actually the client can be another system so in the nowadays world we use like a kubernetes docker enrolments all this cloud system you have micro surface the client can be other micro service as well to talk to another back end another micro surface in between varnish can cache all these requests HTTP requests and then respond we can do some logics in in the cache to speed up or aggregate the response so that's why we have the web API acceleration or the edge computing some figures just to share just two months ago we reached about 1.4 terabit per second in a single server so if you are doing OTT streaming or video broadcasting we are very good in that area and actually we save a lot of power comparing the per gigabit per second delivery so what we do is more than one gigabit per second in one walk okay if you are talking about you know ecosystem saving you know carbon coder footprints which is very good in the company the company future okay so some very generic or technical stuff about varnish or HTTP how it works so in if you don't have the caching or whatever in between the client just come in and get the content from the origin and going back in between you can have some HTTP demon or you do a memory caching this caching or your application itself where this main cache or this component you can use it in your application to speed up a little bit but still as long as more user coming in from the internet or from other area the system will fail okay because you don't it's really difficult to handle or scale up in this level so what can we do is varnish cache okay varnish cache will put it inside in between the client and your origin or your applications then the varnish cache actually to do two things one thing is clearing up all your requests if we don't find the cache in our in the memory or on the disk then we only send request back to your backend to press the content so from this point of view let's say you have five requests right now at the end you only receive one from the origin and then respond back to the client so the latency here is very minimum we just pause it until we got the response the first byte respond from the origin and then we're going back to the client so no matter is a web api or 40t streaming the latency is very important for live streaming so in between we add up probably 10 milliseconds maximum so we get the response back to the client and also we can scale up just add more instance then you have more strongly scale up your application without touching the origin right the origin will be more less the same okay so once we catch the content actually we don't even go back to the origin okay the origin is completely isolated from the client actually we add additional tier of the security from your back from from the application so you are protected you know separately from the client side directly so the client doesn't reach the backend okay the core component is more on the technical side how one is work one is actually divide the whole system into several components one is the core engine the one is cache api this core engine and also the flat pools so we have multiple flat pools architecture on the top we use core vcl the varnish cache language to configure how the system behave and also we have the read more the module the plugin to add on some other feature this all this feature you can create by your own you know open source and also some enterprise read more that you just plug in and then you have more feature on on manipulating the request behind the bottom two part is about the storage engine first we use the j e nano to use the memory how to allocate the memory and the link this to catch the content or in the enterprise version we have the msc the mass storage engine we can combine the memory and the disk to deliver the content more efficiently and use the your capacity more efficiently and then the read the next on the right hand side is the vsm the share memory module which is more powerful to get the login or the system status from from the core without affecting the incoming request okay um one of the things we can about is about the um the logics behind the core is about the fine final size machine so it's a final state machine allows varnish to manipulate the request at different states so once you get the request actually you can manipulate the request how we respond and also manipulate the the response from the back end so all these states we can do very specific modification or decision making it allows us to deliver more efficient and more details of these manipulations okay here's the high level the state machine how it looked like so when we coming in from start the request coming in then we will do a hashing to look into the cache if you find the cache we will respond to deliver we will go back to the deliver if we find the hit with our hit if we don't find the cache content we go to the miss and then refresh it from the back end so each state each square blocks actually can put logs there and to see how you want to manipulate your request so most of the time we manipulate from the back end fresh so you have multiple back end redundancy all these things or low balancing the logics is done at this level if you want to manipulate the request usually what we did it is modifying the response in the vcell delivery okay so that is the vcell and bit more yeah as I said vcell allows you to manipulate the incoming requests and for how and where to cache the response of course you can modify the cache objects lifetime you know how to respond or delete the objects and internally we have different objects to manipulating the request let's say response back end request back end respond and also the generic objects all this thing you can access it from defense api you can extract the status so you can use for medias or other modules to to displace all these status graphically okay so some things more simple about the back end fresh so we use the this one to do a run robin for the back end redundancy request routing or shedding to do more consistent hashing for the back end request all these back end can do health check to make sure your back end is healthless and send the request to make sure your your services up and online all the time okay back end response usually if the back end respond a positive result let's say 200 error 200 code or you have other code like a final three or four four this kind of error code you can specify how to handle it when you have a respond we can set up the ttl that's the cache object in your cache how you want to save it in the cache how long you want to save uh in that level uh one is use different three different values one is the ttl the original lifetime another one is the grace period and the next one is keep so when we use the grace and keep period actually there's a magic that allows the cache to keep a little bit even longer out of the ttl time then once if you have problem of your back end actually you have an option to get the content from your cache even that is expired so there's a lot of other you know logic and uh magic you can pay around with the resale to retrieve the cache so the resale which is the share memory that's export all the status the memory fact and all this thing to this area so we can do it for analytics logging and also we provide a very powerful query we call it uh usq to query the memory uh the log from the memory uh by default we don't keep the logs okay we don't like Apache or engine actually we don't send to syslog or whatever we do the logging from external tools you have to integrate with other systems to get the logs because the logs in let's say in cdn in otd streaming is massive so usually you export it to elastic search for medias this kind of database to keep it for a longer times so because the system doesn't keep the logs the i o on that part is very minimum so that's why we are so efficient in handling the request we focus on that part we don't spend or waste the i o to keep the logs okay here's the log if you familiar with the varnish here's the varnish log how you extract the details so we extract all the information from the vsm there's only one request so the request you can see how it's coming in and out and why so if you're doing troubleshooting if you know the the HTTP spec you have all the details okay and here's the log and statistics okay vcs we do our own internal counter as well so not only the logs but also the counters okay and then the caching api and the store so i just skip those store things but i want to focus a little bit on the massive storage engine which is the way how we store the the cache content normally what we write a files to the cache we just have to you know the file tables to keep the file headers and the file name and the content in varnish how do you keep the cache actually we split into two parts one part is the metadata and another part is the exact cache body doing so that's we have the flexibility to just delete the cache from the metadata not exactly from the storage it's meaning we reduce the i o a lot until we research the whole block of cache has been deleted and then we will use the the database the metadata to clean up the space so we save a lot of i o and more efficient to use the storage and less fermentations so by using this one actually some system can run for more than three or four years without reboot it's still running the disk doesn't crash the fermentation is very low okay invalidations i have to skip it so let's go to strip about web api stuff it just work on the marketing place so the web api stuff mostly we use the vessel api which is http related let's say hs call http request pose or with the query strings or this thing the engine itself doesn't care about what kind of query or post parameters we can catch everything as a index so when you query or using the get or post this kind of resources uh method you can catch it by the uh one-ish cache and it can be shared with multiple requests without talking about you know how to control the detail and all these things the magic is done by the resale okay of course you can follow the cache control headers or other headers but most of the time if you're doing api call we will override it and very specific on how you want to handle it and for these uh web api some of the users use the sci insert esi that's like a template so the surface i just send a template with the core like this one yes i include and then you will include another file so the system will cache but then this dynamic content or with some logics programming logics you can dynamically or on the fly to request another to trigger another request to generate a file it's just more like uh you have the driver script on your browser but it is not on the client side it's on the server side okay uh also uh you can create more other HTTP requests at the same time uh not only the original request so we rely on another read more we call HTTP read more during each call we can create additional HTTP requests just like this one so when coming in i create another request i copy the headers and then i check the authentication okay uh another one is an edge stitch so it's kind of template just like the gold template so but it is on the web side so you put some logics and also combined with the json objects structure and then we just put it here to generate the content on the fly it's more like you put the web server on the edge but it's not exactly the same we aggregate the content from different backend okay this one will be you know to fetch the json objects from another application server and then massage and then generator the layout yes as long as you can some of the files like the 1009 or 2009 here you it doesn't matter we we don't limit you so just say for loop or inside loop we don't limit how many loops you have of course we can have some memory control because it use memory on your cache so it's mainly on the startup how much memory on this uh work space so if you have more loops inside these block objects probably you need more memory on the work space otherwise we don't care there's no logically there's no limit the limit is by the memory okay like the gold language template less computation we just like template and then aggregate everything together we also support gzip so that means if you flash some content like a json object you can see it and unzip it in on the fly okay more integration about like a reddisk so this completely open source if you want to you know keep track of all these sections along with the cold cluster let's say the cluster you have 10 to 20 pops instance you can use the reddisk to share the information across all the logs so just like this one uh the reddisk client that's bull is a an additional widmon we call the additional module they're talked to the underlying uh widdisk bus the widdisk team and then and okay i want to get some session data and then update the the card the shopping cart here on the fly okay it's more like a programming on on the edge and uh actually it does and yeah it's more efficient than you like a nginx or php code here okay um some configuration less about networking you know uh control the speed and ray limiting here is about ray limiting uh so you can ray limiting filtering over checking what kind of method you want to accept some kind of security things you can do uh on the edge in varnish okay um checking the the rules from the incoming request or actually the method of the same but this one is more like a http standard the specification if your request is coming through with english chinese or korean then you can block or accept or create defense version of cash okay this this one will affecting how you catch the objects and more widmon that we can put in your system you know exporting another widmon to manipulate the the response so the response probably can be a xml or others uh html file you can replace it or we can update it and mmdb like a geo blocking database uh you know device database you can base on that device id or location to do some additional decision uh http module that that we show it uh additional request to send uh additional api call json object you can pass the json on the fly without you know doing a regular expression or something kind of like that jwt json web token is a built-in module that you can verify the authentication the token itself to allow and disallow the request uh udows or go to the director how you manipulate or manipulate or control the back end traffic across multiple back end sympathetic back end more like a generate a new page or new response uh other than the back end okay i think that's it for my end so thank you very much any questions before we wrap up thank you guys there's only an example there's a lot of other uh module that you can integrate where this uh main cache and mostly uh the next and it's open source because there are lots of things uh happenings all the time you do you have any existing projects you are trying to connect then we can talk about it sure yes graph sql actually by default is already there it just accepts on how on what kind of query you want to cache so this is nothing it's no magic because uh graph sql is more to us is just like a http request and there's no magic or no specific things we need to handle just one thing we find uh some difficulties it is not difficulties or easy is about what kind of images you want to cache at the same time and how long you want want to cache so that is only the things otherwise they just go straightforward we have some issue with some integration but not too difficult to solve it yes exactly that's all you need to do we can we can support it so that's why there's something you need to change or you can adopt it or you change the post to get or we can allow you to do the post so if you do the post that means you have the body from the request because the usually the body from the request will be ignored by default then you need to open up the parameters to allow how much memory you want to spare to post the object so that is something you need to consider there's two version so the open source version we still run on Hitch if you know the Hitch project actually is still under the software as well it's a separate module they do only the HTTPS termination TRS termination but the enterprise version is we built in the TRS termination we don't need to do anything we just bring it up it will work no more no problem I think it's already there I haven't tested the new version that our new release on open source the version 7 is already support that one yes yes the version 7 that they claims you know the open source version they claim they already test on that version but I haven't tested myself and the enterprise version is still based on the previous version 6 or something so it's not the latest one the ESI syntax not the we not compile the whole set of ESI we compile most of them because the rest of this function we can do it by other with one is more efficient than the using the ESI function okay cool any other questions yeah okay yes we rely on our partner to provide a platform but we work with Intel very closely actually the coming slide extra slides about Intel stuff we do a lot of Luma integration to speed up to use the CPU more efficiently especially in the cloud environment because in the cloud environment they don't really aware the Luma because when you have the IO and the Nikon altogether you have you know 10 gate we are talking about 1.4 terabyte per second in one so it's not in the memory and the memory and the IO and the system exactly exactly yes okay thank you very much guys thank you thank you thank you so Werner Vogel the CTO of AWS famously said everything fails all the time so for Werner Vogel the question was not how to avoid failure but the question was how to handle failure so I am Dominic Torno I'm a principal engineer at Temporal and I focus on systems modeling specifically formal modeling and conceptual modeling I work at Temporal and Temporal is an open platform for durable executions and durable execution is to a distributed system what a transaction is to a database an abstraction that enables you to build an application as if failure doesn't even exist now to provide this guarantee Temporal built a lot of expertise in failure handling and today I'm super happy to be here at first Asia summit to talk about handling failures from first principle so a first principle approach breaks a domain down in its basic principles and then builds an understanding from these basic principles instead of relying on unbroken assumptions or on conventional wisdom so in this presentation we will think about failure and failure tolerance holistically a failure is an event in a system failure refers to an unwanted but never this possible event failure tolerance is a guarantee of a system failure tolerance refers to the guarantee that the system behaves in a well-defined manner even in the presence of failure now in other words if a system is failure tolerant then the system trivially guarantees total correctness in the absence of failure but it also guarantees at least partial correctness in the presence of failure and if the system is actually able to guarantee total correctness even in the presence of failure we speak of failure transparency now failure transparency is obviously the most desirable property but it's not always possible so think for example of the cap conjecture I think this database heavy crowd think of the cap conjecture the cap conjecture states that for a replicated data store you have to choose between consistency and availability in the event of a network partition so in the absence of a network partition the network partition is a failure the unwanted but nevertheless possible event the system is able to guarantee total correctness it is able to guarantee both consistency and availability but in the presence of a network partition the system is able to only guarantee partial correctness we have to choose between consistency or availability yeah but the good news is at least you get to choose as a designer of your system you get to choose what failure tolerance means to you you get to choose the guarantee you want and is important for you you get to choose if you prefer consistency or availability so failure tolerance is a design decision now in order to talk about failure holistically and what kinds of failures we expect and what kinds of failures we need to tolerate and what guarantees we need to make in the presence of a failure we need to first look at the underlying system model where the failure actually plays out so system model is a set of assumptions about a system algorithm and protocols that are correct under one system model may not correct under another system model any deviation may render any algorithm or protocol incorrect so you can think of a system model a bit as a like a board game yeah and the game sets the stage and the game sets the rules and as a player you have to devise a strategy to achieve the game objective within the constraints that the game sets for you and even a slight change to the rules may render a player's strategy completely ineffective that happens a lot with extension packs now for this presentation we will think in terms of a very popular system model in a cloud environment in a microservice environment and that is in terms of service orchestration so a system is a collection of processes and one process is a sequence of steps and one step is a networked call to an upstream service here a single service call has transaction like semantics it's atomic it either happens completely or it doesn't happen at all however the sequential composition of service calls does not have transaction like semantics and the sequential composition is not atomic out of the box but we also want the sequential composition composition to be atomic yeah we want total application not partial application so in the event of a failure we need to ensure that the process executes in one of two ways either observably equivalent to exactly one's total application or observably equivalent to not at all no application so the classic example is certainly travel booking and many of us traveled to be here so our credit cards were charged when we booked hotels and flights each step is in itself atomic however we also require the composition of the steps to be atomic we expect exactly one charge one room innovation and one ticket so to keep things simple for the rest of the presentation whenever we need a concrete example let's talk about charging the credit card that charge credit card service call so what failures do we need to tolerate what could go wrong well it's a microservice environment it's a worked call so the request may be lost in the network the service may crash before the computation takes effect before the credit card charges the service may crash after the computation takes effect so after we charge the credit card or the response may be lost in the network and in the absence of a response we actually don't know if the intended effect happened or if it didn't happen we cannot distinguish whether the failure occurred before the computation took effect or whether the failure occurred after the computation took effect so we may end up in an inconsistent state and additionally the computation may simply return a failure a failure you raise an exception like an insufficient funds exception so there is a response and the response itself indicates a failure okay now what are we going to do how are we going to handle that failure failure handling always consists of two components failure detection and failure mitigation the first component of failure handling is failure detection so it refers to the mechanism that detects if a failure has occurred now generally i struggle a bit with the notion of failure detectors in distributed systems most authors focus on detecting component failures or emission failures crashes but i like to cast a bit of a wider net so when i think about failure detection and failure detectors i generally think about witnesses a a predicate that confirms the presence or the occurrence of a failure very common example for witnesses are exceptions the system itself tells me something went wrong but also timeouts we're waiting for something and it doesn't happen that is also a pretty good indication that a failure has occurred it is not certain but it's a good indication now the second component of failure handling that is failure mitigation it refers to the mechanism that actually addresses the suspected failure or resolves the suspected failure and broadly speaking especially for our scenario there are two failure management techniques one is forward recovery and one is backward recovery so remember the process is a sequence of steps and any partial execution is undesirable therefore in the event of a failure we need to ensure that the process executes in one of two ways so observably equivalent to successfully total application that is what forward recovery is responsible for or observably equivalent to no application that is what backward recovery is for so let's look at forward recovery first in the case of a failure we just move the process forward so more formally we transition the system from an intermediary state to the final state and as a rule of thumb we need to repair the underlying failure because we try to push past it we need to resolve that failure forward failure recovery is a very common platform level failure mitigation strategy we simply retry right something goes wrong let's do this again something goes wrong let's do this again next let's look at backward recovery in case of a failure we roll the process backward or more formally transition the system from the intermediary state back to its initial state and as a rule of thumb we don't have to repair the underlying failure we're not trying to push past it so backward recovery is a very common application level failure mitigation strategy we compensate we undo what we already did we reverse the charge on the credit card in order to choose the ideal failure handling strategy what are we going to do when a failure occurs we also need to take the class of the failure into account failure classification now obviously there are hundreds of different ways of classifying failure but here i want to focus on two orthogonal dimensions that's the spatial dimension and the temporal dimension on the spatial dimension we can classify failure as an application level failure or a platform level failure so in order to do so we need to think about a system in layers components at a higher layer usually make calls down calls to components at a lower layer and generally they're expecting a response and the end-to-end argument states that in a layered system failure handling should be implemented in the lowest layer possible looking from upside down that is able to correctly and completely handle failure detection failure mitigation so now a failure can be classified as either an application level failure or a platform level failure depending on the lowest layer that is able to detect and mitigate the failure so for instance an insufficient funds exception indicates an application level failure now the application level is the lowest layer capable of correctly and completely resolving that failure that failure is completely meaningless on an application level but a could not connect exception that failure indicates a platform level failure although the application itself could potentially mitigate their failure the lowest layer that is capable of correctly and completely mitigating their failure as a platform layer which can just simply we retry we retry in the network on the second dimension on the temporal dimension we can classify failure as transient intermittent and permanent a failure is transient when we can assume that the probability of a second failure after first failure is not elevated so formally a transient failure is defined by two characteristics so first the probability of a failure f2 occurring after a failure f1 already occurred is the same probability of two occurring just by its own and transient failures are auto repairing they need to repair themselves otherwise they're by definition not transient so we do not need any intervention in our example if the cause of the could not connect exception is for example a router restart then that is a transient failure the failure repairs quickly once the router restarts the connection can be made the second class is an intermittent failure where we can reasonably assume that the probability of a second failure is elevated so formally an intermittent failure is defined also by two characteristics and firstly the probability of a failure of two f2 occurring after another failure f1 already occurred is higher than the probability that f2 occurs on its own and intermittent failures are also by definition auto repairing and resolves themselves without any intervention in our example if the cause of the failure is an outdated routing table then the could not connect exception may be an intermittent failure the type of failure auto repairs but with some delay but as soon as the router updates its routing table the connection can actually be made and if a failure is permanent we can reasonably assume that a second failure is certain so formally permanent failure is defined by two characteristics where the probability of a failure f2 occurring after failure f1 occurred is 100% and secondly also by definition permanent failures require manual intervention they require manual repair in our example of the case of the failure as an expired certificate then the could not connect exception is a permanent failure yeah that failure doesn't auto repair somebody has to come and install a new certificate otherwise it doesn't go away now with all of this together you know with the for this particular system model and for that particular failure model what could an ideal failure handling strategy look like and also what is ideal what is ideal for me may not be ideal for you it's a design decision but let's look at one that I think is a reasonable failure handling strategy or in the event of a failure first let's assume it's a transient platform level failure so we retry immediately right let's not wait immediately retry if the retry succeeds we successfully mitigated the failure done if the immediate retry does not succeed we just need to upgrade our understanding of the failure now let's just assume that the failure is an intermittent yet it's still platform level that's auto repairing so we just retry a bounded number of times and typically we do that with exponential back off yeah again if one of the retries succeeds done successfully mitigate the failure if none of the retries succeeds we once again upgrade our understanding of the failure yeah we can still assume that the failure is a platform level failure but not need to assume that it is permanent so the process basically suspends and is awaiting the repair of the underlying failure we need manual intervention that does a trick once again we mitigated the failure if nobody repairs the failure in time and manual intervention doesn't happen we need to upgrade our understanding of the failure once again and we now assume it's an application level failure and we are ready to compensate we're rolling back whatever already happened and if the compensation is successful we successfully mitigated the failure if the compensation is not successful we're in the worst place to be and then we basically have to escalate to a human operator if we charge the credit card but we cannot roll back the charge on the credit card we cannot undo the charge somebody has to sit down write a check and mail the check or however we're going to resolve it but we have to resolve it outside of the system on a level of a human operator now as a conclusion yeah failure and failure handling and guaranteeing failure tolerance and working on failure transparency can be super intimidating topic but it helps me a lot to take a principled approach you know to to think about failure holistically and then implement the failure hand list handling strategy with confidence now if you want to get a head start then check out temporal IO temporal takes a principled approach to a failure handling and implements the concepts that we have explored today on a platform level guiding you towards an ideal failure handling strategy for your distributed systems and with that thank you very much for joining super happy to answer any questions you may have in person or feel free to reach out to me also online for example on twitter at Dominic torno and then yeah thanks again thanks again for being here yeah the so temporal is an open-source project and it is a platform for durable executions now i like to contrast also durable executions to volatile executions volatile executions are just simple function executions and think of any simple function it only provides you weak execution guarantees the function may crash or the function may time out right and that may lead to a partial application now temporal gives you a durable executions and durable executions are function executions with strong execution guarantees the function execution cannot fail and the function execution cannot time out and yeah it's an open it's an open-source project and check it out at temporal.io please yeah it cannot yes correct so you are correct it cannot it cannot know it is an impossibility result in distributed systems that it's either complete or perfect and what usually what systems the i mean in this case the approach that we take is we suspect a failure even if the upstream component is just slow and therefore any stragglers any delay responses will be omitted or i'm sorry will be discarded but you are entirely correct you have to deal with the fact that the computation of that request actually happened so your system must be able to either roll forward or roll backward whether the computation took effect or didn't take effect which is actually not it's not easy to do right a it's not it's basically impossible to do on a platform level unless you know the semantics of the operations like a database does right i know what a right does so i can undo a right generically but for like any service call i don't know what like what is the undo operation of a of a credit card charge we don't know that on a platform level right so it requires the cooperation of the of the application programmer and that's actually quite a feat yeah it's quite a challenge yeah um so the there is um so on the on the system model right that temporal takes into account there are certain failures that we can handle on a platform level and these are trans failures and intermittent failures right so failures that can be resolved by a retry and by forward recovery we can handle that completely on a platform level and do not require the the operation of the developer although since we're retrying we do require idempotence on the on the upward service calls but as soon as we talk about application level failures like the insufficient funds exception right there is there is no way that we can push through that on a platform level so at that moment we escalate to the application level and this these are the exceptions that you have to take into account in your code so the compensation is can still be found in the code um compensate often often called sagas right so the compensation can still can still be found in the code but on a platform level we can take care of intermittent and transient failures but transient intermittent failures we can also um we can help you deal with permanent failures by not abandoning the function execution usually a function execution when a function execution like encounters an exception it just goes away right and then if in doubt you don't even know that it happened and it just went away so the temporal durable executions they do not go away they suspend at the failure point it just sits there and wait right so I can go in if I'm still within the timeout of the overall durable execution some of our durable executions run hours days weeks month we have actual users that use them in the course of years so they service 30 year long loans one durable execution that suspends on the failure point and then you come in you fix it and then it just resumes as if the failure never happened it's transparent to code it's actually pretty slick uh if um um I was just about thank you yeah please thank you again thank you very much for being here and then please come find me we can we can talk uh downstairs in the hangout area and I also have temporal stickers if anybody wants a sticker thank you let me try to do this okay uh oh we are already in um hello everyone good afternoon thank you for joining for uh force all the way uh that's if anybody wants a tip to get a talk in for any summit just try and put the summit name in the title that typically works for me so my name is Yogi um a few things that you will realize from this slide is uh I'm fond of emojis uh I like to travel uh I've been to all those countries I'm always exploring uh ways to get to new countries and uh just below that you'll find all my uh hobbies I'm I'm Indian so like cricket comes naturally to me yeah I've been living in Singapore for uh almost two decades now and uh currently I'm a solutions engineer at Yugabai DB just in case you didn't realize uh it's not YDB so it's probably like YDB and Yugabai DB it's a little confusing sometimes but yeah we are Yugabai DB um I am working with like our Yugabai DB partners and customers across uh Asia so you know India Singapore Philippines uh Indonesia so on and so forth and I have been involved in a lot of cloud native communities and also app development uh cloud native platforms in past and now also um the QR code is my contact if you want it and those are all the places that I have worked in past two decades I suppose uh why are you here like just in case if you're not sure this is the topic for talking about Kubernetes and Kafka and everything between I'm going to have some slide where probably about 150ish or slides no no no it's uh I promise it's going to be only 40 uh but I'm going to do some live demo hopefully it works uh did not give my sacrifice this morning so fingers crossed um so a lot of things have happened in last sort of decade since that we went through a massive shift in paradigm people are moving towards cloud native applications cloud native platforms we are talking about Kubernetes containerization moving to public cloud moving back from from public cloud moving to microservices moving back from microservices it's it's all happening right I mean it's uh seems very cyclical and uh you know if you were there for the earlier two talks I think the the topic is quite well placed uh because it is what what I'm going to talk to you about it very well fits into that same story around DevOps and the consistency so when we talk about anything in technology today the easiest way to start is if we start with an app right which is solving a sort of user problem and uh typically when we say an app it may mean different things for different people uh if you ask somebody who's born in like last two decades or decade and a half for them an app is an iPhone or an android device right pretty much that's what it is maybe iPad right um some people like I I'm from I'm a kid from 90s so for 80s or 90s right so for me it's a piece of software right or a program so in all in essence I would say an app is a piece of code right that is executable so typically when you take an app you need to run it now to run it what do you do you take a compute right uh the compute can come from a virtual machine or a pod and your application doesn't live alone it actually has some supporting things oh by the way one more thing that you might notice about me I like minions it's competition between me and my daughters uh you there pictures earlier so um typically they have their own supporting services that they require it could be for some sort of networking capability or some sort of organization compliance enforcement kind of practices right it happens it it is there and typically in uh in enterprises an application is never run as a single thing right you normally run multiple instances of it so so far so good now in in the modern applications in the cloud native world what we have is a container orchestrator or an infrastructure api provider like Kubernetes which basically takes care of taking your code your containerized code your application and running it on the actual hardware uh it could be on the cloud it could be running on premises that that's that's right but again I mentioned about all sorts of sidecars and compliance capabilities and all those things but apart from that you typically have certain supporting services you may have messaging systems you may have database systems and everything in between right authentication identity provider all sorts um now as we move towards a more and more cloud native architecture we try to move as much as possible to the cloud native platform Kubernetes right when I say cloud native platform it's for me it's same as Kubernetes now this is like interesting because you obviously we we talk about the cloud native platform and Kubernetes and all but typically you have multiple sites in multiple regions multiple data centers or it could be cloud regions and things like those right so everything that we spoke about so far in your mind if you are going through okay I can I can actually run this command to deploy this and I can keep a copy of this through this command and blah blah blah well all that is doubled now right or tripled in some cases so something that I always get asked is like what are you talking like running database on Kubernetes are you insane or something uh yeah I am talking about running database on Kubernetes why should you actually run database on Kubernetes right now Kubernetes has actually facilitated better resource utilization what it does for us well is that it allows you to use your compute your infrastructure in a more efficient manner so rather than dedicating your compute for certain applications or certain pieces of code you can actually create it more like a kettle uh anybody can tell me what's the difference between a kettle and a pet not you so typically for a pet you have a name right it's a goofy or something right but for a kettle like as a kid when I was living in India I had like five cows none of them were my pets like I was like okay one of the cows I go milk one of the cows right I don't know which one but just go milk so that that's a difference you do not actually have a direct attachment or like sort of uh naming or strict dependency on a kettle mindset so that's what Kubernetes is actually providing for us it's it provides you with the capability to run variety of applications in a very standardized manner and not just running the application but also connecting it to the underlying infrastructure it could be network it could be storage it could be identity provided many things right so this actually allows you to make use of the compute for running multiple databases today if you've worked I'm sure all of you have worked in some enterprise at some point in your career and if you request for a new database what's the fastest you've gotten it if it's not on Kubernetes or it's not on cloud what's the cloud that that's cheating let's let's talk about something that that is more traditional I have a giveaway by the way every time somebody says something you you will get a giveaway what's three three months that's fast okay well the thing is that uh I have seen like even longer than that but when it comes to Kubernetes you can pretty much spin up databases very quickly and hopefully if I have enough time I'll probably try and spin up one for you in front of you so that is one the second is dynamic sort of resizing you're able to resize your workload based on your need the demand is higher you can have more demand is lower you can reduce it make it smaller and things like those um you have an extremely good amount of portability between clouds and on-prem and all that if you run things on Kubernetes I'm gonna go a little faster because I was there was an issue with the timing I suppose um and more importantly it provides you with a very good orchestration for your entire infrastructure so not just um you know running your application but you know connecting to a variety of storage and all and of course it allows you to automate a lot of day to work what is day to work backup recovery restoration upgrades all those is day to work but anything that has benefits has some downside right so these are some of the downsides when you are actually running things on Kubernetes you have to be a little aware about stateful nature of your workload because your your application run and your data may not actually end up on the same node so that's that's a bit of a problem you can address those through distributed database um persistent especially in cases of databases persistent storage is always a problem but that can also be something that can be addressed through a distributed database um another complexity is with the reliable access you need to have a load balancing to your application to your database um even inside Kubernetes so obviously if you run it on cloud you have cloud based options if you run it on premises you have to solve for it um there is also a big networking complexity especially when you have multiple clusters if you have two Kubernetes clusters running in two different data centers networking between them is going to be quite challenging right so you that is something that you have to be aware of within the cluster if your application and the the database is running on the same uh cluster it's pretty pretty pretty good pretty easy so essentially what I'm saying is don't run database on Kubernetes run a distributed SQL database on and that's that is what exactly Yugabyte is Yugabyte is a distributed open source transaction oriented database which is hundred percent open source so and it's it's pronounced as Yugabyte um a lot uh gotchas or the limitations that I mentioned earlier can actually be resolved by Yugabyte in terms of uh it eliminates some of the need for external load balancers also it automates a lot of your day to work that you might be running into plus it gives you the scalability you you have more traffic coming in you can actually increase the number of nodes and if your traffic has tapered out you can decrease the number of nodes you can read and write on all the nodes let me just repeat you can read and write on all the nodes in Yugabyte DB which is a stark contrast to the traditional rdbms without compromising the asset transactions you are able to actually perform the transactional applications that you are you are used to uh with rdbms capabilities tables and everything and transactions um a small footprint of Yugabyte would look something like this you typically have a small management console which basically provides you with the management capabilities and you have database nodes typically minimum three because it's a distribution so three five seven those are all good numbers but minimum three is what is needed now what it provides what we have created is the distributed transaction and storage layer which provides you with automatic sharding and load balancing and on top of that a pluggable query layer that provides you with rdbms capabilities uh akin to Postgres and no SQL capabilities which are uh akin to Cassandra so in a single cluster you are able to perform both kind of data models with Yugabyte DB now what does the deployment of Yugabyte looks like on Kubernetes we use stateful set if you are not familiar with Kubernetes stateful set is a construct in Kubernetes to run stateful applications so the ordering of pods and north is taken care automatically by Kubernetes so we use that goes we have two sets of pod one is the master which is more of a metadata server master is a very bad name this is the most you say master people go oh there is master so you will copy over no it's metadata server and the t server which is the tablet server the name originates from our sharding name our shards every table that we create we split it into shards each shard is called a tablet that's why tablet server and you are able to horizontally or vertically scale each of these components now I mentioned earlier anytime we create a table it gets into shards called tablets and these tablets are something that are distributed across multiple nodes and all the data that you are storing will actually go synchronously not asynchronously but synchronously to all the nodes so let me try and show you a very quick demo and I probably would have to sit for this any questions so far how much time do I have 10 minutes okay yes I am not sure about YDB one thing I remember from my understanding of YDB was it's based on my SQL API and we are based on Postgres SQL and I mean in terms of replication capabilities we do synchronous replication of data and asynchronous like if you are doing it across sites across clusters we can do that collect your giveaway later it's here all right all right so let us look at a database that I have running here so first of all I mentioned that Ugabyte is can everybody see the screen is it legible so I mentioned that we actually run pods as and stateful sets so this is my database that is running on Kubernetes and you can see I have master and T server as my stateful sets and then a bunch of pods which are running here so this is similar to the architecture diagram that I showed you earlier now over here what I can do is I can actually run a sample application that will this is a simple sample application which basically keeps on inserting data into the database it's a workload generator application so we'll do that now this is like pretty much simple right bread and butter like this is this is okay now what I'm going to do is I'm going to actually take down one of the nodes of my database and this application should continue to work without any issues yeah so for that what I will do is I it's it's tough to actually cause a failure in kubernetes so my trick is typically I reduce the number of replicas to two like right now I have three replicas I will reduce it to two and let's see what happens okay so I reduce it to two now something interesting happened so one of the one of the in-flight transaction right which was connecting to that particular node that went down it got disconnected but rest of the system it just continued to function now there is a delay of about three seconds between the cluster rebalancing itself due to the node failure so it took if you see carefully here it took three seconds for it to just come back to a stable state where it's able to look at each other and we obviously have a small ui to go along with it and by the way this is all open source I'm not using the enterprise version or anything I'm just using the open source version this is actually showing you the three master nodes and this is showing you the three tablet servers that we have the one that I have reduced it is starting to miss the hard beats now because it's down right it will wait for this server to come back for about 15 minutes by default if it doesn't come back in about 15 minutes what it will think of is it will declare it as dead and any new server that comes up would be bootstrapped for data but in the meantime my application that I was running it just continues to run no problem without any any downtime I am going to scale this back and what I should see here is interestingly some of some of the tablets here there are no leaders on this particular node while it was away right because I mean it is not capable of serving any data what will happen is automatically within few seconds it will rebalance itself and the leadership role will be restored on the for some of the tablets for some of some of the shards that's that's the thing with live demos they can they can take a while but yeah there there you see automatically earlier this number was 0 5 slash 0 now it's 5 slash 1 and it might get rebalanced to maybe 5 slash 2 also so this is how you go by DB because of being a distributed SQL database is able to run in a cloud native environment like Kubernetes and sustained failures and it can also provide you with the transaction guarantees the asset transactions so that's that let me see yeah we we saw this particular thing yeah just in case if it was not clear this is like typically let's say if you have four pods which are running a table is automatically split into all these shards and shards are scattered across all these pods the number of copies of the shard depends on the replication factor that you are defining at the beginning of the cluster and one of the shard will become the leader for the shard group so this is how it happens so all the reads and writes would be have would be done for by the blue tablets which are basically the leader tablets all the selection and everything is automated now I mean obviously if you want to try it out we have a completely 100% open source version we have Docker container we are able to even run on ARM machines and all in fact on AWS we are the only certified distributed database for Graviton so if that is something that you want to try you can try that as well we obviously have our enterprise offering as you know you go by DB anywhere which provides you with a nice portal for day two operation multi cluster management capabilities and all and we have a fully managed cloud offering which you can leverage for your projects you can actually sign up for a pre cluster today you can actually get a pre cluster for life just try it out and you can actually arrange for a full blown demo and not like a small three minute demo just scan the QR code and fill the form and you get ten dollars grab voucher any questions okay I'll leave it on that and happy to take questions now yes you can use it on bare metal we actually make it very easy for you we have a tool like Voyager completely free open source that you can use for migrating over from my SQL and Postgres and Oracle yes third question okay good good good you will need to refactor your application and there'll be an ETL that would have to happen yeah good questions yes you first yes if you have an application that is currently using Postgres it should just work in fact you can use the same Postgres driver we have a smart driver because we are a cluster right so like any other clustered database there is a there is a concept around it but you can still use Postgres driver also yes yes yes correct so yes we we've not solved that problem right I mean the latency the physics is physics right speed of light is speed of light but the thing is that in in case of like asymmetric latency say for example three nodes one is five seconds five milliseconds away the other is 10 milliseconds away then you will effectively be waiting for five milliseconds because we wait for majority of the cluster to get the data so there is that optimization but yes we are not immune to underlying network latency we it will become part of your transaction yes good questions yeah good good good good we our helm chart is most I would say but we are working on a operator as well we we had an operator but it's not as advanced yes yes yes so I can give you like this is a published study on a Yugabyte website with Teminos which is a core banking vendor so they have actually managed to get about 450 queries 450,000 queries per second across 39 nodes in three availability zones on AWS it's their benchmark test and they said it's about 40 percent better than what they had yes your per transaction latency your per transaction latency would be higher because of the synchronous replication yeah I mean we what we have what we have observed is ideally we can bring it down to about 3x of a Postgres database right but we've even seen like you know because of the synchronous replication yeah because the database database itself the rights on the database are not that I mean there's no latency as such we suffer from the latency on the underlying network if the underlying network has like you know below millisecond kind of latency which is like local then you do you should not you should see like at the most about 3x in terms of a transaction but of course we compensated with throughput yeah I mean happy to discuss the use case really because we we have seen we have seen various use cases where we have been able to actually provide better throughput because one of the issues with Postgres you will find is because of a single node master the number of connections simultaneous connections that you can do to that node is limited so you will be limited by that number but go horizontally and at least increase number of connections yes yeah there's no exclusive lock we are a lock free database first of all second is if your query commit has finished the data has resiliently stored across the customer like right very next instance like your commit happened on and obviously at the hood we are using SS tables and mem tables right and we have a wall log so it works as it goes in the wall log if the node crashed and if it comes back again it will reprocess from the wall log right if it has not gone through but the point is that if the commit has happened and the control has gone back to the client then that has resiliently been stored some other shard which will take up the leadership role will now have your latest commit in fact I did not mention it in the presentation we we practically have a zero rpo like absolutely no data loss in in face of a failure if your if your transaction has committed it the data will not be an rpo is about three seconds minimum now that's for the portal yeah the the configuration of the database obviously would I mean that was a minimal configuration I would say but depending on the number of depending on the number of transactions that you are looking for the amount of workload and all configuration would be bigger any more questions I still have three more things to give away please yes exactly yeah you do not the application doesn't have to worry about which server do I connect to and no none of that that I mean being an app developer like that that for me is a big big like not having to bother about which server to connect which cluster to write to and all that none of it just connect to any one of the nodes write the data there and it it will be there okay are we out of time uh oh thanks a lot I get a strange country I'm using so uh from data talk a lot of us if you're related to cloud or logging then we use data talk so I'm observing those data pipelines great topic and thanks welcome the speaker thank you thank you just a show of hands how many of you are sres developers who built stuff in rust that's a curveball and wow you cut that one good um data pipelines anyone built one seen one recently anyone heard there at all of course thank you um so that's how we'll start we'll talk about observation observations observabilities of data pipelines um yeah my name is Hong I'm a sales engineer from data this how we can keep in touch on social media and emails um so we'll start off with a batch practice how do you monitor data systems today one of the key things to note is that we're looking out for some of the sre golden signals if you like to start some of the latency of your system how much lag is your system encountering as a result of a voluminous amount of uh requests sent to the system itself errors what's that like it's good it's healthy tolerable within a range that you could manage traffic requests so basically saturation how saturated is your system utilization wise basically all indicating the health of your application so one of the key things about data of the systems that you observe is that it's huge that's why we call it big data you want to get more of our observed data and that's where the concept of a pipeline might actually help now the evolution of the pipeline has uh got to a stage where now you get full empowerment across the departments in a organization multiple personas can actually be affecting changes to the data pipeline that's right your sets of observability data's could have different yes different data players different personas interact with it perhaps transform it and reach it nothing new because you may have heard of Kafka and you have heard of data bricks you heard of um quite a lot of data leaks for instance and we talk about this um i'll take some it takes something to basically put together a observability platform that capable of observing those key signals that we talked about a slightly earlier at the same time allowing empowerment of different departments that will interact with their data sets all in all the whole idea is to improve quality control um on the data and you can see the example shown here kubernetes clusters being observed streamed across different departments or having different personas in departments um have a say in terms of how we want the data to be manipulated used we'll see a demo of that shortly and here's an example of why that's our philosophy actually helps in observability today for the demo of sre projects it's called democratization democratizing the data platform itself where you will have a means of say the security team interacting with other sre's of other back end teams such as this example portrays and making changes manipulating the changes it could be just simply tagging certain data sets and saying that this is a result of having a very unsecured vulnerable app that's what the security team does and then there is the good governance risk and compliance team that could categorize this put that into the data warehouse make it easy for other teams to reference to to visualize it in their own dashboards so introducing a open source technology courtesy of data it's called vector and we provide the support the whether the link to the source code and on github in a short while the whole idea here is regardless of the sources and the sinks your endpoints our vector is a easily configurable installable deployable pipeline and very importantly also highly scalable a huge amount of loads the likes that comcast team robot and zen desk have actually experienced the vector was able to deliver so vector is that the data pipeline that a lot of personas might be looking forward to interact upon one key thing about this is because it's from data we left our own dot food so data dot being a leader in observability the go-to as observability platform for the for the sre today we like our lock management we like a lot so this our customers the community the sre community so it's highly encouraged to use lock management hanging on together with a open source observability pipeline but on vector so vector comes with the concept of transformation and enrichment of the data sets very importantly also it comes with a concept of an aggregator we'll see in the next few slides example of how the aggregator the aggregator basically allows multiple clients agents so much from data which are from other observability tools they could be streaming all that rich observability data into the transformation pipelines so as to speak at stage two and then keeping a record in different kinds of sinks it could be sprung it could be a data lake it could be any sort of data schema that vector understands so it's a long list of sources and long list of sinks that the vector aggregator is able to connect here's an example of more possibilities the topologies so you could be literally storing it into a WSS 3 storage I mentioned some of our the the community favorites Kafka, Loki, Elasticsearch all of which could be potential sinks after transformation of the data that originated from the usual push sources like lockstash, Bometeus, Statsody, Syslock has been converted and rich transform it could be even condensed prior to storage in all these sinks over here on the right all in all highly load balanced highly available scalable as well the concept here is that pipeline is for everybody for every kind of use case let's take a look at the demo so here are examples of some of the pipelines that have been built whenever we visualize a pipeline the different stages of the data transformation of the data and Richmond is really important to the SRE is really important to different personas remember the security team a GRC team so it makes sense to perhaps look at error rates like are we getting all because this is huge amounts of data being streamed at every single stage are we having a transformation issue some of the data not being enriched appropriately and then we can also see not just errors because we also see the true put I'm just going to rewind that again there we go so you can see true put you could see the number of events that could potentially also indicate health of the different stages of transformation we can look at the the utilization of that particular stage of the pipeline also and very importantly also diagnostics at real time whenever there is an issue with transformation of data sets those diagnostic locks pretty helpful very importantly also get a get a feel get a feel using a few gauge of how much in terms of the the event flow is actually occurring over here and very importantly also would be as you can see over there way at the top a capability of transforming the data based on the different sources I could see at the get go some of the sources are pertaining to one of our favorite things the data agent lock and we're transforming it based on individual key value pairs so not transformation it's another use case very importantly also eventually it leads down to simple management and you can see that from a perspective of locks like I mentioned we really like lock management just like the rest of customers we could use that to pass through any amount of enriched and transformed vector observability partner locks take that analyze it and eventually even go in where it actually comes from which particular components were involved very importantly also which level of detail do you actually want one key thing would be contacts from a contextual standpoint you could even locate exactly where in a lock stream a particular identified lock entry that has been picked up from observability part nine if you could dive in deeper using data lock management so with that we're back to presentation sites now that doesn't live alone it's part of the observability platform that is the leader in the industry it's a platform that provides you insights into usability observability driven by AI as well very importantly there's the freshness the accuracy the durability and the coverage that this platform delivers that makes it a whole lot easier for SREs to do their work very importantly because it's a unified platform economies of scale 16 technology pillars all on one single platform ranging from a mobile user analytics to browser based analytics browser based apps analytics to security analysis posture management security posture management very monthly lock management as well and also application tracing all-in-all smart tooling all-in-one platform here's an example of what you could potentially do with 16 pillars all on one with the power of vector helping in terms of observing huge data sets and that is root cause analysis within the power of a fingertips mouse clicks we could dive in really really quickly using data with vector to analyze data with regards to each of the point applications this is common tracking where we could actually dive into the various signals again every latency saturation picking up on in terms of issues every time any of those signals like latencies get too high here's an example again all I could be doing with huge amounts of data could be streamed over observability park lines and that would be um some of that data correlate to application traces so you have that data in lock format correlated with application traces and visualize across the span of your modern application which could be a merit network of microservices all in all are integrated making whole a lot of sense to use a platform to you investigate if there were any outages the moment you see within the application flame graph that is indeed a model a method call that has failed and then dive in those locks that the vector has streamed to say data dot and using data dot to analyze a whole lot more of what this modern application is actually facing so data dot it is the unified observability form did you know it's also a very very big data pipeline so going back to the theme of this presentation again let's look at what's underneath others from the agent to the ingestion buffer to the individual aspects of the processing capabilities of data dot it's a pipeline it's a huge pipeline with time series database storing all the metrics traces and locks what we call the three pillars of observability from that standpoint it's easy to build a dashboard on top of that time series database we support 22 000 customers using millions of hosts collecting 10 second snapshots the sampling is supported so 10 seconds seems to be a popular sample time window of trillions of events per day and very importantly this is what the pipeline will look like potentially for a cult administrator for SREs who uses the three pillars approach metrics traces and locks to analyze outages of their application or system issues or performance optimize them very importantly the the sysos department the syso-1 will have a role to play using such a platform technology and that will be being able to analyze the security posture of their cult native environment monitor the the health security from a security standpoint of the workloads and applications and visualize it in such a manner where you have what we call a huge map of all your assets your microservices assets databases and we could potentially be pivoting from one particular service to another really really quickly thanks to a concept known as unified observability platform now from a perspective of AI driven nature that big data pipeline called data dot actually helps also to analyze critical failures perform the root cause analysis across such huge amounts of data and also the the frequency of data changes very importantly this is DevOps centric the concept where we could actually be profiling the health of your application dive into perhaps the CPU time per minute at each level so imagine if you have been doing DevOps all the time DevOps talk about continuous innovation and integration rolling on incremental changes the code the challenges is every time you have those new features wrote out can you accurately determine how well the particular module is performing so again a concept we're having a really big data pipeline that data dot is built on top of well it's possible it's possible so in short download vector today right kick the tires try out different deployment apologies this is what you get when you visit the GitHub repository officially where the the vector agents the source code is made available the components listed there are guides listed as well how you would be deploying it for use case a use case b use case c and so on and so forth some advanced use cases worth mentioning very quickly before we end um kinesis firehose a log of forwarding so you have ingesting a cloudbox into it kinesis firehose can you make sense of those logs really quickly vector can help could you merge multi-line logs or lower one of the favorite programming languages when it comes to web infrastructure engine x is belong lower so could you do that so the answer is yes so this is really where you would want to start getting more connected with vector it's repository and here we go github.com slash vector dot def vector um as well as there's a block and a quick start um up documentation as well all here on the slide and you have any other questions feel free to ask me right now or you can drop me a mail all right thanks for that um didn't really mention that but great point um it came from originally a company called timber dot i o so timber has now part of data dot um in the spirit of this uh i i guess the the conference uh we will show you that data dot continues to support the open source community data dot digital was an open source technology company very very beginning yeah so data dot was open source so we have made multiple open source projects vector continues to be one but i i guess i i can quote in terms of the you know the the business direction of the whole company but we continue to serve our customers it's it's something to do very well and i think our customers appreciate it and vector plays a huge huge part in our our observability uh platform strategy thank you thank you everyone hey everyone so today i'm going to talk about depth test ops so basically about uh how we do quality at git lab and few things which could be used would be useful for you as well so yeah so these are the few things that i'll be talking about um yeah shifting left is something that most of us would have heard so we'll talk a little bit about that and about continuous testing so what do we exactly mean by continuous testing and how continuous can it can it be and from where could you start this continuous testing so these are few things that i really want to talk about um yeah so a little bit about myself so i'm with the quality engineering team at git lab and uh yeah i'm passionate about designing and developing test frameworks and tools and simple and powerful is something that i totally believe also i'm with the uh women who code chennai chapter so chennai is a city in india and i'm one of the directors there and i'm really excited to see the other communities from other parts of us as well all right so um yeah so git lab i'm sure most of you would have heard of git lab right so it's the devops platform and uh yeah it is cloud agnostic you could uh use it in your own cloud providers and uh yeah it it uh it could be self managed or you could also go with the sass solution as well security compliance are all built into the system and uh yeah open and always improving so the main um uh git lab it actually works on open core model wherein like uh the community in addition is actually open source and the enterprise edition is actually open core meaning like all the code is available to you uh and uh you can actually read as well so um yeah so these are various features that are in git lab i wouldn't be going through all of these just wanted to give a snapshot of what are available the various capabilities that are there at git lab at every stage of the sdlc yes so uh to talk talk about the um it wouldn't be complete if we uh if i don't address the contributors as well so we have around 3660 uh open source contributors as of today and around uh 2000 plus team members across 65 countries and yeah so uh mentioned here a few open source partners so uh you can read more about what it actually means to be an open source partner that benefits when you are an open source partner so you can read about it more here so i thought it would be relevant to mention these details here given it's all about open source and yeah and there's also this program called uh open source program uh which is actually uh it also has few benefits meaning like if the project is a qualifying project then you get to uh use the git lab ultimate features and uh yeah there are few other benefits as well so again you can either google about this or you could even uh contact the mentioned here all right so getting into the topic for today so this is about depth test ops as i mentioned when we talk about DevOps we often uh hear this word called continuous across every stage right so continuous development continuous testing continuous integration deployment continuous monitoring so my focus here would be on continuous testing so as you see like there multiple tools that could be used across different stages to attain these things and so when we talk about continuous testing like how continuous can it be so for example if you see in this um uh DevOps cycle right so you have a stage called test and uh so what does it actually mean does testing come after a particular stage or does testing happen across different stages so that's something that I really want to emphasize on and before we get into it I just wanted to talk a little bit about this shift left so um have you heard of term called shift left yeah so basically it's about shifting the testing left right so it's all about failing early on and failing fast so that the feedback faster and you improvise on it earlier so that's what this whole thing is about so um so what does it mean so from where can I start testing can I start testing right from the planned stage wherein uh the requirements are being discussed and or can can testing be done at a design phase itself where in the the UI and UX designs are all discussed so where where where is the appropriate place to start this testing so these are few questions that we'll answer as we go through this the various slides and yeah when we shift left there are few benefits attached to it as well as I mentioned failing fast early detection of errors and yeah it's cost effective and yeah those are few benefits when we shift left all right so depth test ops is all about ensuring quality early on and at many points as possible so as you hear we test at different points in different ways maybe it's not exactly the same um uh way how we test at every stage it might not even you might not even know it might not relate to test cases per se actually what I mean by testing it's ensuring quality is what we are talking about right all right so the step zero or the bare minimum that needs to be agreed upon by by uh the team members is that quality is everyone's responsibility it's not uh it's it no more lies in the uh hands off the quality engineering team or the testing team or whatever it's called so quality is everyone's responsibility is something that all of us should agree upon unless this is uh I mean this is a change in the mindset right so I think uh without this uh I don't think we could achieve what we are trying to with whatever tool is being used yeah as mentioned here quality is a continuous process it's no more a phase in sdlc so um yeah this talks a little bit about what we do at git lab um yeah three in the process is something that uh that most of us could be aware of so this is about involving the various stakeholders during the initial phases during requirements analysis and stuff like that so uh yeah the product team the quality team and the development team they come together and discuss about it and don't be afraid of what you see here about this is actually um the tanuki based on which the git lab logo is based on so yeah so uh as you see here like there are four different stakeholders who uh who uh are involved in these initial discussions and we call that process as the quad planning process so uh yeah these are few things that we do at git lab and the tools as I said we use git lab to build git lab itself so yeah the git lab issues milestones epics and there are features specific to planning so all of those are being used and uh yeah and uh can testing be done at the design phase uh I think yes so that's something that is um uh done here as well so we have this feature called design management feature wherein the designers the UI and UX designers they could add their details to the git lab issues and the other stakeholders could actually collaborate on the issues itself so this is a way of uh getting the quality engineering team involved and in the discussion itself all right so when we move on to the code we build the uh code and the code and during code review like there are certain features so basically we need not um think of uh I mean we could ensure quality not just by means of testing and test cases alone but by various other features which could be part of your CI pipeline itself so that's what I'm trying to emphasize so these are few things like code coverage feature is again available in git lab and I'm sure like we could uh incorporate it in any other CI pipeline as well wherein we have um gens and npm modules which take care of code coverage and they could be used the CI pipelines to ensure the quality and um yeah so static code analysis is another way of uh ensuring the quality of the code itself again um the example that I mentioned throughout the slide is all that's used in the lab the various features that are available which enables quality at different stages but again this need not if you're using any other CI pipeline it could actually be achieved by the various other modules that are available uh as well and it could be incorporated into your CI pipelines so yeah as you see here so this is a snapshot of the mr widget so mr is merge request in git hub it's called pull request right so um it's an equivalent um um feature here and this widget contains I mean it could be configured based on what you need to see in the CI pipeline and um the mr widget it could actually show the quality and the degradation in the quality of code as well and uh yeah accessibility testing is again part it could be again configured in the CI pipeline and uh these could be actually added in the mr widget so what I mean by uh this is that even before your code could be merged into the next stage maybe into the mainline branch you are entering the quality of your code itself right so I think that is the biggest advantage you get to uh see all of these um defects or failures early on yes security testing is again possible uh you could actually add that to your CI pipeline as well and yeah again this is an example from mr widget again so all of these are configurable in your CI pipeline if you're using git lab then the git lab ci.yaml could be configured in such a way that all of these are enabled and uh run a mr request merge request all right so apart from all of these features that helps in ensuring quality um yeah of course we do have intro and tests and other tests which helps in uh ensuring the quality as well and testing permit is something that we uh we follow and we try to adhere to unit tests are non-negotiables no feature is complete or it could it cannot be merged without a unit test so that's something that we uh have as a non-negotiable thing and intro and tests are more of journey tests wherein uh it doesn't test specific every detail of a feature it's more about more of a blanket which tests the user's journey from end to end all right so um yeah so apart from this um yeah at the release stage how could you ensure quality of course we do have tests that are run uh in different um test environments as well and these are reported uh through these slack channels again this this could this is a very useful thing wherein you can add it to your way of collaborating within teams so that it helps in uh quicker access and taking a I mean it helps in better uh monitor as well and yes sanity tests are run again on production environments and um and also we do have uh uh tests run on on a subset of users alone and um of course the feature itself would be it could be controlled meaning like uh we do have feature flags which helps us to deploy for the percentage of users and it could be increased so that's something that helps in better quality as well and in terms of exploratory testing we do have some uh uh there's a feature called review apps which I didn't mention here but it helps in uh spinning up a dynamic instance but even before your code is actually merged even when the code is in the uh merge uh merge request stage and during code review you could actually spin up an instance which has your code uh the current uh state of code itself so that way you could uh do some some sort of exploratory testing and you can take a look at how your feature is actually uh looking even before the code is actually merged to the uh main branches so that way it also helps in the quality of the code all right to summarize all that I'm trying to say is quality is everyone's responsibility and I think that is the mindset that we need to have and work towards it and then what is the uh responsibility of quality engineering quality engineering is actually responsible to facilitate this testing and quality at every stage and uh yeah and testability is something that all of us are right from the designing and the development phase everyone should be mindful of um to build to build a product in such a way that it is easily testable and yes quality should be baked at every feature of the pipeline itself that's pretty much it any questions yes yes the issues are used throughout the life cycle wherein the issues are used uh for the initial discussion as well and again if as I said like get fully remote and uh or most of the discussions happen async and having it all in the issues it facilitates this as well and it doesn't end there of course we do have the other features mentioned there as well right the milestones and epics so milestones basically uh we have a monthly release so milestones help us to plan the upcoming releases as well and all the uh and also there are features specific to release like wherein you could just tag a release and uh pick the changes that you want to release so all of that happens based on these issues so all the features are actually tight the issues are used not just for discussions but it's used until the end uh until the code is actually deployed you can when you look take a look at the issue you will see the complete state like we do use labels as well we can see the different stages through which it has gone what it is in is it in the planning phase or in development or has it been uh deployed and stuff like that yeah I have my colleague here as well Albert so um yeah that's something that we constantly keep uh trying to so in labels there are there is a specific type of label called scoped labels which actually has helped us a lot meaning like for example if there's a bug you want to call it a bug but there are different stages in bug right maybe it's a valid bug or it's a bug but it would not own fix bug and stuff like you could actually have scopes on kind of a label within a label the kind of thing so those scoped labels has helped us and yeah I think that's one of the main thing that comes from mine yeah label hygiene is something that we constantly keep working on thank you thank you hello everyone uh today I will talk about um managing rules um on multiple Kubernetes clatter using uh Flux City and Clatter API um a little bit about me uh um my name is Chien I work in Vietel Vietel Group a giant telco in Vietnam um I have uh over five years working with Perion in cloud area uh and uh I'm currently working on development uh cloud solution for Vietel's public cloud services this agenda today uh so um in Vietel Group our public cloud services uh heavily based on the open source process lies OpenStack, Kubernetes, or and maybe ProMeters the the Kubernetes related product will build on top of Clatter API uh that's in the infrastructure service um as an infrastructure provider so um in the past one I will highlight some key point about Clatter API in this part two uh body full Clatter created by Clatter API minimally functional um for instance uh they do not they do not have um uh container networking interface uh or CNI for short uh which are required for port to post networking uh any uh storage classes uh which are required for dynamic persistent volume provisioning uh so uh you just so you may be manually as the component to uh to every Clatter they created we can guide our user to manually install the component with some script or manifest however that's manually processed can easily introduce issues uh so we must have a mechanism to apply and uh uh even those even those sets of different resources uh after creating Clatter uh uh are created so it's made Clatter created by Clatter API functional and remedies for the windows from the beginning so uh this part two I will uh present about automation and managed windows um in Clatter Clatter so uh what is the Clatter API? Clatter API the Clatter API project just you know go the declarative management of Clatter Clatter um it's provided as a Clatter resources uh and controller uh that can lead you to uh Korea's modified uh release Clatter um for your who don't know uh Clatter resources uh we saw Clatter resources in the way uh in the way we can attend the Clatter API uh if you guys um uh familiar with Clatter the uh you may know deployment the non-sets uh here uh the own uh Clatter building resources resources so uh if you if you want to attend Clatter API uh you you can develop your own pattern resources for your own purpose so uh why will you Clatter API? managing Clatter Clatter can be a complex uh and time consuming task uh so with Clatter API you can automate the process of quilting or managing Clatter Clatter make it faster and more reliable you can also manage Clatter considerably across different environments such as uh development, test, tracing, um production uh so um um Clatter uh how Clatter API work? Clatter APIs uh work by uh defining um critical resources just uh just represent different aspects of uh Clatter such as uh control plan, working nodes, uh networking, uh storage, um so the resources are here to declare the desired states of Clatter so um sets of Clatter will reconcile the actual state of Clatter to the desired state the the desired state of the resources so let's take a look uh as um some key component of Clatter API uh here we have uh Clatter API controller uh this is a set it's a call set of um a quality quality resources that define Clatter API uh such as uh Clatter um machine, machine deployment, the bootchart provider um is responsible for for the best uh bootchart uh resources to bootchart control plan or working nodes uh here we have uh in fact the third provider um the in fact the infrastructure provider are really responsible for creating many uh the the actual uh the infrastructure that um the quality Clatter will run on such as uh virtual machine uh localizer and et cetera uh and here we have control control plan provider uh it's possible for bootcharting control plan provider so um what is the benefits of using Clatter API uh here here are a few the automation and Clatter API uh we are allowing you to automate the process of creating or and managing quality Clatter reducing the time and effort the consistency uh with with Clatter API you can manage uh Clatter considerably across different environment reduce the risk of error and ensuring yes your Clatter will set up correctly um the third is a flexibility uh with Clatter API you can you can do a different bootchart provider control plan provider uh infrastructure provider uh give you give you more flexibility on how you set up Clatter you can see uh you can run you can run your Clatter in any provider uh like google uh amazon vmware et cetera uh you you can even develop and and uh your own infrastructure provider to make your Clatter run on your your infrastructure how did we uh automatically install an uh application on RIMOS Clatter um Clatter API introduce a Clatter resource set uh Clatter resource that will be responsible for applying the sets of resources resources defined by the viewer by by the viewer to the matching Clatter but uh using viewing Clatter resources we find it's not a very good solution because we have the support uh applying uh applying YAML 5 YAML 5 but um it's uh you don't have to um um uh control the whole lifecycle of application like release updates uh replace et cetera so um we find we find flexibility in the was matching solution with Clatter API so flexibility in the community native continued continuous delivery to just automate the deployment of application and infrastructure to create the Clatter uh it's provided ways uh to manage the lifecycle, create the resources across multiple Clatter um so in complexity we knew uh hand-touch controller um hand-touch controller in uh a component in flexibility that's managed hand hand-touch on Clatter uh it's provided a way to deploy and manage application using hand-chat so let's take a look at uh some of key components of hand-chats because they're very in flexibility uh here we have uh hand repository so hand repository is a resource that's written in uh hand repository uh actual hand repository that contain actual hand chat uh when when a hand repository resource are defined it's supposed to file an uh URL uh that point to the the the word the repository where the handchats are hosted uh hosted and the the flexibility can use this information to uh uh periodically send the hand repository and make it and make the the related handchats uh valid for our installation uh here we have uh hand really we we also uh cut down resources resources definition that uh you need to define uh configuration for the handling uh it you the med if you to manage uh the deployment of a hand chat to a Clatter um by a hand action such as install publish and install run back or test so push Clatter API and flexibility all together uh you will have uh a system uh that can manage multiple Clatter and windows running on them so uh let's take a look um so look at the full the workflow for doing hand chats uh in flexibility with Clatter API so um here we have uh an essential repository uh the developer can cruise on the hand handchats for the application and put it to the this repository uh the app of us Qoora TS Mini uh can cruise on hand hand repository objects that defy the UILs pointing to uh this uh actual hand repository so uh the the source controller was pretty crazy single uh the information at the handchats uh into the a local repository in um uh many more Clatter um the operator will cruise uh cruise uh a Clatter object so the Clatter API controller will um reconcile is and um provision uh and bring up a Clatter running uh running on uh on the rise um to install some some needed application the operator will cruise uh the hand release corresponding to uh application uh that content the content information about um about the secrets uh pointing to the CUDA conflict secrets pointing to the remote clatter that's uh just be that be generated from Clatter API so the task information the folder of the handchat the hand action this may be uh install uninstall app OS etc uh so the handchats controller will reconcile this object and prepare handchats and install the application to the Clatter so let's take a look at what's in the Clatter resources um in the Clatter I highlight some important um field uh it's not the Clatter network it defy the the network configuration of the Clatter the content network the control plane region um in relation to the objects um another object that that the control plane provider will will reconcile the infrastructure region uh let's point to the point to an object that infrastructure provider will will will reconcile um with uh with the object the Clatter API will bring up on the Clatter on on the uh on the infrastructure infrastructure um the Clatter region uh here we have um hand release as you can see uh um so um is we have the cool back and fit referral it referent to the cool back and fit to the remote Clatter that uh the hand the handchat will uh install uh I can see the task the version the the software it referent to the the hand reposting uh you can see uh tracker display the storage display this is the the next play the handchast will install in the remote Clatter uh in the hand reposting uh you can see the URLs uh that's point to the the actual hand reposting so uh in conclusion um Clatter API is a powerful tool for managing Clatter in declarative approach uh what Clatter API will you up to um provision managed Clatter best Clatter uh expires uh minimally function uh not right be easy to you so you miss using flux uh CD Clatter by to uh effectively automate the process of managing URLs in committee Clatter so uh thank you for your attention you have any uh further question if you have any question you can talk me in the later people switch to Q sheets because they are much more well much more uh at least a bit less verbose so I would be interesting to see I mean especially with multi cluster deployments I would be really scared how do you deal with that uh you mean how how I prepare the the last man you're more fine yes I just I will practice in the handchast to be too easy to install what what what do you mean I I don't yeah incomplete deployments because the the amel file that was used for deployment was just not what it should have been right yeah that can always happen that can happen also with the files you showed us right so is there we seen fluke CD flux CD is there any safeguard for this is there any way to improve on how to create deployments and by definition of deployments uh I uh I will um I will um discover more and answer the question in the mail okay this is what you meant and probably this would actually save a lot of wrong configuration to be pushed and uh so what I would suggest is take a look at the language server protocol that Microsoft and Red Hat together have released recently so that you can provide all the supports for your configuration files and people for different code editors can use those language servers and can write across the configuration without before even they can be sure that the configuration is at least correct before having to push the thing into the question yeah I agree with your comment on language service but language service is only one part because it doesn't guarantee completeness right yeah of course that's just a part of the solution yeah but is this one yes I agree it's one part all right yeah good comment thanks a lot thanks a lot any further questions or remarks are there not the next speaker thank you I'm sure about QA documentation please welcome thank you thank you so thank you everyone for staying so late it's the last I wouldn't be promising like I wouldn't be taking a lot of time here and thank you for Ramya covering most of the testing strategies so I would be just skipping it out thank you Ramya so my name is Rohit Vaz and I'm a state quality engineering manager at Red Hat and mostly involved with the uh the automation part of the CI uh off of a test suit so this talk is mostly about revamping the QE automation pipeline uh the QE automation testing in tecton pipeline this talk is more mostly about my journey towards like how we integrated our um integration testing from end to end point of view if you if you look around the the QECI process and if you want to think about the CI and CD process it's like building your um software automation from building deployment building testing and deployment but what I'm talking about is just the CI process of the testing site and this CI process of testing site includes uh from trigger identify at what stage your test suit would get triggered uh then it from what repository you are using the code to be checked or it could be github it could be githlab uh you need to also link your code you need to ensure that you are coding like js if you're using javascript you need to use jslink ethanyu using pylint um yamalint depends on the code which you are using for your integration testing then you have a provisioning environment where your sud will be run what is your provisioning environment whether it be a cloud it would be a bare metal uh what kind of provisioner tool you're gonna use it in integration testing some might be using ansible as a provisioner some might be using terraform uh it totally depends on the set of test which your test engineers have built uh the sut setup can be in aws gcp it could be a standalone bare metal machines using vagrant or liver or something like this or maybe it's running on the podman containers so that's the sut part the test execution constitute of your ansible test your python test your jmeter for the performance your api test would be in so far the it would be running on the headless mode with using the postman or it could be a go kingo for the go framework the last part not the last but it's a test logging because that's the most important part where your stakeholder is keeping an eye on the logging part and the analysis part that's also a very interesting part because what everyone care about is what's the percentage of a test case is passing and what's product readiness for your entire workflow last but not the least the important part for getting the notification your notifications should be very productive at at any point of time where your test execution get failed or there is an identification of any deviation from your workflow and you can have your test notification integration with your google chat slack or even getting a pdf reports on the gmail so this part is stay a small criteria in the c i c d workflow where q e c i scenarios and the workflow fits in there are so many c i tools available it's pretty hard to identify which one suits your requirement there's like Jenkins where you can write your Jenkins files you can grow up into multiple stages build up a pipeline of these steps you have gate of actions which is good to integrate with your gate repos where you have to integrate your where you have to run your some kind of action files for your integration test you have or go workflows where you can run your execution pipeline on multiple clusters by integrating to it and then you have tecton which i am trying to target today the tecton pipeline so tecton tecton is just to give a brief introduction like tecton is a Kubernetes resource which helps you to build your pipeline using multiple tasks and stages multiple tasks and the pipeline this is one of the use case which i was working for for for my last project we were certifying an operator certification use case where we need to certify different vendors operators and also we need to certify different open stack plugin and have like tons of different test scenarios that we need to cover on if you see that there are like proper triggers it could be triggering our test suit twice in a day or probably on every git check-in then we have to check out the workspace that definitely a part of our c i c i part provisioning and building a test data provisioning on multiple environments like open stack then configuring setting up a open ship cluster it could be using a mini cube code ready containers over a bare metal machine or it could be using just kind cluster deploying and configuring the operator side of it triggering the github actions pipeline which is on the upstream side pulling the events notification from the github comments just to start and initiate our integration test suit integration test suits wherein in many different formats from ui test which were in cypress the backend test which was on python the api test on go then we have go catalog test we have test which certify different brokerages of the container images then we have backend db test and this has to be integrated on together it requires all configuration so we have bundled all the tests in a container format container image and we keep those container image in a registry just to ensure that we are well versed with the the latest images for our test suit as well as it has a vulnerability is scanned we are always updated over it a vulnerability is scanned on clearance with clear services so so it's it's like quite giving us the updates of the health of our images the last part would be the logging part we push the entire test results into our centralized data logger which is report portal we are using report portal for pushing all of our test artifacts test reports the ui test logs everything and then after that we have notification section where we push all the notifications to our google chats slacks and gmail just to bring a light introduction about tecton ci so tecton ci is a Kubernetes resource which which help us to build up our pipeline and by pipeline we have multiple areas the smallest one or the shortest one would be the steps where you execute your commands and your test scenarios bypassing the arguments then we have a task which bundles the steps uh after task you can imagine as a different stages of a pipeline a pipeline is also a Kubernetes customer resource which is used to bundle and execute your task whether you can run it sequentially or you can run it parallely pipeline runs with a pipeline run customer resource which is then been triggered by a trigger template you can bundle it with the trigger template uh trigger template is something which listens for the trigger event and you can listen those trigger event with the event listener services from the tecton which is which helps you to listen different triggers like from webhooks or you can just directly use chronicles uh why we shifted to tecton because most of the workload which we were running was running on was run by uh container side because the test bundled container uh it was easy to use the same format where we can just execute our container test scenarios also applications can be deployed on the same cluster so that was easy for us then tecton provider provides a good mechanism of executing using CLI and GUI so tecton has a tech CLI client which is called tkn you can use it to run CLI it's easier to integrate and it also provided GUI dashboard from where you can just manually run the test on demand then we have a huge benefit of using tecton was that it's catalog source where you have a predefined task which you can just directly integrate with your pipeline it has a huge community and the best part is like its tasks are reusable so you need to write your task once and you can reuse it in multiple pipeline scenarios also tecton tasks uh and pipelines are just yaml uh you can write a yaml uh over here uh defining what steps you need to execute the images on which your task would run and each task run as a pod and each pod executes as a single instance for the process so you define your task in a simple yaml format it's easy to understand by Q engineers or it's easily maintainable because any changes on the test side will reflect on the image so you can just update your image it's quite easy to integrate in the entire format then you can integrate those tasks on the pipeline which defines whether your task would be executing parallely or sequentially on on specific time then you have a pipeline which defines how your pipeline sorry you have a pipeline defines how your pipeline would trigger you can pass all the arguments as the input and output resources you have a trigger template and event listener for like if you want to create an event listener for git events you can have it as a git event uh with with respect to the push hook or something like this and also you can integrate it for a different like you can use it as a webhook um uh like i think in github actions there's a sttp response dispatch workflow so you can also trigger your event listener directly from any any any sttp request all right uh i think this is a small video of uh tecton catalog where you can just directly search for your task uh and uh let's say you want to create a task for a pie test one example you can just directly go in the pie test download the task um you can download the task go in the cli's you can install the task give a reference give a reference of your task to your pipeline you can directly install the task in a cluster you can give a reference once you have a task in your it directly installs the task in your kubernetes cluster and once it's available you can use that as a as a reference so i just downloaded the task of pie test and it's in my kubernetes cluster i can reference it and use it in my pipeline okay all right so this is uh one of the screenshot for my operator pipeline where we have a set of environment setup then we have a provisioning area where we provision multiple bare metal machines and deploy a kubernetes cluster on it using crc and kind the github event actions which we trigger on the github site we wait for the polling agent to get the information from the github site and then we execute multiple tests in parallel which is from the ui back and db and the api uh finally we submit the test run and after that we can get the notification of completion well this is the demo for how you can execute a pipeline from the cli as well as from the ui so this is a a simple example of running your ci pipeline using the tecton cli you can use uh overriding your default parameters as well as if you want to use the existing default parameters you can use set perums default once you run it from there it's pretty easy that it gets executed on the ui side um you can also redirect these locks from the from the terminal as well as you can see on the open shift side uh under the text tech uh task locks so so that that was an example of how we trigger the test from the cli also you can directly trigger tests from the ui side uh it depends on like what's the demand for execution of your test frequency so you can directly go and override the variables from there and you can execute the test all right for test logging and notification we are using report portal where we are pushing all over test results into the launches it supports the launches format where you have all the test results from multiple source you can also see and configure your customized dashboards uh based on of which you can understand what is the failure analysis for your entire test suit and for that program the notification part consists of custom messages that we are sending on the google chat side which gives the information from the tecton locks as well as from the um test results so we we customize our notification format well this is an example like how does uh the report portal analysis looks like you have your test results over here you can see the test logs published uh it also has a capability of predicting the failure so it saves a lot of our time in identifying like what was the reason of failure so let's say if there's a system related issue if it's a product related issue or it's an automation bug so we can just identify from this test log analysis and it's automatically reflects with uh all these kinds of defective so rather investing times um okay all right i'll just hurry up all right so you can see the dashboard over here which has the predictions of like of the test run and also you can see the git chat notifications which we have customized has information of the tecton locks you can directly navigate tecton dashboard as well as uh port portal um log instance from here all right i think i won't be taking much of the time these are the resources and the reference which you can refer for for tecton openshift pipelines we have our project operator pipelines over here in github from where you can take a reference of it and also there are different tutorials on the Jenkins and other ci tools like github actions you can just refer it all right thank you uh thank you everyone for being so patient i thought i covered the talk on time let's thank you speaker so you you you are completely on time i'm i'm not sure what happened here so don't worry about this but i mean okay well then thanks everyone for joining let's give out thank you everyone for our program is finished uh