 Welcome to my session. And it's great to have you join with me about the metrics, about defined metrics for the CI-CD optimization. My name is Xu Chunhuan. I'm with the 99 Cloud and the technical director of 99 Cloud. In this session, I will share with you about how to define metrics for your CI-CD optimization. Right now, nowadays, everyone is moving to agile to stay competitive. And agile development has become those very trendy buzzwords, as you know, right? And as we can see, agile manifesto are like four key values to streamline the software development process. The first one is the individuals and the interaction over process and tools. Second is working software over comprehensive documents. The third is the customer collaboration over the contract negotiation. And fourth is the responding to change over following a plan. Based on these four values and developers come up with another 12 principles of the agile software development. And there is a principle, working software is the primary measure of the progress. And because the agile development is very different from the traditional waterfall development, it emphasizes more about the interaction or communication. So it's not easy to actually implement agile models and even hard to define metrics to optimize them. But anyway, we need to define the metrics to optimize them. Because if you cannot measure it and you cannot improve it, which is said by the Peter Drucker. I don't know if you guys have the experience, if you want to lose weight. I want to keep fit and want to lose my weight. And it's hard for me to lose the weight. Because if I can't keep an eye on my weight, and I cannot find inspiration by myself and keep working on exercise to lose my weight. So I have an idea. Then I buy an app. And the app can track the weight of myself every day and do exercise. And I get the metrics of the exercise. And we will see how the exercise can improve my body. And finally, I can get a good shape. So I think it's similar to improve the CI-CD by defining metrics. Traditionally, people when adopt agile development, they do not measure it. There are several reasons. First one is when we use the measurement, you will use the measurement by mistake. Because you don't really understand your metrics. The second is some metrics actually is worn. And it's not useful. So you cannot actually follow the metrics to optimize your process. And third is to implement some measure metrics is cost a lot. So you cannot spend a lot of resources to implement or collect those measures to help to improve your process. So let's see why we need those metrics. In general, there are three aspects for us. The reason we need to use the metrics. The first one is decisions. We can use the metrics held to make decisions for us. For example, we can make the business decision to get better product for the customer and get the result feedback from the customer by using metrics. And then we can use the metrics to improve. We can improve the quality of the product and we can improve the velocity of the team. And third is we can use the metrics to do a prediction. If we have a lot of metrics about your CI-CD process, you can predict a lot. For example, you can predict the velocity of your team in the next iteration cycle. Or you can predict the customer behavior based on a lot of the metrics. At the same time, we can categorize the agent metrics in several phases in a development cycle. First of all, the metrics can help us in the iteration planning. It can help us to make clear the priority of the features or the size of the features based on your history data. And it can help us to do the iteration tracking I just mentioned. And third, it can help us to motivate and improve the team and the team member. We can visualize the, for example, we can visualize the build status in the dashboard and help the team understand the current status of your building system and understand the building time of the system. And you can find ways to improve it. In the following slides, I will introduce an example of my company how we, from the basic metrics, build time to optimize several other metrics. And finally, we can improve the build time and improve the velocity of the development cycle. And last but not least, we can also identify the process problems and identify the quality of the, during the pre-release and the post-release. We can get the data from the pre-release and understand the development quality of the product and get the data from the post-release to understand the feedback of the customer and help us to close the feedback loop. We are talking a lot about the metrics. So what exactly the metric is? I search a lot in the website, and I think I find a proper definition of a metric. I think this definition tells a lot. A metric is any collectable, qualifiable measure that can enable one to track the performance of a spec of a system over time. So I highlight several keywords during this sentence. Let's see what's a collectable and qualifiable measure. During your process, if you're going to measure some performance issues, it may not be easy to measure. For example, if you're going to measure the mode of your team member, that is very hard to measure. Except you can use some survey or some other statistics method to collect the data from the team member. So you have to find a way or spend a lot of resources to collect the metrics you want to get. And the second is track the performance. We can collect a lot of data from the whole development cycle, but not all the data we want. We just want the data we care. For example, we just want to measure some aspect of the performance. So we have to collect that kind of data, not all the data. And one matrix may be just a spec of the whole process, which means you have to have different metrics to reflect your spec of the whole process. And there is no one data or metrics can tell you everything. And at last is the overtime. You have to have enough data and collect the data over time so you can see the chain of those data and you will find the problems and you will finally get the solution. So a lot of people ask me if there is one matrix to talk about the whole process. So I think the answer is no, right? Actually, during the agile development process, there are many research. They are finding they are doing a lot of jobs to help us to understand different aspects of the agile development, at least several aspects. And there is some metrics in here. I think that these metrics can give you some intuition about your problem. Sometimes, for example, I just mentioned, the team happiness is belong to the people and team, human elements. As we know, human always be a very important factors in the agile development. And the second is the process health metrics, like the cycle time and the lead time of the story or the task. The third is about the release metrics, like the release time cost per release. Fourth is about the product development metrics, like the benign chart. And today, we will focus talk about the build time, which is technical metrics here. I will share you a story from my company, how we optimize the build time of the system. Before that, I will introduce you to approach how you can develop metrics for your own. Actually, when we want to resolve something, some problem, there is two approach. The one is the top-down approach. The second is the bottom-up approach. I personally recommend when we want to develop a metrics for your own, we can use the top-down approach, because we can place in the goal setting as your first step and force you to find the metrics related to the problem, which is very critical to develop a metrics for specific problems. And if you don't have a specific goal, then you can try to use the bottom-up approach. You list all the metrics you can get and just look at the data and try to find some trend or some pattern of them. And you may get some intuition about the data. Then you can have some idea to how to optimize your whole process. So next, I will introduce the CI-CD infrastructure of 99 Cloud. 99 Cloud, we have developed the OpenStack distribution based on the OpenStack Cloud. And we leverage the drew as a CI system inside our CI infrastructure. And we accept drew. We also use the red mine to help to check the project and issues. We use GitLab for the Git code and Garrett for review and Jenkins Server for the integration server. And we have the local mirror servers. And at last, we have our own the metrics server. The metrics server will talk to the red mine for the issues and the project. And talk to the Git, we can get the Git metrics and talk to the drew, we can get the pipeline metrics and talk Jenkins, we can get the build information. Right now, we use the drew v2, not upgrade to the drew v3 yet. As we all know, the drew, we can define different pipelines like the check pipeline, gate pipeline when we are going to merge to the repo and the post pipeline when it will happen when you get merged and pre-release and release pipeline. So right now, I will show you the data from the build system. We collect the build time from the system. And here is the build time train. It's a very nice curve, right? Actually, we launched a new project in the August. And the build time, it will increase dramatically. And it troubles us a lot. And we, so the following slides, I will introduce you how we find ways to define metrics to understand the key issues of this increase and how we can define metrics to keep monitoring this trend. And we finally, we will resolve this issue and we will get a nice job in the late September, even though they have a small increase in the next month, in the month of November, OK? So look at this increase. First of all, there is questions coming up to us. Are all the jobs run equally in different build servers? In our infrastructure, we have actually two build servers. And the job will come to different build server and we will have different build running on these build servers. So to answer these questions, we are defining a new metrics called job balance ratio. We will try to find out if the balance is the job schedule not so balanced, which cause some job runs very slow and cause the time increase. So we collect this data from the build server and we calculate the ratio as this diagram, as you can see. Actually, the job is equally be distributed in these two servers, right? So as we can see, the job balance ratio, it tells me the problem is not caused by the unbalancing of the job. So actually, what's the next hypothesis? We will try to see if because the performance of the build server and especially because the network issue, because when we build, we have to build a lot of the dark image and the image have to get the ripple from the outside and we will cache it in the local cache server and we will build it again. So during our investigation, we find out there are a lot of retry in the log. As you can see, we are using the caller to build the other Docker server and the caller log, you can see it will fail a lot and there are many retries. And based on our history data, normally the deployed job, it will have eight times retry every day and for the image build job, it will normally have 75 times but from the data from the August, we will see there are more than those data. So the actions we have to take is to optimize the network connection and also try to resolve the IO issues. So the optimization number one is very simple. We first of all, we will try to upgrade the build server with more CPU and RAM. And second, we will use the SSD storage to have better IO performance. We moved those two build server from the SATA pool to the SSD pool to have better IO performance. And the third, we will change to use another VPN to have better network connection because in China, the image build have to use the resources from the outside and sometimes it will be banned. So we have to change the VPN to have better network connection. After this optimization, you will see the train is dropped. We optimize about the 25% of the build time. But we are still thinking how we can optimize it to the normal data just like the previous data. So we dig into the detail of the building process. And before the building script, it just not so modelized and every step is mixed together. And we have the engineer discussion and clean up the steps of the building process. And we finally, we will understand is the building docker image, it has the similar curve to the build time. And the build docker image, this step is a critical path of our whole building process. So we will find ways to optimize this script. So how can we optimize the building process of the docker image? Normally, when we are building, we upload a patch. The patch gets merged. And when we run the daily build, it will build all the images overall. Actually, as I remember, there are about over 200 images. So which means in the daily build, we have to rebuild over 200 images every time. It takes time a lot. So we decide to define a new metrics to help us to understand if we could reduce some images and avoid building them again. If those images are not changed. So we define the image reviews rate for this process. What's image reviews rate? The image reuse rate means during this daily build, we are not going to rebuild this image. We will keep reuse these images. Because when we are developing some features, maybe some images is not getting packed. So eventually, we do not need to rebuild this image. So we can keep use the last version or last tag of the images. So from those metrics, you can see in the September, the reuse rate is zero. Because the whole script, it will just build all the images overall. And from the September, 17, we are going to refactor this script. And the use rate will increase to 80%. But going to the next month, the reuse rate, it will not so stable. That's because the use rate, it depends on the feature development. Once the feature has many projects, we have to rebuild many images. So after we reflect this script, we can have a better optimization of this, have better optimization. The optimization effect is around the 70%. But from this 70%, it's not the best. Because you can see in the next month, the build time is increased again. That's because we should have another way to optimize our script or for another aspect. We have to find another metrics to have another metrics to help us to understand the key issues of these building systems. So we are keep working in this month holding to find a better metrics to help us to understand. So the summary is we think a metrics should be used for purpose. Just like I show in my story, we use the reuse rate to measure how many dog image we are using in a daily build. But not all the metrics can resolve one problem. So we have to define different metrics from different aspects to describe these problems and finally, we can get the resolution of this problem. And the second is we have to understand the different relationship between different metrics. Metrics may have some relationship. They may have some metrics may depend on another. Like the building time, the metrics, they depend. Actually, it depends on the image reuse rate and the job schedule ratio, those two metrics. We need to find several metrics or a group of metrics to describe a problem. Also, we should find a group of metrics to describe your solution and help you to evaluate your solutions. The fourth is we need to drill down and define new metrics. How to drill down is you have to understand your problem very clearly. Just like the example I show in the slides, you have to understand the build time, how the daily build runs and what's the steps. And you can drill down the data, drill down the build time into different metrics and evaluate each metrics. And the fourth is make assumptions and implement them. Make assumptions and try the adjustment and try to see the results of your adjustment. And then you will close your feedback look. And last, I will encourage you guys to look at your CI-CD process. And starting today, maybe you do not have a goal to how to optimize your process. You can look at the data in your hand. Maybe you can get some intuition. And if you have a goal that is better, you will try to understand the problem and drill down the data. And you will find a solution to them. OK, that's all. Thank you. Any questions? OK, thank you. So after you break down the build time from Docker emit build, which is the most step that's been most time in the build? After I break down the steps, the most time consuming task is the image building process. Can be more specific? Actually, we have to download a lot of the image from the outside. So the download process? The download process from the image layer. So you change the VPN service to both that process? Yeah, that is one. Another is we try to reuse the image we could reuse. Local catch? Yeah, and not the local catch. It's because in some day, maybe the image not changed. We can use the older image. So do you have any local reports? Yeah, we have local reports, but if we merge the new features, the local reports have to be updated, right? So some image will not update, some image will update. But by default process, it will update all, and build the image all over all again. That costs a lot. Thank you. OK, thank you.