 So good evening, Baselona. It's just a day one, and we got a lot of good tones hit by Mark in keynote. And I hope it sets the last session for the AI. So we'll talk about the evolution of OpenStrike from the vision-critical viewpoint. That is the topic of the talk. And after that, you can enjoy the party. So to start with, let me introduce myself. So that's who I am. I'm an open source software leader at NEC. Name is Deepak Kumar Gupta. So my Twitter handle is Digmar Gupta. And you can locate me. So I'm from India. And in India, the location is Noida. About me, so a brief introduction of myself, having about 16 plus years of experience in Linux and open source software. Started career as a Linux kernel developer in 2000 here. So at the time, I think Linux was having about version 2.2 and 2.4. Worked largely into Linux kernel cache dump, memory management, file systems. And then later on worked with Zen, KBM, all virtualization technologies, and storage products. So I specialized in file, block, and object storage. And now I work across open software. Presently, I'm working as the head of NEC open source software technology center in India. And my first affection with OpenStack started in Wayabacket with Folsom release in 2000. So that's my introduction. Maybe. So let's define mission-critical system. What is a mission-critical system? So mission-critical system is a system which always works when it's supposed to be. And system which never fails. So that's the definition of mission-critical system. And OK. So that's the reality, right? So that's where we come. And they say a file is not an option. It comes bundled with the software. And that's the pitch of my talk when we talk about the mission-critical systems. How important is it to analyze the system from all viewpoints? So to set it on, again, defining mission-critical systems. So there are certain attributes of mission-critical systems. Availability, scalability. So availability talks about a system should be available all the time. And scalability talks about it should be able to scale up without any limit. Performance, another key attribute. So a performance attribute talks about a performance of system should not go down with scale up. Manageability, that's pretty critical attribute. And those who are managing OpenStack, they can appreciate it. Manageability, or managing a large system, or mission-critical system, is something which is day in, day out job of many of us. So it has to be manageable. Interoperability, so a system does not stand alone. It has to interact with a lot of systems. And that's where the interoperability comes in picture. So a system should be interoperable seamlessly with other systems. And of course, security. The system must be secured enough to be used in any of the mission-critical applications. So let's set the background about, and this is the famous chart, the Innovation Adoption Life Circle. So the dot I kept here, this represents the OpenStack. So it's an OpenStack maturity mapping. So I mean, in my opinion, it's ready for early majority. And for those who are very dedicated even, let me be clarifying, it's my own opinion. That's my opinion. So that's how we look at the system, OpenStack, at stage. And as we know, the adoption is increasing. In the morning, we saw the numbers. So my presentation talks about a methodology to evaluate OpenStack from mission-critical viewpoint. And the process we talk about, this methodology is generally useful for evaluating any open-source software before adopting into a mission-critical system or scenario. So this is the evolution process. It's talked with something. And it's pretty simple, as you can see. And you can create a great software app. So that's the way. So it's going back one. So we talk about the evolution process overview. First step is analyzing the requirement what a mission-critical system is supposed to do. Then we do the modeling, modeling of the entire system. And then we do profiling. So profiling, we talk about the runtime profiling. In a recommend analysis section, I am focusing towards OpenStack. So we identify the component of OpenStack. As we know, there are a lot of components coming up in OpenStack and not everybody. And every system is using every component. So selecting the appropriate component for you this case is a very, very first step. And once you identify the component, for example, you may go with Cinder, Nova, Swift, whatever you want to have. Then you need to identify the use cases, which is critical for your system. In the modeling section, we talk about the composition of that product. It's pretty important to understand what kind of files and folder the system is having. So whenever you install the OpenStack in any of the box, how many files the system produces, and then understanding the process structure. So while you learn any component of OpenStack, how many processes threads are created in the system. So this information is pretty critical because the moment we go for machine critical systems, understanding the system behavior before the software and after the software is pretty important. Then I think we go for the profiling. That's the last step, where we integrate any of the profiler with a new system and try to capture the profile data. And this profile data is useful for identifying the bottlenecks or machine critical points, what we talked about. So in profiling, we've got two kind of profiling, single profiling and multiprofiling. Single profiling talks about running a single use case. For example, getting a Nova volume is a simple use case. And we need to analyze the code into the draft to identify the points which are critical or which are becoming a bottleneck from machine critical view point. Multiprofiling is running multiple scenarios. For example, while creating a volume, you can kill a process. So that qualifies for the multiprofiling. And again, analyzing those profile data and identifying the bottlenecks. So let's focus on the recombin analysis phase. And recombin analysis phase is having component identification. For example, in this talk, we'll be talking about one of the reference system what we created out of the OpenStack. And we analyzed it. So we chose following components as a primary component, that is neutron, NOAA, Glance, Keystone, Heat, Cinder, Swift, Cilometer, and Anonic. So we chose this component as a primary component. Then they are very critical. And second component, these components are optional. So we had a Trove, Horizon, Sahara, Manila, Zucker, Designate, and Barbican. So these are the secondary component we had in our system. And after that, we created the use cases. So typical example of use case is create volume, or create volume from image, or list volume. So these are some of the use cases which are applicable for the Cinder. Similarly, you can create chassis, create node, or create port. These are the use cases defined for the Ironic. And the use cases we designed is one of the critical steps. And we designed in such a way that it covers most of the operations for the system, which is the targeted system. Next is abnormal use cases. So abnormal use cases are, again, we define them well in advance. So these cases are not normal behavior of the system. And for example, killing a process while creating a volume can be a normal scenario. Or stopping a service whenever some operation is going on. So these are the abnormal use cases. And we run those use cases, try to analyze the logs and data. So in evolution process, we have mentioned components and use case count. So we talked about three releases here. And this is for a given system. We were relating the system for consequently three releases. And use case count talks about the applicable use cases. For example, in our system, when it was in kilo release, we had nine use cases for heat. And again, when it moved to Liberty, we had 12 applicable use cases. Similarly, in the Mitaka release, the use case count increases from 10 to 17. And again, as I mentioned, this use case count is based on the target system requirement. Next is a modeling phase. So as we talked about modeling, so in modeling, I think it's simply important to understand the architecture what we had for the reference system. So as you can see, we got a few nodes. They are running on the virtual machines. And we primarily used Rails 7.2 as a base OS on which every analysis was made. So we got two compute nodes. That is one is of NOVA, one is of Ironic. And then you got a network, Cinder, and Swift nodes. The components of OpenStack are shown in the pink color boxes. And there are some additional components which were running on the system. For example, Apache, MIS, SQL, WebMQ, they are the secondary component which we used for analysis. So in modeling, so once we install the component on the nodes, release down the stalled folder structure and identifying each and folder file for usefulness. For example, when you install Swift or NOVA, how many file it creates at which location. Then analyzing the runtime behavior of the system. For example, when we start the service on of any component, we need to check the runtime behavior of the system, how many processes or threads are created, and then identify the differences from the previous release for the same component. For example, maybe Swift in kilo release is having certain behavior or certain processes which may change when you have same Swift in the metacard release. So we don't analyze the differences, how many new processes or threads it creates. And purpose of modeling is to verify each and every system level change due to presence of OpenStack component on that node. Next and very important step is profiling. So profiling is integrating a system profiler onto the node where OpenStack component is running. So in example, we had a typically five node system which we used and then we install a profiler like LTDNG or any similar profiler tool can be used to profile the data. A tool should be having a ability to capture the profiling data in a non-inducive way. For example, the profiler should run independently of the OpenStack component and it should be able to capture the data. So that's the basic need. And we need to keep the frequency of sampling to the maximum so that minute level details can be captured till system call level. So it's really important. Capturing of profiling data. So how to capture the profile data? So process is pretty simple. We start the profiler service on the node. For example, until LTDNG start and then we run the use case. So for example, in OpenStack, volume, delete the volume name. Then we stop the profiler service on the node and dump the profiling data at appropriate location and analyze the data using appropriate toolkit. So please remember that since we have a lot of sampling done, so size of profiling data can be used. Now the process starts. So we already had a system which was running. We already had a modeling report. We already know the runtime behavior of the system. Now we create the process flow or sequence diagram for the use case. And when we say process flow or sequence diagram, the sequence diagram is having detailed millisecond or microsecond level accuracy of each and every operation or function call. So that is in-depth analysis, what we do. And then we verify for the machine critical observation. So machine critical observations are typically, for example, if an API is accessing the data database and if it's a single call and it's taking, let's say, a few seconds, if I increase the call and the time of database access is increasing sequentially, which means there is a bottleneck. So there are certain thumb rules which we set up while analyzing this kind of behavior of the system. And then we verify our observations. For example, as I gave example of single access, double access, and we got some observation that performance of database is degrading. So we need to go for the multi-profiling or having more detailed use case scenario execution. And once we identify the problem, then look for the solution. So typically, we look for the solution by checking modeling reports or checking for some of the configuration parameter if they can help us in creating the solution. Once you've got the solution, you need to verify by re-running those use cases and analyzing those bottlenecks at the source code level. If, of course, no solution is available, then list down the observation as unsolved and maybe raise a bug into the community for that. So this is sample output. So don't read into it. It's quite complicated. I could not capture the screenshot. So typically, this diagram denotes that when you go for profiling, you have huge volume of data to be analyzed. So analysis of this data using certain scripts or maybe manually takes a lot of time. And this is a kind of evolution report which we generate out of the system. So for example, if you look at component name, NOAA, and we got a report of performance scalability, availability, and interoperability. So the number, the count here we're talking about, it says number of mission critical observations. So in NOAA, when we, this is a report which we use for killer release. So in NOAA, we got two performance bottlenecks while we used NOAA component in our mission critical scenario. Similarly, if you look at, glance was almost clear. Only it has three to four interoperability related observations, right? And similarly for the other component in the release. Now, looking at this chart, it seems that a lot of changes are required, but it's not necessary to fix all of them. This report is useful for me to take a conscious call, whether can I go with the system with this kind of limitations in hand. Because some of the bottlenecks, what we identified here can be solved just by changing the configuration parameters. You can have small change in configuration, read on the service, things are solved, right? So no need to fix everything. And then the motive is to not have a patch on the committee version of the stack in a system. So let's push the changes to the community so that those source code can be managed at committee level. So the primary motive of this kind of evaluation is to solve the problem, solve most of the problem by not changing the source code, right? So we may go for different kind of complication options and try to solve them. If bottleneck is quite critical, cannot be solved, then only we go for the next release and try to see the fix, okay? So let me focus on the example of problems or bugs we identify using this kind of methodology. So this diagram, this report basically talks about evolution of Keystone, right? And we observed certain problem in Keystone using our methodology. So the problem in Keystone in the Liberty release, you know, we observed this problem and the problem was database schema is not scalable. And it may lead to the performance issues if user count increases. How this observation came? So what we did? We created 10 users and try to fetch the information about the user using Keystone APK and we observed the time. We found that, you know, it took somewhere around 0.649 seconds to fetch the records. Then we increased the number of users and observed the time taken. So the time increased from 0.6 to 1.7 which indicated that there's a problem. There's a problem of scalability. So as you grow the number of users, you'll find more and more issues. Next observation was the CPU consumption data which is again available at a profiler side, right? So we observed that in first operation, the CPU consumed at MySQL inverse 3.2% and in the second case, CPU consumption increased to 34%. So this clarified out that definitely there's an issue, right? So it must be resolved. And also we observed that, you know, it's not possible to solve this issue just by changing conjugation because this is a fundamental core issue. If Keystone, it must be reported, okay? So we analyzed the source code and observed that the database of Keystone must be changed or some fix must be reported. And of course, I think we inquired and we got our team also, we which works in open stack community. So we discussed this kind of behavior with the developer and we filed a bug, right? This bug is being fixed in the next list, right? And then we are adopting the next upcoming release for our work, okay? So this is the typical outcome by using the methodology what we used in our system. And the prime motive of this talk is to advise the methodology what we use in evaluation of any open stack software before we deploy in a machine-village system, okay? So we at NEC, I think, we try to ensure that whatever we deliver to the end customer, be it open source, it has to have the highest quality benchmark passed, right? So that's where I think that the system, you know, whatever we deliver is going through this kind of rigor and, you know, we ensure each and every use case, whatever is being delivered is delivered up to the accuracy. Okay, so, okay. So that's all. I think I finished quite fully. I wanted to have a discussion with people. So I killed the presentation now and it's time to have some of the questions or interaction that we can have. Yeah, so what we do, we report these kind of problems on the community. So we got T who works on different components. We got some of the core developers, some of the contributor. They file these bugs to the community and this community fixes those bugs. As far as report is concerned, it's pretty, I mean, it's use case specific. So there's no point sharing those kind of reports because, you know, it's often used to anybody else. What do we do? We extract the information and give back to the community in terms of bugs, right? Okay, so what we do, for example, let's say we file this bug and our developer needs to have some data, some observation, right? So we provide data information to the developer that this is a behavior, this is a case case, this is how we, you know, I mean, we observe the behavior by using this case case and developer can verify the bug, right? So typically developer doesn't need the profiler data done. They need a scenario to produce the bug, right? And that's what we create the scenario, produce the bug and give the case case to the developer so that he can produce the bug. Mm-hmm. I also have. Yeah, so it's pretty easy. So for example, I mentioned about the basic notes, basic five-note system and any open site deployment you have and there's open source tool called EntityNG, that is Linux case toolkit, right? And pretty standard manual is available. Anybody can install that software on any of the node, wherever, you know, whatever the person wish to target and reproduction of those same logs is pretty easy. It won't take more than a day or maybe a half day to set up the entire stuff. The challenging part is to analyzing the logs or dem data. So that's where I think it has to be done using some custom scripts or whatever, you know, manual way. So that's the painful job, right? And that's where I think we try to reduce the pain for the developer and try to, you know, read the scenarios which can directly be run on the standard deployment of Moostack and bug in the observed. So if somebody's interested, let's say if you're interested to set up the system, please bug me. I'll help you out. And maybe you can have small documentation also how to set up the system, right? Okay, so if you look at, you're talking about this table, right? Okay, so the first table you talk about, the number of use cases reducing from maybe this one. This one? Okay, so for example, you know, if you look at from, for the NOVA case, number of use cases in Liberty increased and number of use cases in Mitaka decreased. So the target system requirement changed, right? During the Liberty release, right? And that's where I think the more use cases for NOVA became applicable, right? And from Liberty to Mitaka, changes in NOVA were very, very few, right? So most of the use case or analysis, whatever we have done in the Liberty release was applicable in the NOVA release. So the new use case count was pretty small, right? So since there was no change in component and our use case remained constant, so that's where I think everything remains as it is. You can see it up. So I think due to lack of time, I think I didn't mention all these cases because I wanted to finish up in, you know, given time. But you can anyway see the list of use cases. Are they online? Not really online, but I think it's a pretty simple use cases derived out of the APIs only, yeah. So it's possible to, I mean, share that use case. That's not a big job. Okay, okay, I know you talked about the machine critical points, yeah. I think you can see, see most of these issues are, as I already said, you know, we got it easy, we got a lot of community contributors. So they have reported those bugs. So these issues are having two kind of issues. One is the bug, which I showed. Second is the issue with the configuration, right? So we change that configuration and solve the issue. And it's pretty well, you know, I mean, given back to the community. So what we do in our system, whatever we apply, we give back to the community in community documentation also. For example, you know, if you'd like to use a NOAA optimally in such a scenario, they need documentation, you know, let's say they forgot to mention about the conflict change and what is the impact of that conflict change? We contribute it back to the community, the documentation part, mm-hmm. Okay, yeah, I think it's possible to have, so if you're interested, maybe, you know, you can get in touch with me, we can always have discussion on those use cases. And for that matter, you know, anybody who's interested in any component, we can always discuss those troubles, what we faced, maybe, you know. And you'll find those troubles anyway on the community documentation or community bugs, right? So most of them are there in the community report, okay? So you'll not find the consolidated report. Because the consolidated report is, you know, is the, maybe, I don't know if in community, people are interested in having specialized report, which is, you know, useful for my case. Community people are interested in knowing the problems in the component, which should be generic enough, not use case-specific. So that's where the entire effort goes, mm-hmm, right? So maybe I think we can have offline chat and we can explain the process, how we do. Maybe, you know, you'll find a better way to do it. But, you know, it's all about profiling data and, you know, dumping the data and analyzing it. So that's, I think, a challenging job that we have. Since we use, so I think, I mentioned about LED-NG. So LED-NG is just one tool. So we do use a lot of tools, for example, for network monitoring, we use NetPerf, or all the processes of Linux are being used. So as you know, we have a Linux rel 7.2 as a base OS on which we are running the profiler. So all the tools, which are available in Linux, we use, especially for the tracing and profiling. So it's not only LED-NG, there are many tools which are required. Yes, so, yeah, so when we agreed from one release to another release, this component do impact and that's where the problems are reported to the community, right? Yeah. So when you upgrade from one release to another release, there are certain issues which are observed, yeah. Okay, so thank you. Thanks.