 Welcome and thank you for coming to listen to our little presentation. I hope you're having a great time here at the OAS Summit in Japan this year. My name is Joakim Larsen and I am representing the Platform SI Service and EA Departments at NEC. And together with me is the leading developer of Exastro, Ekiba Takio. He will be taking questions at the end of the presentation. And today I will be talking about how we can use open source software to solve problems often found in system operations. A little bit about myself first. I come from Norway, which is most well known for salmon, trolls, maybe Vikings. And I work for a company called Digital Information Technology or also called for DIT. And for the last three years I've been working at the NEC Exastro project as a developer, technical writer as well as the lead translator and interpreter. So have you ever thought of how we would be to live in the perfect world? Well, in terms of system operations, I'm not going to talk about fixing world hunger today, but just think about it. In a perfect world, also the whole system would be regulated by the perfect surveillance policies and no system task would be any problem. So that's what the real, that's what the perfect world would be in real life. That's not the case. Most system surveillance policies are unregulated and there are numerous and unnecessary alerts coming to the developers left and right. So what they have to do is that staff have to begin their day sorting through known and unknown events every day. And in this session I will demonstrate how we can use open source software to assist us in real life situations so we can get one step closer to that perfect world. So our amazing team here at Exastro is made of engineers with experience from Japanese systems and we went to collect all of their complaints and what they had to say working on current project as well as past projects. And some of the most seen complaints that we heard was that in most cases they wouldn't know what parameters they could change during system operations. In many cases they would have to read through a large number of manuals in order to find out what to do as well as the system operation itself was very dependent on experts meaning that if the experts would leave they would know what to do. So looking at the problems we can see that numerous of them originate from the same route. And in fact most of them only narrowed down to these three problems. Human errors caused by manual labor situations getting worse because the solutions are delayed as well as as I said the solutions being heavily dependent on expertise. Now what can we do about that? Well in order to reduce human errors we can link the system to automation software so machines can execute the solutions for us. And if the solutions are getting repeatedly worse because the solutions are getting late then we should be able to start recovering the moment the problem is found. And for the last one what we came up with is that we could use a user-based rule that the system will read in order to find out what to do when a problem arrives. So in the next slides I will show you how we can use different open source software to reach that solution. So just like ours all systems need to be maintained in order to run smoothly. But the ones that have to do the maintaining are waiting for incidents that who knows when they're going to happen. So instead of being able to use their time productively and use them to do more important tasks they are often waiting or doing mundane tasks. So the solution to this would then be to solve the three problems that I showed earlier. And we believe that the most effective way to do that would be to automate the process and the pipeline. So easy right? We just automate the alert and event system. Well doing so for single alerts and single problems itself wouldn't be that hard. But often in real life problems the engineers would have to sort through the data. They would have to gather the event sellers from multiple monitoring applications. And then they would have to evaluate the problem to find out what the necessary solution would be. So when it comes to this then we might require some more sophisticated logic. So for example what we would have to do is that we would one collect the alert events from all the different monitoring tools. And then number two we would have to unify the data so we could compare them to each other. Three we would have to evaluate the messages in order to find out what the proper solution should be. And then four take those results and reevaluate them to analyze it more. And then finally five we can then use that reevaluation to decide on what the final solution should be. So again what seemed like simple as a concept turns out to be more complicated when put into practice. So jumping right into it this is our very simple solution to solving the three problems. It might seem like a lot of first but it's actually quite simple when we get to look at it. Don't worry as I will go through it bit by bit. So starting from the left you can see that we have a number of open source software listed. We have Grafana, Prometheus, Xabix in the middle. We have MongoDB. We have MariaDB. Well we have automation tools such as Ansible and OpenTofu. And we have Mattermost. So in the midst of all this you can see something called Exastro which is a suite or a collection of different software led by our team and proudly published as Open Software. You can see Exastro was in the middle which is a tool that can connect with different monitoring tools and automatically decide recovery actions based on user written rules. Besides it we have Exastro IT automation which is able to connect with other automation tools such as Ansible and OpenTofu and have them carry out and execute the solutions for us. And yeah, let me break the process down bit by bit now that you know the programs. So starting with the easiest part we can see that we have the system here on the right. And we have all these different system components, virtual machines on the different devices. And a few years ago these components used to be physical network devices and databases, storage devices. They were often coupled together with the webs and apps and other databases creating a very common three layer structure. And comparing that to today we can see improvements in the industry have led to the not only the monitoring itself being much more efficient and better but also that the system itself has become more complicated. Now that we have virtual layers with where the components are just virtual machines and containers. Yeah. And that itself wouldn't be a problem but it makes it harder for us to monitor the whole system. Now that the parts are interchanging and they are going in and out it would be if we were a teacher for a 30 student class. But instead of the class consisting of the same 30 students they would the students themselves would change in and out every day. We're making it very hard for us to keep up. So what do we do to keep up? Well that's why we have the monitoring tools on the left. So now that we have the tools how do we go about this without everything being a cluttered mess. Well the majority of the monitoring including new monitoring techniques uses some sort of manual labor. So that also means that the information we can gather from the different applications will have to be sorted manually. And with the amount of information found from the modern systems are getting increasingly more and more becomes more and more an option that's not feasible. So what can we do? Well the proposed solution would then be to automate the process. Let the machine do let the machine sort as much as they can and let them handle as many cases as possible. That's not to say that we should let machines do everything. I don't even think this is possible for this case. As the system needs to have user written rules in order to know what to do. And if a system encounters an unknown event that's not user defined then it won't know what to do. And I'm sure we could use some AI but at this point in time I don't think any super AI or chat GPT-8 would help us. This is a step that requires humans. And that is why we created Exastro. Exastro has the ability to use MongoDB, MariaDB, Ansible and OpenTofu and other OSS to help us out with this problem. Now how do we do that? Well as stated earlier one of Exastro's functions allows it to collect alert events from multiple monitoring software. And it has a backing service for aggregating them. And Exastro was specifically used as MongoDB a no SQL for its backing service. And doing so allows us to gather alerts and events from different monitoring applications. Now this is important because all of these wonderful monitoring apps are different and unique and they have different use cases. But that also means that their messaging and their alert system is different from each other which is not exactly what we want in this case. We want the messages to be the same. So while some monitoring apps might say that the system stopped then one might say system equals stop or some other might say power equals down. And that automatically makes it harder for us as we know how to compare events with different data types. And because they're all different the system is going to receive multiple alerts about the same problem. So in this case we have to unify the data before we can compare them. This is where Exastro OSS comes in as that it comes with the function of being able to label the events. So where we would have statuses that said power equals down or status equals stopped we could all label them as say service down equals true. And now that we have them labeled we can actually work with them. We can use them so the system knows that all of these messages comes from one problem. So instead of sending hundreds of different messages to the user it can send one with the same information. That way the engineers won't have to spend a whole day reading through and sorting through the messages. And now we can also utilize OASA to make the system operations much easier on the user. If Exastro OASA receives an alert that a user has already defined a rule for it will be able to send that to the automation software and tell them to execute this action. So after continuing a couple of if statements it can then automatically make the final decision. Say for example we write a rule that says if message says website is down then restart server. And not only that it is also able to reevaluate any executed actions. So whenever an event matches a rule in OASA the results is registered to MongoDB as an event where it will then be able to be checked if it matches with any other rules. In other words we can manage more complex and longer rules by writing single if statements. So next step when OASA decides on an action it sends a message to Exastro IT automation which will then take care of the recovery tasks. We can link ITA together with automation tools Ansible and OpenTofu making the whole process automated and the system can even send a message to the user containing any information regarding the event. So that is our idea of how we can combine different open source software to solve mundane system operations. We can start say that the system are low on memory and low storage on the system. The monitoring apps will then be able to see that and send or make an alert. The OASA agent will then pick that up and put it in the database where it will also unify them so they are all labeled at the same. Then say if the user has written a rule that says if low on storage delete cache data. Exastro also will then see that okay this message came and we have a rule that matches it so we can send the action to Exastro automation and then Exastro IT automation will then send that to Ansible or OpenTofu which will execute the action and delete the cache data on the system. To summarize it we can see here that the problems can be solved with the help of the software. To reduce human errors we can link with automation software in this case Exastro ITA Ansible OpenTofu and have them automatically execute the solution. For the second one delayed solutions if we automate the pipeline and have it connected to Exastro OASA and monitoring tools we can have the recovery start the moment they are detected. And for the last one if we use predefined user written rules then the system can be able to work on itself and even if the experts go away the know-how will stay. Of course we know that this won't solve everything but we do believe that this can help engineers all over the world freeing them from the same boring system tasks that they have to do. Now while I do have the stage I would also like to briefly talk about the Exastro suite. So Exastro is a collection of open source software created with the purpose of digitizing and automating the system life so we can save labor. And it contains different software with different purposes. Today I mainly talked about Exastro OASA but I also briefly touched upon Exastro IT automation which is specifically created in order to automate system operations. We've gotten a lot of positive feedback for Exastro ITA especially with its excellent connectivity with Ansible and OpenTofu. And it is as of right now the most used software in the Exastro suite. We also have Exastro Epoch the green one on the screen are a brand new DevOps tool created to accelerate cloud native system development. Just mind that Epoch is not readily available overseas but it will be available very soon. And if you have any questions regarding any of the software on the screen feel free to shoot us an email or ask us after this presentation and we will be able to answer any questions that you might have. You can also find more information by searching for Exastro or following the QR code on the screen. Both Exastro ITA and OASA the software I introduced today are available in both English and Japanese and as well as together with all the documentations we have guides and we have an FAQ page. And lastly I would just like to introduce Exastro IT automation as a cloud service. It is now available through NEC. If you would like to skip the process of constructing the right environment and installing the software NEC does provide ITA and Ansible as a ready to use package. It is available as a service software as a service and it can be used within two days after application and all updates and maintenance are done by NEC. If this sounds interesting you can send a mail to the mail in the corner info at ebis.jp.nec.com and that concludes our presentation. Thank you for your time and we hope you enjoy the rest of the forum. Are there any questions? Yes. Thanks for the talk. Yeah so I like the idea of automatic remediation of alerts but I also think it could be quite problematic or like a naive thing would be there's a disc full alert. The naive thing is just to extend the disc and but it could be caused by an application that's misbehaving or just filling up the database and just adding this and it could be quite costly. Do you have sort of like a set of best practice remediations that you sort of these are the things that we found to be good sort of patterns or responses to alerts or stuff like that? I will carry on the question to the dev app leader. Excuse me could you just repeat the last part of the question? Yeah so do we have a set of best practices that sort of you've collected like have you sort of oh this this type of alert is you you should solve it in this way or like you said this full delete the cache or something you know do you have a set of practices that's sort of collected? So the best use of was in this case would necessarily be the rule set themselves but more emphasis on the fact that it's being able to cut out unnecessary messages that comes into the system. At least in Japanese systems one of the problems most professionals will see is that there's a lot of messages coming in to the system at one time and even so it's only a small fraction of them that really has anything well important to say. 実際の運用現場っていうのは一つのアラートメッセージを受け取って そこからまず判断をして このアラートメッセージはこういう まずこういう条件に当てはまるかという true or falseの判断をすると でまたその trueの時にさらに まあ別の例えばログを分析してそのログから 今回はこのパターンっていうこの たくさんの条件条件を並べて最終的なアクション っていうのを判断するっていうところが このツールの素晴らしいところです so and the next thing would be that although we have automation tools that are able to automatically apply solutions or any actions to the problems if the problem or the system the strength of Oase is the ability to be able to use written rules to then have the system decide if for example it will read through the logs to see what the problem is then if it reads in the log that it has this problem then it could decide what the action should be こういうルールあって良かったなとか じゃとえば よくあるのが えっとアラート一つのアラートメッセージを受信して そのメッセージに対してログを回収収集すると でそのログを そのログの中から ある特定の文号 あのなって言うかね解析してそれに対して まあさらにそのどういう対象するかっていう判断をして で適切なアクションを実行する でそれが終わったらその結果をさらに イクザストロが確認をして そのサービスが復旧しているかというところを確認して 最終的なその復旧っていうところまでの 一連のプロセスを実施できたっていうところですか So one of the scenes that we see and do the best at is not only being able to automatically select what the action should be, but it would be that after it's taken the selection it can then monitor the or check if the system has recovered and then keep on going. I hope that answered your question. Thank you. Yes. Thanks to talk. Thank you. Is it possible to monitoring Oase itself? Monitoring Oase itself. Oase, ah, I see. Ah, Oase no. Ah. This question is about whether there is a way to monitor the health of Oase. Yes, including that, including other applications and conditions. Including other rules. Yes. Including other rules. Yes, including other rules. So, um, is it possible to monitor the health of Oase? It's a simple question. Ah, yes, yes, yes. Yes, I can. I got other questions. It seems different to cover complex rules. So is there any support available? Well, the rules, during the presentation, there was a review and such complex rules. Do you have any specific rules for complex rules? The specific rules for complex rules. Yes. Um, for example, um, there is an error in the service monitoring. Well, in a situation where there is an alert in the service monitoring of the web, um, the decision to download the web application service depends on the situation. For example, there is a problem with the web application itself. After that, there is a pattern where the network rotation is cut off so far. Or the application of the back end, or the logic part, is dropped, or the database is deep, and it is resettled. Well, I think there are various situations like that. Um, for example, when the web application service is dropped, um, within 10 minutes, for example, if only the alert of the web application is not uploaded within 10 minutes, it will be a problem on the network. And, um, the alert is raised with the URL monitoring, but after that, in fact, the CPU usage rate of the database is over 90%, and it was understood from other alerts or logs. In a negative situation, this is to reduce the amount of traffic to reduce the depth of the database, for example, to reduce the depth of the database. In such a situation, um, even if the same alert comes out, if other events are happening, the response will change again, um, such a thing can be done. Ah, I have an answer. So, in the case of, say that a web application goes down, um, and we only get that information, that might be because of several reasons. Um, say that, for example, it might be something wrong with the database, uh, there might be instability with the network. Or it can be, well, low on memory or something else. So, if we get that, then the system would be able to check first that, okay, maybe it's wrong with the network, maybe it's unstable. And first check that, but if it then later comes to the conclusion that, oh, we can see that the database uses, uh, the CPU for the database is over 90%. Then we might do that instead. So it's able to check all of these different causes and then find out that, okay, we have all these problems, but this is the one that's the problem. I hope that's, okay, thank you. Yes. Any more questions? No, I think we're good. Again, um, if you would like to contact us or see anything more, you can contact us at the email here. Thank you.