 OK, hrlo, sreči. Mi je Zoltan, in sem včasnjel za idemsov. V nekaj 20 minucu srečim, kako smo vse izgledali in izgledali na ekosistem. Sreči včasnjel za nas. Idemsov je izgledajstvo v kompanii. Srečim zelo srečim integracije, instalacije in vsečenje in vših neskrišnjih nekaj nekaj, zelo zelo jazno svoj tudi prijevnih, zelo jazno se vsečenje, kaj smo prišli v moj delo 3.000 zelo. Jeste vsečenje vsečenje, vsečenje vsečenje vsečenje, vsečenje vsečenje, vsečenje vsečenje, Google Circle also provide services to the local government. And in addition to that we have the national datasets, which consists of seven registrzez for support, administrative records and related procedures. And annually we also support elections votes in national level. najdešno oprejten central Governmentari servis, bas ko KAKASB za data interoperabilitaj. Jismo mali dvein ribuosne organizacije in v triožim glasba 60 milijon iz veri tukaj mena. Odgovorilosnimi privoljških aplikacijan stajnje in za investicij po obnenju vzrednjenj In for that, we have the public application catalog for information about the development. So, but why we need efficient operations? As you may realize, we have a lot of services and taking account of age and type of the technology. We have very old systems and we have the cloud native applications as well. So, our goal is to make the operation efficient, make the faster and more efficient the error hand handling increase the availability and reliability of the services. And then we want to benefit for the service operation by reducing the number of incidents. And in the end also, we want to help the work of the application development. But how? So, we want to use, we are using open source as we can. So, the important is there is no license limitation, no additional cost. It has a faster development cycle and there is a wide user community. But there is a trade-offs as well, they are disadvantages, which means there is no customized support. Public code base in higher cases can be a problem. And lake of the documentation and also the qualities of the documentation is depending on the open source solution. And how we can improve the observability. So, we use anomaly detection and other methods like road cause analysis and APM, analyzing the APM data forecasting service map. So, we are in the observability session, but I would like to somehow define what it is. So, the notion of the observability was introduced by a Hungarian-American engineer called Rudolf Kalm in 1960 in the paper on the general theory of control system or controlling linear dynamics system. So, in software environment, so in the distributed system, observability is the availability to collect data about program executions, internal states of the module and the communication between components. So, if you check the Wikipedia, there are a lot of definitions, I would say the now definition is just how we understand what's happening in our system. So, we introduced observability because the monitoring is not enough in a complex heterogeneous computing environment. And in addition, we started to use AIOPS machine learning techniques as well. But how we process the data? So, we have to challenge, there are a lot of products and a lot of microservices and we need a unified monitoring data structure in order to analyze that data and we have also different data sources. So, we are connecting data from legacy system and the cloud-based Kubernetes-based platforms and we also will collect data from mobile and web applications. So, after that, we process the data and we store the data in a central place and then analyze and visualize that data. So, in the picture right side you can see how the production system is built. So, we are collecting data, the traces and matrices with open telemetry and we use open telemetry collector and data prepare to process the data and open search to store the data. So, for the logs, we are using Fruendi and LogStash for transforming, parsing the logs and storing in OpenSearch as well. Grafana OpenSearch dashboard for the visualization and JupyTel apps for analysis. So, this is only for application monitoring, but we are using Prometheus, Nagios, Icinga for other type of monitoring such as infrastructure monitoring. So, for the data integration and processing, so we are using open telemetry for metrics and traces. As I mentioned before, as you already may know that it's a collection of tools, APIs and SDKs for analyzing software performance and behavior. So, why we decided to use open telemetry? So, we decided to use open telemetry because it's a vendor agnostic. The same agent can collect telemetry data, which means our data pipelines is much, much simplified and they are various receiver, processor and exporters and also there are also possible to use processor for cost optimization to reduce number of data stored in storage. For instrumentation, we are using auto instrumentation where we can, but sometimes it doesn't work, for example, in the Node.js cases. And then, currently, we have the agent instrumented using Helm, but we are moving to the open telemetry Kubernetes operator. And the second part is collecting logs. So, we are using Fruende, which is a cross-platform log collector. Collecting application logs and the system logs as well, where the Kubernetes is running. For the Kubernetes, we will move to fluent bit and we keep Fruende in the legacy system for collecting logs. So, for the data processing, we use the collector for process the metrics and races and data prepare for processing open telemetry data and inserting to the open search. And for the data storage and analysis. So, we are using open search, which is a real-time distributed search engine, which perform very well in tessary data. And one of the good things you can store logs, races, matrices in a single backend and it has a lot of functionalities like trace analytics for visualizing open telemetry data, event analytics, which provide SQL, PPL, DSL query languages and it supports Prometheus as well. And there is also security analytics plug-in for SIM functions and there is machine learning, built-in machine learning plug-in as well. In the figure, you can see analyzing the event analytics panel, analyzing a specific error message. In that figure, you can see trace with well span and the trace analytics dashboard. As I mentioned before, we are using Grafana and open search dashboard for data visualization and for the application alerting. We are using the built-in open search alerting plug-in and we group together the same type of error messages and we send to a dedicated email address. And now I will talk about how we use machine learning algorithms to analyzing the data. We have developed OpenPanda's data frame-based analysis and manipulation framework for analyzing data stored in open source. It's highly domain-specific knowledge of open source so it provides data exploration, data transformation and visualization and it can be integrated any Python-based analytics platform at scale. You can see in the screenshot a very simple example of how you can use OpenPanda and then we use OpenPanda for analyzing the data. And now let's talk about anomaly detection. But before, what is anomaly? Abnormal situation, unexpected event. But you may know it's not easy to decide what is the normal and what is the abnormal. You need a lot of information. If you look at the picture, you can see a plane landed on a highway so it certainly is not a normal situation. The anomaly has different types. There is a point anomaly. So let's say we have ten cars and one plane so it means something wrong so we can see an anomaly and the point anomaly is very simple because you have, let's say, a data point and it is referred to the other and the contextual anomaly, you need the context, like in the figure you have to see the cars and you have to see the environment that is going on. And then the collective anomaly, so if all flights are cancelled in the same region and if you have data only from one plane or two, you don't know it is anomaly. You need all the information to decide this is anomaly. So there are a lot of algorithms. There are machine learning algorithms. I call the classical, statistical based algorithm, like QNN, PCA and neural network based algorithm. We compare the classical and the neural network performances. So as you can see in the right figure, the neural network learning time is much, much higher than the other classical algorithm but the accuracy of the neural network is much, much better than compared to the classical algorithm. So we have developed a convolutional out-enko there. It's a neural network. You can see the architecture of the auto-enko there in the bottom of the slide. So in our cases we have the input time series and then the encoder transforms the time series into meaningful latent space Z and then the decoder reconstructs this compressed representation of the time series through the minimization of the cost function. The convolutional auto-enko there is encode input data by splitting the data into the subsection and then converting this subsection into simple signal that are summed together to create new representation of the data. So here a very simple example about anomaly detection. So we use this convolutional auto-enko there. We train as an unsupervised way because there are a lot of data points and it's not easy to annotate all the data points. So then we decided to train the neural network in different patterns and then when it's needed then we go to a supervised way. So in the figure you can see there is a sudden increase of requests which start with this 200 and this pattern was didn't showed to the network and then it means it was not able to reconstruct that time series. You can see where is the sudden increase happened then the reconstruction loss of this time series higher than a certain level in higher case is 1. If some organization 1 year want to get some data some information in our system so it's not anomaly. Then we have to say ok somehow next time if next year this happened again it should not again anomaly. So then we use grafana and we label the data. So the zero means we don't know about the time region. It's anomaly or no. We don't know nothing about that region. We say minus one where it's not anomaly and one the region that is anomaly. And after continued training you can see in the last plot the region where it was above the red line. It's now above and almost the same as the other data. It means there is no anomaly. And in summary monitoring is not enough so observability in heterogeneous computing environment is a must. Open telemetry is very good to unify the data pipeline and process data. Observability can be improved using AIO, especially anomaly detection in complex environment. In the end I would like to thanks to the Paisma and Peter Catholic University for the collaboration and the support of the Hungarian National Laboratory and thank you also for the attention. Thank you again Zoldan. Any questions? Just raise your hand if you have any questions. Oh, yeah, okay. So every time you have a new kind of anomaly you have to do a new supervised training and label it. And then you also have to hope that whatever you label as an anomaly is also strong enough for the next time that it comes up. Yes, yes, yes, every time, yeah. This is, yeah. It's required because there is a lot of data point. We cannot start with someone to label what is the normal. So we start with doing supervised and then we move to next. Any other questions? Oh, yeah, go for it. So do you have this pipeline running continuously all day long or do you just sometimes execute it? So, could you please? Cannot hear very, very. Sorry, sorry. Do you have this pipeline running continuously all day long or do you just sometimes run that to see in the last week did we have any anomalies? No, it's offline, offline. It's not yet but we are planning to put it to the pipeline there to, yeah, to not manual. Cool, okay. Cool. Anyone else? In that case, thank you, Zoltan. Yeah, we will meet again in six minutes for the next talk. Thank you.