 Can you start filling down? Thank you. So welcome to the MLAI track at DevCon US. Our next speaker is Anton, who's going to be talking about the data hub for CI. And I'll let him take over. Thanks. So I'll talk a bit about the open shift routes of data hub, where it is coming from, and how it evolved a bit from just regular logging for open shift into what we now have as data hub. And I also talk about how we use data hub specifically to get data from various CI systems and what we do with this data. So it all started with logging on open shift. Basically, this is logging architecture on open shift. What we have here, for those who don't know, open shift is Red Hat distribution of Kubernetes. And open shift includes logging capabilities, meaning that it can collect logs for all the containers that are running inside it. It can contain logs from hosts that open shift runs upon. And this logging capability is shipped within open shift itself. It consists of this logging namespace. Namespace is basically named collection of ports and services within open shift. The logging contains of the namespace, which is logging namespace. And it contains of three major components. The fluendi is the collector. Elastic search is used as storage for logs. And Kibana is used as a visualization reporting layer. There are some other minor services involved in the picture. Like we have the service. Service is Kubernetes concept, which allows you to reach the appropriate port. We have curator. Curator is a component, which allows you to create indices to discard old indices and optimize older indices. We have Prometheus and Cerebro. Prometheus is a component, which is responsible for monitoring of getting metrics from the infrastructure. Specifically here, we can monitor elastic search. And Cerebro is another component, optional component, which allows you to get specific metrics, specific elastic search-related metrics. So what open shift logging collects, it collects logs from various ports and containers running in the cluster. It collects audit logs. It collects system logs. If we look deeper inside the server part of logging infrastructure, we can see that we have Kibana, which consists of two containers. We have elastic search, which consists of two containers. As you can see, each container is fronted by the authentication proxy. And yeah, so if we look at a request, a user request, which is coming through, then user goes into the browser. Queries goes to the Kibana, which is exposed route into the open shift. It reaches the authentication proxy and authentication proxy redirect user to open shift API, so such that user can authenticate with his open shift credentials. And if user is authenticated successfully, user is redirected to the Kibana container. Then Kibana uses the user ID and header from the open shift authentication to query elastic search. That way we achieve multi-tenancy, because the user that came in is known in this request. So we can give out the results of the query based on the user ID and headers. So Kibana sends the request to elastic search. In elastic search, there is a special plugin that was developed partially by open shift guys and partially by search guard. This plugin allows for multi-tenancy. We use it heavily in data hub. This plugin allows you to protect various indices from unauthorized access. Basically, user will be able to access only the indices that he has explicit permission to access. On the other part, for example, on the ingest part, the ingest goes such that we have a service exposed. We can have it exposed via external IP or noteport. The request comes in, and it goes directly into elastic search. And basically, it gets indexed. We don't have Fluendy or any other ingester must have the appropriate certificates to get the document ingested into elastic search. At the moment, Fluendy can ingest into any index, but this can also be tuned. And for monitoring, we use Prometheus, which queries authentication proxy and gets the token from open shift. And that way, Prometheus can have authorized access to the statistics of specific elastic search node. So this is the way that secure authentication works in open shift. And this is what we use in data hub to ensure that we have secure multi-tenant access within the data hub as well. So this picture was pretty much stock deployment of open shift with logging capabilities. Then what we added was two major components. One component is Kafka. Kafka is a message bus which allows for strongly ordered messages. It is very resilient and it is very high throughput. And we also added Ceph, which provides S3 interface. So three major components got added to the stock logging included in the open shift to get to what we call data hub. The elastic search, Kafka and Ceph are the key components in the picture. The other components that we have are various data ingestion components. Then we have various normalizers and we have other applications which allow for various data manipulation. Everything here is running on top of open shift except for the S3 part at the moment. S3 is running on bare metal Ceph. But other than that, all the other components are running on top of open shift. So yeah, data ingestion. One of the key reasons why we have data hub is to be able to get logs and to be able to get data from other sources from outside of open shift cluster. Basically, we took what was logging, stock logging, on open shift and we exposed it to the outer world. The way we did it was we introduced, as I said earlier, a message bus, which is Kafka. And we also introduced normalizers, which are special components. We have three different normalizers right now. We have Fluendi, we have RCSlog, and we have Logstash. So these three normalizers pretty much cover everything that is needed to get your data into the cluster. So any host that is on the Red Hat network can be set up to send logs into data hub. The collector can be used, as I said, any of these three collectors. Fluendi, Logstash, alternatively, you can use elastic bits, or you can use RCSlog, or you can simply point your CISlog to log your CISlog to a remote CISlog server. And that way, RCSlog on our side is this normalizer. It will act as a remote CISlog server, and your data will get into data hub. So why we need this message bus? This message bus allows us to have a very robust layer, such that if we have any spikes, the Kafka will even out all the spikes, and we will not have any problems with this ingestion of data into elastic search. So one of the reasons is to even out the data flow. Communication with other buses. So we have multiple other buses within Red Hat. One of the buses that is often used is a unified message bus. And we have many various build systems within Red Hat, as well as we also talk to some Fedora build systems. And Fedora systems use mostly FedMessage as a bus of preference. So FedMessage is the message bus that was based on 0MQ, and it is publicly accessible over the internet. We basically subscribe to some topics on FedMessage and get a lot of data channeled into our internal data hub message bus. On the Red Hat build system side, we subscribe to Unified Message Bus and channel most of the messages into our Kafka bus. So the idea here is that the way the built pipeline works upstream and downstream, upstream meaning Fedora, and downstream meaning within Red Hat is something gets built within Fedora. And you get appropriate message on the FedMessage that something was built. Then Red Hat build system kicks in and gets the pieces from upstream and builds something downstream, respective components downstream. And one of our tasks is to reconcile what is happening upstream with what is happening downstream. And we do this by tracking the message within FedMessage and a respective message within UMB Unified Message Bus. So yeah, a lot of systems within Red Hat ingest data to Unified Message Bus. There is Jenkins plug-ins that does ingestion to Unified Message Bus and there are a lot of factory to services that do that. Also we have external data hub in a different cloud and we also have the ability to replicate the data. So what we did was we deployed the data hub on the external cloud. And this is also full OpenShift deployment there. And we have Kafka Message Bus deployed. So what we did was we exposed Kafka as an OpenShift route to the internet via secure Kafka protocol. And internally we deployed MirrorMaker. MirrorMaker is a Kafka component which allows for data replication between geographically distributed Kafka buses. And what MirrorMaker does, it simply acts as a consumer and as a producer. So Kafka has the concept of consumer and producer. Consumer subscribes to appropriate topic and gets the data from some topic. Producers simply produces data into the other topic. And so here what we have is very simple structure. It was not the simple to implement, but the picture here is very simple. MirrorMaker simply mirrors topics from data hub that is located in a different data center into the internal data hub. There are no open ports in the internal data hub. Internal data hub is completely within Red Hat Firewall. External data hub only has one port open for the secure connection to the external Kafka bus. So yeah, the other component that we have is ShipShift, which it is an internal product that we use to gather logs from various Jenkins servers. What it does is it waits for notification either on UMB or on 0MQ or on FedMessage. And based on the notifications, ShipShift goes and grabs logs, built artifacts from appropriate Jenkins job. That way we not only get the messages from UMB and from FedMessage, but also we get the full logs from Jenkins that includes Jenkins console logs and that also includes any logs that are produced within the job. The logs then are sent to MessageBus and also stored in S3, which is our S3 backed by CIF. So yeah, data processing. On the processing side, everything we have is Kafka-centric and we want to have everything within the Kafka. So we run multiple applications. We have several rule-based normalizers that get data from one topic and send data to the other topic. We have several Kafka streaming apps that do the same thing. We don't yet have Spark, but the idea is that we will have Spark that will do the same kind of processing. So all the processing, all the data normalization is happening on top of Kafka that ensures consistency and that ensures the delivery of all the messages. As for the presentation layer, the presentation is set up very... The presentation is very simple. MessageBus contains certain topics which are mirrored into Elasticsearch. So we have a topic in MessageBus and this topic corresponds exactly to an index pattern within Elasticsearch. Another topic, another index pattern. These indices get presented via Kibana to end user and we have a separate repository with reports which we upload to Kibana and the user can basically view various reports based on various saved reports in Kibana, command data model. So one of the key things that we have is a command data model. This is a rigorous definition of all the fields within Kafka and Elasticsearch. What it allows us to do, it allows us to make sure that we don't have any inconsistencies and that we have very good correlation between various different entities and similar entities. So if we have, for example, field called timestamp, it will be timestamp of this log entry throughout everything, throughout any kind of document. If we have a field called, for example, job name, it will be the name of the build job for any kind of document. The other thing here is we use hierarchical namespaces which means that if you have a product, for example, OpenStack, then we will get a completely separate namespace within JSON namespace for OpenStack such that any product, anything that is produced within OpenStack can simply go into this JSON namespace and that way anything outside of OpenStack will not be able to conflict with this specific OpenStack product. So hierarchical namespaces allow us to have no conflicts within Red Hat products. Yeah. Yeah, and also this command data model allows us to generate templates for Elasticsearch easily. Yeah. Specifically for CI, we build several namespaces such as CIJob. CIJob namespace allows us to have definition of specific Jenkins job or specific other job for specific other job and we can use it within Elasticsearch to correlate throughout any documents. TestCase is another namespace that allows us to correlate any test case. So basically any data for test cases will be nested under TestCase. Any data about CIJob will be nested under CIJob. And same for Fedora CI. This is specifically for Fedora infrastructure messages. Any Fedora infrastructure message will be nested under Fedora CI so that nothing else can conflict with that and we can have nice, very good correlations in the end. Basic artifact workflow is very simple. This picture does not include ingestion of messages via message bus. This picture only includes ingestion of Jenkins artifacts. So yeah, so for major stages, collection normalization, storage, and visualization, collection is done via ShipShift. Then we do normalization. Normalization is mostly done via LogStash. Right now we have also other Kafka applications that can do normalizations. Storage, at the moment, it's Elasticsearch and visualization is done via Kibana. This is the layout of Data Hub from the point of view of OpenShift. So we have OpenShift consist of various namespaces and we view namespaces as an application from kind of broad point of view. So we have namespace, for example, that is responsible for all the ingestion. We have namespace that is responsible for storing the message bus. We have namespace for normalization. We have namespace for Elasticsearch, et cetera. Basically, this namespacing in OpenShift allows us to have very modular approach to Data Hub. If we want to add or remove a namespace, it's very easy to do and it doesn't take a lot of effort to remove or recreate some namespace. And we also can control what namespaces are allowed to talk to each other. So we can disallow communications from non-Data Hub namespaces, which we do and will only allow communications with Data Hub namespaces. Yeah, so it's just the snapshot of various tools and components that we have in Data Hub. I've already talked about most of them. We have Elasticsearch, Ceph, Spark, Kafka, which are the key components that we have. We have a lot of other tools here as well. I think that's it. Any questions? Thank you. I'm curious how you deal with errors during normalization. Say someone changes the format of a message bus message so that your normalization no longer can convert it into your standardized format. Do you error out at that point or what do you do? So at the moment, we don't have a great way to deal with that. We do error out. So it really depends on the tool that is dealing with that. Some tools deal with this thing. Better some tools are not that great in dealing with this kind of thing. Depending on the change, there might not be an error. But if there is an error, then we will error out. At the moment, we don't have a specific validation that the message must be in this format. We want to get to this point. We don't have it at the moment. Thank you. Yeah. You've got a question? Yes, I noticed we had to have a bit of error in places. Is that CentOS also on my agenda? So CentOS also uses FedMessage, right? So yes, in that sense. And Fedora uses CentOS CI. So I mean, I'm not 100% sure what are the relationships between Fedora and CentOS. But yeah, I mean, we use the same code base. And Fedora, one of the external instances of data hub isn't CentOS CI. So yeah. Another question, very nice. I appreciated looking under the covers of the design. But do you have anything to show, like the dashboard, how you would use the information? I would, but I'm afraid I don't have VPN connection from here. Just one final question. This looks very close to a product that my company has decided to use called Splunk. Have you heard of that? Do you know of the relationship between them? Right. Yeah, so Splunk is proprietary software. We use Elasticsearch, not Splunk, so yeah. Thank you. Any more questions? So we use something similar to Elasticsearch, Graylog, if you're aware of that, like the log aggregation platform. Sorry, which one? Graylog. Oh, OK, Graylog. Yeah. So I was curious, like if you guys have any monitoring alerts or notifications that are built on top of the Kibana data. So Malik can answer this question. Right now that was my summer project, which was monitoring each of the data hub components and send out alerts if there is unusual log activity or stuff like that. We are using Prometheus for monitoring and Grafana for visualizations. Any other questions? Thank you, Anton. Thank you.