 My name is Kalyan. I am from Intude. I was basically going to talk about what we are doing with observability at an Intude works at scale. We deal with the petabytes of data a day and billions of transactions and lots of diverse distributed data. And I wanted to talk about our experience. And we use open source a lot. We contribute back. I'll talk about some of what we have been doing in that space. And I have over 30 years of experience working in platforms and been contributing to open source and open standards all my career. That's briefly about me. So the agenda for today is we'll be looking at basically the theme or the topic is basically doing federated search over distributed data. As observability scales up, we are dealing with huge amounts of data that's diverse. We are going to look at the different data, the challenges that are there, and some of the solutions. And again, the idea is that instead of reinventing a solution for these problems, why not leverage what's already been done in similar areas? So that's the theme. And again, there's a lot of content. Since there's only five minutes, the slides that are also shared in the site will have a lot more content, and I've given references. So this is just to give an overview. But if you look at the slides and the references, you'll get a lot more data. So the challenge here is that we deal with a lot of raw data. We deal with a lot of derived and analytical data. And then we have data that's coming in the slots, traces, metrics that's coming in. Then there is also profiling data, performance data. And then for various reasons, we need to keep data on the edge in the cluster. We need to keep data in intermediate storage, such as an S3. And it could be a primary or a secondary storage. And also the metrics, logs, et cetera, needs to be correlated and stitched together. So essentially, you have data in a number of stores. And all of it needs to be stitched together for to have any meaningful analytics. How do I do that? And that's the challenge we're trying to solve here. So some of the challenges, as I said, correlation. I need to make sure that the data is correlated across different layers of the stack and different types of data, whether it's metrics or traces and logs. How do I make sure all of that can come together, can be correlated? And the other challenge is we are looking at the significant exponential growth of observability data. And I cannot simply move everything to one central place, because it's not very efficient. There's latency, and there's a huge amount of network and egress cost that I have to pay. So the general approach. And then we also need to make sure that we have data quality and governance. And data can be noise if there's no insights and AI ops accompanying it. How do I need to build that? And do that with data that is distributed. So some of the solutions that we have been working on, that we have implemented, I've done that in my previous jobs. I've seen it done in the industry. So one is we need to be cost effective. And one key concept is around data gravity. As data explodes exponentially, I need to do my computation. I need to do my processing where data exists. I cannot expect to bring petabytes of data into one place every day. I mean, it's just not scalable enough. So I need to make sure I keep data in one place so that that's cost effective. And instead of sending data centrally, can I send the insights centrally? Essentially, maybe a derived data. Can I do something with the metadata? So keep metadata central and data local. And then use, essentially, the federated search and various approaches like data fabric where you essentially have a virtualized view of this whole distributed diverse heterogeneous data so that you have a unified centralized holistic view. But data, you process data where it exists. And that you do with approaches that are popular in the industry, in the big data world, in so many other areas, concepts like data ops, data fabric, presto, and so on. And those are something that we have used. We've seen some gaps. We'll briefly talk about what worked, what didn't work, what are some of the enhancements we've had to make. So again, one data fabric has been very popular. It's used for a slightly different data ops problem where you have data that's distributed across. And you need to provide a holistic view for the data regardless of the location and format. And this is, again, spread across. And while it's a very complex area, but some of the key concepts are there's a virtualized view for the data, you need a data catalog, and you need a metadata. And the key part or the fabric part, which is the fabric is the glue that brings together all the different diverse data. And that glue is the metadata, because I need to know what the data is, how is it related to the other data, and how do I locate it, how do you bring it, how do you stage it together. So all of that is the metadata that's critical. And if you're talking about open source, which is what CNCF is about, one approach would be to do it using Apache Arrow, which is an extremely popular ecosystem nowadays that's used across a number of open source and commercial vendors. So again, in the interest of time, I've just shared the content. But basically, Arrow is an in-memory columnar format that's very good for vectorized parallel processing. It's zero copy. There's no serialization cost. And then if I need to send it over the network, I use something called Arrow Flight. There's also Flight SQL if I want to do a SQL interface on top of it. And then if I want to do query processing, I have data fusion. All these together make it extremely powerful. And there are a number of implementations. And one very popular one is what is called FDAP. That's used by Influx, DB, and many others. Velox uses it, and so on. Open Observe is another example. And here, again, the idea is to use Arrow for in-memory, Flight for networking, data fusion for query processing, and Parquet for storage. And these are all very well integrated together. And that works very well. So what are the gaps here? Some of the gaps we saw was that, while most of what we require in a data fabric or in a federated search, work with Arrow, but one key aspect is the metadata. That is something that is still not supported. Now, there are these open-source metadata platforms such as Datahub and Apache Atlas that you could use. So that's something that we are looking at. Similarly, distributed query is something that's important. The idea being that, when I send a query, I may need to make sure I have a query plan, I have optimizers and execution plan and all of that. And then I federated it over a number of nodes, and then essentially aggregate all the data back together. Now, all that needs to be done. And that's something that's still a missing piece. Now, within the Arrow ecosystem, there are things like Ballista and GlareDB that do it, or you could build something fairly easily. So that's one option. And similarly, the user defined types are supported in Arrow, but UDFs are still not supported. So that's something that's being worked on. I'll just quickly, the other option would be to use Presto. Presto is, again, a distributed query engine that's based on SQL. It's, in fact, it does, the distributed queries it does very well. It has essentially the concept of coordinators and workers, there's also a resource manager. So essentially, I can deal with diverse, distributed, heterogeneous data, and there are a huge number of connectors, essentially, that Presto support. So I can deal with all kinds of data. I can connect to it. And essentially, federate my query, send different queries to different components or connectors, essentially, send it to workers, and bring all the data back, aggregate it into a unified query. So that's where it works very well, and it uses a high meta store. Now, what are the limitations here? The one major limitation we saw was that the meta store is very rigid. So if I have, nowadays, if you look at these large systems, you need transparent partitioning. The partitions will keep growing and you need to change the partitions. You need to deal with schema evolution. And these are some of the challenges that today the high meta store and Presto cannot handle. One approach to do that would be use something like, Apache Iceberg. Iceberg is seen as a way to open source, way to implement a lake house, which is a mix of a data lake and a data warehouse. But essentially, for the purpose of what we're doing here, Iceberg is very good with supporting transparent partitioning. It supports schema evolution, and it supports asset transactions and things like that. So, and it has a very flexible meta data. And so, and Presto has a connector for Iceberg. So one possibility is that delegate the meta data handling within Presto to Iceberg. It'll do it a lot more flexibly. And then you'll get a lot of these federated query and distributed processing for observability. All of these benefits will come to you. Am I okay? I mean, I can go on and on, but I'll just, a click house is another key component that can be very useful. Velox is again a pluggable engine. Particularly if you're doing a lot of Java processing, Velox is based on C++. It supports Arrow. It does very well. And Iceberg, I just talked about it. Particularly the meta data handling in this use case, it's works very well. And if you put it all together, this is the Intude case study that we talked about where there are all these different domains starting from analytics, data, federated queries, and we talked about some of the possibilities that are there. We have gone more with the Presto possibility, but as I said, you can also use the data fabric approach with Arrow. Both of those will work for you. And I don't have time, but I've shared some interesting references and it's already uploaded into the site. The updated one I'll be uploading right after this session. So I hope you found it useful and you'll probably learn more if you're interested from what's uploaded into the site. Thanks a lot.