 All right, let's go ahead and start the session, how to avoid overcomplicating your data architecture. So first thing a little bit about me. So I'm Christina Lin, hi everyone. So I just wanna quickly walk you through some of my previous experience, so you know where I'm coming from and I started playing video games for a very long time but that's way too long. So I started working in the industry a couple of decades ago, I don't know, but yes, I started working and then when I started working there's a lot of adoptions in AS 400s. So I was doing a lot of data work on site-based. So that was actually the first time I was working in an insurance company and we were working with, I think it was called Tata at the time, I think it was called Tata today and they were working with us on creating this huge insurance system where everything was built in with COBOL on the mainframe computers and we're extracting everything out of COBOL and put that into our Oracle and site-based database. And that was the first time I saw a store procedure that was 200 lines long, maybe longer, I'm not sure, but it's a lot longer. This is like the first time I saw how people can deal with data in such a precise way of moving data and there's a lot of temp tables going on in and out of that table thing, right? So that's what's going on there. We're using messaging queue, IBM MQ, if you still remembers that thing to get data out of AS 400s and we're communicating it through putting it into Oracle or DB2 at the time and providing information through that. And then we slowly move on. So once I started my career there and then I moved on to working more in the integration space, I was kind of head of the integration there and we were kind of designing things around the SOA, I don't know, service-oriented architecture, if you still remember that, but we were doing a lot of like SOA work and we were adopting a lot of like the old IBM SOA framework and I found it's really difficult, not difficult in a way that is very over complicated again. So that's when I started looking into other things like integration technologies like Camel. So I was really into Camel. So I done a lot of project with integrating data with Camels and stuff like that. And because Camel talks to a lot of data source, so I was working with a lot of like traditional databases, MySQL, Postgres, and then also doing some of the construction I think when it came in and at the time there was a lot of enterprise integration thing that we need to do. So we also have, I was also doing a lot of like asynchronous communication through either IBM Q or ActiveMQ or RabbitMQ at the time was all messaging queue. And then everything kind of like moved into like the air warehouse. So I did a lot of like things on Hadoop and they slowly moving on to Spark. And then I decided, okay, I wanna, and then we, I was kind of working on, moving on my career with Red Hat. So I was doing a lot of Kubernetes stuff. But now I am with Red Panda today because I feel like streaming is going to be a big, it's going to be a very huge part of what data is going to be, especially with the data architecture. And I think for today, what I want to do is, is to share with you on, you know, what I see my experiences is when working in the industry for such a long time, I wanna share with you with some of my findings, some of my observations and my recommendations. I'm not saying I'm 100%, right? I really wanna hear your feedback on my thoughts of how data architecture should look like, at least for the next couple of years, right? So I would love to hear your feedback, anything, any comments, I would love to hear them. So let me know. But I wanna talk about what I see and what I've been working with from the beginning, working with a lot of financial institutions mostly because I do a lot of consulting for them and I do a lot of work with them. So this is the reality I see like in those companies where they would have a really legacy system that's doing a lot of the core work. They're trying to move out from this like really core system but it's really difficult. Sometimes they say, don't, if it doesn't break, don't fix it, that kind of mentality. A lot of people still has that. So they would have a mainframe running somewhere in their environments. And then after a couple of years, they slow moving on to the monolithic application development where they have large application servers running a huge package of software that is providing services and they're slowly moving their things out of the mainframe or adding more services on top of what they're providing. And those monolithic applications will be accessing a lot of like the traditional RDVMS relational databases, right? Like all of those in Mexico and all that. So they would have that. And the main way of communication would be using SOAP or REST depending on the era they were implemented. And also after that, there's a huge jump into the big data era after the SOAP services was adopted because people finding now they have a lot of data they wanted to kind of reuse them. They want to kind of mining the gold because they have gold, of course to get it out of the system and trying to find meanings for them, trying to make better, precise business decisions on top of it. So people were using Hadoop, but just because of the sheer amount of data they were putting into that data store, they need to find a way to kind of make it easy for people to run and calculate the databases. So that's when they would probably introduce things like MapReduce and all that kind of stuff to make it faster to calculate things. And because the way that to set it up is kind of difficult. So we kind of figure out different things on top of it to run like Hive. Maybe later on that have Spark to do it. And then once they have that running, so this will be mostly isolated in one single department where they're doing a lot of like business analysis. This information probably didn't probably send out somewhere. They were just mainly regenerating reports. They provide that to those kind of like business, more business decisions and stuff like that to help them to make, to kind of better their workflows and all that kind of stuff. And then they switch on to the next era which is Event-Driven Architecture. So that was really, Event-Driven Architecture was there for a long time. I think even so I was a little bit of an Event-Driven, but I saw people when I was in Red Hat, I was seeing a lot of people adopting the microservices thing because of the new Kubernetes platform where people are starting to say, hey, I have all this different type of like applications. I want to handle them independently. I don't want them to be coupled. I want them to be easier to deploy all that kind of stuff. So people starting to use a lot of like microservices and then also because of that, they're much more free. There's a lot more freedom of using the data stores, right? The data stores can be a little bit free and then it's easier and they want it. And because of the nature of the platform itself, it's easier to apply, it seems natural to adopt a more horizontal scalable data store. So that's when the NoSQL database or the key value stores what becomes really popular for these kind of type of services. So that's when we see them coming around and also because of the adoptions of the Azure devices, you know, maybe five or six years ago about the IoT craze that everybody's starting to collect all this data coming into the company. Data warehouse is no longer where they want to store the data because it needs to be structured and some of the data that they want to keep from these devices weren't that structured and some of them were like string data and stuff like that. And that wasn't the things that they want to put into their warehouse and they don't have the capacity to process that and put that into their data warehouse. So they were looking into like things like data lakes to kind of quickly store things on the cloud. Also another move that I saw was kind of going into the cloud because just because of how difficult and how complicated to maintain the cluster of data warehouse internally because there's a lot of layers of technologies on top. So that's why people were using a lot of cloud services and that. And then a couple of years later, I see people starting to like, LinkedIn started the Kafka thing where everybody loved the way that it can now handle a lot more capacity for the traffic, especially for the ingestion part. So it enables the data to flow inside your companies and stuff to a lot faster. And then the crazy AI thing everybody talks about today, like if I don't talk about AI today, I feel like I'm left out, you know? And then that happens to a lot of the companies out there as well, they want to reuse the data they're collecting and they want to put that into machine learnings and then eventually adopt into the AI space. So that's kind of where, and then you see this progression still exists in the company, a lot of the company today, they're still there and they're kind of just slowly building things up, a little bit of difference between these are the different communication protocols that they're using and the capacities that they can handle, right? And if you want to move the data in and out from these systems that you built throughout the era, you gotta have to kind of like figure out different ways to move the data. And then like a lot of them realize on a lot of the batch services underneath the hood. And for real-time access, you have HTTPs, like an HTTP is a good protocol, but the problem with HTTP is that it's a request and response nature making it slow. So probably not ideal for a lot of like one way or fast-feeding data communication for that. So Kafka seems to be knowing a lot more load today. And that is kind of what I see in the reality kind of just and then like you can see that all of these are adopting different places still there. And just a quick note on the top, like telling people, hey, I've seen this through a lot of, especially like in the insurance industry where I used to work a lot in. So like you can see that they have the core, they have different services, like loan claim and all that working in their legacy services and they're adopting VIs and adopting like new app-related functionalities throughout the days. And that's kind of like, just give you a quick reference of what they are because I know it's really hard to see what they are because I like to give a reference for that. So, okay, so what are the challenges when we have that kind of facilities and things like that running in the company? Well, the first thing is data silos, right? Because just because of when they are created, when the data's are created, where they reside in the different departments, people just find they are inside a particular space and it's really hard to get them out. So there's a lot of time we have to duplicate way of collecting data and store them, duplicate in different spaces just because we need them somewhere, right? So that's a lot of data silos that people see and especially with the legacy systems that's really easy. It's very hard to face them out. So they're still there. You still need to connect to them and stuff like that. And also another challenge is because of the regulations and compliances. And I think that's mostly off the thought just because of the booming data infrastructure, the booming data systems that we have today, regulations are catching up. So I found a lot of the clients that I work with or some of the people that I work with are struggling to kind of fit that into their system. Sometimes they have to do that after they have implemented a particular data structure or data architecture and to have that as the last piece. So you gotta find a way to kind of work around it. So like that is causing a lot of issues and problems and that's a big challenge for them. And another one that I saw that was kind of talked about a lot for the last two or three years is the open data initiatives that API started all that. So people wanted to wait to quickly collect data but also sharing the data out. And what is the best way? What's the fastest way? What's the safest way with the regulations to provide the right data to the external vendors out there? So that's kind of another thing that people were struggling with that have a challenges with finding other things. And another one is the data overload. Because we're now having a lot more data coming in and out of our system. And we have trouble kind of finding where to put them. Should we delete them? Should we store them or where should we put them? Should we process them? All kind of stuff like, so there's a lot of overwhelming data. There are like, especially we're catching a lot more time-bound data, right? Like transaction histories, customer interactions, even if we wanna get the behavior of where they are there's a lot of just tracking data that we're collecting. So how do we make it useful? And how do we store them? That's a lot of things that they have. And lastly, but last but not least the data pipeline chaos, right? And it's because we have all these needs and challenges, right? So we're creating a lot of data pipelines to kind of make sure that the data flows and to get them to one point to the other. And a lot of the data pipelines were just built on top of somebody's laptop and then if that person is gone, that's gone, right? So I've seen that doing a lot especially with the marketing team where they have like one single Google Sheet to kind of maintain as a data store. And I see that happens a lot internally as well. So I just wanna take a step back, right? Cause I do a lot of integration and data thing like, so let's take a step back of like what they are. I think you already seen this but I just wanna re-emphasize this, right? So what types of data do we have, right? So we either have a batch pipeline. So they're typically dealing with like larger sizes and their nature, the nature of this type of batch is used for more of a legacy protocols or legacy systems like file transfers, right? Running inside a database to get a lot of things, relational database to get things, right? So they are more vertical scaling. It's really, some of them does horizontal scaling but most of them were kind of vertical bound. So in order to kind of increase the capacity, you need to kind of increase the capacity of your hardware. So that's what they are. And they tend to have a longer latency, right? And then like where you see them, you kind of see them in the data warehouse database. And that's when they, that's kind of where they resize. And the things that they're running on it's either a batch, it's either batch services like you just write batch and just move everything. Or you can do like that reduce on top of your Hadoop data stores and all that, right? So that's kind of what batch is about. And then we also have micro batch. I think that's kind of a middle ground when we don't have anything that we can process quicker to something that we can do for most of the BI or business intelligence reports. This is where I see them were mostly used. So they are smaller in size. Because we want a much more recent report, right? They can't be like two or three days late. That's not going to be useful. So we have a defined chunk of data sizes. And just because of the scalabilities and all that, so we want to make sure that we're isolating a particular chunk of data and run them frequently. And then having some kind of like temp storages to kind of store historical stuff and to run that and then tend to scale better with the horizontal scales. So you can generate reports, generate data sets for this type of, for getting your data out there as well. But I think what I've been seeing more recently is the use of real-time data, right? I think people are using a lot more real-time, they want real-time decisions, they want a real-time feedback. That's why people are using a lot of the streaming platform, we use the stream platform to provide them like real quick, direct feedback on those pipelines. So working with pipelines. The problem, that's good because we have all this pipeline, right? It's all here. But the problem that I, when I ask people, like what is your biggest problem when working with pipelines? I think the biggest one that I ask, when I ask around is the difficulty of accessing data because the data that we're getting today are not as structured as we would like it to be. It could be image, it could be a video, it could be a different structure of data, it could be a JSON, it could be just a CSV file or whatever that is coming from, right? It's just different data structures, right? So the lack of metadata making it really hard for us to kind of locate our desire, like the desired data that we wanna get from, right? From the repository. And a lot of the analytics tools or platforms are pretty strict regarding to the data format. So it's really hard for them to actually reuse the data that was provided into the required format. So therefore you gotta create more complex data pipelines in order to kind of transform the data into something that they like. So it's not like an easy single step of, okay, I need to get that. Or sometimes it's more like, okay, so I know what the data is and I can get it out, but the problem is the security measures or regulations, compliances that can introduce a lot of middle ground that people can wear and how they can access the data. So that's one thing that people find difficult when they were trying to access data. Another one is the noisy data. I think nobody can get away from that, especially with the Data Lake, or should I call it Data Swamp? Like people were kind of finding there's a lot of duplication of data because people just keep throwing data into Data Lake. This is a great store, it's cheap. You can just put everything in there, but they're not very well organized. So it kind of just becomes this huge glue of things where we're trying to pick things up. It's really hard to find it. It's really hard to use them and it takes time. So, and also the data, the relevance of data. So some of the data are outdated and why is it still in there? Why is it still in my transactional database? Why is it still in there? And why is it, should I be cleaning them out or should I kind of, should I, if they're outdated, should I clean them or should I just kind of archive them? Or another one that is really common with the noisy data is data mismatch. So they're just garbage in, garbage out. So garbage goes in there and that was not relevant. It's really hard to map who they are. Sometimes, you know, some persons working from this company and they have the same name, but the company changed names. So there's a lot of things that's causing the data to be very noisy. So you gotta find a way to kind of make sure that it's clean, that regular things to do. And also because of that, the performance and we're getting, because we're getting a lot more data nowadays, comparing to where we are before, just the sheer volume of data it makes processing a lot, a lot, processing a lot difficult, right? There's a lot of performance problems. Do I have enough timeframe to run my batch? Do I have enough CPU powers to calculate to process what I want, right? And the method of retrieving data is also the problem, right? Because if I'm trying to retrieve data through API endpoint, that means I have to make multiple API calls, sometimes depending on how they design the interface. It takes a lot of time to actually get what you want. So you kind of, and also where the data is stored, location of it is making the performance. It's gonna sacrifice your performances. Another one is data visibilities. They have no visibilities into where the data's are. They don't know the metadata. They don't know, even if it exists, just because of the data silo, right? They don't know, hey, this data exists like, maybe 10 years ago when we designed the system, right? They don't know that. And there's no well documentations or there's no metadata shared across different places. And that's why people find it's really hard. It's just blind to blind, you can find things. Another one that people face, I see is troubleshooting, right? Like, it's really hard to kind of find out what's going on when you're facing with noisy data, when you're having performances problem, where you don't have visibility into things. I think it's all contributed to, it's really hard to fix things because you just don't have good data qualities. So these are the things that I see that works today, right? So, and one of the most asked questions for me recently because of the AI craze, everybody was asking like, oh, so how do we prep ourself into this AI world? Well, like, if you take a look at AI, it's not like magic, right? AI is something that you need your machine to learn. In order to get your machine learning, the logic that is supposed to apply, you need to actually create data sets. And that's the machine learning part of the work, right? And I think that's mostly placed with the data architecture. So how do we get to the point where we're providing data sets, whoops, providing data sets to kind of start generating models, to do model trainings, to do model predictions, and then it will generate the model registries. And then we'll use the same data, similar data or different data to kind of become the reference data. And when there's actual events coming into the system, it will use that data model that we have used to, we just generate that, the model that we just generate using the data set that we provide to kind of apply that into the system, which is the inference part of the AI structure where you can use the reference data plus the data model to make real-time decisions or artificial intelligence decisions for users, right? So that's kind of where things are. So if you look at it, what we need to do from the data architecture part is probably just getting the data sets, right? So that's why I was thinking like using the, kind of the things I've seen, like different methodologies that people has been adopting. I was, I really liked the way that, you know, they started talking about like data mesh and stuff like that. I think it's a really good way of doing it. The problem with that for me is it's really, there's a lot of methodologies and there's a lot of concepts and there's a lot of things you need to, you need to take in and like have that concerns and to get it adopted. So I would say, you know, it's good to be there. It's good to kind of become that mesh, right? But how do we get to that mesh? I wanna focus this on more on the more practical side, like more on the infrastructure side or like how do we do that for your data? And I think this is a way to get you to that. So the first way is to build the infrastructure highway for your data, right? And like what every good book would say about, you know, building your, redesigning your data architecture is make sure that you are working with a smaller set of domains. It is never be cool and it will take forever for you to get everybody coordinated inside your company to start working on that project. So it's always good to have a isolated, you know, domain so you can start working with that and slowly spread your work across and expand that to the entire company but always start with a single domain. And then you start your data modeling work. I won't dive into the data modeling today because, you know, to be honest with you, I have yet to find the best way to model my data. I've tried Kimball, I've tried Iman but I can never, with those two methods I can never satisfy anybody, everybody. There's always people complaining. I just like, I have to always change models and stuff like that. So this is a call for you. If you know a better method of doing a data model, I love to chat with you, you know, I would love to share, you know, compare notes. We can see like what's the best way so I can do another talk next time about data modeling. But for me, I think a good data modeling it will be a good foundation. But once you get that first initial, but I'm not saying you shouldn't do it, you should still do it, try to do it. You're not gonna make everybody happy but you should still do it. But provide a data access endpoint for people to start retrieving your data. They need to know what they're getting, right? So data model access points and then also federated governance, right? Governance should be part of your plan on getting that done, but I'm not gonna talk about it because it's just too many communication skills and politics around that. So we'll be focusing more on the self-services data and streaming infrastructure, right? So that's where I think we should focus on and how we should do our things, right? So this is my way of thinking, right? So in order to get the data flowing we need to kind of create a streaming highway basically. So instead of asking for data, everything should be push-based. Well, not everything should be push-based. Everything would ideally be push-based. If something happens, it will be pushed through the networks of your data and that will get processed into its ideal state. For instance, if you have like three different systems that require three different formats then you will push that data into that and then quickly convert the data into their ideal format and then push that into the system they're trying to ingest, right? There's always gonna be a request coming in for pulling the data, but that's another story. That's the endpoint that you're providing. That is the endpoint that you're providing here, right? Right here on the, it's really hard to use this that you're using here on the right where you can use it to access your data but mostly I think data inside your system should be like human blood where it's just going to traverse through your entire system flowing through when things come in and then things get triggered and they get processed. So data will be flowing in and out through this streaming pipelines or streaming process that's set between different connection points or different services and they get turned or different data ingestion points and then they get received and then they process it and then put it somewhere or trigger another one which is going to make things a lot better in terms of the data pipelines. I'll show you a little bit later on why I say that. Of course, I'm not going to deny that there's going to be always a need for batch pipelines. I'm not telling you not to do them even when you have streaming data because there's always a need for different types of data like streaming, like video content. I don't think you can do that through streaming very well, right? So I will still do those in pipelines. Some of the longer BI business reports that these historical data, I think some of them still requires micro batching. So there's just different types of different pipelines they deserve to in different places but most of the data inside your company, inside everything should be pre-prepared. Even for the BI ones, ideally we want them to be somewhere that is in the most ready place. So when you're trying to query them, when you're trying to pull them, when you're trying to aggregate them, they're all in a much better form in order to get it out as well. So what we see is we'll have data warehouse in the middle, at the end words, it's trying to store more structured data and data like it's kind of just where everything was temporary stored and real-time engine for you to pull in data or transactional database for quick transactional application where they don't really like to integrate with other things. But like any transaction that happens in the transactional database should then be signal into your data, data highway that letting people know that, hey, your data is being changed and here's the signal who wants it and then people will pick it up and they'll put it in there as well. So that's kind of like my view of what the data infrastructure of how you should set it up. So basically having a interconnect point for everything. So a quick look into what is in that data, a data highway or data kind of, we call it data mesh or data highway stuff like that is so you can have multiple streaming platform you can have streaming platform with multiple clusters forming a mesh or the network for the data to go in and out of that. And each one of the logical representation we call them topics, they can be set into different model names. So if you ever want transaction data you can get it from as particular topics where you can kind of receive from different topics. If you ever want to hear a difference between the latest transaction interest rates or whatever that is, you can receive it from there. So just a quick shambles plug on the streaming platform. So we rep Panda is a great streaming platform to kind of do that. I think we are a direct replacement for Kafka. So basically you can use us to stream data in and out of your system to build that highway, of course. Our biggest difference between us and Kafka is that we actually optimize your hardware, right? So instead of relying on JVMs and then PageCache and then disks, we just avoid all that management all together. Also, we use our Threeper Core architecture where we can quickly write our streamer data in and out of the broker as quick as possible. And we also have a lot of add-on stuff like automatic partition balancers as well. So we can automatically balance your brokers because everything, if your brokers are in balance it could take a while for it to kind of, it could jeopardize your streaming highway so everything can be like really jammed up or there can be problem shooting issues as well. And then we also do a lot of like RPK toolings. Like my favorite, my personal favorite is the RPK tooling because it makes me easy to kind of to create the applications. So that's that. Do think about us when you wanna build your highway but what is that little dots, right? In like on top of the highway. So I call them the, I think I got these ideas from a lot of times when I was doing integration as well is that these joint points were a little pipelines or data pipelines that you can create in order to process the streaming data comes in and out of that. Then they do a lot of things. And I heard people doing a lot of like patterns. There's a lot of, there's millions of talks that I go to that talks about different patterns but we've been doing that for a very long time in the, in our Camel community. It's called enterprise integration patterns. If you haven't heard of that I would strongly recommend you to kind of read that book because we've been doing those kinds of things for a long time. The idea doesn't change but the way that things can implement can be a little bit different but basically what you're doing is you're either doing transformations or you're doing enrichment or you're doing some kind of rerouting off your data from one point to the other. So that's kind of where the dots are. And so when I was saying there's the chaotic pipeline thing that was going on I think what we can do is to organize them a little bit better. And the way that I categorize them with like batch, micro batch and then real time we can also kind of split them up and we're having them more in the detail of what each of them do, right? So we can start with the most used one. So this is what I see that people will use more often is the stateless streaming pipelines, right? So these are the pipelines that is going to handle a lot of the traffic when there's data coming in there's initial data coming in when there's things you gotta do some kind of reformatting, normalizing transformations, filtering, validations. Validation is one of the things that you probably want to do in order to make sure your data is clean, right? All of this type of things that you can do that you do like mostly and these type of pipelines does not need to remember any of the historical things that has happened. Anything comes in it just look at it and quickly process it and just let it go or store it or send it somewhere else, right? So these type of pipelines shouldn't take you that long to kind of create and it should be easy to scale out because just because of the nature of they don't remember anything. So these should be very easy to horizontally scale out. And then you get state for pipeline this gets a little bit tricky because this we call we used to use them as like complex and processing or time window based processing where it will need to remember some of the states and that causes complexity because of if something goes wrong then I still need to remember the state or where it kind of last went off. I still need to pick it up, right? So these type of pipeline is a little bit they're a little bit less though you don't see them that often but you still see them then you still have to kind of take care of them. So these are the second level of pipeline that you see you'll see more of them. And then on top of that you see you're starting to see more of the large volume data processing where you see like, you know micro batching where you want to do a lot of analytics based on a huge set of data as well. So these type of things, you know some there's there's smart way of doing it like storing it in a time like time series database or things like that to make it easier for people to kind of process depending on what they are. And also you have the batch pipeline where you expect it's the most traditional ways of like sometimes they're just still in files or you need to get them into the data sets all that kind of stuff. And you still need that batch pipelines. So these are the hierarchies of pipelines that you see in the entire system, right? And this is just a story about the drawing why it's like squiggle but I'm just trying to kind of showcase, you know where the state of where the pipeline starts, right? So when the ingestion comes in it will goes into your highway and then that's when your student pipeline kicks in. So first of all, the stateless pipeline comes in and they will start dispatching they'll start transforming they'll do like quick stuff and because there's a lot of them because it's easy to scale them they'll quickly process it into the right place and then the stateful pipeline kicks in, right? Because once they satisfy their requirement of processing they will start processing it depending on their time window base or if a certain requirement that was met that triggers that pipeline to start or to start processing all that kind of stuff this is the stateful thing and they will start receiving that and putting that back into the highway. And once everything gets into the highway there are some micro batch processing that you probably wanna do that first still get the continuous feed of the new data plus the historical data that you already have in your data stores no matter where they are and then the batch pipelines that get stored or that gets processed at nine that just needs to be done like in a large file sets, right? So, and I think I'm going with that because that kind of determines where RIP HANA is going in the future and how I see things are going with all this data pipelines and how do we make it easier for data engineers to get their things out, right? So what we're seeing is there is a lot of there's going to be a lot of stateless streaming pipelines going in and out and the way that we currently do that is to kind of store them somewhere or receiving that from a streaming service streaming places and then just receiving that consuming that from a streaming platform loaded into my computer system loaded into CPUs, loaded into memories process it and then put it back into the streaming services there's a lot of like data ping pong going in and out in and out the traffic, right? So you buy increased egress traffic you've got memory loadings and you got CPU processing and sending it back. So there's a lot of waste of energies. So what we think what we think will be happening a lot later on in the data engineering world is that people will start thinking about, you know is it a better, more efficient way of, you know processing the stateless pipelines? If everything just happens in the streaming broker level with the streaming platform level can we just inject a wasm or, you know, not wasm we choose wasm because we like it to be more at language agnostic so you can choose whatever languages that you wanna use but everything can be processed, transformed and filtered, validated inside your system. So that's what we're trying to do like in the rep into space. So I think in the future, what you can do is you can start putting your, applying your stateless transform logic or transform pipeline logics into the broker. So the broker will directly do that internally when the data hits the broker CPUs or hits the broker memories or where we are and we'll start to transform it and then it will do a quick move over or replication of the same thing plus the transformations into the same broker and that could be quite quickly accessed. So there's no extra work or extra flow that needs to be done for these type of transactions. Another one that we think that we, I've seen at least like at least you can see all the new investment that was going into the industry lately that everybody was going to a standardized tool of accessing data and that was more like a SQL base. So you can see a lot of other things were happening. Like Flink was all over the place, right? If I open up, if we start a Flink course today we're going to have to have a lot of people because people want to learn about Flink use SQL to retrieve data. So I think SQL is not going away and there's going to be like a standardized people or it's going to be much more standardized and there's a lot of technology out there that's going to allow you to use that tool to access data and data can come from different places and they need only to be enabled and being accessed by this interface. But another things that we also see is the efficient use of cloud storage. So you know now that we have built this infrastructure highway, right? So with this infrastructure or highway how do we most often with this thing will probably offload the use data into a data store say a S3 object stores, right? These all on the cloud. But the problem is every single time when we're trying to offload this data or process this data, you need to do a lot of transformations or transfers and normalizations in order to get the data access. What happens if there's a way that way we can reuse the stream data so when the data comes in and out of your streaming highway everything gets stored somewhere if we can reuse that cheap object data store using your SQL interface to access that there's no need for you to kind of export it put it somewhere else. If we can reuse that as the center of your data store wouldn't that be better, right? So another things that in Red Panda we're trying to do is we're trying to be we're trying to work with Apache iceberg provide you with that experiences. So once you have the data highway created what we're gonna do is everything that hits into your disc if it's if it's if you need to move it to outside store we currently do it with tier storage where we can move the data over to the S3 buckets and then you can have or any other object stores on the net and then you can kind of use the iceberg on top of it to do the to do the metadata and to do access for your historical data. And that's kind of where we're going. And I think with this it's gonna solve you a lot of problems. First of all, I forgot to mention that when you're creating the highways when you're creating highways you wanna make sure that you also set up your schemas schema registry is one of the things that that we all do. Hopefully you don't forget about it, right? Schema registry so people know where they are so it dissolves the data not transparent problems. And then you also have because all the data's are flowing in and out in and out of your system and it's easier to access them through the highway so we're breaking that silo, right? Hopefully we'll be breaking that silos. And it's because we're injecting the data validations with those stateless quick data transformation data pipelines either internally in your broker or externally we're hoping that this will then clean up your data so it's a lot cleaner. And because of the way that we handle horizontal with streaming traffic it's easier to scale them out so we can certainly adopt a lot more traffic so that it's gonna solve our performance's problem. So that's kind of why I was preaching this idea of like building this data highway for your data pipelines and also adding that differentiate different data pipeline natures to make them easier to access and it's easier to scale them out it's easier to manage them and all that. So that's kind of my talk for today. So thank you very much. And here, if you wanna learn more about RIP and what we think there's a lot of documentation and learning resource and I'm always there in the Slack community if you wanna hear more about me or if you wanna give me feedback especially the data modeling part, let me know. Let me know, ping me on Slack, I'm always there. And I think we can go ahead and start the Q and A. Let me, right, I don't think there's any Q and A so is there anything else that we need to do? I guess we can probably end the session early, right? If that's the case, thank you very much and I would like to hear more of your feedback. Oh, okay, there's a question coming in. Okay, this is from Derek. Thanks, Christina, that was very informative. I'm still in the beginning of our project. So learning towards, yeah, this is a bit long. Derek, can I, maybe we can talk about this offline. This is a bit long, I still need to read it. The interface is not that easy to read. Can I meet you over there in the Slack community if it's okay with you? Or can you email me? I have an email, Christina at rependa.com. We can kind of do that there. So thank you very much, I think that's it. Thank you all. Hi, Christina, it looks like we got one more question and I don't know if you wanna take a look at that. Oh, okay, sorry, this interface is not. Yeah, it's a little, it's a little tough to see sometimes. I'm from Thomas. I can read it out too, if you like. It says, can you share your thought about data affinity? Moving data to processing is expensive compared to moving processing to data? I think that's a good question actually. That's why I preach for, that's why I kind of preach for moving data. Sorry, this thing is moving. That's why we preach for the wasm, kind of the wasm approach of doing things, right? Because like what you said, moving data to processing is expensive, right? So when you have unstructured data, moving it over to processing, instead we should always do our data to kind of create or process our data right away, like inside the broker or as quickly as we can and then to have them ready and broadcast them out. So I think I do think that's what we're trying to do for Red Panda as well. So I don't know how that answers your question, but that's kind of my way of thinking about it. Maybe I need to give you the better thought about this. But yeah, we can probably do this a little bit later. But yes, I think you're right. My thought is, you should always pre-process your data before it gets to the destination as much as you want, as much as you can. Hope that answers your question, Thomas. All right. Awesome, that was the last one. So thank you, Christina, for your time today and thank you everyone for joining us. As a reminder, this recording will be on the Linux Foundation's YouTube page later today. We hope you join us for future webinars. Have a wonderful day.