 Right, thanks. Thanks, Candice, for the introduction. Hi, everyone. Thanks for joining and welcome to the session. So when we talk about the modern storage architectures in the modern application, it is evident that the cloud has become the most reliable and most cost-effective mode of data storage. So in today's webinar, we'll explore how streaming that applications can leverage the cloud storage to build infinite data storage solutions. And we'll also explore a couple of other business use cases that we can build around this cloud-first storage model. Let's get started. So Candice did a little bit of an introduction to me. So let me add a few things as well. So I'm Dhulit Dhanushkar, a senior developer advocate from Red Panda. So my background is, so I'm a solutions architect and developer advocate with a background in stream processing, real-time analytics, and big data. And with my experience in designing and building real-time distributed applications, I can bring in this experience with the Red Panda community and help Red Panda developers educate and use these, build these architectures at scale. All right, so that's about me. So let's look at the agenda today. So first, I'll talk about the storage fundamentals. And I'll walk you through. I'll give you a bit of a refresher on the storage fundamentals. And we'll talk about how these storage architectures have been evolving over time for streaming data platforms. And then we'll talk about the tier storage and what problems it can solve. And then I'll introduce you to the cloud-first storage model. And then we talk about the storage, different business use cases we can build on top of that. And finally, we'll talk about potential business benefits that you can gain from having a cloud-first storage model in your organization. And then we'll open up for a couple of questions as well. All right, let's get started. Right, before we take a deep dive into the cloud-first architecture and its internals, I think for us, it will be important to understand the basics of a cloud streaming data platforms storage model. So when I mentioned streaming data application, I specifically focus on streaming data platforms. Because when you talk about any event-driven real-time streaming application, these questions, this streaming data platform act as the centerpiece of this architecture. Right. So going forward, I'll take Red Panda as the reference implementation here. For those who are not familiar with Red Panda, it's a streaming data platform, API compatible with Apache Kafka. So that means if you already have a Kafka producer, Kafka consumer, you can seamlessly integrate that application with Red Panda because Red Panda offers compatible read and write interfaces. Also, Red Panda brings in additional advantages of Kafka in terms of high performance and simplicity in operations and also durability in handling data as well. So yeah, so that's about Red Panda. So as we progress through the presentation, I'll take a couple of examples related to Red Panda and its cloud-first storage model so that we can have a contextual knowledge in the particular slide. Right. So when we talk about the early days of this streaming data architecture, so streaming data platforms, in the beginning, there was no streaming platform at all. So it was all governed and dominated by these enterprise messaging solutions. So in the beginning, so whenever a business needs to have a messaging solution in place, there were these messaging vendors. They came up with their offerings. So back in the day, the compute storage and hardware, software, everything came up in a single monolithic solution. So the vendors came on site and they installed the whole thing together, configured and get it up and running. As a business, when you want to scale up or scale down, again, the vendors has to come there and do their configuration. But this model has been disrupted by Rabbit Temque in 2007. So they first introduced this open-source model for messaging. And that allows many organizations to deploy open-source Rabbit Temque solution on their own hardware and scale up and scale down as needed. That was the first disruption. And then we had Kafka at LinkedIn. So Kafka was the first streaming data platform of its kind. So by that time, storage was becoming more cheaper and that time was the beginning of the big data era. So Kafka was designed to leverage the cheap disk so that they could build a faster and more more tolerant streaming ingestion system. And in the meantime, ApologyPulse also came into the picture with a key significant differentiator in mind. So that was the tiered storage, I would say. So Pulse decoupled this storage and compute together. So that allows customers to scale their storage, independent of the compute options. And when we travel back to today, so when we compare today's hardware and software operations, especially in the hardware in the modern world, has made a lot of improvements. For example, we have the virtual machines in very tall configurations like 96 cores of EMS. And especially when we talk about the data storage structures, we have SSDs more faster and more cheaper compared to the early days. And then when we compare these streaming data platforms like Kafka, Pulse, and Red Panda against the database world, specifically relational and non-SQL databases, there's a fundamental difference. So that is these streaming data platforms have been designed to support immutable, appended-only storage in mind. So what does that mean? It means that once you write to record into the streaming data platform, you can't go back and modify it in place. Whereas in the database world, it operates on a page-oriented storage model where you can specifically specify where the specific record should go. And once you write that, you can always search it with a key or some sort of a modifier. And you can update it in place. So that's the fundamental difference. So with that storage model, so we usually call it load-structured storage engine for streaming data platforms. So when we talk about Kafka, so it is powered by these sequential writes. And sequential writes are always faster. And the primitive unit of this storage engine was this distributed log file. So we can call it partitions. And this is replicated to different machines so that we can recover it from failures. So that was the local-only storage model. But there was a fundamental limitation in this local-is load-structured storage engine model. That is, everything is scored to the local disk. So initially, that was the design purpose. But as these logs grow, as the time grows, as more data comes in, these logs grow in size. So that will lead us to two fundamental problems. Let's look at them. First one, first problem is your streaming data platforms' storage capacity, total write capacity, is bounded by the aggregated capacity of your local disks. So it is difficult to scale or go beyond that, the total local disk sizes. So let me put that into perspective by taking an example. Let's say we have a Kafka cluster that has a total local capacity of 100 gigabytes. And let's say we are ingesting from a stream with the throughput of 1 gigabyte per second. Let's imagine we enable the replication factor as 3. At this ingestion rate, the cluster will run out of a local disk space within few minutes. So there are certain ways to mitigate this. For example, you can easily enforce this retention controls on local data so that you can instruct the streaming data platform to purge data after a specific point in time. For example, you can instruct Kafka to keep only the last hour's data and purge or delete or compact, topic compaction, the older data. So those are the few implications we have in this local only model. And then the other problem is, and then just to add to this one, so the second option is to always provision new hardware into the existing cluster so that we can get more disk space so that we can accommodate more data. So that's a second option. The second problem is, even after you provision more hardware, more disk, the nodes still keep getting bigger and bigger as new data comes in. And that will result in slow migrations and slow recovery of these nodes. Let's say, for example, if you want to migrate an existing cluster to a different region in cloud, the migration will take a lot of time because there's a lot of data to be moved and this will result in cross-regional replications costs as well as this will result in taking a longer time to recover from crashes because in order to reinstate the crash node state, there's a lot of data must be moved. So because of this limitation in the local lonely storage, people came up with the other option, the tiered storage. So tiered storage in conceptually, it's about having different storage tiers in the application to support different data access needs. So for example, as far as we can consider a streaming data platform, you can place frequently accessed and most recent data in an expensive but fast local disk like SSDs. And you can offload or you can move all the log segments into a more cheap and but reliable storage medium like a S3 bucket. So that way you can decouple your cluster from the compute from the storage. So this is not new in the streaming space. In fact, Apache Pulsar has had this consumed and then Kafka quickly caught up with this. So that was the tiered storage. And when it comes to Red Panda, so we have the tiered storage support now. I'll talk about this tiered storage in the different slide detail. But for now, you can think of tiered storage is enabled by two APIs, remote writes and remote reads. So when you configure tiered storage, Red Panda can move asynchronously, move all the data into a cloud storage bucket, something like Amazon S3 or Google Cloud Storage. And then same data can be read in the read interface. So when we talk about Red Panda in specific, I'll talk about I'll dive deep into this. So even though if you enable tiered storage, there are some limitations. I mean, this is not the silver bullet for our storage problem, but the thing is there are certain life cycles in this tiered storage that had to be manually administrated by some administrator or an operator. For example, once you move the data into a storage bucket, it is no longer, most of the time it is no longer under streaming data platforms due to this addiction. That means you cannot force the streaming data platform to purge the data in storage bucket after a certain period of time. So there were some sort of shortcomings of this model. So by considering the limitations of this local lonely storage and the tiered storage, what if we make the cloud the default storage engine for streaming data platforms? So it seems like a very interesting idea. Yes, so as far as we consider streaming data platform, this means a true cloud-first storage engine means whenever this streaming data platform accepts a new message from a producer, it will first stored in the cloud remote cloud storage bucket, something like a S3 or Google cloud storage. And then once you consume it, you can just consume this record by sending a fetch request. So it will consume the same message from the storage bucket. So there are certain ways we can optimize this operation. We'll talk about that in detail once we go through the presentation. So this kind of cloud-first storage model enables different opportunities for streaming data platforms. The first one being it enables infinite data storage in the cloud and makes your data portable across multiple locations. For example, we saw earlier, so in local storage, your cluster's right capacity is bounded by the total size of your local disk. But when you configure your cluster to work with cloud storage buckets, there's virtually no limit. So you can scale as much as you want. And so that's one thing. And also, you can back up your clusters later into a storage bucket and quickly spin up a different cluster in some other location. Or you can even move it to a different region. So there are some other benefits as well. Second benefit is the unparalleled reliability offered by these cloud vendors. Especially when you consider object stores like AWS S3 or Google Cloud Storage, they usually provide 11 nines of reliability as soon as. So that is far more better than using your own data center. And also, you don't need to worry about maintaining it over time. Another advantage would be, obviously, the cost. Because usually in storage buckets, the cost to store one terabyte of data is relatively small compared to the local disk, which is quite expensive in most cases. As far as Red Panda is concerned, Red Panda supports cloud as its default storage engine or default storage tier since its 22.3 release. So currently, we support AWS S3 and Google Cloud Storage as storage destinations. And the support for Microsoft Azure storage blobs is coming up in a way. So there are a couple of interesting features we developed in order to power this out first storage model. So I'll talk to these features at a very high level. But it is not tied to Red Panda. So you can take this design principle as your guidelines for your next streaming data project. And you can apply it to in a different context also end of that. So underneath this internal cloud storage model, we have the cloud first storage engine. So this is the foundation. There we have the primitive feature, shadow indexing. So shadow indexing is the mechanism that moves all the log segments, the log files into the remote cloud storage bucket. So the beauty here is it happens asynchronously and transparently to the underlying user. So let's say if you are an operator or administrator, you don't have to worry about that because Red Panda can't take care of that. And once you move that data to the cloud, this unified retention controls allows you to enforce retention policies on cloud data as similar to you do that on the local data as well. So for example, you can instruct Red Panda to purge the data stored in the S3 bucket after one month or one week. So it's totally up to you. So the good thing is it is totally under streaming data platforms, governance. So there's no need of manually going there, manually logging into the AWS console and purging it manually. So it is also taken care by the underlying streaming data platform. And then what does this mean by you as a developer or a producer or a consumer application? So the beautiful thing is it is not directly exposed to you. So all you see here is the producer and consumer APIs which are compatible with Kafka. So as a consumer, you would read these data stored in the cloud by sending a fetch request. So by controlling the offset, you can switch between different locations. So it's up to Red Panda based on the offset. We can fetch from the cloud or we can fetch from the local storage based on the configuration. So in order to make these reads optimized, we build this read and write cache at the read side. So that's the foundation that pairs the way for the cloud-first storage model. And on top of that, we have built several other capabilities as well. The first one is tiered storage. I discussed that earlier. So basically, tiered storage allows you to break your storage into multiple tiers. So that means you can combine the local storage with the cloud storage so that you can serve most recent data from the local storage while keeping the historical data in the cloud as well. And secondly, we have the read replicas, which is another interesting feature powered by the fast rehydration capability built into the shadow indexing. So you can think of these read replicas as mirroring a topic in Kafka in different clusters. So for example, you have Red Panda cluster in one region and you can create a read replica in a different location or could be in a different geographical region. And then you can select a topic from the original cluster to mirror at the destination. All you need to do is point that topic into a specific cloud storage bucket, then it can prehydrate faster and catch up with the original topics content, including the offset locations and configuration. All right, so that's the basics of the cloud storage model. I hope you've got an understanding of that. Now we get to the meat of the session. So we are going to talk about a couple of use cases that we can build around this cloud storage model. First one is the instant disaster recovery. You know, when you run streaming data architecture, disaster recovery is strategic to your architecture because it allows you to reinstate your application's business state after failure. So usually, when we consider Kafka or any other streaming data platform, we usually enable this by deploying Kafka clusters in multiple geographical locations so that we can isolate four domains. So, and then we can enable cross-region data replication through tools like Miramaker. So there are so many ways to do that, like cluster linking and there are so many techniques to do that, but the thing is most of the time, it requires you to have a active standby DR cluster at some geographical region and you have to also incur, absorb some cost in cross-region replication. But when this cloud-first storage comes, you have to think always in, your data is always portable in cloud. So there's a central storage bucket that holds your entire cluster's dataset. So that allows you to spin up, quickly spin up a Red Panda cluster when we talk about Red Panda. So you can quickly spin up a Red Panda cluster and let it hydrate from that storage bucket quickly. And usually it will take about a couple of seconds because of this fast indexing and other features we built in. And then it will, it can also reinstate the last known offset locations, offset configurations as well so that all existing consumers can smoothly fail over to this newly built DR cluster. So this is one good use case for this cloud-first storage model. Second use case is the fast scaling up and scaling down those clusters. So when it comes to especially retail and e-commerce workloads, there are certain periods where we can expect seasonal traffic like Cyber Monday and Black Friday seasons. So how do we tackle this kind of situation? So usually what we do is we provision new clusters to handle that traffic. So usually what we usually do is when there's no way of predicting the traffic, we usually over provision the hardware and most of the time it is underutilized and at some certain points it is utilized to the full extent. At most of the time it is sitting idle at some places, but yeah. So with this cloud-first storage model, you can follow the same approach as DR clusters. You can quickly spin up a cluster and when you feel like there's a need of spinning up provisioning new compute and storage power, you can quickly spin up a cluster and have it rehydrate from the storage bucket. Also the other way around this also possible, let's say you don't need the additional cluster at some point because the season is over and then you can quickly decommission it by simply shutting it down and then it won't take much time because the nodes are smaller in size and the major portion of data resides in the cloud and decommissioning will also take less time. So this will enable ad hoc scaling and scaling up and scaling down. Another use case would be to offload non-priority workloads from operational or user-facing workloads. Let's say for example, you have a red panda cluster running in somewhere else and so it is running this fraud detection which is critical to its business and it has strict SL based. At the same time, this same cluster is being shared by the analytics team for fulfilling something like non-priority analytics tasks like running or training a machine learning model or responding to a compliance reports or something like that. Since the workload is shared in the same cluster, most of the time, this will affect the SL layers of the operational workload that is the fraud detection thing. But you can avoid this situation by spinning up or creating a read replica in a different cluster. So with read replicas, you can mirror whatever the topics you want to having this different new cluster, new list on cluster. And once you provision it, you can offload, you can give it to the analytics team so that they can leisurely run their analytics workloads without impacting the performance of the operational cluster. So this is a good pattern to follow. So basically you can offload analytics from operational clusters in this use case. Another use case would be to build stretch clusters that span across multiple geographical regions but will act as a single unit. So you can control it from a single control plane. So this use case will be powered by this Red Panda's rack awareness capability. So in that case, cross region replication is inevitable, but Red Panda can intelligently decide when to replicate and when not to. By that way, you can save cost. And then finally, another use case would be to provide a single interface to access both real-time and historical data. You know, this is made possible by this tiered storage and fast free hydration capability. So for example, you can combine the local storage for an cloud storage to serve two different audiences. For example, you can keep most recent data in the cloud and you can serve real-time operational use cases like fraud detection or real-time analytics use cases, stream processors, kind of things. And then you can also offload historical, infrequently access data to the cloud and you can serve that data to other teams like machine learning teams and ad hoc analytics teams and these kinds of use cases. But the beauty is both of these teams will see a single interface that is your consumer API. So you will use the single fetch API to fetch all these data and you can specify how much data you need and for how long you want to go back in time by specifying the offsets. So based on that offset value, Red Panda can decide whether to fetch data from the cloud or from the local storage. So by that way, you don't have to maintain two different systems. You don't have to have a data lake at all. So you can treat your cloud storage as the data lake at some point. And this will be ideal for application backfilling use cases and also quickly hydrating new applications to catch up with the latest application state. So those are the potential business use cases we can build based on this cloud first storage model. Of course, there are certain other scenarios as well, but this will serve as a couple of starting points. Now let's look at what are the potential benefits you can gain from this cloud first architecture. Obviously, so this will significantly reduces your total cost of ownership when running a streaming data platform in production. So this is made possible in several ways, especially by not having always on DR infrastructure. So you can spin it up on demand and have RPOs closer to zero. And then you don't have to spend more on overprovisioned hardware. Whenever you see a demand spike in your traffic, you can quickly provision a cluster and get it done. Also, you can quickly decommission it as well. So you don't have to pay additional for overprovisioned resources. Also with this intelligent rack awareness and intelligent cost regional data transfer optimization, it will save more cloud spend in costs as well. So we have seen a couple of our customers achieved lower TCO as much as closer to 1.7 millions at one time. So this is the first benefit you will get as a business in production. And then the ability to quickly scale up and scale down as I mentioned to meet demand and while having zero downtime and also delivering, ensuring the SL list. So this will be ideal for handling spikey workloads, especially without having always on infrastructure in place. Another use case would be to have a separation of concerns. For example, you can segregate workloads for different clusters. For example, you can dedicate clusters with more resources for handling priority one workloads and user-facing applications while you can create a read replica to satisfy other non-priority workloads like analytics or full filling or streaming data pipelines or a few other things. All right, so those are the things I wanted to cover. So we first started from the storage fundamentals. We talk about the evolution of storage model from in the beginning to modern days. And then we talk about the different storage architectures. We have append-only model and in databases we have random access model. And then we talk about different ways of storing data. So we can have local storages. And then we discussed about tiered storage and there was some shortcoming. So we propose cloud-first storage as the modern and future-proofing your data architecture. And so as far as RedPand is concerned, so cloud-first storage is RedPand as default storage model. And there are so many use cases you can build on top of this cloud-storage architecture. The key thing you have to keep in mind is cloud is your single source of truth when it comes to data. So this model makes your data portable and you can make it quickly, spin up new clusters and let them catch up with the past state within few seconds. So those are the benefits that this model gives you to your business application. So thanks for joining and we can take a couple of questions from now. So we have a question regarding will the data that will be stored in the real time on the cloud will be encrypted or not? So yes, again, so it will be encrypted. So there are certain mechanisms you can enforce encryption keys. So we can configure mutual TLS and few other ways. So yeah, so the answer is yes. So there's another question, how many servers are needed to create a cluster? So as far as talking about RedPanda, so you need, you can quickly get started with a single node RedPanda cluster, but most of the time we recommend if you are deploying it in production, at least three, we need at least three nodes to get started in the production. So there's another question regarding, so is it DLT? I suppose this is about dead letter topics, but this is different, dead letter topics. So dead letter topics, so once you put a message into a dead letter topic, so you have to manually reprocess it and it is mostly not the part of the streaming data platforms responsibility. So it should be undertaken by the application developer itself. But here we use cloud storage as a default storage. That means once you send a message to the streaming data platform, it will be stored in the cloud storage bucket first, if configured. And it will be governed by the lifecycle of that record will be governed by the streaming data platform as well. So that's a major difference. Yes, so there's another question, can it be used as an edge compute? Yes, so there's a possibility of deploying this in resource constrained environment like edge. And so that's one good use case for that as well. So you can configure that edge cluster to write to some sort of a cloud if there's a need. And then later this, the data in the cloud can be consumed or aggregated by a central aggregation cluster. We can think of that as an additional use case as well. So there's another question, can we apply any filter on the data while setting from cloud storage? So that's a good question. So again, there's no specific filtration method. So as a consumer, all you see is the Kafka consumer API. So you control your filter by setting the offsets. And if there's a need, you can build filtering into your consumer application. For example, if you are building a stream processing application, you can build filters there as well. How is the latency of cloud storage compared to faster local disk? So that's again a valid question. Of course, the local disk are more faster always. So there's a concern in latency when we read from the cloud storage, but we have a built-in cache, read side cache to speed up that process. So that cache also associated with this shadow indexing mechanism, like I mentioned the architecture diagram. So whenever the shadow indexing moves specific segment into the cloud store, it keeps an entry in the index and then when you re-request that log segment from the red panda, we can quickly look for that. Where this log segment resides in the location and we can quickly load it and we can cache it locally so that we don't have to go over and over again to fetch that data. So that's how we optimize the loading time because this cloud storage objects. So another question, three nodes, three servers. So yes, so if you deploy on EC2 could be and a red panda comes in different form factors. So this could be three containers as well or three Docker containers or in Kubernetes as well. So another question, can you explain more about shadow indexing? I think I have covered shadow indexing, but I'll quickly go through that. So shadow indexing is the mechanism that automatically moves the flow segments from local storage to the cloud storage. So that happens asynchronously and happens transparent to the user. That means as an operator or a user, you don't have to worry about that and it will be taken care by this shadow indexing mechanism. So whenever this mechanism moves a log segment to the cloud, so it will index us this which log segments has been moved to which location. So that's a indexing. So this indexing will be instrumental in reading that segment back. So also when we read it back, we have this read side cache. So there are lots of information. So you can learn a lot about this shadow indexing in Red Panda block. So we have a couple of in-depth articles explaining shadow indexing. I can share that after this webinar. But that's the whole question we got so far. So I think with that, I would like to hand it over to the NUC Foundation to take it from there. And thanks all for joining the session and hope it was useful for you. Thank you. Thank you so much, Judith, for your time today. And thank you everyone for joining us. As a reminder, this recording will be on the Linux Foundation's YouTube page later today. We hope you join us for future webinars. Have a wonderful day.