 Good morning everyone. Good morning. Thank you. Thank you so much for for coming to our talk. This is this is our first KubeCon and we're really excited to be here. My name is Pratima Banerjee. I'm here with my colleague, Ken Hoan. We're both research engineers from a small business called Real-Time Innovations, which is based out of Sunnyvale, California. And we're here today to speak on the topic of bringing real-time performance to the Edge Cloud. More specifically, we're here to introduce the Data Distribution Service or DDS in Real-Time Published Subscribe or RTPS standards to the to the KubeCon audience here. So I'm going to start the talk with an overview of DDS and RTPS and the standards and the benefits that they bring to the Edge Cloud environment. And then, Ken Hoan is going to conclude by showing some DDS performance benchmarks that really highlight the power that these open standards can bring. So before we get started, just just a little bit of background about us. So Real-Time Innovations or RTI was founded back in 1991 by a group of researchers from the robotics lab at Stanford University. So they were interested in building software that could provide the kind of fast, reliable communications that are needed by robots that operate in ad hoc teams. So for communicating things like sensor data, control messages, events and status, and their work really became the foundation for what later became the DDS and RTPS standards. So to continue on that thread, today RTI provides communication softwares for domains such as autonomous systems, including commercial automotive, health care, so including things like telemedicine and patient monitoring, industrial automation, so things like smart manufacturing and industrial robotics, energy with smart grids, renewable energy, and in defense and aerospace. So a few of these are highlighted on the slide here. But in addition to these product solutions, we've also been involved in a lot of standards associated with these domains. So things like AutoSAR and ABCC and the automotive space, things like ICE and OpenFMB and like the medical space. And then in the robotics space, for those of you who might be familiar with ROS2 or the robot operating system, DDS is actually the underlying communications protocol used by the ROS2 operating system. So the interesting thing is that all of these different domains have a certain set of challenges in common. And they all have requirements for communications that span edge systems, where you need very, very, you need to require very, very strict timing to enable device to device communications. And you might include embedded systems in those areas. They also have a need to communicate at the edge cloud systems, which have slightly less strict timing requirements, but they're still very, very sensitive to latency. And finally, they have a need to provide connectivity up to cloud environments that can support a little bit more latency tolerant use cases. So our approach to addressing these challenges is rooted in the data distribution service or DDS and real-time published subscriber, our RTPS family of standards. So these are open standards for secure, interoperable, real-time data-centric communications. They enable platform independent, transport agnostic, and loosely coupled peer-to-peer interactions. And they provide controls and patterns so that we can achieve really ultra low latency at the edge and the edge cloud. So I'm going to get into each of these points in a little bit more detail in the slides that are coming up. But from an overview perspective, these are really kind of the key benefits that DDS and RTPS bring to the edge and the edge cloud. So just a quick note here. So DDS and RTPS aren't really standalone standards. They're actually, you can think of them really more as a family of standards. So they are governed by the Object Management Group or the OMG. And the family of standards around DDS and RTPS are actually provided comprehensive framework for real-time data communication and interoperability. So they're kind of the core. So on the slide here, you can kind of see the core standards for DDS and RTPS for the API and the wire protocol. But then also around that, you can see that there are language bindings for specific programming languages. There are standards for security that get layered on top of the DDS communication protocols. There are type definition format standards in IDL, XML, and JSON. And then, finally, there are specifications for lighter weight versions of DDS for resource constrained environments. And that's the DDS XRC standard that actually just recently came out. So one of the things that we wanted to explore with this talk is to understand where there might be synergies between these standards that have been developed in the OMG and also so that we can kind of leverage opportunities for maybe understanding where the synergies might be between these standards at the OMG and then the open standards that are developed as a part of the KubeCon and the cloud-native community. So what is DDS and what is it all about? So the core of the DDS standard is the data-centric published subscribe model. And we can kind of think of this as a shared global data space. So applications interact with one another by directly accessing that shared state and the interactions with the communications protocol revolve around interacting with the data itself rather than focusing on the endpoints or the communication channels. So in a way, you know, you can think of it that the applications are actually behaving as though all the data that they need to operate is actually available to them locally, even though the data producers and consumers are really part of a really large distributed network. And another way of thinking about it is that the interface to the state distribution system is actually the data structures themselves. So and just one thing to note here, persistence, data persistence and data storage are actually orthogonal to this whole state distribution mechanism so that there's actually services that you can actually link in to the DDS data bus to actually store the data off to persistent storage as you can kind of see represented on the slide here. And another key point to make about DDS is that it only uses peer-to-peer communications. So this allows for direct communications between the data producers and the data consumers without relying on a centralized server or a broker and it reduces bottlenecks and enhances system reliability so that your system cannot be affected by the loss of one or more brokers that are in charge of managing the data. So DDS is loosely coupled, which means that components can interact independently without prior knowledge of each other and this kind of architecture promotes flexibility and adaptability so that DDS is optimally useful for complex distributed systems. So I also want to note here that the RTPS standard actually includes the wire protocol for DDS, which is called the common data representation or CDR serialization format and that specifies how data is transmitted between the producers and the consumers. And so the end result here is that we can achieve the key benefit of interoperability between producers and consumers that are located on like vastly different platforms and transports. So you can have people running on really a plethora of different architectures and still able to exchange data using this common wire protocol mechanism. So the final key feature that we wanted to highlight here is that DDS offers a very granular quality of service configurability and that configuration can be done per individual data flow. So per, you know, individual data flows from within specific producers and consumers and for specific data types that make up that shared data space. So this allows users to tailor communication parameters like reliability, latency, resource usage for individual data streams. So by customizing the QoS for each data flow, DDS can optimize the resource allocation and network efficiency. So for example, you know, you can send critical data with a really high reliability setting and optimized for very, very low latencies while, you know, less critical data, you can you can set those configuration parameters a little bit more loosely so that, you know, you might not need that same stringent level of reliability or those same latency guarantees. And this kind of flexibility ensures the DDS can adapt to a variety of different requirements for real time systems, which, you know, which can vary widely depending on what domain you happen to be deploying to. So one way of thinking about the kinds of QoS that you can configure in DDS is in terms of the CAP Theorem. So you can optimize for message delivery, you can optimize for consistency and or, you know, availability too. So you can kind of tune your configuration parameters to kind of optimize, you know, based on the types of guarantees offered in each of these areas. And again, it's important to note that they're, you know, I mean, this is kind of the real world. So there are going to be trade offs, right? And, you know, you're always going to have to make trade offs between the different types of guarantees that you want to enforce. But the nice thing about DDS is it offers a lot of very fine grained tunables so that you can really ensure that you're getting the exact type of performance guarantees that you need. All right, so to summarize, DDS offers a powerful framework for the Edge Cloud with a range of benefits. So the first one is customized reliability. So DDS can accommodate situations as specific trade offs. So as we were just saying, we can allow for very fine grained tuning of reliability, resource utilization, and determinism based on specific requirements. It allows you to set up mechanisms for efficient data transfer. So you can set things up like sender side filtering to limit the amount of data that's actually getting put on the wire. DDS also enables eventual consistency. But in this environment, eventual consistency actually takes place within the timing requirements needed for real time systems. So when we're talking about achieving eventual consistency across all subscribers for a given set of data, we're talking about a consistency that's achieved within microseconds. So that's really the kind of timescale that we're talking about, the eventual consistency guarantees being achieved in. And then finally, data model evolution. So the DDS type system supports the definition and evolution of data models over time while maintaining interoperability. So we saw earlier that there was a type extensibility standard. So as a part of that standard, the DDS type system can actually be evolved within a deployed environment so that new applications can evolve to using new types without really disrupting any legacy applications that need to be maintained. So it was kind of our expectation that not too many people in this audience would be familiar with DDS. We were pleasantly surprised a little while ago to find that there might be at least a few people who are a bit familiar. But to kind of help maybe drive home some of the points that we were talking about, we just wanted to offer a comparison exemplar between DDS and Apache Kafka, which some of you might be a little bit more familiar with. So at a high level, if we're kind of doing this side-by-side, DDS is designed for real-time applications providing determinism. And it provides this fine-grained quality of service configurability and contrast. Kafka is optimized for high throughput streaming data and log-based data streams. And it's commonly used for a lot of analytics applications and it's used for a lot of data integration. So the Kafka architecture is cluster-based and it requires producers and consumers to interact with a central broker. DDS is broker-less, so all the interactions are peer-to-peer as we were just discussing. DDS supports both the published subscribe and the request reply model, and Kafka is primarily published subscribe. Kafka runs over TCP. DDS is actually transport independent, and it can run over UDP. And most recently, there's actually a spec in place for time-sensitive networking, which is a real-time networking protocol. So for latency-critical applications, you can actually run over these other protocols as well. And finally, Kafka is agnostic to the type system, whereas part of the definition of DDS actually includes the type system as a part of the protocol. And that type system is what's used to define the global data space that we were talking about earlier. So just to give a little bit of context and to show a couple of real-world use cases for how DDS is used, we're going to go through a couple of usage scenarios for DDS. So the first is what we're going to be looking at, an autonomous driving type of application. So in this scenario, the DDS layer data bus is used for the vehicle control system. And so we can see the structure has a few distinct layers here. So the first layer, kind of in the green box there, is the sensor and sensor fusion layer. So at that lowest level, the various sensors, sensors like the lidar, the radar, the cameras, et cetera, are actually collecting the real-time data about the vehicle's surroundings and doing things like object detection, looking at road conditions, traffic information, stuff like that. So there's kind of the sensing layer of the data bus. Then there's the controls layer, which is kind of that middle layer, which is that first orange arrow there. And that acts as the brain of the vehicle. So at that layer, there's aggregation that's going on that is enabling decision-making components like situation awareness and vehicle control. So you kind of have that at the middle layer. So you can kind of think that as the edge cloud type layer in this scenario. And then finally, there's the connectivity layer. So it's that top cloud layer up on the upper right-hand side. And that layer implements connectivity and reach back to off-vehicle data sources for functions like traffic management maps, things like that. So a key point to note here, though, is this green arrow along the left-hand side. And that's that there's a single-shared data model. And again, kind of going back to that shared global data space. So there's a single-shared data model that spans all of the different layers here. So that you don't need to do any data or protocol translations as you're going up through the different layers. So from the edge layer up to the edge cloud layer up to the top layer in the cloud, you really have a consistent set of data structures, which makes the architecture blend seamlessly together. So the second use case, and it follows kind of very similar pattern to the to the first one, is from the medical industrial internet of things or IIOT domain. So here again, you kind of see the same layer data bus architecture. And you see at the very lowest level, you have the devices within a given hospital room. So you have the given sensors and monitoring equipment within the, you know, within a given hospital room. And then that gets aggregated up to the higher level hospital system. So it might be like a hospital ward or like a nurse's station. So that's kind of your edge cloud layer there. And then finally at the top layer, you have your cloud or data center layer where there's a higher level of data aggregation happening. And, you know, you have maybe your data lakes or your like long-term data storage actually attached to that top layer. And once again, we see that there's the single-unified data model that's actually spanning all those layers. So you don't need to waste time in processing doing a lot of data translations, data mappings, or protocol translations. So at this point, I am actually going to hand off to Kyung Ho to talk more about DDS performance. Yeah. Thank you, Pratima. Yeah. So from this slide, I'd like to introduce some of the DDS performance numbers that we measured in the edge and cloud environments. And I believe these numbers are going to provide some of the insights into expected performance that you can achieve when deploying DDS applications in such environments. So we actually offer a performance testing tool called the PERF test for users to assess DDS performance within their own environments. And this tool evaluates two critical performance metrics, throughput and latency. And we actually use this tool for our performance experiments to measure the latency numbers. And one challenge we encounter when we measure the, you know, performance in a distributed environment is the clock time variance across the different machines. And this variability can significantly impact latency numbers, especially latency numbers in a micro, you know, microsecond range, so that to address this issue, this tool actually calculates a round-trip latency and then divided that in half to derive the one-way latency, which actually providing a more accurate representation of latency experience in the network. So we conducted performance measurements with Kubernetes to gain insights into how containerization and virtual networks impact real-time applications. Our approach was deploying PERF test application in a Kubernetes cluster. And before jumping into the performance numbers, I just want to give you some idea of how DDS operates and how we measure DDS performance in this environment. So DDS by default operates on publish and subscribe messaging pattern. And to measure the performance, we deploy publisher pod and subscriber pod and measured a message from a publisher to subscribe to get the latency number. And as Pritima mentioned, it is important to note that DDS is fully decentralized peer-to-peer architecture so that there's no brokers in between between publisher and subscribers. And so that we don't need to deploy additional component for communications between the subscriber and publisher and subscriber. They directly communicating each other. And DDS utilizes multicast for its built-in discovery. But most of Kubernetes CNIs do not support multicast so that to resolve this issue, we need to deploy a component called the Cloud Discovery Service, which is going to provide a discovery for publisher and subscriber applications to discover each other and communicating each other. But after discovery phase is done, they're communicating directly each other peer-to-peer. But to access this Cloud Discovery Service, we needed to create a cluster IP service for perf test applications to access Cloud Discovery Service. And then the last one that I want to talk about is that we use node label for the experiments because we want to we don't want to deploy publish and subscribe applications on the same node because we want to measure the performance over network so that we assign different node labels to deploy it in the different machines. So from here, I would like to talk about the performance measurements in three different environments. First, performance at the edge with bare metal servers. And then second is performance numbers in the edge with Kubernetes. And the last one is Cloud with the Kubernetes. So let's first let's begin by looking at the performance number from the edge with bare metal servers, which actually specifically located in the RTI office in California. For all experiments, we measured median latency while changing two different parameters. The first one is sample size, ranging from a small 64 bytes to the large, like a 63 kilobytes sample. And then different quality of service settings, reliable and best effort. So DDS by default is operating over UDP. It does not provide reliability of the message transmission, but DDS implemented reliable protocols on top so that for users to decide if they want to reliably send a message they can do, otherwise they can do a best effort. And as the data shows, DDS delivers incredibly low latency with small message sizes. You know, we're seeing latency as low as like 20 microseconds. And even the large message sizes, like it's around like 180 microseconds for both best effort and reliable. Continuing our performance evaluation journey, we shifted our focus into the same edge server, but using Kubernetes with Calico as the CNI. And we used the default setting of Calico here. And the results provided insights into the impact of container orchestration at the edge. And incorporating Kubernetes with Calico CNI, we observed a slight increase in median latency when compared to the bare metal servers in the previous slide. For small message, latency is around like 30 microseconds. And for large message sizes, it is extended to around like 300 microseconds. Let's now take a look at a plot that draws a comparison between the performance of bare metal servers and Kubernetes. And this comparison reveals that how Kubernetes, particularly like virtual networks, introduce different degrees of overhead depending on the message size. And Kubernetes with virtual networks introduces a modest overhead, especially for small messages where the difference in latency is around like 10 microseconds compared to the bare metal servers. And this indicates that it is well suited for small messages like a command and control type of message, which typically falls into a small size. And however, it is important to note that as message size grows, the absolute latency difference becomes more noticeable. For larger messages, there is indeed measurable overhead with Kubernetes. Even so, this overhead remains below 300 microseconds. This means that even for large messages, I think Kubernetes remains a viable option, providing acceptable performance. Lastly, let's look at cloud environments, specifically that we used AWS Elastic Kubernetes Service, EKS. And in this scenario, we anticipated certain changes as cloud environments often introduced more overhead because it has shared resources and also additional virtualization layers. So for small messages, we observe a latency of approximately 230 microseconds. And while for larger messages, the latency increased to around 700 to 800 microseconds. Now, let's take a look at the comparison of performance between the edge and the cloud environments, both using Kubernetes in this case. As the graph shows, the performance difference is quite significant. However, the variation is in line with what we expected, given the inherent difference between the edge and the cloud. It is important to note that we conducted this test on T2 small instances. That means definitely there is room for improvements with larger and dedicated instances. So here, the key thing we want to talk about is that we provide a performance tool, as I mentioned, that helps you to gauge the expected performance in your own environments. And this tool serve as a very useful resource for understanding how DDS will perform in your specific setting. Taking a closer look at this graph, you'll notice that the best effort latency tends to be higher than reliable latency, interestingly. As I mentioned, in a reliable setting, you need to do additional kind of work just to make sure your message is delivering reliably just by sending a heartbeat or ECNAC messages. But interesting for large messages like 16K and 163K messages best effort latency is high because we actually use the DDS fragmentation feature here. We actually selectively apply this DDS fragmentation for a reliable setting. And then the message size is larger than the MTU size, which is around 9,000 in AWS. So the reason why I apply this is that we could see some of the high latency when the message size is higher than MTU size in AWS so that we apply this feature so that DDS is fragmenting the messages instead relying on the IP fragmentations under the hood. So we could see that incorporation of DDS fragmentation generating a substantial performance improvements here. So this is actually all we have today. And I want to wrap up these presentations with these two questions that would guide us to our next step. As we presented today, DDS standard support data-centric and peer-to-peer communications offering incredibly low latency for real-time use cases. And it is a well-established connectivity communication technology for the edge. While it has found its place in the edge, we would like to explore potential needs and benefits of extending its capabilities into the cloud environments. As an example, we can look into cloud-native extensions for DDS and these extensions could enable DDS to seamlessly integrate into cloud-native architectures for applications, require low latency and real-time communication requirements. And so here we are really looking forward to hearing from you about any kind of cloud use cases that require low latency and real-time communications. Thank you for joining us today and we look forward to the insightful discussions and collaborations ahead. And if you have any questions or feedback, please use this QR code to leave them. Thank you. Yeah, I think we can take questions now. Thanks for the talk. So I'm Tom Wain with Sony. By the way, I am the member of the TSC for a robot operating system. And I got a question about, as you mentioned, about CNI implementation, most likely it doesn't support multi-casts because it generates the different problems and necessary packets to using a multi-cast. Are you guys working on with some CNI implementation to enable multi-cast? So you're asking we are aware of any CNIs supporting multi-cast? No, are you guys working on specific CNI implementation to enable multi-cast? As far as I know that we've not used to be, but it's EOL now for the open source project for the CNI that works with multi-cast. Yeah, so we don't work on our own CNI implementation, but we recommend some of the CNI that do support multi-casts like QNation, BifNet or others. So if they want to use multi-casts for discovery and also for user data, if they want to, that's what we do now. But I don't, we didn't really kind of plan that we kind of develop kind of our own kind of CNI to support multi-cast. Okay, thank you. Yeah, thanks. Hi. So if I've got, you know, some objects out in the real world and then I want to, you know, potentially send that data back to like a cloud provider to, you know, make some very quick real-time decisions on it. Like what, the diagrams of, you know, like you have a relay where one data bus plugs into another one, what sort of latency are we looking at between those two points? I think that kind of, you know, it is, it is situation specific. And I think some of the benchmarks that we've done and those were between the offices in California going up to, you know, like Amazon environments were on the order of, I mean, still below a millisecond, but on the order of like a few hundred microseconds. So it's those levels of latency. So still within the microsecond range, but not in like the tens of microseconds maybe. Yeah, yeah. Awesome. Thanks. Hello. Miguel Martinez, Jonder. First time hearing about the specification, I'm interested to know, if you know about adopting CAN bus as part of the protocol that DDS supports. Maybe it's a question for the specification itself, but I'm not sure if you know about CAN bus. I think like, you know, some of our like users in the automotive, you know, space, they're using CAN bus and they're adopting DDS as well, so that instead of like adopting CAN bus as part of DDS, you know, maybe we can kind of look into kind of gateway so that if you want to convert in kind of CAN bus data into DDS space and vice versa, that might be a kind of way to go. If you already have systems running over CAN bus, but in addition, you want to do more kind of additional, more complicated things like autonomous, you know, driving logic and you can develop DDS and then you can have that kind of gateway to get the data from the CAN bus and vice versa. Yeah. Thank you. Well, maybe I didn't get the comment. My understanding is we need to do a layer between CAN bus and DDS specification, right, so we can adapt it. Yeah, and also we are part of the AutoSAR community so that I can just provide you, you know, kind of specific things that we are doing in RTI regarding automotive and CAN bus. Yeah. Thank you. Yep. Hi, I'm A from HPE. We have some customers in the automotive domain where the subject of latency does come up. Now, looking at your performance charts for the larger packets, it showed that on Kubernetes, the latency was about double the bare metal scenario. Have you considered introducing a way to artificially add latency so people can tell at what point they need to start getting worried about latency? Like, you know, it could be that 300 milliseconds latency versus 150 is critical for the application or it could not be an issue at all. So how do people know when to, you know, what order of magnitude of latency starts being a problem for them? Oh, so I think that's kind of situation dependent. So there are metrics that can be recorded. So there are statistics that are actually part of the protocol. So you can actually gather those latency statistics so that, you know, all that information, like all of the statistics around not only latency, but also around like lost messages, you know, if there are like, you know, like ACNAC types of statistics, all that is actually provided by the protocol that can be recorded that people can actually put like thresholds on to indicate whether or not there's a critical condition or like a warning level that needs to be addressed. So yeah, those things are all kind of part of the statistics API that's included in DDS. Sounds good. I think we can stop here. Yeah. Yeah, thank you. Thank you, everyone. Thank you.