 And in camp and I'm the Chief Digital Manager of DataVersity, we would like to thank you for joining the latest installment of the monthly DataVersity Webinar Series, Advanced Analytics with William McKnight. Today, William will be discussing trends in streaming analytics and message-oriented middleware. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag ADV-Analytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And if you'd like to continue the conversation after the webinar, you can follow William and each other at community.dataversity.net. And as always, we will send a follow-up email within two business days, containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce to you our speaker for this series, William McKnight. William is the president of McKnight Consulting Group. He takes corporate information and turns it into a bottom line producing asset. He's worked with major companies worldwide, 15 of the global 2000, and many others. McKnight Consulting Group focuses on delivering business value and solving business utilized problems, utilizing proven, streamlined approaches and information management. His teams have won several best practice competitions for their implementations. He has been helping companies adopt big data solutions. And with that, I will give the floor to William to get today's webinar started. Hello and welcome. Well, hello, it might help if the speaker gets off mute here. I see some of you were enjoying the music. Shannon and I like to go back and forth about the music, but she always wins because, well, she has the controls. But anyway, welcome. And hopefully there's more than the music here today. We're going to talk about streaming analytics and message oriented middleware. A lot of my clients, I think a lot of an audience for something like this is data integration, data professionals generally. I do notice that there's always a few titles that are clearly on the quote unquote business side of things. If I can still make that distinction in organizations anymore. And I really welcome that. And actually, you know, before we get started, I just wanted to say, I think that it would be great to have both counterparts, business and technical on webinars like this. I know I get a little technical, but it's not so much. I think that it's going to fly way over the head of your business counterparts. And actually, that may be a great way to start some conversations internally about the data architecture of the future. Because I know that a lot of data professionals out there, they would prefer to work in a more mature data environment. Wouldn't we all? Well, that doesn't happen overnight. And you may have to be kind of the instigator of the very beginning of this through things like getting them out on webinars and just starting up the conversations, starting that tale that'll ride up to hopefully a new architecture down the road that is great for everybody. But it doesn't happen overnight. Let's keep working on it. And one of the major things that I've been working on with my clients is streaming. Streaming data. It seems like the workloads I've worked on in the past five years have been exponentially more difficult and different than the ones in the prior years of my consulting career. And how they're different is a lot of it's about the data. It's about the volume of the data, the velocity of the data that's coming in to organizations that they're ready now to adopt. They're ready to apply their science to that data. And we've got to get it under control. So a lot of that data is going to fall into the category of what we're talking about today. Streaming data and analytics. But first, since we're clearly in the area of data integration, let's acknowledge the ETL legacy that we're all living with. And I say all. Everybody's doing ETL to some degree. If you're doing any streaming, you're just starting to fold it in to the organization. Unless you're a total startup, then you started down the streaming path. But ETL is an ad hoc manner of connecting sources and targets surfaced in the 1990s. Some of us were around doing it then. Of course, it's still the largest part of any project that involves data movement. And much more so than data modeling, then data access, then data quality, then data governance, then data architecture, and all the other things that we do. Data movement. Still big. We're still not leaving data in one place in the organization and saying, do everything with it right there. That's not working. It may come to that at some point down the road. Obviously, a lot of us have great data warehouses that approach this concept that nobody has just a little bit of ETL, right? It's always a lot. It was built for data warehousing. I think about Informatica. They're now, I think, 26 years old. So that gives you some timeline on this, because they were really built for ETL for data warehouses. Of course, they've expanded since then, but that's what they were built for. It is the bottleneck in data warehouse population. It's very time and resources intensive. If you stand back and think about it, it's very time and resource intensive. And it is batch. And I know we can throw smaller batches at it and approach that real-time idea, but it's still batch. And it's chaotic and unmanageable in a lot of places. Not because it's inherently that way, but just because we've been doing it for so long and to so many different standards with so many different sets of people in organizations in so many different ways. And some organizations have many, many tools for this. And that's what creates that spaghetti-looking architecture that some of us have. Okay, then along came EAI. And it had a short-lived heyday as an elegant way of moving information. And I say that, and some of you are thinking that's what we do. Of course, it continues to exist in the long tail in organizations. But I'm gonna hopefully move you along here, but let's acknowledge EAI, the Facilitation of Exchange of Business Transactions. Still point to point, though. Still point to point messages using usually enterprise service classes. Works for small-scale data. So if you still have that, you might be considering EAI, but it's not really designed to handle the data that we're trying to get our arms around today, like sensor data and so on. So for your existing applications that were generally built without integration in mind, and now when your task is to connect these together and somehow produce a functioning whole out of the independent pieces, to me that's where EAI shined. And again, often tackled with solutions that were based on web services. The modern realities of data integration today, if you're a data integration professional today, if you're a data professional today, you have to move along with the skills here into the modern realities, which are desire for consolidated methods of data integration, that is across the enterprise. Like I mentioned, we have a lot of different ways of moving data. We probably still will, but making reasonable choices when it comes to data integration is imperative. There are new types of data sources out there, logs, sensors, et cetera. So many different types of data that we as data professionals are now required to get under management, and that includes video, audio, chat, et cetera, et cetera. So we have more than OLTP and OLAP. I know I still, when I talk about data architecture, I still talk about the distinction. It's still there, but things are getting rather muddled, really muddled today, because as I think we've all known for a while, analytics are required in OLTP. And so there's a lot of online analytical that's actually migrating back into the OLTP world. So you have these hybrid approaches now, and sometimes I'm left to scratch my head. I don't know where to put that on the architecture diagram. Is it on the right side of the line with analytics on the left side with OLTP? Well, data integration works across the board, across the spectrum of requirements, of course. And I think even though it was built for data warehousing, lots and lots of informatic and other data integration has been about OLTP, truth be told. But now we have more than this. We have distributed data platforms, hybrid data platforms, confusing data platforms that still make sense inside an organization. The desire for real-time data, I mentioned, the ability to actually step up to more and more data, more and more granular data is imperative for organizations today to take advantage of that level of data. And it's really the basis for AI, which I'm gonna get into in a little bit here. But high-velocity data increasingly needs integration and traditional approaches without stream processing turn into ETL plus, plus, plus, plus, plus custom scripts, plus middleware, plus MQ, perhaps. So it's the requirements have exceeded ETL capabilities in most organizations. So what do we do about that? Well, now I've set us up here that we have streaming up and up the right-hand corner of this quadrant where we have scalability, right? ETL was very scalable, but it was batch. EAI, not so scalable, like I mentioned, great for small-scale projects, but it was real-time. And so therein lies the conundrum that is filled by streaming platforms of today. So streaming is very forward-thinking, where real-time and scale are becoming the rule, not the exception. So you might think, well, streaming, that might be overkill for some things. And I agree, I definitely agree. I'm not here saying let's do all ETL, all integration with streaming, but if you're looking for a platform of the future, you got to consider streaming if you have an application that you're platforming right now and you're choosing tools for now that involve this level of data, or it's something that may get there. Please, always think about the long-term ramifications of the platform that you're implementing and where it might go and how soon it would be a problem if you were to have to exchange products on that application. We don't want that to happen, right? So, point-to-point with the old way. And this is a picture of a typical architecture. Hang on just a minute. Sorry about that, I have an older dog that gets excited sometimes. So, the old way was point-to-point. So if you add another database to this picture, you have to repeat the process and you have to repeat it again and again. And even though we try to take care of this by having consolidated platforms which I'm still hugely advocating for you like data warehouses, like operational hubs, like MDM, et cetera. I'm still advocating that, of course. We have lots of themes in that. And I had a whole other webinar dedicated to the data warehouse and where it sits today and why you might have multiple and blah, blah, blah. So I won't get into all that right now, but it's true that we're gonna have multiple platforms on the analytical side of things, obviously on the operational side of things. But in this architecture, and let me just play out the slide here one more time, all right, if you have two services, then there are two direct connections. But think about it, if you have three services, there are six now, assuming they all need to connect to each other. With four services, there are 12 and so on. In a way, such connections can be viewed as coupling between the objects in an object-oriented program. So you need to interact with other objects, but the less coupling between their classes, the more manageable your program is. So that is really the problem that streaming tries to address. Setting it up a little bit more here. So if you have this combination, I was starting to get into where would you absolutely must have streaming today? And there are definitely, definitely things that we platform for our clients that it's just gotta be streaming, not ETL solutions. So this is the criteria. Data platforms operating at an enterprise-wide scale, things that are going to have long legs in this organization. A high variety of data sources. Yes, the more variety, obviously, the more possibility you're going to get into some sources that are going to exceed the ETL capabilities and be more like a profile of the future where you actually do have real-time and streaming data too. And of course, if you have that today, as you would in anything that involves sensors, IoT, if you're trying to track big data on a real-time basis, like video, audio, all of this kinds of data is streaming data. So to be sure, they're streaming data and they're streaming integration solutions. I'm here mostly talking about the integration solutions, but they are for streaming data and really all data, but they're definitely for streaming data. So if you have streaming data, ETL would force real-time loading on you without being scalable or scalability with batch loading. It would give you scalability, but you'd have the batch loading. Data produced from numerous sources is a torrent of flowing information needing to be time-stamped, dispatched, and even duplicated to prevent against data loss. So we need a postman in this and now I'm starting to introduce the pieces of streaming, one being that postman that sits in the middle, that grand central station, as I like to say, sitting there distributing the mail and collecting the data that's coming in and putting the mail, putting the people on the trains to get out to where they need to go. So this is needed to distribute data from message senders to receivers at the right place at the right time. So building that conceptual picture now of streaming, which is actually conceptually not that difficult to get your hands around, but a little bit more in that real-time data, which I've mentioned is a catalyst for needing streaming solutions, right? I'm not gonna put ETL against this type of data. Messaging data, live feeds, real-time event-driven, these are some words that if I hear them or if I see them, my ears are perking up and I'm thinking much more about streaming. Data that comes in continuously and often quickly. So we also call it streaming data, okay? Real-time streaming, sometimes the same thing. Needs special attention and can be of immense value, but only if we're alerted in time. So think about your business changing on a dime when there's a sudden price change from a competitor or a critical threshold is met or a sensor reading that changes rapidly or a blip in a log file. These are the types of things that you want to act on and that means competitive advantage if you can act on those things and they're probably happening hundreds to thousands of times a day in an enterprise where you actually could be acting on them if you only had your arms around that information, that activity. Well, streaming gives you that possibility and this kind of data is the foundation for artificial intelligence excellence. Yes, of course you can do artificial intelligence with any data, with small relational data, of course, but it's going to get better and better and more granular, the more data you bring into the mix. The streaming data forms the core of the data for artificial intelligence, for those robust solutions out there, those leading edge solutions and I've given the last two of these webinars on artificial intelligence and machine learning. So I talked extensively there about many examples of artificial intelligence in the enterprise and I may not have plated up a lot in those webinars, but a lot of the data underlying that is streaming data. So if you go back and listen to that, listen to those, you can tell that that's what I was talking about. Message brokers are a way of decoupling, the sending and receiving services through the concept of publish and subscribe. That concept's been around a while, I don't think I have to explain it extensively. Another thing message brokers do is to queue or retain the message until the consumer picks it up. So again, so we have decoupling. So the sending service also known as the producer posts the message or the request on the message queue and the receiving service, the consumer, which is listening for messages will receive it. Message brokering is one of the key use cases for streaming. Streaming allows us to have both pubs up as well as queuing features. Historically, either one or the other was supported by such brokers. So that queuing is very important as well. So that's another thing that message brokers do. It's important to have one in a streaming solution. So if the consumer service is down or busy when the sender sends the message, you can always pick it up later. The sender doesn't have to worry about that. It's gonna go at the right time. The upshot of this is that the producer service doesn't have to worry about checking if the message has gone through and to retry it on failure, et cetera. So streaming is great because it allows us to have both pubs up as well as queuing features. It also guarantees that the order of messages is maintained and not subject to network latency or other factors. Streaming allows us to broadcast messages to multiple consumers if needed and usually is needed. So a little bit of a streaming architecture here. This is a way I like to represent it. You have all these applications in the enterprise and by the way, the number of applications inside an enterprise, if you haven't noticed, has been growing by leaps and bounds lately. And now it includes so many apps that are third party as well. It's really a conduit for change in an organization. It's about the time that people like me start getting the calls about, is it time to change our approach to data integration? Because we have so many apps, so many apps, so much granular data out there. And it's all, in a streaming platform, it's handled through request response to the message broker. So the steps are to extract the data, make it as structured a product as you can, the transforms that might happen in the streaming platform. And I suggest there could be several and then it gets loaded out to the requester. So all data can be represented as streams. I don't know if any that could not be, even those old clunky relational databases that we're using and have been using for, how long would the oldest have been? Probably about four years. And there are some of them out there. The future of all your data is represented as streams with a central streaming platform. It serves as a storage later layer or for extreme data extract and involves moving streams between external systems and the central streaming platform and it does transformations. So it's a real-time bus and it's a messaging hub. And one of the things, I don't know that I have an icon representing it here, but one of the things I like to do because of what I do, I like to put MDM data on a streaming platform. So that really opens the doors up for publishing that data. In MDM, it's great if you build a wonderful hub, but you've only done the half a job there. You have to get the data out of the hub into the subscribers where it actually gets to work. And I've just found that I can go ones and Tuesday through the organization and sign up subscribers, but it's really hard if it's hard for them to get the data and it's a lot easier for them to get the data out of a streaming platform. Of course though, there's that upfront work of establishing the streaming platform. So not a free lunch, nobody's saying that, but definitely scalable for handling the information asset, which a lot of you are competing on today. So streaming data is that unbounded, continuous flow of real-time records, stream APIs transform and enrich the data. So I'm gonna talk about APIs here in a second. We're talking millisecond latency. I'm gonna give you some numbers to shoot for in your streaming implementations here can be stateless or stateful. So the key difference here between stateful and stateless applications, they're both handled the same way, but it's important to understand the difference. Stateless applications don't store data, whereas stateful applications require storage, require backing storage. So that grand central station that I was talking about is data stored there? Well, sometimes yes and sometimes no. Sometimes it just bounces through, but quite often it's stored there and that can be very advantageous. And I'm not just talking about stored until all the subscribers pick it up. I'm talking about stored for some period of time and dealt with there. I'll get into that as we go along here. So enter message oriented middleware, also known as streaming and message queuing technology. Messages can be any kind of data wrapped in a neat package with a very simple header as a bow on top. And I like it when it's like that, of course, I can get to work, but quite often you have to work on the data and there's no bow. Messages are sent by the producers, systems, sensors, or devices that generate the messages toward a broker. Now let me pause here and say that there's also some maturity with this and there are some hurdles to get over before you get to that nice path where people are signing up left and right and getting data off the platform. That's what is sold. That's what's told everybody that you can get to. And yes, of course, that's it, but there's a bit of work getting there. And so in the interim you are going to have a triage. You are going to have a special work with each of the subscribers and the publishers. And each time you hopefully tune the whole apparatus a little bit, right? So that you can get to that point of almost hands off. I don't have any clients that's completely there, but you're almost hands off with your streaming platform and people can sign up to get what they want. I tell the organizations I deal with on this that I would like two things from you. I would like you to give your data to the streaming platform and I would like you to take your data from the streaming platform. And if I can achieve that, if we can achieve that together, we will have achieved a very significant step forward for that organization. So think about that. Think about how does your goal, isn't that a worthy goal? You should have some worthy goals out there in this area. And I think it's time a lot of organizations as they're doing digital transformation, it's time to step back, look at the bigger picture of what's going on. And certainly a lot of it's about data maturity and certainly streaming is going to be a big part of that. Back to my slide. So then consumers who are retrieved the messages from the queue to which they subscribe and different solutions call this different things, topics and whatnot. So we'll get into some of the different solutions here in just a moment, starting with Kafka. Although sometimes messages are published to consumers rather than pulled. The consumers open the messages and perform some kind of action on them. Yeah, hopefully they're doing something smart with that data. Somebody, see somebody in the organization, somebody's in the organization, they have to worry about the plumbing. They have to worry about moving data and that data can look like widgets to them. But somebody then has to worry about, okay, what's that data look like? How are we going to use it? And it's okay if they're different people, but somebody needs to be over the top in order to make the thing all work harmoniously. Otherwise it doesn't work harmoniously. There's too much disconnect in those organizations that don't have somebody that can make that bridge. I digress. Streaming solutions are intelligent data platforms for fast data, connect processes, store data in real time in a unified, flexible solution, able to meet demanding SLAs even at scale without operational burdens and complexity. So that's another way of saying some things. I think I may have already said, but if you're looking into the product set in this space, here are some of the things you have to look at. Here are some of the main things. Then we'll get into some other things that you have to look at. Throughput, high, sustainable rate of message processing. Yeah, that's probably the most important thing. Storage, the ability to retain varying volumes of messages for varying lengths of time as necessary. The whole operation, the adding publishers, adding subscribers, adding topics, organizing the whole thing, that's important. Minimize that burden for scaling, tuning, and monitoring. Latency, you don't want latency, okay? Fast, consistent responsiveness for publishing and consumption. For example, I want published latency of five milliseconds, all right, at 99%. That's something that I think is a smart goal for an organization stepping up to this. By the way, you will test, you will probably benchmark your solution before you buy it, and these are some things you can look at in that whole process. Throughput, you want millions of messages moving in a single partition. What else do you want in a solution? Well, comprehensive capabilities that include things like this, durability, scalability, delivery guarantees, okay? I say some of these things right now because I want your minds to kind of wrap into the fact that these are important parts of the whole streaming solution. Georeplication, out-of-the-box support for geographically distributed applications, multi-tenancy, stream-native functions, and a unified messaging model. So those of you that are building your RFPs for this now, here's some ideas. Also, you need for tolerance and parallelism. So we can deploy lots and lots of these internal processes to handle large resources, and we need it to support more latency, delivery, semantics, all right? Operations and monitoring, yeah, we need that. To be able to know that what you're viewing and monitoring, all your details of your processes, it's all there centrally. And then schema management on how schemas can evolve as you continue to populate and grow in this. Now, some of you think of Kafka as synonymous with this approach. Yes, we are in that space. So if you've been hearing Kafka, that's what they're talking about, this is talking about streaming. And by the way, I say Kafka, and that's what most people say, but I've heard Kafka and I don't blink. If you want to call it that, for plenty of people call it that. But probably if we get into Kafka or Kafka or things like that, that's probably something you want to fix. But I digress. Okay, now that we're pronouncing it, Kafka. OpenStream, open source, let me see if I can talk. Streaming platform developed at LinkedIn of all places. So it's a distributed, published, subscribed messaging system that maintains seeds of messages called topics. There you go. There you go. Topics. So if you want to subscribe to our topic, you may do so. And then you'll get all the data for that topic as a subscriber. So in this way, it's kind of like RSS feeders. All right, those of you that have those or I'm subscribed to some topics on Slack, for example, for various of our projects. It's like that. So publishers write data to topics and subscribers read data from topics. Okay, Kafka topics are partitioned and replicated across multiple nodes in your Hadoop cluster, should it be on a Hadoop cluster and enables what they call source to think data pipeline. To me, I've heard that Kafka messages are simple, bite long arrays that can store objects in virtually any format. I've dealt mostly with JSON. And it's a lot of JSON out there, of course. The ENL are in the ETL, okay, through Kafka Connect API. The T, you get through Kafka Streams API. It is fault tolerant, but it is a bit of work on your part. To do it yourself, kind of an approach. Confluent is the vendor that does the value add to open source Kafka. Now I brought up API, it's dindy. It's true for all of these. And here you see there's Connect APIs to Kafka on both ends of the pipe. So this is my graphical representation of that. Pretty simple, I'm putting the terminology into the picture, source and sync. APIs, very important in this approach. APIs, ubiquitous method and de facto standard of communication among modern information technologies. APIs have begun to replace older, more cumbersome methods of information sharing with lightweight endpoints. Okay, not necessarily the APIs of old due to popularity and proliferation of API and microservice and the microservices approach, by the way. That's also something that is, something you step up to at this level. Microservices, API, streaming data, they all kind of work together. Now you don't have to have a microservices architecture, but why wouldn't you? Why wouldn't you start moving in that direction? It's something that you'd want anyway, but it works very well with APIs, which work well with streaming. So the need has arisen to manage the multitude of services that a company relies on, both internal and external. Okay, lots of external data, lots of external partners now sharing data, a lot of that's in JSON. So streaming is very important for all of this. The APIs themselves vary greatly in protocols, methods, authorization and authentication schemes, and usage patterns. Additionally, IT needs greater control over your hosted APIs, such as things like this, rate limiting, quotas, policy enforcement, user identification, a lot about security, to ensure high availability and prevent abuse and security breaches, right? Okay, if you expose API, you open the door to many partners who can co-create and expand the core platform without even knowing anything about the underlying technology. They just have their end points. Many companies experience workloads of more than 1,000 transactions per second on their API end points. So imagine a financial institution with 1,000 transactions happening per second that is 86 million API calls in a single day. For those organizations, their need for performance is tantamount to their need for management because they rely on these API transactions. On the contrary, many companies are looking for a solution to load balance across redundant API end points and enable high transaction volumes. So information systems within these large companies can be very complex and many have turned to APIs as the glue to hold it all together. This gives these organizations the ability to knit all these systems and applications together. APIs and microservices also give companies an opportunity to create standards and govern the interoperability of applications both old and new. So, yeah, enough about APIs, but they're an important part of these architectures. Not enough about them, a little bit more. Public, those of you that have dealt with APIs, you probably know this site here, programmableweb.com where there's over 20,000 public APIs sort of the marketplace for that. Go out and check it out. Check out all that data that's possible. I will bring in weather data. I'll bring in stock market data. I'll bring in news data, things like this into just ordinary client applications. It's right there. So anyway, public, private with an external partner like a supply chain partner or customer or private internal only. All of these are part of the ecosystem and there's a need for managing all of this. So in a microservices architecture, microservices and containers are great for developers because they remove the fragility scale and deployment issues associated with tightly coupled application architectures through decomposition, right? So it allows you to have continuous delivery and allows developers to crank out the code much faster, not having to wait for a lengthy system rebuilds, redeployments and integration tests or sweat whether that one line code change might bring the whole thing down. So in this environment, and I know many of you are on the path here, the new tech zookeepers become that thing we're now calling DevOps, okay? So not to go off on that too much but that becomes part of this as well. The platform architecture around APIs, you want 100% of requests that are made with an HTTP 200 okay response and no spikes or outliers in latency. And you wanna test this, when you do your test, you wanna let it run for a good hour plus. It's not just, let's shoot a few transactions through here, put some volume on it, make sure you know what you're getting into, make sure you can live with what you're getting into, make sure what you're getting into is actually gonna work not just for today but for the inevitable growth of this. And hopefully you do feel like it's inevitable, you wouldn't be doing it in the first place. And we architect for success. You want scaling down to a nano instance size with only one CPU core and half a gigabyte of RAM. So these APIs do not have to occupy a big part of the environment. Okay, requirements for API and then I'll finish on API. Good for high performance workloads, over a thousand transactions per second. I think there you're almost inevitably needing to go the API route. Reliability, all workloads completed with a hundred percent message completion or 99. some number of nines okay and multiple plugins enabled in terms of the complexity. Now let's look at some other non Kafka solutions like RabbitMQ maybe some of you have heard of this are working on it managed by Pivotal software came out in 2007 just to show you how old some of these things are. Finally, hitting some stride note though. A lot of this is, it's gonna sound familiar but it's just their terminology as opposed to Kafka. Okay, RabbitMQ uses an exchange to receive messages from brokers and pushes them to the registered consumers. The broker pushes messages which are queued in random order toward the consumers. What else do I wanna put a point out here? Messages queues and exchanges do not persist unless otherwise instructed. So if a broker is restarted or fails, messages can be lost. It has settings to make it more durable. Durable is a big word in all of this. By the way, myself and my engineers have modeled all of these solutions here. We've benchmarked all of them. So we have some pros and cons and some gotchas and so on and trying to share some of that with you here. Amazon Kinesis is similar to Kafka. Not a lot more needs to be said about it. It's an alternative. If you like the Amazon set, in an enterprise-ready package, Amazon users will pay for by the shard hour and the payload. Still does the pulling of the data though, so yeah. And then there's Apache Pulsar, originally developed at Yahoo. Began as incubation at Apache in late 2016. It's been in production at Yahoo since 2013, utilized across their product set, follows the Pub-Sub model, uses built-in multi-data center replication. So the thing about Pulsar is it's great in the middle in that grand central station. It's also great in the integration pieces with high performance and high durability. So that's something to keep in mind. Stream Leo is the enterprise-ready deployment of Pulsar. Should you want to go this route? What else do I want to say about Pulsar? Yeah, that key distinction is really the durability of it all. Pulsar ensures that message data is never lost under any circumstance. It does this with Apache Bookkeeper to provide low latency persistent storage. And when a Pulsar broker receives a message, it sends the message data to several Bookkeeper nodes which pushes the data into right-ahead log and into memory even before an acknowledgement is sent. So in the event of a hardware failure, power outage, et cetera, Pulsar messages are kept safe in permanent storage. And again, they use this concept of property which could represent all the messaging for a particular team application, product vertical, et cetera. It's kind of like the topic that we looked at before, the topic or the group of, the grouping of the messages that you can subscribe to. So it's a property here. Stream Leo is an exclusive by different models. Okay, let me go through the models quickly. Exclusive subscription is only one consumer at a time in a single subscription digesting a topic partition. That's pretty limited, but it can get you started. It's a great, it shows off the features around streaming. There's failover subscriptions that have multiple consumers, one elected the master. So that's another way to leverage streaming and then there's a shard subscription where you can add as many consumers as you like without adding additional partitions. So that will show off the queuing features of Pulsar and Stream Leo. In persistent messages, persistent topics I should say, all messages are durably persisted on disk. Multiple disks, if there's a broken cluster, Pulsar manages the cursor to ensure messages are not removed until all subscribers have acknowledged receipt. So I'm stressing when you get into these solutions to make sure your solution is durable. Workloads, now as you start to size up your workloads and frankly, size up your test before you get into your product. These are some of the things that you'll think about. The number of topics you're going to have. The size of the messages being produced and consumed. You might want to vary that size between your smallest and your largest. The number of subscriptions per topic. So subscribers, how many? You might have one for a lot, but test your max. I don't know what that might be. 10, 20 in an organization, 100. The number of producers per topic. You definitely want to have, you want to show off the multiple producers per topic features of whatever you're looking at. And the rate at which producers produce messages as in per second, how many per second? I gave you the goal of a thousand and that's just arbitrary. Maybe you don't care for that or maybe you need more. But figure it out and test it out. And the size of the consumer's backlog in gigabytes. And you want to test this once again. You want to test this over some period of time. Not just a point in time. So create a streaming application. If you're ready for that, configure the application, serialize your data, probably JSON, but in whatever form that you're serializing it. Set up the tables and the broker for the change logs. If you're migrating ETL to stream processing. Now I haven't seen a lot of this frankly because usually stream processing is required for some newer applications. And I'll just say that some of them have tried the ETL route and switched over, okay. But it's not like many organizations are going back and changing out ETL for stream processing all over the place. That's not what is happening. But if you're doing that for whatever reason, you need to sessionize your event data. You'll need a message bus. You'll need data storage. For example, HDFS with S3. And you'll need operations support on this. And of course, the bus and the storage and the integration parts come packaged if you will with a lot of the solutions that I just talked about. And by the way, and by the way, I just shared the open source solutions and the companies that are enriching those open source solutions. There are plenty of other solutions. And I've given you enough here today to make sure that you know some of the criteria that you wanna look at to know if you're looking at a streaming solution and to size them up. Okay, biggest challenges I've seen with streaming. Getting data live at scale. Accenting that data with metadata. Some of the transformation that you might do to data in that hub. Misordered events. Events coming in in the wrong order just because of the timing of the publishing. Okay, recovering jobs that fail. High operational workload. Yeah, it is a high operational workload. It's much higher than many operations are used to dealing with. But if that's what the organization is doing operationally and you need to get your handle on this data, here you go. All right, so I have my little persistent picture in the upper right here. But let's break that down. Let's break down one of these connections. We've seen this before in the Kafka picture. We don't have to use their language when we're looking at it generically though. We have sources and destinations or targets if you will. We have APIs that connect sources to the streaming, streaming to the destination. And we have an API to the streams. Yes, and transformations that will occur there. This is the future of a lot of data integration here. The point-to-point might be something soon of the past. So hopefully I've opened your eyes to some of the possibilities or helped you round out what you know about streaming and help you be more successful. Streaming and message queuing have lasting value to organizations. They will be as prevalent as ETL was. That's my prediction. In our lifetime, by the way, and it's in the world of data warehousing and integration. As ETL was in the world of data warehousing integration. APIs have begun to replace older, more cumbersome methods of information sharing with lightweight endpoints. Streaming and messaging will be able to meet the data volume, variety and timing requirements of the coming years. And that means you, if you're going to be a sustainable organization, a data-driven organization, you will benefit from these technologies because it will allow you to ingest data and operate at the scale that you need to operate in, which has not always been possible. All right. Shannon, that concludes the presentation portion. If there are any Q and A, we can do that. But I'll take this opportunity to remind everyone that analytics, advanced analytics, excuse me, happens the second Thursday of every month at this time, same bat channel. And I'll be here next month presenting something that fits my mind right now, but I'm sure it will be good and along the steam of advanced analytics. Shannon, any questions? I love it. If you do have questions, feel free to submit them in the Q and A section in the bottom right-hand corner of your screen and just to answer the most commonly asked questions, just a reminder for this webinar, I will send a follow-up email by end of day Monday with links to the slides and links to the recording of this session. And I don't have any questions, and yeah, I think everyone's still on vacation from the forest. You know, it's been a little quiet this week in all of our webinars, catching up. But August is embedded. The topic for August is embedded data science, trends and databases at the edge, which is going to be fantastic. I love that. Towards that, yes. Everyone's so quiet. Do you want to know what the weather's like? What's going on? No, get in the hot air in Texas. You don't need to be like that. Oh, we've got a question here. So what about Spark? I'm at Spark. Yeah, I don't know if I see it as a streaming solution as much as I see it as a programming language, but certainly it can handle a lot of what I talked about. So, and I know some people are doing, they're managing their streaming data in that way. I've not done so myself. So I can't say with firsthand knowledge about Spark in this way. We use it as more or less a programming tool for getting data out of Hadoop clusters really fast. But yeah, you can throw that in the mix too of a streaming solution, although it's not a streaming platform per se. It's more multi-purpose than this. Right. All right. Well, we will give some time back to everyone. Thank you so much for all of our attendees. And William, thank you as always for another great presentation that do it's just been great. Yeah, I reiterate Indra's comment there. So again, I hope you can join us in August and we will see you on the flip side then. Hope everyone has a great day. Thanks everybody. Thanks, William. Thank you. Bye-bye.