 Hello, everyone. Welcome to my talk. The title of this talk is Optimizing Speed and Scale of Real-Time Analytics Using Apache Pulsar and Apache Pino. And I actually have both data stacks, which is the company I work for, and then also StarTree, which is supposed to be my co-speaker Karin Wallach, but she can't be here today. So I'm doing also covering her part and also just a note that this talk is only 30 minutes, so it's going to be a bit compressed. And then I welcome you any questions afterwards as well. And also not just here on site, but it can be offline too. So, okay. So thank you very much for coming. And oh, first of all, if you're interested in this slide deck while I'm talking, it is here, the QR code, if you're interested. Just a very quick thing is that, well, data stacks is the company that the flagship product is Apache Cassandra on the cloud, Astra. So they've been in business for 10 years, based in California. And recently, a year and a half ago or so adopted Apache Pulsar. And it's now also managed cloud platform called Astra Streaming. And that's Pulsar underneath the hood. And then with StarTree, they are startup company. And if you're familiar with Confluent, Tim Berglund, who was the DEF-REL director at Confluent before, he's now at StarTree. So they are open analytics and they are the ones that power pretty much LinkedIn of their online real-time analytics too. So that's StarTree. Okay. So with that, let me start. Very quick. I don't need to introduce myself as much since, presumably, you were here this morning when I did my keynote. So I'm streaming developer advocate at data stacks for those of you who didn't come to this morning's session. So I'm, I was an engineer for 20 years before, 20, over 20 years, actually, in more Java. And then prior to that was some CN, Unix and then Java. And then started becoming an advocate in five years ago at IBM. And now I'm at data stacks. So, and I'm also Java champion and based in Chicago. I also run the Chicago Java users group. And this is just a quick thing since Corinne isn't here. I told her I would just kind of introduce her to, you know, to you here because the slides to from StarTree is from her. Corinne is from StarTree. She's head of developer community at StarTree. And yeah, she's just a very energetic person too, but too bad she isn't here today. But yeah, so just introduce you to her. Okay. So today's agenda. Basically too, I wanted to start by kind of understanding the problems that we are trying to solve here with Apache Pinot and then also pair with Apache Pulsar. Although I don't necessarily now have a use case to show you, I will have future talks that will actually have demo too. But for now it's just more about the talk. You can use Apache Pulsar as a source data source for you to ingest data. So then it feeds into Apache Pinot to do real time analytics kind of work. So we kind of try to understand the problems first and then giving some examples and giving some quick background of the evolution of real time analytics too. And then basically too, how can we use, you know, and how and what can we use for solving the problems is basically an open source solution using Apache Pinot for real time analytics and then also then Apache Pulsar for real time data ingestion part. So that's pretty much a thumbs up for the next 30 minutes. But let's first start to talking about the types of analytics use cases. I think what we are more used to these days when we talk about think about analytics is, you know, from the, let's say, you know, 20 years ago or something in the IT industry, most of the time too, we are dealing with statistics, right, being displayed on dashboards in your company, maybe website, you know, for marketing for business decision makers. They were using like BI business intelligence tools and a lot of times too, they may be doing, you know, analysis from data warehouse. So very much kind of, you know, not real time and also batching up all the information and processing them, making sense out of it. Those are kind of the, you know, what we're maybe come, we have come from a bit of an older way of doing analytics. And then now with the, you know, times have changed now, we're kind of talking about machine learning artificial intelligence. So now too, they're also the machine learning part that's ingesting like real time data. And also, you know, essentially to defining your model data model as the data comes in, that requires kind of real time, huge volumes of data kind of processing too. So that's machine learning. For example, even like machine learning ops by ML ops too, they are essentially helping it's like DevOps, but then it adds in the machine learning data part of doing, you know, real time analytics and immediately turning out the model and immediately doing rapid deployment. That's the theory. Okay. So that's machine learning. And then, but then today too, this particular example we're talking about is more about user facing analytics. So a very good example will be like with LinkedIn and you're on your LinkedIn profile where you see, you know, statistics, like who has viewed your profiles, how many people have viewed your profiles or, you know, all sorts of stuff, you know, to your account. And that's actually utilizing real time and analytic software. And they actually at LinkedIn, they are using Apache Pinot. So and, and in fact to Apache Pinot is kind of came from the LinkedIn engineering too. So now this company is called Star Tree that's working on developing further. Okay. Okay. So then this is just a diagram explaining to you again, kind of going back, right? The dashboard BI tools, they are kind of more dealing with data that's already in your warehouse. You're drawing them out and doing analytics and drawing out pie charts and graphs and all sorts of stuff. That's kind of BI tools. And then again, well, this, this one is actually machine learning too. As you can see, how many of you actually are working with machine learning? Maybe I'm just curious or maybe some, yeah, maybe you're then familiar to having to ingest huge amounts of data. But I think my understanding is that I think a lot of the machine learning kind of, you know, production, so to speak. And they are not actually using real time because of the huge amount of data that they are explaining. I've talked to some, some folks who are in machine learning engineering, they said just cannot, you know, like ingest like terabytes of data, for example. I think right now, maybe the streaming systems are still a little bit like short on being able to, to ingest so much data. But theoretically though, once, you know, kind of reach a point of maturity, you can actually do real time, huge, you know, really like huge, humongous set of data to at the same time. So, okay, so then this one is the user facing analytics. For example, here too, I already talked about the, you know, like your page and your items and, and all of the real time like messaging, all of these things, as you can see, it's kind of immediately analyzing all of your activities and then print out who view your profile impressions of your posts and they are kind of near real time, not quite exactly real real times, you know, by some definition. So, okay, so then in this slide too is kind of basically talking about why should I care, right? So we're looking at, first of all, right, in a business too, you have like business monitoring used to be to is, you know, you have an ops team that are kind of doing monitoring of these things. And then the thing is that as business getting more sophisticated, then you get into analyzing business insights, all of these things, and you get into the, the analyst side of things, right over there and over here and analysts. And then essentially too, the goal actually of Star Tree from what I know from Corinne's explanation too is that we want to basically shift the power, essentially having the power of the data in your hands for the end user is what they're trying to do because previously you do all these BI tools and analysis is really for corporate big teams like they are the ones that have, you know, access to those data only. But now you want to kind of get into all the insights and also monetization too for companies that are business entities trying to, you know, make more money and all that stuff. You basically want to shift the power to the end user, essentially. So it's it's that's what it's kind of like giving kind of importance to this area of computing is you are now able to actually analyze your data and basically monetize it as well. So so kind of like this whole chart is basically saying the business has evolved over the years. So we want to kind of shift to doing more, everything more real time. Okay. So this one too, I think it's just an example and it's showing you the LinkedIn as such where I already mentioned about who viewed your profile. So like over here, it's very much near real time. And it just does all the analysis on your profile as it happens too. And you don't have to wait, you know, for that, you know, information to come out, essentially. So over here to actually Karina already told me that the total number of users right now is showing 700 million, but LinkedIn already has reached higher than that at this point because she took this picture like a month or two ago. And all of these to the query per second, all of these are kind of also like increasing. And also the latency to that they want to have like less than 100 milliseconds, right? And the freshness of your data to is in seconds. And that's what this is. Okay. And then this is also another example of user facing real time analytics. As you can see to over here, I think it's kind of probably same, same kind of information in here too. It's just basically also showing you to who view your profiles, view of your posts, all of these other things and the messaging, everything is being done essentially like very much a real time thing. Okay. And then this one too, actually another usage too for these user facing real time analytics would be useful, would be for like recruiting firms, right? They are out there searching for candidates at any time. So they need to have more like a real time analysis, right? They are kind of looking, monitoring maybe certain, you know, engineers or candidates, you know, in their database. And so these talent scouts out there are constantly be watching your status change. Once you are open up, they may they are going to reach out to you for whoever they are looking for the skill set and all of these things. So okay. And then this also is another example too of this user facing real time analytics would be like Uber eats, for example. So say, for example, right, you, you want to be able to have real time service, especially when you are doing food delivery. And let's say, right, if someone, you know, ordering everything's done online on Uber, and then they have some orders that they missed, or maybe they receive an order the delivery and they found out they missing some dishes and they wanted to report it right away to Uber. So the thing is too, and that's when you want this information to come to you in real time, because you don't want to wait now. The food is fresh and they deliver something, they forgot something you want. The kitchen wants to immediately react to it and be able to say, okay, I forgot about this, you know, like, you know, some deep dish pizza from Chicago or something, and they want to immediately, you know, bake another one and do delivery, for example. So that's kind of, you know, very, you know, a use case for this type of system too. So, okay, there's also another example too is from Stripe, which is the financial company. And hope over here too is basically they are using like real time analytics use cases in financial statements and reporting, and then also the bugs and problems analysis, detecting like financial risk and also like managing essentially their business operations, their liquidity, you know, their financial line to are they doing good. So you want to have that kind of financial information at your hands, you know, really up to date to an act on it immediately if something goes wrong, something. And also as well as like auditing too, there are past actions too that happened, you want to be able to audit that and give you very tangible kind of analysis and results so you can again, you react to it rather than waiting for it and it becomes too late, you could be sued if you forgot about something in your procedure, for example, things like that. Okay, so, okay, so this one is just basically saying what problem are we solving is basically teams like executing like money movement, those are kind of prime candidate to use this kind of real time and analytics type of software. Financial data consumer of these are kind of prime kind of candidate for this type of software. There are also unique challenges and opportunities too. So you want to make sure especially financial data has to have high precision and accuracy too. And basically you do all the, you know, data, all the aggregations, they must be exact, you know, there's no eventual consistency actually shouldn't apply especially when you're dealing with money, right. And so basically there are also like many small unit transactions and also like small currency units can vary too. So you want to have very accurate kind of calculations very precise as well. And also granularity has get onto the transactional level granularity and reports too must be reproducible and and also then requires like frequent manual action to all of these things are meeting compliance security. So as you can see financial data and requirements especially, you know, for analytics, it's, you know, analytics can be applied to financial applications, you know, heavily in this case. Okay. And so now let's then kind of take a look further into user facing analytics. On the left side is internal analytics. So these are maybe like company internal analysis. So the latency can be you can tolerate up to seconds in latency. A freshness that can be seconds to minutes and concurrency, maybe hundreds of users. But whereas if you look at like external analytics, and these are like user, like your customer facing side of things or potential, any kind of external analysts, right, you want to have super fast latency of milliseconds. And the freshness have to be in seconds too. And concurrency, you don't know how many people are going to come to your site to look up information that type of stuff. So it has to be really kind of very robust, you know, especially in the external analytics side. Okay. So now then this one is about evolution of real time analytics in here. So essentially we're talking about it's this landscape is rapidly changing. It used to be the OLAP right online analytical processing system that are older, doing like data warehouse and dealing with data. That's kind of already data at rest and you pull them out and do things with it. But these days too, we want things to be done much faster. You want the data comes, you want to analyze it right away and turn, you know, into like a shopping cart recommendation, for example, things of that nature. Right. And all of these things, as you can see from internal facing analytics, we're now going to user facing analytics. And then structured data has become like going into semi structured data too. And then over here to approximate data query consistency in the past. Now we want strong data and query query consistency. And there's also there used to be the slice and dice queries to that is in the past with your data warehouse, slice and dice, all of these things. Now we're kind of want to support like full SQL semantics. And also in the past to maybe gigabytes to terabytes of data and now terabytes to petabytes of data. So we're handling much more data right now in today's kind of economic climate, right? The market out there. That's how it looks like. So now we go kind of talk about, we want to build a user facing real time analytics system on the left hand side. We know it comes in, we need to ingest. That's very important. You know, the whole essentially the whole pipeline needs to be all real time, real real fast, real time ingestion and high dimensionality and the velocity of ingestion has to be very high. So, and then essentially to it has to be highly available to and very scalable operating in the cloud environment. And also the cost to you want to keep the cost down very effective. And on the left side, as you can see, these are we're dealing with these user facing real time analytics. It has to be seconds of freshness and thousands of query per second and milliseconds latency. So, so now the question mark is who can help us with doing real time ingestion. So now this is where a pulsar we're kind of like to suggest to you. And of course there might be folks already using any other kind of ingestion right streaming kind of ingestion like for example, Apache Kafka. And Apache Kafka is fine. It does its job and everything. But the thing is I want to introduce to you a page pulsar because over here too, as you can see, it is a using a traditional pops up architecture in is kind of a, you know, a quick introduction to you is it is using pops up as an architectural kind of approach to developing, you know, we had this broker that helps with distributing messages right in high volumes, high ingestion very fast too. And then there are brokers that will be handling all of these messages. It's much like you produce your messages, pops up, producer consumer kind of approach, you send the producer will produce messages, send it to the broker, but label it as a topic, right? And it's like a post office. Think of it broker is like postmaster. I get the messages and it's labeled as such. And when the messages comes, then I'll deliver to whoever is interested in receiving the messages. Those are the consumer. The consumer must subscribe to your topics. And pulsar too, by default, it uses one broker, but you can also define multiple brokers as well as you can have topics that can be partitioned across like different brokers as well. In a cloud environment, it's actually very efficient because Apache pulsar itself, it separates out the compute and the storage too. So it pulsar doesn't want to deal with all the log messages, distributed log logging messages in such high volume. So it lets bookie, which is Apache bookkeeper to do the job. And then there's also zookeeper that's in the picture. Zookeeper is there to help with monitoring and managing all your cluster, all of your configuration data, more like kind of really managing keeper of the zoo kind of so to speak. So we'd like to introduce to you Apache pulsar, but basically let's kind of take a look too. Now when we talk about Apache pulsar, it is actually a unified messaging and data like streaming platform. If we kind of, this is a reactive summit, I would say that this actually, it is actually you know, helping to do things in a reactive fashion. And let's take a look at it, right? So what is driving the change, right? It's basically real time, we need real time data to enhance customer experiences. We don't want things to be like, you know, you can batch up your processing, but then in that case, your data may not be as fresh. But in this case, we want real time data is basically when data comes in, you want to immediately ingest it and put it to work, essentially real time. And then another thing is that there is also essentially, I mentioned about machine learning kind of cases, you can use it to build data pipelines to kind of essentially ingest huge amounts of data and then push it into your data pipeline to do machine learning type of analysis right away and then immediate turnaround, then you update your machine learning model and be able to deploy it right away. So that's the idea, you want to have that constant kind of streaming in of data, huge amounts and being able to do that really quickly. And then the third thing too is that we need to have systems that can handle scalability too, that will meet the demands of large volumes of data generated by application that's operating at the edge too. For example, like IoT devices in a huge field, you're collecting data, maybe like weather changes from your huge amounts, the thousands of acres of land, you want to install all these IoT devices, you want to collect all the data, you use this kind of approach, like a PubSub type of approach, it can scale very well too in this case. Okay, so now kind of take a look to again summarizing why event streaming, because we want to watch for the events with the system or the application and we want to be able to subscribe to those topics that are only of concern to you, you don't need to subscribe to all of the messages, you can select the messages that you want and then you can in this case you get the data in real time and not after the event, you can also ingest high frequency of messages with very low latency. Okay, so here too, let's take a look to get into event streaming. So basically too, let's kind of take a look at the times of before streaming, right? And then now we're talking about event streaming. The one below here is that when we're, it used to be too, is that we're doing things not real streaming, maybe there, you know, we think we're doing streaming what used to be, but you still need to maybe ingest data over here. You get some data from somewhere maybe using ETL load, right? The extract, transform load kind of older process, you load them, you ingest the data, but you actually have to persist the data first, right? You kind of write it to disk. Anytime you need to involve any disk IO is going to increase the time, even though if it means a millisecond, nanosecond, it's still, it takes time to read and write. So the IO is going to increase your, your, the time, right? It's kind of decrease your performance. Okay. So then what we used to do is ingest data, persist the data, and then you, you then have to read the data back and then do any kind of processing on it. After that, you write it back to your database to some, some data store, you persisted back and then you do more selection. So it, it works too. Not that it's saying not working, but it's just slower because of this extra layer of your persisting to, to the database, right? Sometimes it may not be necessary too. So if you look about here, we're talking about streaming and everything to you ingest the data and it's in memory, you process the data in memory. And then once you kind of, after you've transformed your data, then you output it to a sync is some kind of destination where you write it to. And then from there, you can do all the selection. Everything is being done kind of faster using this approach. Okay. So now let's get introduced to you a patchy pulsar. So pulsar too, what is this? Right? It's an open source project created by Yahoo engineering back in the early 2010s. At the time they were dealing with Yahoo, you know, all the finance, all kinds of Yahoo services. And then they realized that it's, you know, not able to handle a situation in which, you know, you're doing things on the cloud. You need to add another cluster to your cloud, all of these things. And they find that there are no, you know, library already out there to handle how do you actually deal with the scalability aspect of your infrastructure. So then they decided to work on pulsar and develop pulsar. So over time, it has added quite a number of features to it. And in 2016, they contributed to the patchy software foundation. And then in 2018, it quickly became a top level project. It is very much of a cloud native design, right? I call it like it's born with the cloud native DNA because it takes into account, I'm going to run this event streaming platform in the cloud native environment. It has to be very efficient and it takes care of, you know, kind of the infrastructure dealing with the cluster is all cluster based. And also it's a multi-tenancy to this architectural approach. And I'll show you in a few slides about this multi-tenancy kind of approach it takes. And also too, because of a pops-up nature, you can actually develop your client APIs relatively simple and using different languages too. You are not only confined to using Java, like the broker is written in Java, but you can actually have C sharp client, Python, Go, and any other kind of, yeah, you know, what you call a community contribution like Scala it also supports too. And also the message delivery part is guaranteed too. So the broker, basically your message is reaching the broker, the broker will guarantee that it gets delivered to this intended target even after the network has gone down, for example. And there's also a serverless functions framework called Pulsar functions too, that you can do message transformation as the data is traveling through your pipeline too. So then you don't need to rely on yourself writing another function layer or maybe another external library too. And then there's also this concept of tiered storage offloads. So basically too, your messages are on your system, but it may become like a stale, it's aging out the data, then Pulsar itself is smart enough to know that okay, this data is kind of becoming aging out. So I'm going to move it off to code storage so that it won't take away storage space from your hot storage. So okay. So this is again just a summary. Apache Pulsar is distributed messaging and streaming platform open source, very cloud native in nature too. And as you can see these graphs is showing is increased in popularity and GitHub as it reflected in a number of contributors and the number of GitHub stars. And then these are just a sample of companies that are using Pulsar. It is still kind of growing and importance at this point. And this one too, I'm just going to talk over here actually already talk about that. Oh, this one too. I just wanted to say right in it takes on a very traditional multi node architecture. They're common challenges like, you know, when you need to scale your system, it's actually very nice because it separates out the compute and the storage part storage is handled by Apache bookkeeper. So when it needs to scale like all of your topics to it Pulsar will take care of spreading out all of your topics accordingly. So then you don't need to worry about it. For example, if you are using Kafka, you know that how hard it is to do scaling. You need to do a lot of manual process to make things shift things around. But Pulsar is already built in. It knows how to kind of rebalance all of your topics. And that part is really important too. And so everything is very loosely coupled in this structure. And okay, I know I don't have as much time, but I wanted to quickly kind of highlight right separation between compute and storage is one big thing. And also that supports like geo replication is that geo replication quickly saying is that you can have data center scattered throughout the world and you can, you have to do different replication. You can do it like actively. And also to it, you can do selective message replication. For example, some region of the world, you don't want data necessarily to be replicated everywhere else. Right. For example, like GDPR in Europe, you don't want to like replicate those datas to regions that are not, you know, allowed to have those. So it's all built in. And this one too, another differentiator is it being having the multi-tenancy kind of way of organizing all of your data. So in your cluster, it's basically it's already built in, you can organize your data like using different tenants too. You can have different departments that are managing your own namespace of your data. So it is kind of already built in with all of these things. And it's basically helps you to organize all of your data better too. Another thing too is with the message processing model on the subscription side, you can use an exclusive subscription mode that will turn essentially pulsar into like a message queuing because it becomes like one producer only sends data to one consumer in this case. Or you can have failover, which is have one primary and many consumers backing up in case the primary goes away. And then there's also shared subscription in which you can have more than just one consumer in that subscription. But the shared mode doesn't guarantee your message order. So if you want message order, you want to use like a key shared type of mode to do it. Okay. So let me kind of real quick, this one is just data pipelines. I think we already understand what it is. I'm going to just quickly go through but pulsar functions is essentially it kind of takes on this spirit of AWS Lambda, very lightweight, allow you to do message transformation and it travels through the data pipeline. And there's also pulsar schema. If you want to define your data without worrying about serialization and deserialization, you can use pulsar schema. They can also keep track of the change of your data too. So also there's a concept of pulsar IO, which is an SDK. You can actually write your own source connector or sync connector if they are not already preconfigured. For example, elastic search is already a preconfigured sync. Then if something else you want to do, it isn't there, you can write your own IO pulsar IO connector too. Okay. So data stacks just real quickly is that we have Esther streaming, which is the managed pulsar that actually is pulsar in the cloud. And also Luna streaming is the one you deploy yourself, but you can use our enterprise support pulsar or you can use it completely open source too. Okay. So this one comes back in here. We'll use pulsar as the ingestion is we like to suggest you to kind of combine with the Apache Pinot. So, and over here to just a quick summary is that Apache Pinot, as you can see, right, it can take sources from many sources too. But I want to kind of highlight we have pulsar being added in addition to Kafka. And then basically too, as you can see, it takes care of everything. And the thing is wanting to point out Apache Pinot has this star tree index. And that's what it's kind of like their algorithms that is real, real fast in analyzing all of the data. And then also essentially have very give you use cases of, you know, business intelligence, data products and normally type of detection too. So all right. So I think this one I probably won't have as much time to go through. Sorry. Is it I only have, I don't have much time, but over here is just wanted to say, oh, actually, what's going to show you that that too is basically star tree index is the one that does the index optimization in Pinot. So if you want to read up more about it, you're welcome to do that. I basically have to skip over now and over here. So this is like the link that some resources for you if you need, like to look at Pinot and they also have a slack channel that you can join to is all over here. And then there's the resources for Apache Pulsar. So these are all the links to it, but I want to highlight to the middle one this extra data sectors.com. You can get a free $25 credit account to try out our extra managed cloud platform and you get it for a whole year, more than a year to every month you get $25 it get refreshed to so you can develop like personal projects and try this out to our extra managed Cassandra and managed Pulsar. So that's that. And this if you're interested, there's extra credit that you can get and you can visit that link and give that code. And then also we have five minutes about Pulsar on YouTube. These are just a little bite sized nuggets. If you want to get information, visit our data stacks developers.com playlist. You can find those and also I also do a Twitch stream every Wednesday to usually if I'm not traveling. So if you want to follow me or join me to on my Twitch stream, and that's that's how I do my stream. I invite people in if you're interested to and share your projects. So and then join us at the hood. And this is we have Apache Pulsar neighborhood is a wiki page, the meetup.com. The meetup groups is a little bit quiet right now. So and then with that, I want to thank you for coming spending the time with my on my talk. And then this is my contact information, my LinkedIn profile, my Twitter account, and also my discord. I also have a discord server too. And this is just a picture that's Karin. And then we were at Zurich when we went to a Fox days conference. So that's Karin's connection information. If you want to get more information about Apache Pino and Star Tree. So with that, I want to thank you. And then I hope you enjoy the rest of the conference. Thank you. Supposed to be asking question. But if you have question, please feel free. Also, too, I have some Apache Pulsar stickers. If you want some, yeah, I forgot to mention about, yeah, so it's up here. You can take more than one to if you like. So please feel free.