 Hey everyone, thanks for joining us at the open source Summit Latin America. I'm Karine Wallach. I am head of developer community at Star Tree and this is my colleague. Hi, my name is Mark. I'm a developer advocate at Star Tree. So I'm just going to share my screen. So hopefully that's working. Yeah, slideshow button. So, yes. So we're going to be talking about real time analytics going beyond stream processing with Apache peanut. So I'm going to let Karine get us underway. So just to kick it off, I'm going to start off with like some basic fundamentals about real time analytics and the trends and how where it's changing the dynamic of businesses. So to start off, there are three generally speaking three different types of analytics use cases. You can skip to the next slide. The first one is probably the most commonly known when people think of analytics as dashboards and bi tools. These are often used internally at organizations and they're often obsessed by bi analysts and other internal folks. You can go to the next slide. The next one is user facing analytics. So user facing analytics is essentially when organizations are providing their end users or their customers with their own real time analytics. And we'll dig into this a little bit further. This is probably going to be like a big chunk of the highlight of what we're going to be speaking about today. The third one is machine learning. These are often processed by some kind of system on the opposite end. So you know you're bringing the real time analytics and then there's some kind of system processing. So this could be like things like anomaly detection or like fraud detection that's automated by a machine any kind of like machine learning type of powered analytics. So going into further about user facing analytics and how it's kind of changed the dynamics of businesses over time. So back when I don't know what yours but back when it used to be that organizations were primarily using analytics for internal purposes like the bi analysts and they still are and it's still very very powerful. As time progressed, a lot of organizations started providing their end users with their own own real time analytics, and they're basically empowering their users to be able to make decisions and read into these analytics and so much so that this is actually transformed into services like they have actually product ties and created products providing their end users with real time analytics so there's like a lot of organizations that provide premium services where their end users have the ability to access their own real time analytics, which is really powerful. The other thing that the actionable insights. actionable insights I think if this is really important to kind of take away because this is kind of like where the power like how you can empower your end users or your customers. So actionable insights are essentially the real providing your end users with real time analytics, but then allowing them to take immediate action upon gaining those insights. So we could dig into this a little bit further with some examples, but essentially a user sees something this is action one site slide I wasn't sure we have this one. So essentially when a user gets some kind of information or insight, they have the ability to take action right away, which increases your user engagement, but also empowers your users and this is where some of the benefits of having these premium services that provide their end users with real time analytics and utilizing it can become essentially really powerful for the users. So we'll go into some a couple use cases. The first one is LinkedIn, they do a lot of end user real time analytics. The first one, the most popular one that LinkedIn has actually I think the total users now at LinkedIn is like over 800 million insane, but is who viewed my profile. So chances are, most of you guys have experienced the who viewed my profile experience. It does allow you to see in real time who viewed your profile. But what it also does is allows you the ability to slice and dice this data, based on like country or company and things like that. So the real time analytics experience for the end user is really dynamic, which increases the engagement people love the who viewed my profile feature. Another example of the real time analytics at LinkedIn, you can click. I mean, I guess the other thing to point out of this one is you notice it's got a little premium icon in the corner is that they've actually made a product that makes money out of the data that then adds on top of the free version that they everybody see. So that's pretty cool. Like it's data they had anyway, and they're able to package it back to you and say, Hey, here's what's happening right now and you're like, Oh, that's cool. It's a major moneymaker. I think I was reading somewhere. I don't know if this is accurate or not, but like I was reading that online. It said that like almost 40% or something have premium LinkedIn accounts, which is like crazy. Oh, okay. So another example of real time analytics for the end users, which is like a little bit of a twist of this, but it has like the same necessities to be able to build something like this is the newsfeed. So when a user goes to their homepage, they have to see relevant information, things that they haven't seen again. And ideally it's already been processed or some algorithm to make sure that it's actually relevant for them. And this system has to be built with the same kind of pipelines as, you know, providing your end users with real time analytics, you have to be able to take all this data in, which will explain how to systematize that. But that's another example of the real time analytics. You can click again. And then the big productized real time analytics that LinkedIn does is their talent insights, their recruiter facing products. This gives them like a big real time look on what's going on in the market, the trends, things like that, all these insights. Now, these insights are very powerful as you can imagine for their end users, they were able to see something and they could take action on it immediately, which is really, really powerful. So this is an example of a few different ways that LinkedIn has productized, as Mark was mentioning the real time analytics. Another example is Uber Eats. So Uber Eats has over 500,000 plus restaurant owners. And they provided these restaurant owners as part of like these premium features, their ability to access their own real time analytics. And on their analytics, they could see misorders and accurate orders, top selling items, things like that. And going back to speaking about things like the actionable insights where these end users have the ability to take action. If a restaurant owner sees that something is going wrong, there's a misorder or like maybe there's some item feedbacks that are like very thumbs down, they have to be able to act immediately, right? They can't wait until all the batch data comes in is queryable, like they need to know now. And that experience is much more positive when you can provide them with real time analytics, like things that are almost instantaneous. You can click to the next one. Another example is Stripe. I really like this use case because it's not necessarily user-facing analytics, but in terms of the components of what they need, it's very similar. So Stripe's use case is like this. They have dozens of different types of engineering teams that work on all different sides of the product. And a lot of this data is real time, constantly feeding in transactions and all this kind of stuff. Now, they have a lot of sensitivities with this data, right? They have to have like everything has to be like accurate and it has to be like legal and compliant and secure and all this other things. But essentially you can click on the next slide. Oh, no, actually go back one. Sorry. I think I thought we had another one for this slide, but I guess we don't have the other image for it. It doesn't matter. I can explain it. So essentially what they do is they have to take all this data that all these different engineering teams are working on. They have to be able to pull it in and then they have to provide it internally to a bunch of the different teams. So it's kind of combining this like user-facing analytics and the BI analytics kind of stuff together. But you know, they're fraud detection teams, they're accounting teams, they're marketing teams, whatever. So like, so it's basically essentially the backend teams that are really working on the functionality of application and the people who need access to this real time data to be able to make decisions or even if it's automated, right? Sometimes the fraud detection things can be triggers, you know, anomaly detections inside of your analytics. They have to be able to do this in a streamlined way and it has to be all in real time. So I really like this use case. I think it's a good example. So let's talk about the properties of a real time analytics system. So there are three things that you substantially need to be able to build a solid real time analytics system. The first one is speed of ingestion. This means you have to pull all your data in very, very quickly. As soon as something happens, you have to be able to pull it inside and be able to record that data. The next one is speed of queries. So even though you're pulling this data in, you also have to make it accessible to your end users in real time. Right? So it's pulling it in and then also pulling it back out because you have to be able to show these analytics, right? And then the third part is you have to be able to do this at scale. So you have in some cases hundreds of thousands of users querying this real time data at any given second. And you need to be able to pull this data up. So pull it in, pull it out and show it at scale like in real time, which is kind of a complex problem. And we're going to talk about how to build those things. Yeah, so the heart of the system that we're going to use demonstrate all of this are Apache Pino and Apache Kafka. So those are sitting in the middle of the rectangle we've got there. And then you can kind of see around the outside of the rectangle the different properties that we've got. On the left hand side you can do real time ingestion of very high dimensionality like very wide column tables or lots of columns in the tables. We can get that data in very quickly. We can store it cost effectively and in a scalable way. We've got high availability on the actual data itself. And then once we've got it in, we need to be able to get it out again quickly, right? It's no good if we get it in really fast if you then have to wait 10 seconds for the query to come back. So we've got to do that quickly. And then we want to be able to have thousands of queries per second. And the reason for that is that we're going to have lots of either lots of users, we're going to be building some product for the users, or there's going to be some sort of maybe metrics dashboard but doing lots of queries from that dashboard like pulling in lots of different lots of different charts at the same time. So just to give a quick intro, we're not going to assume you roughly know what Kafka is. And it doesn't have to be Kafka in these systems. It pretty much can be any streaming engine. But they generally have a producer. So someone's generating some messages and I'm going to put them on to Kafka through this broker onto a topic. So a topic is like, but almost like a named name of some messages so it could have a topic for events and other topic for people and other something else. And then those can be split into partitions allowing you to scale the production and consumption of the messages and so consumption is again on the other side that's taking those messages off. And we'll be able to do those in parallel as well. Kafka also has the concept of consumer groups, but Pino doesn't actually use that concept. It has its own way of managing where exactly it's read up to in Kafka audition. And Kareen's going to do a quick explanation of Pino. Oh, you got muted. Yes, I did mute myself while you were talking about Kafka. Sorry. So I'm going to talk about what is Apache Pino like hopefully everyone can read because he was on the slide is what is Apache Pino. So Apache Pino is an OLAP distributed data store. Essentially what it does is it gives you the ability to pull in data from a variety of different sources. So like streaming and batch like S3 and then like as Mark was mentioning like you could do Kafka or Pub sub or kinesis things like that. So you can pull it into Pino and then Pino has a bunch of like different types of very powerful indexing capabilities and aggregation. It allows you to like merge all this data into one consolidated view, have a lot of these powerful indexes, and then make this data quickly accessible, like in real time to the end users at scale. It has a lot of dimensionality too. So as Mark was mentioning with like the very wide data is also like pretty important with these like complex analytical queries. Oh, yeah, okay. Cool. So let's just do a quick run through the architecture of how this works. So starting with the ingestion side. Let's go into what I've left on corners. It's coming in here's lots of data coming in. I've got a Pino controller sitting at the top of the sort of managing the whole cluster managing everything that's going on, handling any metadata working out where stuff's going to go. And then ingestion goes into Pino service a Pino service and serving, serving the data and hosting the data itself. So it will then be managing using zookeeper as the storage layer for the metadata. It'll be keeping on track like segments are Pino's way of Pino's partitions if you like. They have like a segment and it will be pointing to a server. So for example, in this diagram segment one is going on server one segment two on seven two and so on. And then if we have set a replication factor, each of those segments might then be on multiple multiple servers. So if one of them went down we'd still be able to serve it from the other one. So that's the sort of data ingestion side. And then on the query side, we have the concept of a broker so we send our query to the broker, and the broker then scatter gathers out that request to the servers that have the appropriate segments, and then it gets the results back aggregates together and then serves it back to the client. And so now we're going to have a look at a demo. So this is the architecture so we're going to build ourselves a real time analytics dashboard so I guess this in this would be kind of an example of a user facing tool so it could be a BI dashboard used internally but the way it's designed you could easily replace one of the components and make it a user facing one or maybe you could even use this as a user facing one. So the idea is we're going to have a streaming API. We're going to process that streaming API using Python will put the messages on to Kafka P no will then consume the messages from Kafka and finally will show how to visualize the data and query like query query that data via which is which we'll get we'll explain what it is and we got that. Okay, so let's, oh actually I forgot going on the slide just explaining what the data set is. So the data set that we're going to use is the wiki media recent changes feed. So this has a continuous stream of all the stuff that's being changed on wiki media so like all the pages that are being updated. So the changes being made in their metadata so everything is captured. And those messages are published. So they actually store internally in Kafka, and then they output it by over HTTP endpoint which uses the server site events protocol, which allows you to stream stuff. This is an example of what it looks like. So you get an event you get an ID and you get the data and the one we're particularly interested in is data and you can see in here we've got a schema it's got a metadata section. So we've got a length, we've got a length, and then what was the length before, what was the length after revision, and some other data as well. So let's now go over to the original studio code. Here we go. So what we're going to do is we're going to process this stream. So let's just first copy the URL in here. So we're going to have a look what does it actually look like. So there we go. So that's the same as what we saw before. It says loads and loads of messages coming in. All the messages. And you can see each one has an event, an ID, and data. And I'll just close that down so my Chrome doesn't hate me. And what we're doing in here is we're using the requests library. So we're saying hey, I want to create a request. It's a streaming request. And then once we've got the result from that request just here, we're going to wrap it inside this set aside event client. This is a Python library. And then it gives us an infinite stream of stuff. So we can, we can then just go and call our, call our, we call it here. We get up by. There we go. So streaming all the messages coming in. So you can see, I mean, it's the same as what we had before. So it's not any different than what you could see in the browser. There we go. We've got the messages. Next step is we want to get it into Kafka. So what we're going to do there, we're going to use this wiki to Kafka file to do that. So you can see most of this, a lot of this code is the same. But then we introducing and we're introducing a Kafka producer here, which indicates where Kafka is running since running locally on 1992. And then we're going to loop over all the events and publish them into Kafka and every, that will load them into one of the partitions, one of Kafka's partitions. And then every hundred, we're going to flush it. So if there's anything left in the queue, we're going to flush it and then it's going to make sure it makes its way to disk. Now, before we run this script, we're going to create a Kafka topic and it will actually create one for us. We don't do it, but it will create it with one partition whereas we want to see how we can, how we can do it if we have multiple partitions. So there we go. We've got our, we've got our topic set up. And now we can run this right here. So this is going to put data into Kafka from wikimedia into Kafka. So there we go. We've got 100 events. And we can have a look like how many, how many, how many messages are in there via like the offset, the partition offset, because we've got 70, 72, 74, 78, 96. You see it's going up. And we can also have a look at what the messages are if we want to. So we could copy this query here. So you can see here, there's the messages, they're coming in. They're really fast. So we've got lots of messages coming in. So the next, right, so we've got, we kind of got it. We've gotten, we've gone from the wikimedia into the Python script now into Kafka. So that's the whole middle of the application is working. So the demo is going well so far. So the next bit is, can we get that into Pino? Now, the first thing we need to do before we put it into Pino is we need to create a schema. So schema is just defining what does that table look like, that table that can take in high dimensionality data. We need to specify each column and then a data type to the column. So we've got an ID, we've got wiki, we've got user title. And then we've got datetime as well. That's a timestamp. You can also specify a type. So you can see we've got dimension fields and we've got datetime fields. And those are kind of just metadata to help the query optimizer when it's running. So we've got a schema and then the schema goes with a table. So you can see our tables here. It's called wiki events with stating that it's a real-time table and that indicates to Pino, hey, I'm going to be streaming data into this. So there's going to be, and it will then expect some sort of streaming config. If you don't provide it, you'll get an error. And if we scroll down here, you can see this is the Kafka config. So we're saying, hey, I want to connect Kafka. Kafka is running on Kafka-wiki. That's the Docker container name. 1993. So 1993 is where we've got it inside Docker and 1992 outside. And then it's the wiki underscore events topic. And then the only other interesting thing around this bit is down here, we're saying, I want you to flush to create a new segment every thousand rows. Now, obviously in a production app, you do it much higher. We're just doing this so that you can see what happens. And then the other thing is we specify the schema and then we need to specify a time column names. That's important for the naming of the segments and how Pino keeps track of what exactly is in each segment. So we need to specify a time column name as it's going to use that. And then the last thing is we've got some transformer transformation functions. So those are used. So if all the columns in your data source, i.e. on our Kafka topic matched up exactly with the schema names, we wouldn't have to do anything. But in this case, we've actually got some nested content. So we've got like some stuff under meta. So we've got meta ID, meta stream, meta domain. And so that's what we're doing here. So we're saying, you know, I want to go pull out meta domain and put that in under domain. And the same for the other things. And the last thing we're doing is the timestamp that we get from wiki media is in seconds, so epoch seconds, but Pino, all of its calculations assume that any timestamps are epoch milliseconds. So we do a multiple by 1000 on that field. So now we've gone through that, we can create the table. So we could put a slide, let's go over here. And then we'll paste that in down here and we'll run that. And so now we can go over to the browser. And if we click on the query console button here, you can see we've got a wiki events table. And at the moment we've got 45, so it's catching up, like loading in all the documents that were there while we were waiting. Oh, sorry, while we were running that command and you can see that there's more and more documents are loaded and we could even group that. So hey, you should find me the domain, find the number by domain, let's put in that. So hey, I want you to group by domain or by count of sending. And so we can see, most of the things that are being changed are on wiki data. And that's their metadata website. And then after that it's commons wiki media, which I think is images and then we get onto the English video. Okay, so that's cool. So we can use this UI to do exploratory queries and that works pretty well. But if we want to build an application, obviously we don't really want to be doing it through here. So what we want to do instead is we're going to look at a tool called streamlet. And streamlet is a Python web framework, I suppose is the best way of describing it. And it allows you to write everything in only Python. And then it generates effectively like an interactive web page for you. So we've got a few different versions of this dashboard. So we'll just start with the first one. So if we just run that streamlet run command, it'll open up streamlet dashboard. So there we go. So I'll just leave that on the screen at the moment. So you can kind of see at the top, we've got what they call metrics. So we've got a metric showing how many changes have been done in the last minute. And then the number below it is how many changes are done compared to the previous minutes are like two, like one to one last minute and then one to two minutes kind of before that. And then it shows the number of users who made changes in the number of domains as well. So this is all using, if I just minimize that window there, this is all using the Python clients. So you see we do an import from PinoDB, import connect, and then we create connection down here. And then we're running a query. So this is quite a cool thing that you can do. So you can actually do a count and a filter at the same time. So we've got our big filter down here on line 21. So that's like get all the records in the last two minutes. We've got like a, like another filter afterwards say, which says show me the stuff in the last minute and then show me the stuff in the minute before that we do the same total number of events that number of users who have made changes in the number of domains. And then it kind of just goes down the page and so we created metric using this code. And then the, the chart is kind of just doing a slightly different query. This one is grouping the data by minutes. It's funny, like the number of changes per minute. At the moment, if we want to see an update, we've got to manually go and refresh it. But if you make a code change streamlet will automatically pick it up. And obviously what we'd really like is that it should automatically refresh. And so that's the, that's what version two does. And we just kill this. And we laid version two instead. If I just come over here, change that to say version two. And that will open up in the browser as well. So I'll just come back here and show you what's the difference. So streamlet. Version two, there we go. So what all version two has is some, a couple of differences. So we've got this code here. So this is like checking whether or not we are sleepy or whether we've set like a sleep time, like how long should it wait before it refreshes. And then whether we are auto-refreshing or not. So we've kind of got the state that carries between page refreshes. And then down the bottom, all the way down the end, we check, are we already refreshing? If so, we're going to call this experimental rerun function. And then it will just run the whole page down from the top, down here, wait again, then do it again. Keeps on go. And so what that allows you to, or the feature that you get, you can see up here, last update. So 20, 29, 11, 20, 29, 14. And so it's changing. So like this is updating. And you'll notice that always the right hand side is always going to be slightly like the last minute is always going to be slightly lower. Because we're not all the way through that many, I would probably about 20 seconds through this minute. And so, yeah, so that's what we can do with streamlet. And we could choose, we could choose to substitute in a different dashboarding tool if you want it. But for now, we're going to go back to the slides and just sort of go through to the conclusion of our talk. So just quickly running you through. So how does this work? So how is this data being stored? So as we kind of looked at like a few slides before, this data is being stored in segments. And so we can have lots of different segments on a server. And we'll have copies of those across servers. And for us, we said that every thousand, it'll rotate to a new segment. So what does the segment look like? So Pino is a column store. So all the data is stored in columns. So all the data in one column is always next to each other. So all the data for countries next to each other, all the data for browser and then so on. And so that means that if we do like aggregation queries or like searching through there, it can very, very quickly do those types of things for a column. And so the idea is that we're not returning every column. That's the benefit of what happens when you're storing data by column is that you're only really loading the data for the columns that you're interested in. We can then apply indexes on top of it to make things even faster. So we can do a range index like a lot of these, I guess, are very common indexes on the database. So we can do full text search, we can do JSON index. And then there's kind of a quite a cool one called the starchy index, which is where the company where Karina and I work is named after. It's like a selective pre aggregation. We'll run through that in a second as well. So just some of the more basic ones first inverted indexes is the idea that we want to do a filter query to say find where the country is US. And so then inverted index means we'll have some sort of like a voice that you can think of as like a table. And it has each of the countries and then it has the document IDs or the row IDs where they are. And then it means if we look up the country, instead of having to scan the whole column, we just go, hey, country, oh, it's all these going to go and get the value. We've got the sorted index as well. So this one's actually automatically applied by Pino. So when you, when you, when the, when a segment gets flushed, so when it reaches its threshold gets flushed to this, it will, it will automatically look at which columns are sorted and then will create a sorted index on them. What it means is that it puts everything that is the same as each other next to each other. So in the country one, all the values of Canada will be next to each other, all the ones in Japan, all the ones in the US. And therefore if you have like quite a, it's like a low cardinality of values in a column, it means that you can save quite a bit of space, right? Because you can go country, Canada starts at zero, goes to 80. Japan starts at 81, goes to 100. So we're saving ourselves something to store Japan over and over again or US over and over again. So the start tree index, this is the, this is a quite, quite a neat one. So this is almost doing like partial pre aggregation. So you can choose how much pre aggregation do you want to specify which combination of fields you want to pre aggregate, and then you specify how many rows should there be. So you can kind of get almost like a kind of a consistency of query. So you say, I want the maximum number of rows scanned in a query to be 10,000. And so it will then build a tree of these, almost like a random forest type thing. So going down, splitting the data. And then when it does a query, you should get roughly the same performance regardless of what you're doing. And then sort of across the whole stack, Pino is doing optimizations to try and make things faster. So it's doing things, as we just talked about now. So at the data level, it's doing it when you're actually doing the filtering, it's also doing like clever optimizations at the storage level. And then of course, when the query planner is running as well, it's again trying to work out which one am I going to use, which index should I use, which storage facility should I use. And so yeah, so hopefully we've shown you so I hand over to Karim to conclude, this is a pretty good set of tools to build real time analytics applications. Yeah, that was awesome mark. Thank you so much. So yeah, I mean as the slide says it's like, you know, the combination of any kind of streaming system plus Apache Pino will help you build out pretty much any kind of user facing real time analytics use case. We can go to the next slide I think. So just a little bit of background about Apache Pino so Apache Pino was actually originally built at LinkedIn. And since has been widely adopted and it is a very mature product. It's used at over 100 companies, our community is constantly growing I think it's, I don't know probably 2800 or something like that now. And, you know, lots of GitHub stars which we love so if you are interested in giving us a GitHub star please go and do that. So in terms of performance, the largest Pino instances that's currently being run I think has as at LinkedIn, I believe and it has over a million events that are ingested per second with a query query performance of two actually this is incorrect it's 250,000 plus queries per second while maintaining millisecond latency. So it's highly performant when it comes to this ingestion and query with high throughput capability. And then we can go to the next slide. So that's the summary right so if you take away three things from the talk. I hope you've seen that real time analytics let's us build is kind of real fast applications where the user can quickly see like what's going on like the Wikipedia one if you were, imagine that was the Wikipedia editors and they were like why is there suddenly like a big increase in people doing stuff or why is there suddenly a decrease or why are people all focusing around this one topic this with this type of tool chain would allow us to be able to quickly identify that and go and do something about it now, rather than maybe finding out in like 10 minutes or 20 minutes time we're a bit too late to do something about it so that's the idea is it allows us to do some sort of action straight away. So that's Korean described on the slide before the whatever you use whatever tools change whatever set of tools you use you got you need three things right so data's got to be fresh like it comes into the streaming service we want to get it straight away be able to query it. Those queries then need to be fast they need to be like mobile app fast like web page refresh fast we don't be waiting for like 10, 10 15 seconds for the results come back, and then in order for lots people to use it obviously has to be able to scale. So Kafka and Pino is a great combination to achieve those, those things. So, thanks for coming to watch. I know Karina forgot to put you put you on this slide. That's okay. And you definitely green wallet because it's Karina's one we've got the Linux foundation there. If you're interested in, in learning more you can have a look at dev.starchy.ai we've got a bunch of recipes and guys there. Thanks to me and Karina come and come and join us on the selects. Yes, and if you have any additional questions about Apache Pino or want to talk to some of the PMC and committers directly. They're there too so yeah, that wraps it up. Yeah, so thanks for coming to our talk. And that's the end. Thanks. Bye.