 Okay So I'm gonna kick off with a quote from Peter Norvig. He's one of the big AI people at Google We don't have better algorithms than anyone else. We just have more data Now this is an overgeneralization. I'm sure they do have some very good algorithms But there's a lot of truth to what he's saying about having a lot of data if you have a lot of data your The final result of the calculations will be a lot more forgiving of mistakes in your algorithms or bad choices in your algorithms Then if you had less And this isn't always the case, but it tends to be the case where One second it tends to be the case where Rare events have a disproportionate impact on the final result If you have more data, you're more likely to get these rare events a really good example of this would be web indexing so If you're crawling the web and you come across a link for the very first time That link is going to have a bigger impact on your final index Then if you had come across that link for the millions time and real-world data is full of Examples of this kind of Zipvian distribution So let's look at what a system with massive data would look like a modern system I'm Gonna only talk about streaming because that's the direction everything seems to be going these days usually you have a user or an entity out in the world and from that entity you get a stream of events these events get fed into a stream processing engine which Applies some sort of algorithm you end up getting a model from stream processing engine Which then gets put into your serving system, which then the user interacts with to introduce more events and you get a cycle a More concrete example of this would be a music recommendation system so the user is listening to music and We're getting a stream of events of the songs the user has listened to These this stream events is fed into the recommendation a recommendation algorithm in the stream processing engine This produces a listening model for that user and then that's served to the user in the form of Recommended songs which they then listen to and then cycle continues It's important that the user events Contains both live and historical data if a user has recently discovered a new band We want that to be reflected in the songs that we recommend But we also need to take into account that the user has been listening to music for many years and have has kind of Hard to change preferences, so we want the historical Listening preference of the user to be taken into account too And this data can't just be processed once so we have a whole team of data scientists who are Improving on the recommendation algorithms or finding new ways to use this data So we need to keep this data around for a long time so that you can validate that their new algorithms work and If something is good enough to go into production we need to rerun the whole stream of user events against that to give the use of the most accurate and enjoyable experience we can So this kind of architecture where everything is a stream is usually referred to as a kappa architecture But in the real world what we often end up with is something like this It's called a lambda architecture so in this case live events are Separated from historical events and this often happens because Usually your live events will come through some sort of messaging system and a lot of messaging systems just aren't able to grow forever they have a hard limit on how large a topic can grow so After a certain point data needs to be moved over to a different data store some sort of my SQL database or something And then you end up having to Processing pipelines So the historical pipeline which will run once a day or a couple of times a day That's a batch pipeline produces a partial model You also have the live events which are happening as the user is interacting with the system But this only keeps around today's word of data that goes into a stream processing engine to produce another partial model And then you end up combining those models and then serving the user with that This is not ideal for a number of reasons the main one being that you have two pipelines which means that you have to To systems or two sets of systems to debug and maintain you need to hire in the expertise to work on these systems You can't just move load between these systems Easily and it's just a lot more work then You want to do you also need to implement your algorithms in both streaming and batch paradigms which Is double the work So ideally what you want would be a true kappa architecture And that's what this talk is about so With Apache Pulsar you can have a infinite amount of data in a single topic only limited by your accounting department the It also has a feature where older data can be moved to cheaper storage while still keeping the same streaming interface Finally, I'm going to talk about pulsar SQL Which is like it's an SQL layer that you can use to query your entire backlog Without having to write a single bit of code So quickly to introduce Apache Pulsar. It's a messaging system It has both pub sub and queuing semantics Which means it can be used in use cases where you would otherwise use something like rabbit mq But it can also be used in use cases where you would use something like Apache Apache Kafka Originally it came out of Yahoo. It was the Right-head logging and replication layer for their massive distributed database at the school Sharpa internally But it's also called peanuts in other places. So I'm not quite sure on the naming It's been out in the wild for maybe four years now And it's just recently become a Apache top-level project maybe a month ago. I think It's built on another piece of Technology that also came out of Yahoo That's also an Apache project Apache bookkeeper bookkeeper has been around for many years now I started working on it maybe seven years ago, but it was already going three three years at that point Apache Pulsar gives you very strong persistence guarantees so If you write a message to a topic in Pulsar You are guaranteed that any consumer that is reading that topic will see that message In the exact same order with regard to other messages in that stream. So I'll go into that more later the Software has a lot of features, but I'm only gonna concentrate on three today Unlimited topic backlog size tiered storage and the SQL interface so Before we can talk about what how we do Infinite topic backlogs. I'm gonna break down what an actual topic backlog is So when a user wants to publish to a topic it writes a message to the broker Broker then writes that message to a topic backlog. So this is basically a log of all the messages that have been written to that topic each topic has it has one backlog or Indicates of a partition topic each partition in that topic has its own backlog The broker will not reply to the client acknowledging that message until the topic has hit physical hardware So we f-sync on each message right and this means that if you have a catastrophic data center failure You know like power goes out and everything then you're still guaranteed that the message will be there when everything comes back up So once it's guaranteed to be persisted in the topic backlog then the broker acknowledges the the Publication to the client Topic backlog can be broken in is well it is broken into segments and a segment can either have a Open state or a closed state so There's only one open segment in each Topic backlog and that's the current segment that's been written to All previous segments in the backlog are what we call closed and that means they are immutable Nothing can be added to them. Nothing can be removed from them and the position of each message in that in that In that segment is Fixed and this gives us a very nice guarantee So if a client in this case writes a B and C at position X Every consumer that ever comes along and reads forward from position X We'll see a B and C in that order without duplicates and without loss of message And this guarantee is called total order atomic broadcast and it's very useful in a lot of places Each segment is oh Yeah, so you cannot delete anything from a closed segment Which means that if you want to remove part of your backlog because of retention policies or whatnot Then you have to delete a whole segment and that's what pulsar does does as if you do have a retention policy to delete data Each segment in the backlog is independent from each other So they don't actually share anything they don't even know their order with regards to the other segments in the backlog But pulsar does keep track of this in its metadata part of this independence is that Each segment has its own replication specification so in this case segment AB in our second the segment with a B and C is Replicated to storage node one two and three The next segment may be replicated to storage node four five and six and This is how we manage to provide Infinite backlogs the segments can go anywhere on the storage layer without being tied to a single machine so as we As the backlog grows we just keep adding storage So this is what the system usually looks like we have a storage layer and a Serving layer so each topic is assigned to a broker and then that broker as it adds segments to log it decides which storage nodes To put that segment on This separation of serve of serving and storage gives you very nice scaling properties so if you have a workload where there's a lot of read and write requests then Your your bottlenecks there is are going to be CPU and network bandwidth In that case what you want to do is start adding broker nodes And these are generally cheaper machines. You just need a CPU and and then and a nick and You add those until you have enough bandwidth to deal with the large amount of requests However, if you want to have a large amount of data with not so many requests for it so maybe for each topic one writer and One consumer then you need to start adding storage and these storage nodes tend to be quite Expensive compared to the broker nodes because they have disks attached to them. Oh Well big disk attachment So if you want to add more storage you just add nodes And poster will handle it. I'll just give an example of how poster handles this. So we have a topic backlog each Segment gets replicated to two nodes. So we start adding messages to the backlog Start adding segments at a backlog And eventually you're going to get to a point where your storage nodes are starting to fill up And you're looking at it like if I keep going like this my system is going to fall over in a couple of hours So you add another storage node Serving there will see that another storage node has been added and it will gradually start to write new data to that storage layer So note, there is no rebalancing taking place here The old data stays on the old nodes new data goes to the new nodes and also new data goes to one of the old nodes This allows us to avoid hurting which would take down the new nodes Immediately if every broker saw oh, there's a new load. Let's just put all our right traffic there You know that new node is gonna die very quickly Yeah, but it can be quite expensive to so This way you can grow your backlog forever. You just keep adding nodes as new segments Are added to the backlog they keep being replicated to this they can be they'll be picked up by the new nodes and You can grow forever that way Whoever this can get expensive. So the replication we're using here is 2x replication So and we do this to avoid one storage node going down taking down part of the backlog So if one storage node goes down We are able to continue reading that segment our segments that were on that storage node by reading the other node that it was replicated to but this kind of replication is Suboptimal so this would be called mirroring and In terms of space efficiency, it has a 1 over n space efficiency where n is the number of nodes So 50 percent space efficiency Which means that we can store half the amount of data we would otherwise be able to store if we weren't Watching out for fault tolerance We're basically paying twice as much for discs as we need to oh Well, not that we need to because we need fault tolerance, but We're paying a lot for this In this case we tolerate n minus one failures, so One failure if we had Five replicas we would have be able to tolerate four failures But really in pulsar you only need to be able to tolerate one failure because there is a background process which When a spots that a storage node has gone down It will start it will go and see which Segments were replicated on that storage node and it will start copying from the live copy To another storage node to ensure that the replication for that segment is 2x So we don't really want more than two replicas and also because I've if you had more than two replicas the space efficiency just Goes to the floor, so Five replicas would have one over five so 20 percent space efficiency not very good There are other storage schemes which have much better space efficiency and the one people I guess will be most familiar with Would be something like raid 5? so in raid 5 You have a number of different blocks Which are stored on in this case We have five different blocks stored on five different nodes and these five blocks are used to calculate a priority block Using an algorithm like Reed Solomon or something like that in the case of a failure. Let's say the red node failed that block could be Regenerated by reading all the other blocks And this has a really good like so this is called striping of a party or that's what I'm calling it And this is a really nice Space efficiency, so one minus one over n So in this case with six nodes we get 84% space efficiency and this goes up as you start adding nodes, so if you had ten nodes here you would have 90% space efficiency and you still tolerate one failure, which as I said is enough for pulsar so if you could store your topic backlog in a in a system like this as Your system grew to massive topic backlogs. You would save a lot of money However, it's not practical to store all your topic backlog in a system like this, so To actually generate your party block you need to have all the other blocks complete and available In the case of a messaging system, you're always adding little bits of data to to the topic backlog and You're not with pulsar you pulse that doesn't actually respond to the client until that's been persisted Now if you had to wait for in this case five blocks to fill up before you actually responded to the client then you would be You would be Your latency on rights would just go to the roof so for The open segments as I mentioned earlier. This is not practical, but closed segments are Immutable to complete and they can easily be moved to a system like this and that's what tiered storage is about so we don't actually implement this kind of Striping repairty ourselves. We use other systems where That's already been implemented and they already do it. Well This out works in in pulsar so Client starts writing to the broker Then the broker builds back the backlog starts putting segments on the pulsar storage nodes Eventually you're gonna hit a threshold So we support two kinds of threshold for tiered storage First threshold is a size-based threshold. So when the backlog exceeds a certain size, let's say 100 gigabytes We start moving the old segments to long-term storage The other kind of threshold is a time-based threshold. So if a segment is Say more than two weeks old, then we start Then we move it to long-term storage So it gets moved long-term storage and I guess deleted from the local storage nodes Which means you need to have fewer local storage nodes than you would otherwise need to have From the client point of view. This is transparent. So When a client comes along and needs to read These older segments they want to read the whole backlog from the start It'll talk to the broker and the broker knows to go to long-term storage. So the client really doesn't know where this data is being stored For the rights, the rights always go to the storage nodes because that's the open segment So it's not even taken into consideration. The offload process looks like this. So When we decide that we want to offload a segment We first update the metadata for that topic saying that this segment is going to be offloaded to this location So we do this before the actual offload process to avoid having a case where the offload world would fail halfway through and Then you'd end up having zombie data using up space in long-term storage and costing you money So we update with the new location and then we start moving the messages from that segment Over to a data object in the long-term storage the data object is composed of blocks and we keep track of the first message in each block and we use this to build an index And that becomes another object in long-term storage. So you end up having for each segment You end up having two objects in long-term storage a data object and an index object The index object isn't necessarily needed We could actually just have the data object and then each time you read that segment just read from the start to Find out find the message that you wanted, but the index it does avoid having to read the whole segment each time when you just want one or two messages or as is often the case when you're Reading from a topic backlog. You don't start at the exact start of an of any segment usually have a Cursor somewhere in the middle of the segment and you don't want to have to pull the whole segment to get Just that that little bit that you need So once the index object has been updated and written to long-term storage Then we update the pulsar topic metadata to say That the The offload process has completed so we can you can clear up your local storage whenever you want We don't actually delete the local segment straight away. So we have a Grace period. I think default is four hours Which we keep the The segment on the local storage just in case something went wrong with long-term storage and you just need to Stop the whole offloading process and dig into that deeper But eventually The data will be removed long-term off from local storage a lot. It's not even local storage from the pulsar cluster storage and a bit later and Then that space can be used for new segments So this has been available in pulsar since version 2.1 and the first implementation We had was for s3 as the object store version 2.2 went out, I think three weeks ago and that added support for Google cloud storage As your and hadoop are planned for version 2.3 So the as your patch is already out. I think The hadoop well HDFS is already complete, but I haven't actually seen the patch pushed yet And you can also implement your own offloader Indentation got messed up there somehow So the interface is as you would expect you have methods to offload messages to read the offloaded mess just to delete the offloaded Because you don't even if you are offloading to long-term storage You may still have a retention policy on your topic. So you might want to delete those segments after a year for example So you implement this interface and then you bundle it up into what's called a nar file and nar file It's a nyfi archive. It's like a jar file with class loader isolation. So Usually these Something like s well actually s3 is not too bad, but something like hadoop has a lot of dependencies It'll pull in which will Which will interfere with the pulsar dependencies in the process So we need to isolate the the Java class class loader And now our files allow you to do that So you bundle up an our file with your implementation of the interface with all the dependencies needed by your implementation You just drop it in your offloader's directory and then it is available for you to To use for offloading your data Okay, finally Pulsar SQL so this has also been available since pulsar 2.2 and it's an SQL interface that allows you to query all the data in topics backlog It's based on presto db. So presto db is a It's an SQL engine with a Plugable data backend so you Tell it where to get the data and you tell it the format of that data And that you also map a it's table schema internals to something to tell it how to present that data to the user So we use the topic schema feature in pulsar to create the table schema and this will be clearer I'll show you an example It's important to note that this is for data at rest. So there are some streaming SQL implementations where you put in your query and The query will the system will give you results as the As new data comes in That's not what this is. So this is for query at rest. So it queries the data that is already in the backlog when you make the query and the results you get are a So a final result that you get one time So to give an example of how we make a query against presto So first of all, let's go back to the music example. We have a listen event and You have everything you expect me to user who's listening the artist the title and the album So it denotes a song a user has listened to a song So we generate a schema object from this class We sport Afro and Jason schema right now. We also do support protobus schema But that doesn't work with the SQL stuff yet So generate a schema you tell it you're going to write it to the user profiling topic create a producer and then you just push a couple of listen events now, let's say someone in the rights management department Wants to see every user who has listened to deal So they can just go to their presto console and write your normal SQL type statement. So I want to select all from pulsar It tells presto to use the pulsar driver public default that's just the namespace in Pulsar that topic exists in So in this case, I didn't actually specify a user namespace. So it went into the default namespace, which is in the public tenant the You specified namespace then you specify user profile in its topic name and then you put in a normal SQL type clause And it gives you back a result like you'd get if you went into my SQL table There's also you can have many different Things in the where clause I'm not going to go all the way through, you know presto's documentation here But let's just give you an example of what you can do this That you can do this with the topic backlog data and this It doesn't matter if the topic data is stored in the presto local storage or in long term storage so this is not a benefit of separating the serving layer from the storage layer, so we have an abstraction for the storage layer and Presto doesn't even go near the serving layer. So presto has its own serving layer, which is the presto workers That's what the presto client connects to and these workers use that up use the abstraction to Pull in the data from either pulsar storage or long term storage be it s3 Google's Google's cloud storage or HDFS when we have it Okay, and that's it. I'm gonna just summarize So what about you pulsar you can have unlimited backlog sizes You can also have a lot of topics so it's something I didn't touch but like Pulsar can scale to I think a million topics was was the number we were putting out there The only limit is actually what zookeeper is doing in the background We can have unlimited topic backlog size All the data in those topics can be offloaded to much cheaper storage you can query all that data using an SQL type interface and That's that I Motivated this with a machine learning Recommender system type use case, but there's lots of other use cases for having massive backlogs Once you have a massive mass backlog you can use this you can build a CQRS event streaming or event sourcing System you can build datamarks audit logs fury logs. What's important is since pulsar does provide the ability to have these massive backlogs or infinite backlogs then These use cases are all possible Okay, so if you have any questions, I'll take them or You know I'm on Twitter, so you can ask me there or just grab me when I'm walking around. Thank you. Sorry. How many? two So you need to a minimum of two on your storage layer because obviously one can crash No, you know what so this you need three zookeeper nodes, so that is the part I didn't show but you need three zookeeper nodes you need Three bookkeeper nodes and two pulsar nodes But yeah, basically No, no, I think actually you you only need two bookkeeper nodes because the reason you need three zookeeper nodes is you need need to be able to form a majority To get this total autonomic broadcast thing But that only happens when you you only need that when you close the ledgers or you close the segments And that operation involves zookeeper. So zookeeper you need three all the rest of them You just need the minimum that you would need for fault tolerance switches, too Another question. Can you run pulsar over Kubernetes? Yeah, yeah, yeah, that's what a lot of people are doing now. That's Yeah, that's most of our work right now is getting it working nice on Kubernetes with Helm charts and all this stuff But yeah for sure. Thank you Yeah, thank you for the talk very very interesting and thought-provoking and I haven't seen a system like like this but I was wondering the difference between with Apache Kafka because we are seeing here about the storage of events and just the Presto Bay is SQL Engine for storing raw data, but maybe you have to Join cross data or elaborated a bit So presto stuff is the stuff actually not least about because I worked a lot on tier storage and bookkeeper layer But I Think joins are supported in presto So you need to look at the presto documentation to check that but Presto pulls the data locally and then it will do stuff like joins. So it's quite a big system The big difference so going back to your first question was the difference with Kafka the main difference With regard to what I've talked about is the separation of the serving and the storage layers so this separation allows us to Attach different stuff to the storage layers to provide different use cases And the last question and are there any APIs for external applications? Apart from presto another application. Yeah, presto has an API itself so you you would have a presto client library on your own application and That will connect to the presto workers So if you have an external application you connect to presto that the square is the storage layer. Okay, so, you know You're not explicitly using pulsar when you're using the presto stuff You know, it's pulsar in the background providing the data, but you're actually using presto. Yeah, that's it And pulsar is just the back end for the storage. Okay Hi, hi for a lot of Data like style storage you're gonna partition your data usually by date So you can run queries for a specific date and and have it return in a reasonable time if you're having infinite topic Backlogs being stored in the long-term storage and you're querying it with sequel. Is there any partitioning? I know understand those indexing within a segment to get to the offset, but anything else Well, it's partition. Well, the segment is like It is a time-based partition because like segments are chronologically ordered. So Segment you get kind of that partition natural naturally So would the presto be able to understand the times in the segments to only go to the segments It needs to go to if you again, I'm not a hundred percent sure because I didn't implement it But um, if it doesn't that's very easy to do, you know, you just pull the first message from each one or find It's segment is also dated. So like dated what when the first message is so you can pull that data very easily It would be very easy to do if you don't do it I'm not a hundred percent sure if we do do it. Gotcha, but I would be surprised if we didn't No, thank you