 So my name is Deepak Goyal. I'm one of the software developers at Walmart Labs. I work in a team called customer backbone where our team is responsible for building platforms which can really handle the scale of events that our website generates. So stateful processing of multi-million events in real time. So when I say stateful, every event that gets generated on the website, it has to be processed. There are inferences that run on top of it and then we generate recommendations based upon the event processing that we have done. So website generates millions of events. These include transactions, profile changes, item views, page clicks etc. So let's get looking at the CBB data pipeline. So let's say the customer clicks at the add to cart page and then that event, that particular action generates an event which is collected by our front end Kafka which then flows to our internal Kafka after event filtering. So we filter approximately 99% of the events and we are only interested in 1% of the event and then the CBB internal Kafka and then we have a Kafka Streams cluster which processes those events from the Kafka. So all this event processing includes like customer state updates, feature extractions, model influencing etc. So any that information that we have inferenced can actually also be consumed by other interested consumers inside Walmart. So as it is written to back to Kafka as part of the change log topic. So the scale that we are talking about is that after the event filtering we have 1 million events per second and we have 400 of the Kafka Streams applications running on 400 different VMs. And this highlighted box essentially forms the CBB infrastructure. So there are a lot of things that I like about Kafka Streams. So first and foremost that it is easy, very easy to develop code in Kafka Streams. It is not a framework. So the code really holds the control of the entire thing. It out of the box provides very high processing throughput. And if you are using persistent storage with Kafka Streams as it gives an option of both persistent and non persistent storage, you get a distributed persistent storage. And this is one of the great things about Kafka Streams that it provides you exactly one semantics end to end. No other architecture provides you exactly one semantics. So consider an example of a transaction that or a bank transaction that needs to be possessed only once or an order that is placed on website that needs to be possessed only once. So let's take an overview of the architecture. So Kafka Streams is essentially a collection of tasks. Each task has a consumer which is consuming events from Kafka. The event is then forwarded to a processor which processes those events using the state for that key stored in the ROX DB. And then that whatever changes that you make in the ROX DB is also written back to the Kafka using the producer. And on the top of it you have a HTTP server which helps you look in the keys that are present in the ROX DB. So Kafka and instance is a collection of such tasks so you can have multiple such tasks in a single instance. And this single instance is distributed across the 400 instances that I was talking about. So let's look at the event flow in Kafka Streams. So the above is an input topic which is replicated on the Kafka side and you have a task zero which is consuming events from the partition zero of the input topic. So there's a one on one mapping between a task and a partition of the input topic. And if you have multiple set of topics that the app is consuming from then each of them are mapping. Each of them have a one on one mapping with the task zero of partition zero input topic. So there's a very fundamental problem with this picture that Kafka as you know have replication factor. So there's a redundancy on the Kafka side but on the app side there's no redundancy as of now. So what Kafka Streams introduces a notion of a change log topic as well as a standby task. So whatever changes that you make to the ROXDB are gone to the change log topic which are then consumed by the standby task for just the sake of redundancy. So at any point of time if the standby task goes down the change log a new standby will be fired up in the existing cluster and then it would replay the entire change log to build it state again. So the above forms the Kafka cluster and below forms the Kafka Streams cluster and essentially you can have any number of standbys based upon the criticality of your data. So as is with any other architecture there are number of challenges that were with the Kafka Streams. So we categorize them into five different categories and we'll go over each of them that buys it an issue how we at Walmart lab solved them. So the first one being fault recovery, horizontal scalability, cloud readiness, the issues with the implementation of ROXDB in Kafka Streams and the problems when you get when you have large clusters of Kafka Streams. The first one is the fault recovery. So there's a default behavior of how the bootstabs are done and versus what we introduced as cold bootstabs in our implementation of Kafka Streams. So we forked out our code from the Kafka Streams 1.1 and we implemented various features on top of it. So the first one being the cold bootstrap. So as I earlier told the change log topic acts as a source of truth. So whenever this is where any task or instance goes down it has to replay the entire change log. So in terms of Walmart scale it can literally take 15 to 30 minutes to replay the entire change log because the data is so huge. So if you also see the change log topics are log compacted, which means that each key value that is present in the ROXDB is essentially present in the change log topic at least once. So it introduces an efficient disk usage because if you have a replication factor of three on the Kafka side and you have a one standby and one active you end up with five copies of the data which out of which only one is essentially usable which is the active storage. So let's look at how the default behavior of what is the default behavior of the bootstrap. The standby task goes down and there's no one consuming from the change log topic. So as soon as the Kafka recognizes this a cluster wide rebalances triggered and a new instance is filed up with an empty ROXDB. The source of truth being the change log topic it replaced the entire topic and then it builds up its state and starts serving which now at this point of time it can act as a standby. So let's look at the cold bootstrap. So instead of change log as a source of truth we made active as a source of truth. So ROXDB is an embedded database so you can stop the processing on the ROXDB and copy over the entire data from one machine to another using just the pure bandwidth of network. So if you have 100 GBs of data and you have 100 GBs connection you can literally transfer the entire data in 8 seconds. So this introduced efficient disk usage and we could also do cross data center or cross cloud bootstrap. So if you lose all the copies of data in one data center you can also effectively do cross data center bootstrap. So how does it work? So instead of the change log topic we use active as a source of truth and the standby goes down. Then a cluster wide rebalances triggered which introduces a new standby task which does not have the complete state. We cannot start replaying the entire we cannot start replaying the change log topic because we have removed time retention from the change log topic and hence the efficient disk usage. And then we copy over the entire ROXDB from the one machine to the other machine and then it and start consuming events from this. So this our average time for cold bootstrap is around 30 seconds which is for 100 GBs of data across machines. So the second challenge that we wanted to address was the horizontal scalability two of them while serving data and while having a static cluster. So Kafka has a feature where you can reduce repartition any topic into more number of partitions but Kafka streams does not have. And even when you do a repartitioning on Kafka the data or the events in the stored in the topic are not distributed to the new partitions according to the new hashing scheme. So we introduced the dynamic repartitioning in Kafka streams. So what is the repartitioning logic that we use? So let's say you have partition zero and partition one and these are the keys that are distributed across partition zeros and one. So what I do is I create exact two copies of partition zero and one and I name them partition two and three. And if I was to supposedly delete the highlighted keys, then I would have effectively repartition the system into twice the number of partitions. So this is the technique that we use it is a commonly used technique across multiple databases to use to repartition into more number of shards. So how does this happen in Kafka streams? So let's say we have only two partitions and from two partitions we want to scale to four partitions and you have a task zero. So similarly to task zero, the task one would also be a repartitioning it. But for the sake of this example, let's only look at task zero. So we fire up and we call bootstrap a new standby, which is an exact copy of tasks zeros active or standby. And then we make that active and then another one new standby for task two is fired up, which then again is bootstrap from the task two. And then all of this happens in the restoration phase of Kafka streams when we trigger repartitioning. So whenever the trigger for repartitioning is that whenever a source topic, any of the source topics get repartition to twice the number of partitions, the app automatically detects that and triggers a cluster wide rebalance and initiates this process of repartitioning. And as soon as this is finished per partition, we start deleting the keys which don't belong to either partition zero or partition two and resume they've been possessing. So the second feature that we wanted to address in the horizontal scalability was scaling lookups. So as in the vanilla version of Kafka, the standbys were not queryable. So the only partition or the only task that is available for doing lookups was the active. But with slight modifications of the code, we made the standbys also queryable. So essentially doubling the through serving throughput. This is an on demand feature because as standbys can be few milliseconds or few seconds behind the active. So depending upon which client can serve the data, which client request the data from the standby only it's a per client basis. So we also introduced a ca server. We replaced the inbuilt jetty server with an a ca server which helped us doing non blocking IO. This also helped us because a ca inherently is an actor based system and it also had the dispatcher systems. So we can exactly control how many threads are used for serving and when we have higher load, how many new threads per VM are now serving the thread. So the third thing that we added was partition specific lookups. So as it initially showed in one of the slides that each instance had multiple tasks. So each of those tasks have their own rock TVs. So when you are looking up for a particular key in an instance, it would iteratively look over all the rock TVs and based upon whether it is available on the first or the second or the third. It would only have served the keys from the rock TV, which actually has the key. So since we are in a hashing based environment, we exactly we by serving the data, we look up the partition which belongs to which partition it belongs and then only look up in the partition which holds the key. So this really helped us in decreasing the serving time for a key. The third feature that we looked up was cloud readiness. So the first thing that we added in towards cloud readiness was a rack slash easy aware task assignment. The second was the partial partition assignment or evocation. So consider this as your distribution of active and standby tasks. If you have three easy and let's say two partitions one one active and two standby rate. So if you were to lose an entire easy one, you would have an un-uniform distribution of data across your azis. So to increase the fault tolerance and to achieve uniformity in the data across azis. We introduced a new config which is called the rack ID. So this whenever azis join or the instances join the cluster, we do the partition assignment in such a way that we use the rack ID and we see that no active or the standby are on the same easy. So this as we are also moving to a public clouds like on GCP and Azure. Azure has a notion of a fault domain as well as an update domain. So while taking care of this sort of assignment, we take into account both the fault domain and the update domain so that no two actives or the standby is of the same partition are on the same fault domain or the same update domain. So this increases our fault tolerance and we never lose all the copies of the data in a single data center. So the second thing that we added was a partial partition assignment revocation. So let's say you lose both task ones active and standby rate. So what the default version of Kafka streams would do is that it would in the remaining cluster of the VMs, it would assign a new standby and new active. But as we have introduced the cold boot stuff mechanism, the data all it cannot replay the entire data from change log because the change log topics don't have the full state. So what we did was that we essentially revoke those partitions from being assigned if there is if all the copies of the tasks are missing from the cluster. So as in when if one of the tasks comes back up, then then only we do the partition assignment. And then you can do a second standby which would replay first bootstrap from the active and then replay the change log topic. So in case we have completely lose both the copies of the data, then we do a cross data center bootstrap in which case we fetch the data from another data center. So the fourth challenge that we addressed was the limitations of the rocks DB that was implemented by the Kafka streams. So the first thing that we added was the column family support. So inherently rocks Java had the capabilities of column family. So those of you who don't know what is column family. So within the same instance of rocks DB, you can achieve logical separation of data within the same instance. So even if we enabled the column family support at the Kafka stream side, the change log topics had no notion of change log of the column families. So what we did is that while we write to the change log topic, we encode the column family information and when the standbys are replaying that information, we decode that information and write it to the exact column family in which it was written on the active. So that's how we were able to write the column family support. So we also eliminated synchronized get so somewhere in the implementation of rocks DB store in a file which is rocks DB store dot Java. Everything was synchronized. This was done to prevent doing gets and the puts while the rocks DB is undergoing state changing operations like open DB and closed DB. So what we did is essentially that we removed those synchronized and we introduced a read write lock mechanism where all the state changing operations like close open DB would have to access and get a write lock and all the gets and the puts would acquire only a read lock. So even even if the underlying rocks DB is capable of doing concurrent request and the above layer or customer was doing multiple concurrent request because of this synchronized gets we were not able to get the throughput that we wanted. So the third feature that we added was that so when you have large clusters of Kafka Streams application, even when a single application, a single instance joins or leaves the cluster, the entire cluster goes under rebalance. And depending upon the size of your cluster it is proportional that how much time does the instance remains in the suspended or the restoration state. So even when the event processing is not happening, we wanted to serve the keys from the rocks DB. So we made them queryable in suspended and restoration state. So this helped us achieve five nines of availability in terms of serving the data. The fifth challenge that we just as I told was the problems that we have when we have large clusters of Kafka Streams. The first problem that we had just was the rebalance time and I will also go over some broker defaults and stream defaults, which we over it to have better data reliability and lesser latencies. So how did we reduce the rebalance time? So all the instances when join a cluster they say that I earlier hold these active tasks and these standby tasks. So all of this information goes to a single broker and that broker which is called the group leader and that broker gives that information back to one of the instances in the application cluster. And one of those instances called the consumer group leader requires that all that information and decides which tasks active and standby goes to which of those machines. So this information needed to flow from all the instances to one broker and then back from one broker to all the instances. So that information which is called the partition assignment was very verbose and so we added better encoding to the partition assignment info. And we also added compression when there's an exchange of information from any of the consumers to the broker and vice versa. So we could effectively achieve 24 times smaller size of the partition assignment and roughly 24 times lesser rebalance time for. So particularly for our case this reduced our rebalance time from 8 minutes to around 30 seconds. So this is available as one of the PRs. So this is not yet gone into Kafka 2.3 I guess. So but we are working on this to get into 2.4. So some of the important broker conflicts that you might want to override when you have large clusters. So the first three settings applies to the maximum size of a message that can be written into a broker. So which is message max bytes replica fetch max bytes and socket request max byte. So just in the last slide I was telling you about the partition assignment info. Even the partition assignment info is goes to the broker as an event which is stored in the brokers. So if these settings let's say you have 100 MBs of partition assignment size and these settings have a max of 10 MBs. Then your partition assignment would be rejected by the broker and would not be written into the leader. So that is when your cluster entire cluster will go into a corruption and you won't know which of the instances want to hold which actives are standby tasks. So the fourth setting is offsets.load.buffer.size. So I hope that none of you ever have to use this setting. So whenever as I told one of the brokers decides holds the partition assignment info and it's replicas and the followers of that leader. So whenever that leader goes down the follower becomes the leader. So that leader has that the now the new leader holds that partition assignment info in its offset topic. So it would only load an amount of data which is defined by the setting. So again if you have this setting as said to be 10 MBs partition assignment is 100 MBs. It won't be able to load all that data into the memory and hence won't be able to provide that information to all the consumers. So it is important to have at least double the size of your partition assignment info as the setting. So the fifth setting is men in sync replicas. So whenever you write a message to the broker this setting says that you know how many of the replicas of the leader have to have this data so that before sending an acknowledgement back to the producer. So if you have set it this equal to two and you have a replication factor of three. So you will be sure that it is it gets written to at least two of the three brokers. So we're adding some stream default so acts. So acts is a corralry of the men in sync replicas. So if you have acts equal to all then it writes to men in sync number of replicas even though the terminology is slightly odd. So if you have acts equal to all and men in sync replicas equal to two this ensures that any message that you send to the Kafka gets written to at least two of the three brokers. So by default the setting is equal to one in streams. So the second setting is linger.ms so this setting is helpful under lower load scenarios. So this ensures that if you set this equal to 20 milliseconds let's say and your batch is not full within 20 milliseconds then irrespective of the batch size this will produce whatever is written to the producer buffer to the Kafka broker. So third setting is auto dot offset dot reset. So let's say your application is down for six hours and you have a retention of six hours on your Kafka topic. So after that when the app comes back up it would ask for certain offsets of data which it had not consumed but the broker won't have those offsets because of the time retention the data has been deleted from the broker. So by default this setting is set to latest and you want to set it to earliest so that you process the messages that are at least there on the broker. So otherwise it would only process the latest message that gets written to the broker. So the fourth setting is cleanup dot delay dot millisecond. So this setting let's say your cluster is under rebalance for 30 minutes or so and this setting is set to 10 minutes. So what the Kafka application would think that you know it has been 10 minutes and I haven't received any information about this partition and it might have probably migrated to another instance. So that is when it would delete this tour but in actual the reassignment and the rebalance has not yet finished. So you depending upon the cluster size and how much rebalance time on that you have you have to have this setting greater than the rebalance time. So otherwise you would just lose data. So this is one of the benchmarks that we did. So it is around eight to nine months back and this was done on a staging cluster. So our Kafka cluster had 17 eight core instances and our streams cluster had 102 core instances and we could achieve a processing rate of 2.3 million events per second. Each of those events was a read modify right in the Rocks DB and the payload size was 10 KB. So with all the features and this benchmark we effectively have achieved architecture which is really has which has a very high processing throughput as well as distributed through distributed storage and we never lose an event while doing an event processing. So this was it. So up next we are working on various features. So few of the things that we have pretty much nailed we are on the verge of moving our featured extraction and modeling in model influencing pipelines to this new architecture. And we the preliminary benchmarks says that this is exponentially better than the architecture that we currently have in Walmart labs. We would love to have cold bootstabs from the other standbys when you have more than one standbys because when we do the cold bootstabs from active it pauses the event processing on the active during the duration of the cold bootstab. We would also like to we are also looking into cold bootstabs and repartitioning for DSL. The fourth is TTL support for state stores. So Kafka streams has disabled TTL support in the vanilla version but if you and this was probably done because the change log topics have no notion of the TTL right. So whenever you write something to the change log topic and because it's because of its log compacted nature everything just resides there. So even if you would enable those TTL supports in the app side you won't be able to delete the data from Kafka site. So as we have introduced the cold bootstab mechanism we are change log topics have time retention. So as of now we are doing some tricky things to achieve the TTL but essentially we can enable TTL on our ROX TVs. So the fifth thing we are also looking into is the merge operator for ROX. I don't think has anyone on the ROX TV community has been able to solve this because it is very tricky to do that because it requires callbacks between C and Java. And also it is a compile time feature which converts all your read modify writes into a write only system. And when the background compactions happen they would merge the data. So the sixth thing is that multi tenancy multi tenancy has been boggling our minds for so long now. What we effectively want is to have a single Kafka streams cluster where all the where multiple clients are running their code doing their event processing and the inferences on a single cluster of Kafka streams. We are doing some things regarding multi tenancy but there are multiple problems like how do we share resources among the clients and how do we avoid one client affecting other client's data. So my name is Deepak, you can find me on LinkedIn at Deepak-tripleit. So this all has been possible because of my teammates who were there from the beginning on Navinder Giri and Ashish. So a special thanks to them. Keep streaming. Thank you. Any questions? Hello. So let's say I have an app. I have an app which consumes from multiple topics. And one of the topics I don't care when my app crashes let's say or when it starts. I don't care what was then Kafka is lost no problem to me. But another topic is there which I want all of them to really do those things which are not yet been processed. So will this auto offset result help me there? No it is across all the topics. So but I think you can effectively modify the code to have a topic wise configuration. So it is not impossible to do that but without tweaking the code it's not possible. So you would have either offsets or reset policy for all the topics or latest for all the topics or at least for all the topics. Thank you. Thank you. Thank you.