 at five five p.m. I know it's maybe we're the last talk between now and going to the bar, but Thanks for coming and thanks to the entire team for building this We are just here to kind of representing an large team at the West Bank who build this and then couple of them are in the room and and Then thanks thanks to the Cassandra team and the data stacks team also for supporting us So what we'll be going through is in some of the You know cool mysteries that you guys have is when you swipe your credit card at a you know Starbucks or something like that And then you know, how does this transaction actually show up in the mobile app or a web app? You know, maybe you'll answer that questions and then partially at least So I myself am seishu good anteep seen a director of software engineering at the US Bank and then Um Wicked All right, so thank you for coming and thank you for showing interest for the session today I'm Winkert and I'm an engineering leader at US Bank With over a decade of experience. So most of my experience lies in the data space. They're engineering in a moving data between systems And so I've like my journey evolved from using leading ETL tools from the market For the best data movement to a developing of a homegrown software platform at US Bank a software product enabling real-time data transfer and I have you know, deliver impactful projects for the bank and Today migrating workflows from mainframe took a sunrise one of them, which is our topic of discussion today so Just few few few words about US Bank and then the scope and scale of the problem that we saw So the problem that we saw impacts consumers and then consumers are part of the consumer and business banking and We offer for consumers products like checking and Savings and and deposits and then And and loans and then for credit cards also, which is the payments are payment side of it We offer credit cards and then wealth management based on your network Then we offer different kinds of services. So if you kind of total up all of the percentage of kind of revenues 80% of the Banks traffic is kind of impacted by the consumers Space and and then the solution that we kind of belt is for for a large Scale of the customers That's on the on the sort of the scale side of it and then you know, but the other side of the equation is You know the sort of the digital Interactions are kind of increasing at a high scale or percentage points if you look at it from Four years ago to now they went up by by 16 percent on the on the digital transactions And they went up by 20 percent on the on the remote remote deposits and then it's not that Somebody is actually encouraging them to kind of move to this digital It's actually this trend of migrating towards digital is an across industries and across Many different companies and then I'll just go over some of the drivers on why actually this digital transaction or transfer the digital interactions are increasing So one of the main reasons is that if you kind of look at it As we are all moving away from the analog version of it to the digital version The number of interactions actually will go up. So for example, if you were Renting a DVD in the in the olden days with with a DVD in a blockbuster You would browse for 20 minutes there and then check out one DVD and then return the DVD after three days You would probably have three rows in the database one for checking it out One for written ink and then one for some late fee So if you you know add up, you know, maybe three transactions But the equivalent of that if you are kind of browsing on a Netflix, you'll browse for 30 minutes and Then you know, you probably create. I don't know 50 interactions or even 100 interactions on all the movies and the reviews And then you know who the actors are so you'll do a lot many interactions And then even same thing with the Kodak to I mean like, you know If any of you remembered old film roles, you would have like 24 or 36 But in the in the in the camera or in the iPhone camera world people take like hundreds of pictures Yeah, so you're kind of scale of interactions in the same concept even same in the bank, too You don't visit a branch as many times, you know If you have a digital app, you know, we would see like 10 times more transactions So that trend is the one that is actually kind of impactful to the kind of existing infrastructure, which is kind of mainframe So we have to move to a different infrastructure which will support a different scale of the interaction. So that's the first problem And the second problem is that When when customers are actually moving away from the digital transactions or from the from the analog to the digital People want continuous engagement And when I mean continuous engagement again, if you go back to the old days of ordering a cab Versus Uber when you call the cab you would probably have one kind of you know interactions But when you look at ordering Uber, you have to you people follow that every turn by turn Oh, it's one minute away five minutes away ten minutes away and then you keep track of it that so similarly even when we kind of wire money when we Do to credit card processing people are constantly following those transactions That means that the number of interactions that they're going to have Is significantly higher than in the in the traditional days And the last thing is is more on the velocity So as we are developing more products and more features And and selling them as bundles There's a need for different groups to develop in parallel And especially in the kind of the mainframe world. It was much more difficult for multiple teams to kind of operate So we wanted to kind of unleash the kind of the creativity of all of the product managers and engineers so we wanted many different teams to work on this and This was another driver to kind of migrate away the workloads from And I'm not kind of discounting the fact that cost is a driver But but there are other reasons to kind of move away from the mainframes So I'll go into the simple kind of the architecture kind of view of this. It's a very simplified view. So you from a UI There is a UI meant like it could be a mobile app or a web app or even a customer agent a generalized version of UI and then You know, somebody's making a address change And then it goes through a thin domain layer and the domain layer is is nothing but a security There are no business rules in it and it goes to the mainframe and then frame has all the business rules And then of course it returns back the data to the UI and therefore in the back end itself There are other batch updates that happen batch or even real-time like grid card processing or a CH transfers So all of them put together are sent back to the UI, which is about retransactions This is a very simple Kind of representation of the architecture And then if you want to move away from the mainframes, what are the strategies that we typically have from this world? One is we started moving away in new domains new domains or anything new and don't put them in the mainframe anymore Well, that will take care of I don't know some percentage of the of the avoidance in some sense that you're not building anymore and the second one Which is probably the key part of the problem is the read part of it Which is in a financial services, especially, you know, you don't wake up every day and then do transactions You every day you wake up and mostly look at the transactions, but rather than modify them So our kind of ratio to read and write is around 20 to 80 percent 20 percent for write and 80 percent for read. So that's the focus of this conversation And how we kind of Migrated away the reads from the mainframe And then of course you can move the non-core functionality So all of these efforts are going on in parallel But but the one that we're talking about is the migration part of it and then we could migrate the core Which is probably the most complex one And also expensive So as I said for this particular conversation, we're talking about the read part of it So now what did we do to achieve the Reduction in the 80 percent of the traffic and then we describe the scale of the problem before For 80 percent of the business is mostly consumer Focused and then 80 percent of that is Read focus so this forms a particularly significant chunk of the banks Problems So again starting with the with the left-hand side. This is the same Architecture that I showed you before so what did we do to that? So We modified the mainframes to put in a publisher So we can kind of answer why not CDC? But but we modified the mainframes to publish the data a full-fledged event And then the events were published to a streaming platform And the streaming platform consisted of an API abstraction on Kafka and then of course Kafka and then there was a event processor spark and Then all of this a combination of that which was built as a platform So that we could stream all kinds of data We had around 40 45 years events or 50 events just for one kind of particular project and then all of that we moved it to Cassandra and then we built some of the Business rules now in the new domain layer And then not only this when we move the data to Cassandra You had an opportunity to kind of enhance the data like today when you go to your credit card transactions All of them are enhanced and what it was actually years ago. So some of the enhancements enrichments Happened and then we built different kind of Services to you can search you can do why search now one on the on the data that's there in Cassandra So you can build a lot more features This is what we did and so now we are going to go into this particular CDA pipe and and what we did at the mainframe and then tell you some of the problems and challenges that we had so there Like numerous problems that we encountered. Let's not be And then here we are only listing some of them Of course the new system whatever we built the pipe the Cassandra all of them have to be reliable They all have to scale And and and we will go into like three of them because in the best interest of the time So the first one that we will get into is the data transfer For this whole thing to be kind of meaningful the When the event actually happens in the mainframe For reaching the Cassandra, it has to be less than a second and the reason why that is kind of important is you know If you change an address in the mainframe and you want to go back and then see the address you want the address to be reflected there Same thing with transactions, you know when today you swipe the card It would have gone to the mainframe and then customers would kind of immediately look at it and then if the data is now in Cassandra and and even if there is like a 10 more seconds delayed you are kind of degrading the customers experience. What is today and what's You know what we are going to propose or what we proposed or what we did So in that sense like we didn't want too much of degradation of what the customers actually have today And the next one is missing events. We you know, we are a bank Even if the one transaction is gone customers call you Even if it's one cent away people are going to call us So we wanted to be very very careful and then build Systems and processes to kind of detect the missing events and the next one is There are potentially problems and The events even not being published The mainframe itself would have problems or maybe there are some business rules that kind of failed and then events So we had to have kind of some reconciliation between the mainframes and the Cassandra You know run at a frequency so we will get into the details on some of these problems and how we solved And then when can you want to go ahead and Talk about this. Thank you, Shesha Okay, so the three problem that Shesha was talking about latency right without bringing down the one second latency This project doesn't exist. So we have done number of you know Performance test we can see, you know, we had fire more than 500. So the first thing is how do we reduce? Where is the latency right? So we started with the mainframe. So of course Shesha mentioned CDC There are reasons why we did not use CDC is one of them, right? So the mainframe, you know, we they had they tuned the number of threats How many threats are listening to the events they added, you know, more You know listeners kind of more partitions and when you talk about Kafka That they do the latency under 500 milliseconds within mainframe. So that's one part of it There may a number of tests we have done but just listen. So that's one of them Then when mainframe published to streaming engine, of course, this is a based on Kafka spark and there is a springboard Which is a rest API gateway for mainframe to publish. So it's a complete abstraction for mainframe So they just published an endpoint. Everything was taken care of behind the scene. So the streaming engine We are, you know, it's it's probably processing in less than 100 milliseconds today We started with the one second more than a second was taking more than a second when we started with the spark traditional D-sleeps, which is a micro bath streaming. So it was taking more than, you know 10 seconds that way we started when we started this journey in three years ago so of course the spark structure streaming came in next and Continuous streaming is one of the part of the structure streaming that helped us you're processing every event You know one message at a time that also helped us to reduce the latency and of course Kafka how many number of partitions you need you to have the right number of partitions to speed up. So that's you know Using the right number of partitions. We will also bring down the latency under 100 milliseconds That means the moment you hit the streaming platform the streaming box right here we're taking 100 millisecond less than 100 milliseconds to write it to Cassandra and The other important was checkpointing a destination here our Cassandra is the destination and a spark comes up with the you know the fault tolerance it comes up natively, you know Checkpointing to some Hadoop system or some NFIs. So but it's an external system So Cassandra help does Cassandra one of the Cassandra feature is a term city so you can batch your transactions You have the data row. You have the checkpointing and you have you know There's another cable called journaling. I'll talk about that in a bit So you can batch it and write it to Cassandra. It's it provides a very, you know, low-right I mean the high-right throughput when you do that. So with that, we were able to bring down less than one second latency There's one more thing with the kid with the checkpointing We we modified some of the native spark code that gave us full control of our checkpointing We want to go back and replay those events. We are not dependent on Kafka It's completely decoupled so can always go back to Cassandra update your you know offsets it's gonna replay the events so that's the you know important problem that we had to solve and Moving on to that the two problem was data loss so data loss is Just to give an example. You you go to UPS and ship a package. You will be worried about, you know, hey Determined package reached so that's why UPS gives us a tracking number and we have that, you know Confidence we go and track that and then when the message is delivered, you see the tracking number is updated the same concept So if you see that box journal and mainframe publish any event the track it It's it doesn't cost anything. It's just a tracking number. They're not storing the entire payload Just a journal table with the tracking number and then when we the tracking number pass along with the streaming pipe So when you write to Cassandra, we also, you know stored that tracking number It's a short lived three days. We kept it for three days because you will come to know if the native the events are missing Within the next 30 minutes. So what we built was again using this spark. You know the streaming It's not a streaming. It's a time interval batch pipeline built on spark Just checks every 30 minutes. You can depending on your criticality of your data You can change it to even five minutes, right? So what he's checking is finding out any the mainframe published a million events did a million events reached a Cassandra. It doesn't do any data validation. It's just identifies. There is a data loss Humans we make error. So, you know, we have all of the lighting enabled today, but when someone something fails humans make, you know, okay, he Missed to address some of those, you know events So that usually missing a transaction in Cassandra is a big deal focus for customer if it doesn't see a transaction It's a big deal for customer. So that's you know data loss Now if you look into carefully the box the between mainframe between mainframe and publish there is an arrow, right? So when you change an address Mainframe itself never published the data to and now streaming engine data loss doesn't tell you that because Data loss doesn't know it doesn't even never happen. So that's where The next problem that we need to solve is data reconciliation. How do you know now the data is the rights are happening in mainframe? Reads are happening against Cassandra. How do you know the data is in sync? I know we have done all the quality testing, you know Thousands of test cases every test cases more, you know more validated in the lower environments before you You know moving to production, but they're always edge cases. So as long as you know As long as we're using computer systems in the world could always be a problems So you need to be ready for that. So data reconciliation is an edge cases. So this is a capability It's a product. It's a one of the transformation think about the software platform built it. This is kind of a transformation Where you can plug it into your pipe To verify the data between two systems So this is where the mainframe on the left side mainframe is publishing the data and keep in mind We are not putting burden on the mainframe the whole model is to you know reduce the main You know cost on the mainframe reduce the mess use on the mainframe. We did not put a burden Like any company everyone has data lake so it's the same concept mainframe is turning the data to a data Like it's a day minus one data. That's fine But we use the data lake we use that you know every element and column that is published in real time So built in a Daily reconciliation pipeline what it does is identifies three things if a if you change an address But mainframe never published that record into streaming engine. So it identifies as a missing record. Hey There's a address change record, but it doesn't you know Cassandra did not get it another one is mismatches So the rows and columns are compared in this. It's a very fast utility that we built So it can process you know millions of requests as well So the mismatches is something again, you know you updated your street number But you know Cassandra mainframe did not publish it So that's one of the problem, but there are the problems could be anywhere right in streaming engine in spark You know there's a memory loss that happens the events any events that get lost data reconciliation is another control process That could identify proactively, you know all the issues without creating without impact in the customer experience so with that said I Mean So to just the key take us are and a three years ago when we started the real-time TS Bank we all thought everybody thought here This is impossible that you're gonna do it in one second without impact your customer experience But it is possible with the right using the right technology, you know Of course Cassandra spark and there are other tools I'm not saying spark is the one but fling is another one that could be Printerly we could even use fling for the real-time going forward but we have spark built with both batch and real-time components, so it is solvable and And so latency is one of them we sold it less than one second. It's better than that You know, it's I don't I would say most of the 95% of messages are under you know find a millisecond between mainstream and Cassandra So so I would say thank you for everyone being here and thanks to you know data stacks and Cassandra Giving this opportunity and of course thanks to the team in the ES Bank The entire team who helped us who encouraged us who you know the all the leadership so thanks to everyone and not to make it happen and I'll open the floor for any questions so we migrated the data and the structures in In the mainframe could be different from how we stored it in Cassandra because The the mainframe structures were probably best suited for its storage system On how it indexes the data or whatever mechanism it is and then we stored in Cassandra the way the reads from the from the clients would be like from the mobile app or the web app So the the structures could be different, but but the data is the same So the data the model is not just one-to-one the DB to table is not one-to-one. So Yes, yeah Okay Any other questions? Yeah, so We try to be use solar the simple answer is yes the complex answer is Relative to the most of the use cases would be based on the key We have a customer key And account keys and then so you can actually retrieve When the user logs in in a bank at least we clearly know though you who the user is and what account they're kind of related to so we can pull the exact data but there are some edge cases or less percentage of the use cases where search is kind of Needed and search was provided But for that the latency is actually slightly more higher for the solar to kind of re-index So there's kind of difference. So we have to kind of Carefully orchestrate the searches in such a way that the customers doesn't search Insert into the mainframe and search within the next second and then you know, but it may or may not exist Some of the like not critical functionality like voice, right? So it's not customer is expecting, you know Hi, the low-latency in that so the solar is used kind of wise in such use cases But not in the you know when you log into mobile app and want to see transactions It's all key based and now provides the you know the low latency Yes, it comes with it, but there is a if you insert into kind of Cassandra, then you know, of course the data is there persistent But for re-indexing of the solar or indexing of solar takes few extra seconds So that's not counted in the in the in the budget of one second That's what I meant So you may take two seconds, but but yeah, it's in billions Yeah billions Yeah, right. I think the peak it depends on the peak traffic with the 300 TPS we During the business hours so for the one second, I mean 95 the ways I'm not saying 95 percentile But what we measured is if you have a thousand messages in that second like 95 messages are you know We prove that it is writing under one second. There's always spillover, right and That we cannot the spillover is not minutes, but at least it's in you know extra more than a second so we have alerts enabled anything about it, you know less than 95 Tell me why right? So sometimes you know mainframe might be publishing a little late so and Sometimes it's a scene in the streaming pipe. It's little delayed Kafka lag on all of that Yeah, we have fine-tuned this or Some some few months Maybe even a year And then to Wengen's point. We have made sure that 99% of the 99% of the events not percent and then percent of the events are within the the one second and As you said, we kind of fine-tuned the number of threads on the mainframe to partitions to Kind of the technology of the event processor to the checkpointing so lots of effort have been gone there So that that's almost you know Without which like as we said, there's no reason to build it So we built it to make sure that we meet that sometimes it's even matters Spark on VMs, you know, which disk you're using using SSD or That's for the garbage collection all of that even we went to that level Spark is better if you use SSD, but you know, not on the virtual desks Guarding that no that it's not the security concerns Sometimes what happens is when the data gets written. I'll just give me maybe one best Data point, but when you write to some of the mainframes, for example, the the even doesn't get published It could be whatever maybe that they it kind of, you know Crashed or turn or maybe not crashed, but we don't know the exact reasons, but let's assume it didn't write it So the the data loss won't catch it because nothing was actually published But the batch export from the mainframe would have that record And you will not have a corresponding record of that in Cassandra. Then we know that hey, yes the data recon That's again what happens is when we kind of I don't know when we went to production say on day one The data recon will produce. I don't know thousand errors But you fix all of those edge cases And then you know now if you look at it, maybe you'll have 10 and then we constantly keep fixing it but then hopefully There'll be a day when data recon is kind of not needed, but but of course we need that as an insurance It's not for not all of your data is wrong. Think about that if you all of it is wrong, that is You'll come to know immediately like that, right? So it's a subset of customers, right is impactor Maybe it's just one of the product that we offer just making it up in a money market so it's something wrong with the money market data we know how exactly how many customers have a money market accounts and This could identify it's like When you cook rice, you know, you're not checking every piece of it. You just check like that So the same thing data reconciliation is a sampling. So the moment you identify it identify the issue We can quantify what all the impactor customers and then we can quickly react to it So and it can correctly fit you know can be fixed as well So I mean that's not the at least I mean This project is not driving that but as we stole that that that you know, there are New domains are being built outside. We are migrating the reads The reason why the mainframes are far or at least in the thought process wise say we acquired another bank The the rights itself are going to go up Yes, we you know drained out all the reads but but the rights actually went up because we are actually doing more customers now So the goal is not to kind of replace it But argument it in such a way that we are actually kind of doing different kinds of services and also serving More volume or interactions to the customer which you know were probably cost-effective or less cost-effective on the mainframe versus So think of it as that way and then you know, maybe there is some future at some point where you know The core will be migrated but that that's that's not the kind of line of sight for us Yeah, all the Transactions that you swipe your card a CH transactions money movement transactions and all of that. Yeah Good Got a question Okay Okay, all right time stop apparently. Thank you. Thank you guys. Let's get that. Yeah