 So thank you guys for coming here, it's a pleasant day outside, I'm sure, you know, there's many different places where everybody wants to be, but thanks for choosing to be in a place where it talks about mainframe and open source together, you know, paradoxical statement maybe, but we'll talk about it. And thanks for the Linux Foundation for organizing and for selecting me as the speaker, and the third thing is thanks to the entire team, there are a lot of people who worked on this one, I'm just sort of representative for speaking on their behalf. So, and then, so I'm sure all of you guys have a bank account or a credit card, and I'm sure you always wondered, you know, how when somebody swipes a card, you will see your transactions in your mobile app or a web app. So I'm at least going to partially unravel that so that you guys have a better understanding on how that happens. So, I myself am Seishi Gurudanti, I've worked in the bank for around six years, worked on the data platform previously, and this part of the presentation is actually representing that work from while I was in the data platforms. I live in the Bay Area. And then that's my kind of LinkedIn profile. And then I write, read, travel, and then do photography. So that's, that's kind of a little bit of put myself and when it used to work for me, and was also part of the team, and couldn't make it here to do different circumstances, but that's what this is. Okay. So, a little bit about US Bank, before we can get into the actual technical presentation so that there's a context. US Bank, we serve consumers, we serve small business, and we serve large enterprise customers. This talk is mostly about the consumer segment, and that consumers are part of the consumer and the business banking, and these percentages represent the relative revenue of the bank that gets from different business lines. So, consumers are primarily the consumer and the business banking where we have checking and savings deposits, and then they're part of the payment services, we offer credit cards to consumers, and they're part of the wealth management, because we have wealth management solutions and products to the consumers. The last business unit, the corporate and the commercial bank, we don't have any consumers there. So that's the kind of the landscape on where consumers fit in. And then if you look at the consumer transactions, especially the digital transactions, they've been going through the roof relative to kind of four years ago, and then they're across the board. These are just representative of a couple of them. This one represents the digital transactions and the mobile deposits. So I'm going to explain a little bit about, you know, these are not just like new customers signing up and then doing digital transactions. These are, there's a fundamental change in how the products are evolving and how customers behavior is changing, and that's the trend towards the digital transactions. And the talk is about the mainframes and how we kind of migrated, and these increase in the digital transactions is probably the sort of the main driver actually. And so now I'm going to sort of dig deeper into why the digital transactions are actually going up. I'll kind of skip that. So if you look at the digital transactions, why are they going up? One of them is that the physical analog of it is typically less transactions. If you ever went to Blockbuster and then, you know, rented, you probably would have one transaction for renting it out and then returning the sort of the DVD. You had two rows in some database somewhere. But if you are looking through like Netflix, you would log in, you would browse, you would browse for like 20 minutes and then maybe watch for 10 minutes. So the number of transactions that you're hitting the server is actually phenomenally higher. That's the same trend even in the bank. Anything that is physical analog to what the digital equivalent is that there's more transactions that are hitting your kind of backend. That's kind of the core philosophy. And then sometimes if you don't get up, you kind of die. That's the Kodak example. The number of photographs are actually quite high, but Kodak doesn't exist now. So the other one is customers are kind of having a different kind of engagement. It's not just you have a point of transaction. You went to a bank or a branch and then you have one interaction, but customers are expecting like a 24 by 7 engagement. And then you kind of see that, you know, driven by if you look at right share apps, you order a Uber, you are able to see the driver every turn that they take. Imagine the number of transactions or the number of server, client server responses that you're actually getting that same thing with, you know, delivery apps, same thing with airlines. You can track your baggage. You can track the flight. So that kind of a continuous engagement is being requested by the customers. And there are kind of equivalent analogs to that. And in the banking space where you want to track transactions, you want to track application status. You want to track the mortgage application status. So there are a lot of tracking that needs to be done. So that actually has an impact on the how we kind of evolve the architecture. And the last piece is that, you know, the complexity of the products is changing. So not just we have, you know, for business lines, but with the business line, we have a lot of products and these products are being bundled, re-bundled and then sold. So that means that the architecture of a monolith, you know, cannot kind of function very well. Many different different teams cannot work on it. So all of those are maybe standard things that you guys all heard. So I'm not going to get too much deeper into that, but these are some of the main drivers for why we went on this journey. So if you look at how the sort of the architecture was, I mean, this is a very simplified version of it. There's a UI. Most likely it was a mobile app or a web app or a banker app. And we used to have a thin domain layer, which it was not there was no in business logic or anything, but just wasn't like an API translator, some level of security, like a gateway kind of thing. And then that would hit the mainframe and mainframe had the business logic, the rules validation and also the database. So this is how it was. And then, you know, simple example is that you change an address flows through all the way to the mainframe checks for the address validation and then it saves it and then you can retrieve the same information. And then behind the scene, this is what I was talking about when you swipe a credit card transactions, they don't come through the your mobile app or web app, but they're coming, you know, updating the mainframe from from the back inside of it. So whether it's your ACH transactions or whether it's your credit card transactions, they're all coming. So these are all think of them as, you know, credit operations, either being new records are being created new records are being read, or they're being updated or deleted that that's pretty much the entire gamut. So of course you can read the transactions, you know, the typical use cases you, you know, you want to see your transactions list you ask the mainframe the mainframes of the data, and the data is kind of shown in the UI. Nothing fancy, quite simple, and this is how it worked. And, and there are kind of few problems or few few things that are kind of concerned one is the, as I said, the mainframe is a single monolith. So many teams can't work on it. That's one. And then the capacity is, you know, you're stuck with that capacity. So all the time I'm not going to even go into a lot of these details, but but you guys know that every company probably wants to get away from the mainframe. So, and the reasons why is kind of very simple. So how are we kind of doing it and what's the scope of this particular conversation. So, of course, all the new things that we're building, we're not building on the mainframe. So that that's the simple one, and I'm not going to go over that. And the second one is the migrating some of the reads out of the mainframe to open source and other software. So that's kind of the main focus of this talk. And the last one is that we are migrating some of the noncore functionality. When I mean noncore, like the mainframes today also have functionality like your customer management has features like KYC checks. And they're not actually processing any money transactions like ACH or credit card. So they are a little bit easier to migrate. So those are the ones that you're targeting probably third. And the last one is, of course, migrating the actual core, which is complex and actually expensive. So these are the ways that we are doing it. And this is how the sort of the cost and the complexity happens. And for this particular talk, I'm going to focus only on how we migrated the reads away from the mainframe to open source. And then, of course, this is the starting point, the previous architecture. And then, of course, nothing changes. The read patterns, the write pattern remains the same. All the writes are actually happening to the mainframe. The writes like address change from your mobile app, the customer is doing it. Or the backend transactions that are coming from your ACH and your credit card processing are all coming from. Exactly the same. So what we did is we first put a module on the mainframe. We modified the mainframe code to publish an event for every credit operation, mostly for the updates and then creates. And then that module basically publishes it to an event bus. That event bus is a combination of there is an API that abstracts the Kafka. There is Kafka and then there is Spark that reads it from the streaming infrastructure and then puts it in Cassandra. And then from Cassandra, we have the domain layer, which is spring boot based and where the business logic now resides. And then, of course, when the read request happens, the read request is actually going to the domain layer now to the right-hand side. And then it hits the Cassandra. So this is a, you know, from an architectural pattern, if you look at it, this is a standard CQRS pattern where the writes are happening onto the sort of the left-hand side and the reads are happening onto the right-hand side. And then, and then, yeah, probably, yeah, so okay, on the domain layer here. So one other benefit that we get out of moving the data away from the mainframes is that the domain layer could be a lot more sophisticated. We enhance the data when I mean enhanced. For example, in the traditional, you would see the transactions to be like very simple, but these days most of the transactions are actually enhanced. You will see the address of where you shopped. You would see the category of where you shopped. So all of that could be kind of enhanced in the domain layer. And then we also built a voice capability that you can actually search the transactions. And then a lot of voice capability was built, enhancement was built, search capability was built. So all of those capabilities which didn't exist in the previous system, we were able to kind of do that. That's the benefit other than reducing the cost. Cost was driver, but not the most important one. So building the new capability, as I said, was the driver. So I'll take some questions in the later so that we can. So what are the main learnings or the main things that we were supposed to take care of? There were many of them. I'm not going to go over all of them, but at least some of them I'm going to sort of dig deeper. And then the first one is of course, mainframes are known for their reliability, data consistency, and then security. So reliability was probably the highest. And if you look at the mainframe reliability, typically they are in the range of almost three nines and getting close to four nines. There are some of the maintenance windows, but they're actually pretty high bar to meet. So when we went through the open source, most of the technologies that we picked are distributed and redundant. Like if we take Kafka, that's a great example of it's redundant. It's distributed, Cassandra is distributed and redundant. And then if you look at even Spark, you can run it in Kubernetes. That's how we ran it in Kubernetes so that you could get the distributed and redundant. Same thing with the spring boot, the domain layer, we ran it in Kubernetes so that you would get it distributed, scaled, and highly available across data centers. Both all of them were built highly available across all data centers. So that way we could actually hit the first bar, which is the reliability should be higher than the mainframes. And then the next one, if you look at the scale, mainframes are good at scale, but they don't scale up when the volumes go up. And of course, as I kind of explained, the main reason why we are migrating is the curve of the digital transactions are actually going up significantly much more. So if you look at even the events that happen in the financial industry, like stimulus checks during COVID, traffic actually went up like nine times. So we need to have capacity that we can ramp up very quickly. Which was harder, and that's one of the problems that we have to kind of solve for. And then the next one, as I said, the pipe that we built between the mainframe and Cassandra, there are many different business lines that are actually using it. And as I said, as consumer retail wealth, many different business lines are kind of using that pipe. So we didn't want it to be built as a custom pipeline for every business line or every particular event. We wanted to build it once and building it once actually increases reliability significantly because every team doesn't have to test that pipe. And I can go over some of the features that we built with the pipe in the presentation. So there's a lot of core capability that we built on the pipe. So all the data pipelines of every business line are basically configuration driven. Not there's no code that these teams are developing actually. And then the next one, which is probably the most significant one, which I'll kind of dig deeper is the data transfer. Which is if somebody is actually doing a transaction on the mainframe, the time it takes to go to the next screen is typically around one second. So we had to kind of match that. Then the data transfer between the mainframe to Cassandra has to happen in less than a second. So that's sort of the reliability. The data transfer is probably sort of the biggest problem that we had. And then as I said, there is other than the scalability, there are issues on the peak traffic. Because once the Kafka topics are there, you can get a certain burst of traffic. Because the batch processing happens on the mainframe, middle of the night. And then suddenly there's a big spike of transactions that you will see for the settled payments. And then for the most of the day, there's no traffic. So it's very bursty, there are peaks, but burst of traffic is another thing that we solved for. And the next, most of the third order ones are the data related ones that I'm going to sort of dig deeper. The detection of missing events. In a file based transfer, it's actually quite easy to figure out end of the file. Because you have 1000 records, you put 1000 records, clean slate, everybody's happy. There is auditability, there is tracking. But in a stream, you don't know when you began, when you end, you can do accounting on it. So we had to solve for that problem. So the ones that are in italics is the one that I'm going to dig deeper, so I'm going to go. And then the errors actually. So there are kind of few kinds of errors. One is the, is any bugs basically, and then any software testing that you do, you cannot kind of test for the millions of records that are there. There is no way it's possible. So we had to figure out a way that even though we can deploy the code and we tested to best of our abilities, can we test this data in real time as we can add more data. To it. And then we also built a failsafe switch that that means that we are running with the Cassandra version and the mainframe version of it, just in case that that something goes wrong. Can we actually switch because that was key may not be for the entire life, but at least for the six months, three months. While we do the deployments that we learn and then fix any bugs. In the worst case, you have a failsafe switch. That was probably the good reason that we got the approvals because there was always a failsafe switch. The next is because we had to react to any real time events, like if the SLA goes significantly higher in the pipe, greater than one second, then we had to kind of make a decisions that is impacting customers and a lot of things like that. So we had to build a like a robust alerting and monitoring system in the entire pipe and even Cassandra so that to make sure that. So these are some of the sort of, you know, lot more problems, but these are ones that could make to the slide. So these are the three things that I'm going to sort of in the interest of the time that we're going to go deeper into. So again, just to kind of summarize, you had the main frame main frames published to MQ series. And there was a listener on the MQ series that would pick it up from the MQ series published to the springboard Kafka spark infrastructure. And then that puts it into the no sequel with an atomic right. And when we mean atomic right, if there were multiple tables, we would write it atomically so that there is no inconsistency in the data. That's the one. But here when we started out with, you know, the latency was almost around five seconds. And as I said, one second latency was the whole project would be kind of relevant if you didn't do one second. So we had to really, really work hard to actually get there. And then we did a lot of performance tuning like partitions, right throughput, memory settings, et cetera, and all through the pipe. And some of the other key things were going to be we had to learn and move away. One is from the micro batching in the streaming to to continuous streaming. And the next one is checkpointing, which is probably sort of a learning that we had that spark would do checkpointing to a file server. And that was actually very expensive. So we checkpointed to to actually the no sequel itself. And when we write the data, we actually did the checkpointing to that same considerably actually. So our latency is less than a second and mostly at 500 milliseconds mostly. But this was this probably took the most effort effort. And the next one is the is the detection of missing events. So again, mainframe, MQ series, Kafka, and then the Cassandra on the side. So what we did is whenever the data gets written to the MQ series, there is an entry in the journal. Journal, think of journal as just a table like with the message ID unique. So the inspiration but for this came from like the package tracking. Like if you look at FedEx tracking or UPS tracking, they track every package packet. So we wanted to track every event, the analogous of that is simple. So every every event that got published got a unique ID. We stored that unique IDs in the journal table. And then when when the spark job inserted those rows of data into Cassandra, again, it wrote the journal entries. What are the event IDs or tracking IDs? And every hour we would go back and then make sure that these packets are actually compared. And then, you know, as I said, multiple business lines, there's a lot of data there, but but still not too bad. We can actually do the comparison. So that and the next one is kind of important because we are a financial services company. The order of the data is very, very important. So if you miss the data, we wouldn't go back to the queue because the latest data could actually change. So to preserve that order, we go back to the mainframe and then get the data and then insert it. Okay, we'll get back to I'll finish that there's only one more slide and then we'll get back. So the the the next one is kind of a little bit of an extension on that one. This one was for think of the sort of the packets or the FedEx packets being delivered. Now, did I deliver all the contents? Because that was also the problem because if there is any bug in the code and instead of updating one column, we updated other column. And of course, there are so many edge cases that that you cannot ever even test in kind of in a production environment or in a in a IT, UAD testing environment. So what we did is we we export out the data from the mainframe. You know, we already have existing processes which would, you know, take the delta data and then send it across to different systems. So we took advantage of that snapshot daily snapshot and then and then we take that snapshot and then run it data reconciliation program that would basically hit the Cassandra APIs. And then get the the data out from them and then compare actually row by row basically on all the data. And then of course this has its own problems. It has the big problem with this one was that there are a lot of false positives that you will get because time has changed because the nightly batches and the new data would have actually hit the Cassandra after the midnight. So we have to make sure we write the logic to suppress all the false positives and then get the true positives out of this. So that took us some time to build the true positives. But but we were able to fix kind of a lot of records. And then there are some examples. There were bugs on the sort of the mainframe on what events would not get published. So you would have some missing records and there are kind of orphan records and kind of mismatches. And this one took care of that that the contents of the packet are actually logically delivered to the Cassandra side. And some of the reasons for all this data checks is we're kind of a bank. And then if you kind of see a post missing and social media or anything like that, no big deal. Nobody is worried. But if you miss a transaction, even if it's a dollar, you're going to get caught. So and then we had a what I didn't talk about is that, you know, we had a rollout strategy. We rolled it out to select group of employees in a larger employee base and then to small percentage of customers. So that itself, the whole thing was all to make sure that the data and the and the reliability, all of them are there. So that's pretty much the end of the talk. And then as kind of three highlights from this one is that the reason why we kind of rearchitected this is not because of cost savings was one of them. But but to do more services for the customers and to be ready for the customer when when the scale up happens. And the next one is that the open source was actually useful and critical. And then we had to build the additional data checks and parity. And that's what so, yeah, open for questions. Let me see how many minutes you've got left. Okay, looks like around 10 minutes for question. Sure. And I'll come back to your question. Yeah, go ahead. They say, oh, how are we going to get support if something breaks? So, yes. I mean, support was one and then we also had, you know, the vulnerability fixes and then making sure that we are running the latest and then we have to fix the vulnerabilities and so, yes. There was a larger drive for more open source within the bank. And we actually have an open source office to now. And and that said, if I kind of go back some pieces of it was kind of open source supported by vendors and some wasn't. Yeah, like for example, springboard. Thank you. Just a quick questions like how long was the migration effort? So the migration effort if you look at all the business lines, they're still kind of moving the workloads. So some workloads and kind of moved and then we are in kind of production. So if we go back to sort of the core engineering effort that needs to be built was that the streaming pipe that we built was one of the core engineering efforts. The Cassandra was already in production for a different workload already because I said all the new domains were already moving to a different platform. So they were already on Cassandra. So this pipe was probably took us like, you know, six, seven months to kind of build first to so that every pipeline that was being built don't have to kind of do all of the data checks, all of them, you know, the journaling and all of that came free for them. So and then the then each pipeline, you know, if you kind of add that plus the release mechanism, which is about releasing it to the employee and, you know, going through that cycle took around 18 months. So that means that when you kind of swipe the card kind of a transactions, all that happens on the mainframe. I mean, that was not the use case, but but potentially we could actually do that. Yeah. Yeah, I don't think the I mean, I don't want to, you know, speak on audit, but but I don't think there's any restriction on the system. It's just the checks and balances that that probably went into the mainframe and then somebody needs to replicate all of that for that function on the open source side of the Cassandra side. Yeah. I mean, all the reads and writes reads are happening and the customers are looking at the data in a way customers look at the data every day. Everything, you know, like from an audit ability, it's there are two different functions. The customer actually checks it in some sense, but but auditor the, you know, the mainframe is still probably the source of it. Given, given everything that you had to do to make a new service layer work to provide a functionality that to your user, they don't notice any difference. Why would you build a new domain to serve this as opposed to enhancing your existing domain with a materialized view of the transaction? So let me what do you mean by materialized? So so so in the left hand side existing transaction, I make a change and then the change is reflected back to the user in a presentation layer. Why would I build a new domain and all of this to store the secondary view rather than saying, hey, domain that deals with. I don't want to say customer identity information, but like the customer as an entity, why wouldn't I say, all right, you're the domain that owns the customer. You build a materialized view and eventual consistency will say, hey, at the end of the day when the mainframe runs its batch, which the bank considers is the actual final state of truth. Why wouldn't I just say, oh, you maintain transient state and serve it back to the user. So if it's I make it to address change and it's not committed to the mainframe, I'll just serve it back from the domain directly. Why build all the rest of this stuff? So the domain was actually adding as for for enhanced transactions, for example, enriched transactions as we called it. You had additional data, which, you know, we didn't want to store in the mainframe to begin with. So the credit card transaction, they see each transactions, all of them that are there in Cassandra are more than what's there in the mainframe. We added, appended more data to it. And there are kind of search indexes because, you know, for example, in the mainframe, you cannot search, give me all transactions that are on McDonald's. So that's there in the new domain. So the CDC process that moves it into Cassandra is not the sum of the data set. There's stuff that's fed directly into Cassandra that the mainframe doesn't know about. Yes. Okay. That makes more sense. Good. Is it fair then that that Cassandra layer is really to optimize the consumer's experience like the read. So it's almost like a BFF or what I would call a BFF pattern in a way that just serves the front end visibility of the data. Yeah. Yes. In this use case, definitely in this use case, but if you go back to the use case that new domains are built outside of the mainframe for that Cassandra was a read write. Yeah. Yeah. And for the second one is the whole talk about. Yeah. You know, it's funny. I do most of my work in healthcare and it's always interesting to me how many overlaps there are between some of our different industries. I've been working on a project to separate out our on-prem SQL database to have our cloud no SQL database going through the change data capture being able to enrich and have a completely separate database just for read. I mean, this is like you're preaching to the choir on this one. I'm curious though, in the CDC process, using the address change as an example, knowing that an address change is good, but what you might need to publish to Kafka would include additional information. Maybe like the customer's name or their favorite color, their favorite TV show, for instance. How did you handle that at a very atomic level where the change is happening? I'm guessing on an address table, but it might actually need more information to be able to hydrate the customer record. So there was a, I was talking to Matt before this. We did like phase one of this before and we were successful partially. We took one or two use cases and we used actually CDC. And then when we scaled it up to like 40 different events, then we actually had a lot of problems because if you do CDC, CDC is you're exactly right. It's a granular level. And then all the logic used to sit in the pipe to kind of put it together. And then then we had a lot of things where you can't put the Humpty Dumpty together actually. So then we kind of abandoned that. So in this model, that's why we had to actually modify the mainframe to publish a complete event. So there is no translation or pretty much we avoided all translations and putting the package together in the pipe. The pipe is just dump pipe. Yes, it has some smarts and checking the validation and all that stuff. But we don't open the package and then see the contents and then put it together. Good. That's what I had to do too. And if you could go forward to slides, one of the things that it's, it's not that one is the one where you had like the pretty pictures and stuff and like Kafka. So one of the things that some of our teams do before they go live is they will actually hook their pre-production environment up to the production Kafka. And before they go live, they'll run essentially like the transactions live and then they'll run it through their pre-production environment. Compare what's in the NoSQL database for those same records in prod against what they pre-processed in pre-prod. And if they match, then they know that there's no regression. It only really works if you have some sort of near real-time asynchronous right of your mainframe so that you can do real apples to apples. Just curious if that's, you know, a strategy that you all have taken or if I've given everyone in the room a really good idea that hopefully you can do because your mom's data and health care matters a lot. Just like my mom, my mom is actually a U.S. bank customer. So thank you very much for making sure that her dollar transactions got through. Thanks for being a customer. I'll let her know. So we didn't try that, but our kind of deployment strategy was rolled out to the employees first and then get to the customers. Good job. No, actually good question. So no, the data sync was happening for all of them. So the almost like the, yeah, the serving of the, because what we could do with all this, we could actually check this. But the checking of the journaling, the data reconciliation, any alerting monitoring, all of that could be tested. But the impacted customers only with the employees because we made that switch in the UI layer. Yeah, the UI layer. That's where we kind of made that, you know, just select a random list of employees and then switch it between these two. Sure. I'll deliver it just a second. I'm curious, like with, you know, the storyline you told at the beginning, a lot of the work you did was to build confidence in this migration. So now that you've got that under your belt, like what's the next big tent pole thing on the sort of cake layer cake that you're showing? What's the big next problem? Actually, to make all of this kind of relevant. So we are probably in the third stage now. We're trying to think of moving all the noncore. And actually, you know, started thinking about the migrating of the core right now. Does migrating that noncore functionality, do you find like you're having to convince more of like the business folks that sort of rely on that? Like, now I have a whole different confidence building job. So like, like if you look at the noncore, they're exactly like the new domains. It's just that, you know, they were grandfathered into the mainframe. So the first thing that actually built the confidence and open source was actually the new domains. That was, of course, from a reliability perspective, we had that running for almost three years even before we kind of attempted this. And as I said, we attempted smaller scale sinks for one event and two events and then that was done on CDC. So there was some confidence that yes, we could do it, but at scale that we failed and we built this. So MIPS is a driver. So that's something that we actually kind of talk about it. The reason why it's not tangible at a bottom line is this is what happens. While we are doing this, you're getting new customers, just the new customers and their rights. And then we acquire new banks. And of course, we get the new rights from that banks traffic. So you will not see a tangible base lining of the MIPS cost per se because the new customers are being added all the time. So if you track it as a MIPS cost per transaction per customer, yes, you will see a decline. So it depends upon how you do the math. Yeah, good. So yeah, that's a good question too. Like if you look at all the four cakes, the mainframe cost would have gone up with the new domains. And then, yeah, the cost would have gone up basically. So you may see a flat line, but you're actually declining because considering all the things that would have gone on the mainframe are not there. That's a bit of hard sell for the finance, but that's true. We prevented an unrealized cost. Yeah, that is true. We say that, but it's a little bit harder. But yeah, good. But did you implement this in the cloud, private cloud? So yeah, we have a version of private cloud. All the Kubernetes things, like here the spring boot, like I was trying to get to that. Yeah, the spring boot ran on Kubernetes. The Kafka is on a VM. The Spark is on Kubernetes. The Cassandra is on VMs. And the domain layer is again in Kubernetes, internal cloud. Yeah. I mean, we wouldn't use probably no SQL for that. But yeah, we would design something different because core migration is a totally different track with totally different set of problems. Yeah. Okay. I think the next talk.