 Good afternoon. I'm Noah Mandelbaum. This is my colleague Steve Housack. We're here to talk about Capital One and a large contact center application that we built using open source technologies and inner source techniques. During this talk, we're going to cover how we went from a monolithic application to a micro everything architecture. We will also cover some implementation details and some choices and learnings we made along the way. Like I said, my name is Noah. I've joined Capital One in 2012. I'm big into architecture, technical teamwork, and Node.js. And I'm Steve Housack. I'm also a distinguished engineer at Capital One. I've been there for eight years now and work on Greenfield architectures along with sound engineering practices. So a little disclaimer. We wanted to keep it light. So we have a lot of dogs in our presentation. We are inclusive at all pets. We've given this presentation before with cats. All the dogs are part of our Capital One family. We have our own dogs in here. So if you get bored with the technology, there's always the dogs to look at. So a monolith emerged about 15 years ago. What we have is we have a large contact center. For those of you who are not familiar with contact centers, when an agent cannot solve a problem on a website or otherwise, they'll call in. We might help them with the payments, rewards, any sort of question they have. So it has to represent the full line, in this case, of the card business. Like many projects that came before it, we had a contractual obligation that we needed to get out of. And so we built this monolith pretty quickly. It was ASP.net, C-Sharp. It was web forms. And it was running on Windows server. Of course, at the time, we were not in the cloud. Everything was server-side. And we got this done very quickly. Got this done very quickly, took about six to eight months. And our contact center associates, they were fine with it initially. And our customers were fine with it initially. But pretty soon afterwards, we began to run into some of the exhausting problems that came along with running a monolith in an on-premise configuration. So we had 100 servers that ran and served our agents as they helped our customers when the challenges was configuration drift. So at this point in time today, Capital One is all in on AWS. But at this point, we were not. So what would happen is we might have 100 servers, server 98. Somehow the system just wouldn't work the same on that one. And so we'd have to go ahead and tear down that server and rebuild it. It would take weeks. Our build test cycle was very slow. So it would take us about a day to build the system, even optimized. And it took several days to go ahead and test it. And if something went wrong with that, you'd have to go ahead and start the whole thing over again. This led to some very large batch delivery. So we had one to two releases a month with hundreds of changes put in by hundreds of software engineers all contributing to that single monolithic code base. It made it very difficult to back out mistakes. Naturally, when you have this sort of application, the blast radius can be very large. And it was stateful as well. So what might happen is you'd encounter an error. A agent would be pinned to a server or actually a couple hundred agents would be pinned to a server. Now they're talking to a customer. They have the phone data going back and forth and they're talking, but they can't see anything anymore. So they have to go back in, apologize, log back into another server. We had many direct connections to data. Even beyond this, from a kind of a Conway's law point of view, nobody was terribly happy with this setup. So Capital One, like many large financial institutions that are here today, operates internally as a lot of kind of small silos, right? With different imperatives, different lines of businesses. And each one of these lines of businesses might have their own trading cadences. They might have their training cadences. They might have their own freeze periods. And so what would happen is we would have to go ahead and have all these lines of businesses negotiate around releases. So one or two releases a month could sometimes stretch into even greater periods. Even worse, you know, we're in an industry that's highly regulated and those regulations can change quickly. At the same time our agents have problems they need to solve. They might raise bugs with us. So if you're releasing one or two times a month, you really can't address your agent's needs. And sometimes it's very difficult to deal with the quick regulatory requirements that are placed upon you. So none of us were very happy with this. Now, talking about kind of some of the ideas that began to float around about a decade and a half ago, we had some forward-thinking engineers. It's not just me and Steve up here, but we had hundreds of people who were thinking about the same thing in our environment. The first thing that kind of caught our eye was in 2010. The continuous delivery book was published by Humboldt and Farley. And we began to understand that, hey, releases did not have to be something that were heroic. It could be something that we could script and automate. And so could our build pipeline be automated. We used to have a team of people who would just work on the build and we realized we didn't have to do that anymore. About 2014, Capital One began to embrace REST APIs. Previously we had a lot of direct connections, queuing mechanisms, SOAP services. But we realized that a REST ecosystem would help us quite a bit. 2014 and 2015, and this is where our story is maybe a little different than some other people's at Capital Ones, we began to look really carefully at that monolith and said, hey, there's this new thing called single-page applications, right? Angular. That's pretty cool. And we really began to look at Node.js, right? And we kind of thought to ourselves, well, if you're using Node.js and you can use JavaScript on the back end and you use JavaScript on the front end, maybe you can avoid some of the complexities in terms of context-switching for your developers. You could use kind of standard development practices. You could go ahead and do isomorphic JavaScript. Use it both on anywhere, really, you want to do your work. And then 2016 as well, lo and behold, we moved beyond the idea of on-prem infrastructure and now we're going to go to the cloud. But the thing that really impacted our way of thinking about it was we noticed kind of in the ecosystem, there was this idea that you could begin to decompose front-end applications. You no longer did not have to do single-page applications. MartinFowler.com, Cam Jackson, published an article about micro front-end architecture. And this really seemed to align with the experience that we were having at Capital One. We realized that if you could achieve this, we would model based on our business domains, we could begin to hide implementation details. One of the things about the monolith is that whether we wanted it to or not, we had broken down kind of those contract boundaries and interface boundaries and code was slowly merging together. So we could be loosely coupled in contract-based communications. Isolate failure, right? We didn't like the idea of when one piece of the application went down, the whole thing went down. Decentralize as much as possible and release independently. Again, we have lots of teams that want to release all at once or on their own cadence and we wanted them to have freedom to do so. We basically wanted it all like this little fellow here. We wanted limited 12, we wanted limited blast radius, simpler smaller code bases. So the monolith I was talking about ran into the multiple million lines of code spread across a lot of different kind of sub modules within that ASP.net application. We also wanted though our engineers to have rooms to irritably innovate, right? Because especially if you move in this open source world, there's always new ideas emerging that can add incredible value to your company if you can just capitalize on them quickly enough and we wanted to be able to capitalize on those things. So this is all great story at this point, but in 2017 when we were looking at this, I'd say it's safe to say that we did not know we were doing it exactly up front. So in order to do this, it wasn't like we decided to do it and it was successful. We had to try this about five times. So we had to keep on pivoting as we had models that were proven to be less and less or proven to be inefficient and then we had to fix them. We had to think about regulations and think about governance, partner with our risk partners, our audit, our architects. At the same time, we also had to have our large legacy center application continue to run and of course nothing comes for free in our environment so our business partners who were very supportive of us also wanted business process innovations at the same time. So innovating with new technology, innovating with business processes, keep the old system running. We learned some lessons and Steve will go into more of these as well but what are some of our takeaways in our experience? Since we are a large UI application, we figured out that what you need is a visual design library, a unifying system to give everybody their own look and feel across the ecosystem. So one of the things even today when we talk about micro finance with people in our organizations they say well isn't it kind of chaos? Well if you have the right governance including a single unifying design system you can get rid of that chaos. Foundational here as well is a CI CD pipeline. If you don't have one of these things I highly recommend you get them. And so in our case what we do with the micro front ends as well as the back end services is we have a standard pipeline with linting and bills and all the security checks kind of baked into them and everybody kind of deploys their piece but they use that pipeline so we all have that standardization to help free up innovation. Governance is incredibly important here just like with a micro services ecosystem. What you need to do is you need to make sure that you have rules like we have a process by which intent comes in no matter who owns that intent and how that intent is groomed and placed into our information architecture. We share information among our different groups so in addition to Slack channels we have what we call congresses where people from across the federation get together every week and talk about things that they think we can do to improve the federation. It's not just the core teams it's that active contribution from teams who you think have a great new idea but don't sit on the core teams that really powers what we do and then we constantly measure our developer experience and end user experience. We survey our agents our developers as well as our product managers every quarter and we take those numbers and we go ahead and we say where are we falling down and we use it to improve ourselves or where can we do where do we do things well and can get better at it. So where do we sit today? Dogs sit. Anyway right now we have reduced time to mark with no outages. Remember that we were talking about one or two releases a month. Recently this micro front-end platform it says several dozen teams contributing at the same time. Yeah we have a lot of teams that contribute at the same time. Hundreds of releases a month. We have a dozen plus releases a day on this platform. In the past year we've done several thousand releases and we've had no outages associated with any of those releases. Sometimes we have to roll back because the code is wrong but in general we haven't brought the system down. Sometimes you have a little piece that doesn't work the way you'd expect but again we've been able to isolate that failure. It's highly decomposed. We have more than 50 micro front-ends with a similar number of node services on the back end. We find that we can resolve bugs in hours without heroics. We're 100 cloud native all AWS Capital One is big AWS shop and our developer experience has shown that our developers really like working with the ecosystem and they also appreciate the community we've been able to to build compared to other systems in our organization. I will now turn it over to Steve to talk a little bit about the technical details. Okay so this is the part where you'll concentrate on dogs probably more. So in general we're getting more into the deeper throws of what we achieved here and basically it's an app shell with a special router that allows us to swap applications in and out of the DOM at any particular time the user's navigating. You can see on the diagram on the right it's basically a three tier type of architecture. We have the browsers obviously the main front end that our customer that our agents our customers are our agents use for the UI and then from there we go through a reverse proxy and we hit our middle middleware layers. Those are Node.js and layers and then we still utilize the same REST endpoints, GraphQL servers, whatever we need to to get data behind the enterprise through to get the enterprise data. We use a lot of the most popular open-source libraries out there today including Fastify, Pino, NestJS, React, RESTify and DGView.js. Some of our development dependencies include Cyprus, Jest for our testing, and then some of the older testing frameworks such as Mocha, Sheenan and Chai. We are looking at Playwright coming up so that we can start looking more into multi-browser testing but that's that's one of our things to improve upon for the future. The routing through our system is done by convention. This is like one of the only rules we basically have and it allows us to be pretty deterministic with that router. So you can see we have a segmented URL here with the first one being the tenant and that just tells our system what configuration to use. From there we go to a domain. A domain is just it's an opaque organizational unit and then we have what we call a container which is actually our micro front end and then with our micro front ends as you would do with microservices you appropriately size them so it may contain a couple apps or it may contain one and you can see then we have the app and then the resources behind the app. That configuration based on tenant comes from configuration. It's a surprise price and we just use simple HTML5 constructs to swap the divs as needed. So here you can see we have some configuration on the left and it's pointing to the HTML on the right and from that you can see that the servicing mode, mode is a tenant, is utilizing looks like all four components of the HTML whereas the bottom one which is our quality mode is only using the header and the main outlet and the main outlets where your application actually resides. And how this comes together is as one cohesive application in the browser. So what this is showing here is essentially five different applications on the screen at the same time all running independently. So you have the header application, you have a sidebar application, the application page which is in the middle, and then a footer application and then the shell is what brings it all together so that gives you the five. And when you're navigating within each component so if I'm inside the application page I'm actually navigating with that framework's native router so if that's a re-react application I'll be using the react router if it's view I'd be using view. And likewise if the sidebar was view and the application could be react we're agnostic on that. Then we get to how we get to make this a little more federated so that reverse proxy I mentioned brings it all together under one domain and allows for very flexible hosting solutions. So Capital One's all in on the cloud so I'm going to focus on AWS here but you won't be limited to AWS because you could point to a whole different cloud provider if you wanted to. But in general this is showing that I can host the let's see the sidebar application or shell infrastructure I'm sorry on ECS the header and footer on EKS an application could be a lambda or sidebars hosted on Fargate. So in practice usually teams will pick one but we do have the capability to host however it makes sense for our federated teams. So that was as technical as we're going to get. So now we just pivot and talk about some of the lessons we've learned along the way in building this because there are like Noah said we pivoted how many times did you say five five times I lost count after three. So we've learned a lot and so I'm going to go into some of what we've learned. So in a good developer experience and I would cross out good now and say a great developer experience is what you actually need to be successful here. You need to give your developers the flexibility to run things locally only run what they want and run natively. And so the monolith required our developers to have since we were ASP.net IIS on their machines they were constantly doing IIS resets to get the changes up with node we're able to and we use webpack dev servers and node processes and watchers and stuff to automatically reload on change and allows us to immediately see changes with hot reload and those kinds of kinds of functions and developer experience. We have a developer proxy which is similar to reverse proxy which brings it all together but also allows us to host parts of the application off off the local machine. So if I'm working on my payments functionality but I want to see it in with the the main application alongside rewards let's say I can point rewards to the cloud hosted one and payments to my local dev one and work it that way and then we were from the beginning we we were very adamant on having well set a full set of maintained documentation tutorials how-to's and reference pages along with the congress that that Noah mentioned is another place that we we we do a lot of that knowledge transfer that happens in a system like this. One of the lessons learned as Noah mentioned was having one basically language through through our stack so being full stack JavaScript simplifies our developer experience greatly and avoids that context switching that decreases developer productivity. Other applications in our ecosystem will have a angular front end and then a Java back end or a Java middleware. We said we're just going to be node all the way through and so that that that's really helped our developers deliver at a much faster cadence than some of our partner teams developing different applications. Our code tools and testing patterns are shared amongst all between the front end and the back end as much as possible so linting rules while they're a little different they can be shared and the processing and just the builds and everything like that is shared because it's essentially the same because you're using a package JSON and you can just say build does this and what's interesting is while we do support people using Java or go outside you know as the the middleware no team has opted for that yet. I think I think Noah did once as an experiment but everybody uses JavaScript and node. The probably the most controversial thing and we get these questions all the time is deciding on a monolith or I'm sorry a mono repo versus a poly repo type of structure. This was this I think we had 60,000 meetings on why we would choose one over the other like knockdown drag out meetings and so we finally set it on poly repo and it unfortunately it comes down to some of Conway's law in that people wanted to retain ownership of the code that their teams are producing and it did actually fit better into the structure of the application in general in that it allowed us to have that independent deployability and you knew what you're deploying you weren't worried about picking something out of a big mono repo or or something like that or having to write tooling to help you pick out the right part of the mono repo and this was before I think we were looking at rush at the time and Lorna was just starting and we were the tooling just wasn't quite there yet. I think today with NX and things like that and Lorna's resurgence I think you have a better choice but we like the modularity and the encapsulation and the clear ownership so it's just one of those things to consider. I'm no longer in this group anymore that produce this my new group that I'm not new to me group does use a poly or a mono repo so and they're successful with it. We also learned that we had to define a clear support model so internal customers wanted to know that their core components would remain secure and bug free so that they can focus on their business intent so we created a support model in which trusted contributors dedicate time to the new library features with comprehensive documentation which is required and security patches. We also have an n-1 versioning strategy where consumers consuming teams are asked to stay current with their dependencies but usually we're I would say in the fifth iteration of this system we are completely backwards compatible all the time. We learned that early when we had people doing mass dependency updates. We use a lot of automated tooling to handle our dependencies now and so that's reduced a lot of our developer toil. The next thing is we're never done so with that we started on restify as our as our middleware for our Node.js middle tier. Fastify I forget when it actually got started and we started going wow Matteo has something going here we should keep our eye on it and so eventually and finally I think we've moved over to Fastify everywhere and it's given us automatic performance improvements without us having to really change much code at all. So I think the important thing to get out of this is you always got to keep looking forward and you know don't settle on what was yesterday's hot technology and look forward to the future. Obviously as a financial institution we have to have some certain amount of restraint so we do have some rules on how we bring that stuff in and how we deal with it in general. So this is my dog. Our journey is not unique but the path is that we took so we've achieved what we've needed for our platform currently but there's always more work to be done. A couple of things that I wanted to add to the slides but I didn't get a chance to add is we do have pretty strong governance I wouldn't say governance we are up to date on Node.js as much as possible all the time so when Bethany release is a new release we are on it within depends on the type of release security release within days a minor release within two weeks now and LTS is a whole different thing we deal with those when they come out but we are moving to 18 as I speak so 18 was LTS you know a couple of weeks ago too proud too proud that can depend obviously but yeah yeah yeah yeah yeah so just to repeat the question was how long would it take to go ahead and get these kind of these node updates into the Tor ecosystem again the answer is yeah we're going to get it into our into prod within days and to tie in more of the open source here thing of the conference all our stuff is open source we're we were inner source from the beginning for this platform so we don't have restrictions on who can contribute where and a lot of our interdependencies so if I'm working on payments I can expect my partners that work in our collections area to contribute to my payments infrastructure for different types of payments so that's one example of how we inner source we've learned lessons along that way too but that would require a whole different talk but in general it's been very successful for us for this platform in general yeah and I did want to add one thing some of our inner sourcing models and practices are based on what the the Node.js community does so if you want to see something that's a pretty good model in terms of code of conduct in terms of how you handle PRs and how you have contributions I would definitely refer you over to the Node.js pages yes there's a question so so the question is how do we enforce the the end minus one policy by no choice so I think there there's a couple different ways that we're doing it one of course is to to shame but that usually doesn't work as as well as business intent so there's a couple different ways that we do it the first is we make it easy for people to discover whether their their dependencies are out of date so we built a special console and what it does is it checks kind of what's available in terms of the packages and what is in each one of the projects and then we have a dashboard that shows people where they're out of date second thing is we're proactive in terms of publishing guidance on how they might manage this the third thing which is kind of interesting is and again this is stolen pretty pretty much from Node.js is we post release schedule so we say that every quarter we're going to have a major release and before that major release occurs we're going to have a beta right if it's a minor release and backwards compatible we'll go ahead and put it out or if it's a security release and we go ahead and say there's a current release and an end minus one and then for everybody else you're on your own now there was a discussion earlier about forking people of course could fork or not pay attention to that but what they get the benefit of and this is key in these sorts of communities is we have a core team of individuals who go ahead and keep our base libraries up to date including those security patches so if they choose not to consume those patches they're going to end up on the security naughty list or they're going to have problems so there's both kind of shaming and providing information but then there's also kind of a natural market force that's in play and there's also automation so we do utilize a lot of industry standard automation systems to automatically give us a pull request when a dependency changes whether that dependency is internal or external any what sort of package four feet four feet did we use any four feet packages no I would say in my space I'm in my new space we have one package that's four feet we have to pay for we've it had major security vulnerabilities which we weren't going to pay anymore for because they weren't going to fix so we refactored it out yeah unfortunately um yeah and we had no choice we'll take of course if we were to have time we'll take questions afterwards but uh again one of the things just before we answer this next question is that we try to pay it forward by paying back to some of these community groups and open source yeah we recently joined OpenJS Foundation and we're a member of all these other ones too yeah you had a question no so the question was you we talked about a singular design system but then we talked about two different application frameworks Steve do you want to take that one um yeah so our our component library that we use our web components and they're just standard and react and view both can deal with those um you'll notice we haven't been saying the a word much and the a word's angular okay good because yeah I don't want to get into that happy to get into that in the hallway I saw another question over here yes the question is do we see any security issues related to to webpack so I don't think there's anything specific that we have seen related to webpack I think the the main thing you have to worry about with this sort of application is what you're actually placing the browser uh it's it's a pretty insecure environment what we do is we have kind of local data stores but by policy and by code enforcement you're not allowed to put anything into these global data stores that be considered PCI or sensitive so strictly identifiers another question what part of this is open source so basically all so this particular piece is in inner capital one internal project in terms of the technologies we used and the ideas they're all drawn from the open source community and they're all open source technologies that we're using we have an an OSPO open source security office and management office that helps us with all that stuff but yeah these are just common technologies that your average developer doesn't need to have capital one money to get at the question is how do we measure the developer experience do you want to take that or I take um so what we do is we do it a couple different ways but one thing we do is we survey our developers every quarter and it's an anonymous survey and tells the good the bad and the ugly right and they're usually pretty honest about it and so we have a part of it again running a platform of this size we actually have a program manager that helps us with that so it's a formal survey and we track those results over time and we have a standard set of questions there's also a free form test we then collect all that information and then we publicly share what our findings are include in including that we also share kind of our remediation plans and sometimes people are unhappy with the way for example we're communicating about things or they're unhappy about the tools that we do uh not only do we just collect that information and respond to it we also invite through a kind of our congress uh setup of committees so what we'll do is we'll go to congress and say it seems like developers have this sort of concern who among us would like to go ahead and form a committee and contribute back to making the things work better and this has allowed us to see improvements in our contribution model in terms of our functional testing in terms of development practices so we mentioned uh next day s nest j s previously that was something that was brought in by a federated team that's not part of the core uh so as much as anything it's not about technology it's about the culture that you build and to follow up on that we do we do get an nps score so it's the same scoring we would use for customers even outside of cap and one okay so the question is how do we relate it to our our product owners and and yeah and stakeholders so one of the things that we do is when we go ahead and do a deployment we're heavy users of feature flags and so what we tend to do is we'll put the code out in dark mode and then go ahead and let the product owners make the choice based on the training that the agents have received when to toggle it on and so they're able to go from completely dark to a very small group usually just the product managers to a group of contact center agents to the whole team and our c i c d pipeline is gated at one point where they have to yes take this forward and they add a manifest that says the actual changes they're going to go we have a stop sign we'll take more questions but thank you very much yeah thank you