 I'm a DevOps engineer at Lending Club, and I'm here to talk to you today on how we use a graph model to manage our infrastructure. So Lending Club is America's largest online credit marketplace. We offer personal loans, small business loans, patient financing, and as of earlier this year, auto refinancing. I'm sure you guys are all more interested in DevOps at Lending Club though. So I'm on the infrastructure and tools team, formerly known as, and oftentimes still referred to as the DevOps team. We build software and infrastructure automation to enable Lending Club to efficiently and seamlessly deliver our apps into production while ensuring stability, reliability, scalability, all the illities. So more specifically, we write software that handles our infrastructure monitoring, alerting, deployment automation, cloud orchestration, and all the way up into common app frameworks that all of our platform apps use. And then a little bit about our architecture. Up until recently, Lending Club was data center only. We have a primary and a secondary data center. And then a couple years back, we started migrating our services into the cloud, into AWS. So before I start talking about our graph model, I would like to frame my talk with just saying that I think embodies everything that my team does at Lending Club. So be pragmatic, not dogmatic. Over the years, we've worked toward a consistent standard unified build, packaging, and deployment model. So whether an app is written in Java or Node or Go, whether we're deploying that app into our production AWS environment or our non-prod data center environment, we want everything to look, feel, deploy, and run exactly the same. We've also tried to avoid what we call tool trends. So today, the big thing is Docker and microservices. Tomorrow, it could be something else. And we want our infrastructure and tooling to be able to handle that. So oftentimes, we'll write third party tools within our own internal interfaces and automation tools. And over the years, as we've grown from five microservices to over 400 and moved from the data center to the cloud, one of our goals has been to automate all the things. Well, we soon learned that we have a lot of things and it's really hard to automate them all when you don't know what those things are. So looking at this slide, you guys probably recognize most, if not all of these technologies, you probably use a lot of them at your companies too. And so we all face this problem figuring out the relationships and integrations between and among these tools. Oftentimes, when we're onboarding new tools or considering multiple solutions or options for a solution, we look at the built-in third party integrations. So what if New Relic doesn't play nice with GitHub? Does that mean we can't use one of those tools? How do we manage those integrations? Are we making dozens or hundreds of rest calls to dozens of different end points? Enter Mercator. At Lending Club, we've written an internal Java application called Mercator and its job is to communicate with all of our infrastructure and build a graph model of our infrastructure. So I also wanna mention in a talk yesterday, it was on analyzing system failures. The speaker mentioned that we need models to help us visualize our systems and by extension to help us diagnose and track down problems when they arise within those systems. And this is what Mercator does for us. So Mercator is periodically scanning all of our infrastructure components and third party tools and it's making sense of the responses and building a graph map of all of those interconnected infrastructure components. This then provides us with metadata around which we build monitoring, alerting and automation. So as I mentioned three years ago, we were struggling with manual deployments. We were keeping track of our services via Excel spreadsheet. So we needed greater visibility into our infrastructure and we really wanted a way to be able to get real time or near real time state of our infrastructure. So as I mentioned, we created Mercator. Okay, so this right here is a visualization of our graph database. So as I mentioned Mercator is a Java application and it's storing information into a graph database. We use Neo4j and just to kind of go over what you are looking at. So each of these circles are what we call nodes and they're different colors to represent different node types. So every node has a label. You can think of each individual circle as an instance of a certain object type or label. So if you look at the yellow circle called LCUI, you'll see up at the top left that it is the label of that node is a virtual service. So this is our concept of an app. And then the lines between all the nodes are relationships and both nodes and relationships can store properties. So for example, our LCUI virtual service node contains two pools, pool A and pool B and then each pool contains a number of virtual servers. And so what we're actually looking at is our blue-green model, which we use in our data center. Many of you guys probably also use blue-green. So the question now is how did we get to this visualization? And so this is Mercator feeding into Neo4j or Graph Database. So what we did in this data center example is we had all of our app instances phone home to Mercator every minute with information on the apps that were running on them. So things like the app ID in this example, it was LCUI. What environment it's running in, the revision inversion that it's running, the IP hostname. And just by adding this, actually we gained service discovery, which we didn't have before. Similarly, Mercator, we started to scan our load balancer for information and we got back information on our load balancer servers. Not so much app info, but information on our load balancer server, the state. So whether it's active, inactive, how much traffic it's taking. So now we have these two objects. We have app info and then we have load balancer server info. And if you map those two things together by their hostname, you can combine that and this is what gives us the virtual server, which you saw a few slides back. So the virtual server has all these properties. And in this example, it's LCUI, we'll see that it's in production, the version and revision it's running. It's active, which means it's in the live pool. So then if we go back to this slide, hopefully it makes a little more sense now that virtual server we just saw is the pink nodes. So one of those pink nodes is represented by the virtual server that you just saw. And they all have these properties like IP, hostname, live or dark. And so you can kind of start to see if we start to group these servers together by app ID, state, and also some hostname naming conventions that we have, we can split them then into these two pools. We can group them into pools and then if we group the pools by app ID and environment, then we get this concept of a virtual service. And so this then allowed us to automate our data center deployments and we were able to set up a lot of monitoring and alerting just off this. So for example, we never want to have multiple revisions of an app within the same pool, especially if that pool is live. So this model allowed us to visualize that and now we have alerts that go off. Similarly, if one of our servers is down, we'll get an alert on whether the pool is degraded or if it's fully down. Also once we hooked in vCenter, to the scanning we were able to map our app instances and virtual servers to the vCenter instances and vCenter arrays which gave us monitoring ability into single points of failure. So if one pool, all of the instances are on one array, then that's a single point of failure and we'll get an alert on that and we'll move or distribute. So that was a data center example but it works exactly the same way in the cloud. We are periodically scanning a bunch of AWS tools. You'll see there is EC2, RDS, SNS, SQS, a whole bunch of stuff. And again, this is a visualization of some of our AWS components and there's no need to understand exactly everything that's going on in that picture but I think the key is just to show how quickly all these components get pretty complicated. And this is not even everything that's in AWS and there's a lot more that we have in our graph database besides AWS. And so there's no way that our minds would ever be able to conceptualize or visualize this on its own but the graph model allows us to track our infrastructure not just in the cloud or the data center but even beyond that and keep track of all our interdependent infrastructure. So this particular cloud model that we created allowed developers actually to start spinning up their own instances in AWS. They were able to self-service more instead of having to create tickets for us or wait for sysups to spin up servers for them. In addition, this is what we built our cloud orchestration on and we mirrored our blue-green deployment model from the data center into AWS. I know that AWS actually recently released a blue-green code deploy feature but I just wanna say that we made ours first. Okay, so we've seen a data center example and a cloud example. And I think we have like a basic understanding of how the graph model works but there's a lot more to our graph model. So if we just walk through the app lifecycle so an app will start out, we'll write documentation and confluence. PMs and engineers work together to write stories and tickets in JIRA. Engineers then commit to Git. They build with Jenkins. Artifacts get stored in Artifactory or S3. We deploy it to AWS. It goes into our load balancer. I mentioned VMware. And then our monitoring tools Splunk, Wavefront, and Urelic. We diagnose and talk about these apps. There are any problems on HipChat. We get paged. We also use OpsGenie, I know they're the sponsor too. Storage, Cisco UCS. So all of these things throughout the app lifecycle, they are all getting fed into Mercator. And so as I mentioned, we have maybe over 400 microservices now and they've all followed this lifecycle and they've been managed from conception to deployment to monitoring and beyond all through Mercator. And so we went from having low service visibility to having a unified graph model that we can query at any time that will give us real-time information on the state of our infrastructure. And we can answer questions now. Like I mentioned, do we have a single point of failure? Are revisions synced across environments? How much is our AWS infrastructure costing us? And another point that I wanna make is that the growth of our graph model happened very naturally and organically. When we first built our graph model, we did not set out to include all these things. We really just wanted to answer the question of what do we have deployed out there, right? The service discovery bit. That's what we started up with and it's evolved over the past three years and we've ended up with this. And I think it does kind of highlight the fact that the graph model is very flexible and it grows and evolves with your infrastructure. And now I'm not saying that everyone should go and use a graph database to map out their infrastructure, but for this particular use case, it's worked really well for us and that's why we've continued to use it. So I'm gonna do a demo. Last time I presented when I did a demo, it didn't work. But I don't care, I'm gonna do it again. And hopefully I've better luck this time. So this is the Neo4j console. It comes with your Neo4j download. And this is my local database, but it is actually a copy of our production database and I scrub some stuff and, yeah. Actually I saw enlarge this. Oops. So first I just wanna show you some of the stuff that we have in our graph database. So this is like a really simple query just to get us warmed up. If you look at, so this is Cypher, which is kinda like SQL, but for graph databases. It's very visual, this query language. So anything that's in a parentheses is kind of your node, right? So you'll see there's three AWS account nodes. And if I click into here, you'll see the properties within the node. So this one's kind of boring. There's just, the only two properties are AWS account and the update timestamps. We have non-prod, prod and infrastructure. Again, kind of boring. And I can filter according to certain properties. So if I do this, it's gonna filter all the nodes that have AWS account equals prod. And so it's just gonna return this one node. So that's not super interesting, but once we start to dive a little deeper, so this query then I'm asking for all nodes that have a relationship to the prod AWS account node. And because here I did not specify a label, it's gonna return all label types that have a relationship. So we'll see now within prod, prod owns some AWS S3 buckets, some SNS topics and four VPCs. So again, we're gonna dive a little deeper. So now I'm asking it to return everything that has a relationship to the four VPCs that are contained in the prod account. And I'm gonna enlarge this. So one thing is that as your queries get a little more complicated, it does take a little time to render it here in real life when you're hitting it, when you hit the back end, it's much faster. So you can see now we're pulling in purple, we have the four VPCs we saw earlier. And within those VPCs now we have subnets, VPC endpoints, security groups and regions. And then not to belabor the point, but I'm gonna do one more. So we're gonna get into the AWS subnet. And again, the query language itself is pretty visual. So like the parentheses mirror the nodes and the two dashes mean relationship. I'm gonna limit this so it doesn't take forever to render. Okay. So now you see we're starting to pull in our EC2 instances, our auto scaling groups, elastic load balancers. And again, this isn't even everything, there's AMIs, code deploy deployments, stuff like that. And actually if you think back to the slide where I showed you this kind of mess of nodes, we're actually looking at, I think either the same or very similar. And again, we could dive deeper and deeper and we could start to see how the EC2 instances are related to code deploy deployments or the EC2 instances are parts of ELBs and auto scaling groups. The auto scaling groups and ELBs are also attached. So it's a very complicated network of stuff that by ourselves we could never, our brain could not handle that type of load, but our graph model allows us to visualize and make sense of this stuff. All right. So that was the stuff which is kind of interesting, I guess, but maybe, and it's what we use to build all our cloud orchestration, but maybe not that useful. This is the type of thing that maybe my boss or my boss's boss might ask. And so I have a query here. So what this is doing, those EC2 instance node types that we just saw in the previous query, we're just matching those with another label called AWS EC2 instance type. And the properties within the type, it's just the model. So for example, C4.large or T2.medium and then the hourly cost of that model. So we're mapping those two together. And again, so the two dashes in between, that's like the specific label for that relationship. So the EC2 instance relationship label is has type. So we're mapping those and then we're grouping the instance types by account region, what else? Oh, an instance type. And so you see from this query, then we get in descending order, a monthly cost of our instances. It's kind of funny. I guess actually I was in a meeting with some of our billing folks and my boss and boss's boss. And we were trying to get better visibility into our billing. And because this information was in Neo4j, it was a matter of minutes. I kind of, I spun up this cipher and we actually have reports generated off it now. Every week we send it to Wavefront. It's a thing now. And then just to show you maybe another way to look at this information. So this is my instance type and we could group it by, just show me the total cost or show me the cost by account or by region or just by instance type. We could also, in this example, we'll show the cost by app ID. So up here the app definition. This is another label that we have. We have a service catalog where we keep track of all of our services and by mapping the EC2 instance tags to the app definition, we can get a picture of how much each app costs us per month. A little bit, a description of the app, what type, how many instances are running and again the monthly cost. So like this is kind of expensive. This is expensive but we have a lot of instances. It's just interesting stuff slash useful to know. And finally, what is this thing? So this is another thing that maybe my manager, my manager's manager might come running down or if someone from InfoSec, right, you have an IP, you don't know what it is, they wanna know what's on it, where it's running, what is it doing. And again, you for Jake can help. So those app instances we saw towards the beginning of the presentation, they store one of the properties you store as IP. So we match according to the IP and then we match the virtual server and the pool and the virtual service and the app definition that it's all connected to. And then we go from not knowing what this IP is to knowing exactly what it is. We have the host name, we know what app is deployed on it as well as the revision and the version. We know it's running in prod, so okay, maybe this is a real problem. But then, so according to naming convention, the 200s are pool B. So then we see, okay, the dark pool is pool B. So it's not actually taking traffic. So maybe it's not as big of a problem. If it were live though, when we needed to restart it, we would see, oh no, there's this many connections. Maybe we should wait, maybe we should drain it. Then we can take actions. And again, this one also, it's easy to see how if you want it to go even deeper, you could map the app definition to a Git repo or a Jenkins job and then you can really start to dig down and kind of figure out all the different components of this IP. So that is the story of how we've used a graph model at Lending Club to manage our infrastructure and run the company. I'd like to recap a few of the things that our graph model has allowed us to do. So I mentioned automating all the things earlier on. You can't automate them all if you don't know what they are. And with our graph model, we have a pretty clear visualization of our infrastructure. And I mentioned some of the monitoring and alerting capabilities. Some of the deployment automation and cloud orchestration that we've been able to build using our graph model. But in addition, some other examples I can think of are automated patching. We keep track of the image that is attached to each of our EC2 instances. If that image is over, let's say 30 days, then we have software that will automatically roll those instances forward to use the most recent image. Or for example, we have automated nightly replication from our primary data center to our secondary data center and from our primary region in AWS to our secondary region. It's very hands-off. We get paged less in the middle of the night. And because we spend less time patching or replicating from our primary to secondary site, we have more time to build cool tools. So the graph model has also allowed us to push for DevOps as a culture, not as our team name or as a title. So I mentioned we have more free time to build cool tools. And some of those tools are built around Mercator by exposing the information in Mercator, not just to our team, but also to release engineering, QA, and even like the risk teams. They can use this data and leverage it to build their own automation. And they can self-service more instead of having to ask us for things or maybe assign us a ticket and then wait on us. We don't have to become the blocker anymore. It's allowed engineering to take more ownership. And finally, circling back to my opening statement, our graph model has allowed us to be pragmatic, not dogmatic. With this model, we treat almost all of our third-party tools exactly the same. We have a standard and unified pipeline and we're not locked into any particular technology or tool. If tomorrow our CTO said, we're leaving AWS, we're going to Microsoft Azure or Google Cloud or Oracle Cloud for whatever reason, then in terms of this model, in terms of this model, very little would change. And this gives us a lot of flexibility when it comes to infrastructure and when it comes to the next big thing at Lending Club. So that's it. This is my Twitter handle. If you have any questions, if you're interested in maybe trying out Mercator, we have an open source version there. If it doesn't work, then you also have my handle so you can yell at me on Twitter. And that's it. Thank you guys. Hi, so I guess, well, you kind of answered it at the end. So clearly you have open source, some version of this. I guess my question was, when you first determined you had a need for this type of tool, how much exploration did you do in terms of what was available in the open source world and how did you say, you know what, we don't really have, none of this is going to cut it or it's just easier or simpler to build our own. What types of, what was that process like? Right, so I actually joined right after we chose to use Neo4j, but from what I've heard from my manager, he did try out some other graph databases. I think like TinkerPop was one, there was one more and then he settled on Neo4j. It did kind of start as an experiment, kind of like, oh, this is cool. And then once I joined, I started adding to it and I guess from there, once we saw how useful it was and how easy it was to visualize our infrastructure, we just kind of never looked back. In terms of open source, or like there are some things, like I think there is a tool, I forget what it's called, but it also maps out your AWS infrastructure in a graph-like model. You do have to pay for that though. And also this one is, I guess very easy. It was simpler and easy for us to build it ourselves. And this way we just, it's very easy now, we kind of have this rhythm where we have a new tool or some third-party technology will just build a new integration with Mercator and it just, it'll get loaded into a graph model, which is huge now. I have a question. Do you have, sorry. How are you, okay. Have you, GraphQL is like a thing that Facebook does? Yeah, yeah. And I've been looking at options as well. So you guys just have a single rest endpoint that your other, you said you built tools off of this tool. So how do you handle those requests? Is it just querying that? It's actually querying directly our graph database. And is that through a web API or just directs? Through, yeah, through an API. We've also written a wrapper around some of the Neo4j, the Neo4j stuff, and we use that wrapper then also. We like to wrap, we like to wrap our third-party tools within our, within our own internal stuff. I think there's more questions, but for the sake of moving on and getting to lunch at some point, you can find Ashley in the hallway and ask her all the questions or start an open space on the subject. Thank you so much, Ashley. That was amazing. Thank you. Thank you.