 There we go. Thanks for coming out. I think this is one of the last presentations of the night. So thanks for joining us here. My name is Chris. I'm here with Nick and Ashwin to talk about our hyper-converged open stack architecture and its operations. We're with FICO on the cloud engineering team. And so we'll start out with some basic stuff and get into details later in the presentation. Who is FICO? Well, most people I talk to here that I'm with FICO and they say, oh, FICO score. And that's usually the first. That's usually the piece that they know about most. And that's the score coming from different financial analysis of personal credit scores. But they also do a lot of business to business products in analytics, financial analytics, such as debt management, decision management, and fraud analytics. 98% of credit-related decisions are made using FICO products. And 2 and 1 half billion credit cards are protected by FICO fraud systems. And we've been around for more than 50 years doing data and analytics. So why the move to the cloud? FICO's traditionally an on-premise product company. And there was a need for expansion into a smaller or more middle market from their top tier focus. And they wanted to get away from the traditional on-premise tech and move towards a software as a service model. And we've been using open-source cloud management software to allow participation in driving change and increasing productivity, faster time to market, and increase our global presence. The lower cost enables emergency into markets in different regions that we wouldn't have been able to penetrate with our traditional model of on-premises. And one of the first products that we put into the cloud is the FICO analytic cloud. And that's our platform as a service product. And it basically allows our customers to bring their data into our, to use our analytics tools and create their own applications and services. And this is one of the applications that we're using with our OpenStack cloud. So why did we go with OpenStack? And this is a couple reasons why we chose OpenStack as our cloud platform. Predictable costs, freedom of choice, no vendor lock-in, and simplified deployment. We wanted to take advantage of some existing deployment processes that we had in place. And OpenStack allowed us to use some of those pieces. A solid platform that was understood by operations and many of the decisions we made in moving towards OpenStack, we looked at the operational aspect of it first and tried to align with what was already in place so we weren't making huge changes in the environment. And we wanted to take advantage of the operations team knowledge. And so some of the choices were we were a big Linux shop running Red Hat Linux for the most part. So we've chosen Red Hat's OpenStack product. OpenStack Juno and Kilo run on REL7. So the operations team is already familiar with a lot of that. And also, as Ashwin and Nick will talk about later, we're using, or get into detail later, we're using Cisco UCS software or hardware for the bare metal as well as F5 load balancers. And again, that's to simplify the operational piece where our teams already know that those technologies. And that's to minimize any operational impact. So Ashwin's going to jump up here and go into detail on the design. Thanks, Chris. Hi, my name is Ashwin. OK, let's get going. So what does FICO OpenStack design look like? So we have a high availability implementation of OpenStack running Juno in most of our data centers. So we have F5 to load balance between the APIs. And we have all, as Chris mentioned, like 90% we only have Cisco UCS servers. So we have currently most of the deployments are running on M3 servers. So Cisco UCS 240s and 220s. So 240s are the rack servers dedicated to compute and storage. And the interesting thing is we leverage hyper-converged architecture. So where both compute and storage will be running on the same node. And it simplifies the number of servers needed for small deployments and scale out as necessary. Since we have presence across the globe, we deploy based upon the needs of geolocation. What kind of applications, products will be hosted on the data center that drives the requirement of the physical hypervisors. And we have UCS 220 rack servers for the control layer. And currently, we are testing a lot on C groups to limit the resource contention. Because since it's a hyper-converged architecture, if either compute or storage goes out of control, we have to set something in place. So that's where we are also working with C groups and trying to containerize certain applications. But our high priority is working on C groups. And tired storage, I will talk more about when we talk about Cinder. And we have both SIF and solid fire for our storage. OK, this is the actual logical deployment here. So you have a FI load balancer, which is load balancing across all the HA piece of open stack. So on the blue one, which is like Cisco UCS 220 running controllers, all the controller services, and SIF monitor nodes. So smaller deployments will have three nodes, three monitor nodes, and three controllers. And bigger deployments will have five monitor nodes and controller nodes. So all the API are being getting load balanced FI F5. Since we already have FI in many of our data centers, we just wanted to use the existing hardware to make it more practical. And for the MySQL, we use Galera to replicate. It's active-active across all the controllers. And the orange ones are the Cisco 240s. Typically, our server consists of 24 hard drives out of which six are SSDs. And 18 will be 10 case S drives. So the design which we split is two will be for the operating system SSD. And the remaining four will be accessing as a journals for the stuff OSDs. So in all the disks, we'll have two store spars. Because the threshold we got on SSD to the journal ratio, the OSD to journal ratio is 1 is to 4. So 1 is to 4 is that benchmark. If you go 1 is to 5 or 1 is to 6, it doesn't make any sense. You can directly use the 10 case S drives. You'll get better performance. So that's the trade-off we got. And we decided 1 is to 4 for the OSD to journal ratio. And this is like more high level of the data center deployments, which includes your Cisco network years too, which is like a spine and leaf design. And the previous slide shows the bottom one. Open stack. So a typical data center like deployment is this. We have a Cisco UCS servers running compute storage and networking. And on top of it, we have Red Hat Linux. Now all the green field deployments we are using, REL7 and Red Hat Storage. Here we are looking at Ceph, not cluster. And we run OSP6. And on top of it, all the application currently is running on OpenShift by Red Hat. And as I mentioned, we have a tiered storage. Both for now like, OK. So I'll come back on now with the Ceph. So Ceph storage is selected as it is scalable. It's open source, and it is software-defined storage. It provides block storage, file system, and object storage. As of today, we are just using, in production, we are just using block storage for backing up Cinder. And the file system, we are just playing in the lab. It is not in production as of it in FICO. And object storage, we have a couple of options. Either to leverage the existing hardware and just spin around as a gateway client and talk to the existing cluster or put a separate cluster for Swift. So the reason behind both the things is our developers have experience coding against Amazon S3 buckets. So since the software's are like API, they are already being for S3, both Swift and Rado's gateway. Both can have many sets of subsets which can communicate to the, they don't have to change anything much on the code side. So the question or the thing, the decision which we are making is, is that the performance is so much different in Swift versus Rado's gateway, whether we can leverage existing hardware or have something new for a separate cluster on Swift, which we can also use it as a backuping your center volumes. We are still on the testing phase. And it can be optimized for small and large scale deployment with OpenStack, CIF, hyper-convaged nodes or dedicated nodes. As I mentioned, currently we are using S3. Okay, so as I mentioned before, CIF cannot meet the demand of all applications currently. I mean, take an example of MySQL database. You cannot run, you cannot expect a high performance out of a CIF cluster, using 10K SAS drives and things like that. So initially we just had CIF, but learning from lessons and the requirement, which needs high performance and low latency, we have solid fire flash arrays. So we hooked up, Cinder has two backends. One is solid fire and another Cinder. So in our deployment, all the VMs are directly booting from CIF volumes. So it directly has a RBD volume sitting on the CIF cluster. So if the requirement requires to meet SLAs and those things, we just use solid fire to spin up that particular volume. And even on the compute side, there are certain licensing factors and things like that. So we use host groups to pin down certain VMs, should launch only on hypervisors XYZ. So it doesn't use any hypervisors scheduled by Nova scheduler. So we use that with the help of nova flavors, extra specs, you decide which host should accommodate those VMs. Okay, operations. Hey everybody. So my name's Nicholas Drosmatos, Director of Engineering Cloud Services and actually recently Operations. So a big question that we get a lot of time from everyone is, you know, how are you dealing with the operational aspect of next generation technology, cloud-based technology, moving away from a more legacy mindset, right? If you think like a big blocky, sand, VMware, things along those lines, right? So my answer for that has always been that, you know, we're powered by magic, right? But the fact of the matter is, we initially, when we deployed OpenStack, we did an entirely manual deployment. And what we found out was, a lot of the documentation was a little bit off back in that era, OpenStack wasn't really fully baked. And so some of our deployments didn't necessarily go as smoothly as planned. We ran into a lot of issues. And so what we decided was we needed to automate, right? We had to find a way to be able to automate every single deployment, regardless of the scale, regardless of the size, regardless of how many nodes we're gonna be there, products or things along those lines. So we decided to use a UCS central formant and puppet for the configurations. So when we bring up Cisco UCS domain, we could actually attach it to UCS central. UCS central has all of its policies and it'll actually push those down to the server configurations themselves. And then we will essentially bootstrap one of the C220s and then we'll provision everything with formant and puppet. So that's pretty much how everything gets done for formant and puppet do all the dirty work. Right now, typical deployment time goes anywhere between about 10, 12 minutes, 15 minutes really depends on latency and how many nodes we're actually deploying to. And our provisioning is actually mostly hands off and output is logged for failure. So we wanted it to be intelligent. If there was a failure, we wanted to have it, you know, try multiple ways of rectifying the situation rather than having manual intervention because really the intelligence and the whole thing with automation is trying to get away from the manual configuration. So the main benefit from automating everything is that we reduced our time to market and staff hours. So people didn't have to work 60 hour weeks anymore, right? And also when we were deploying a region or another availability zone within a region, it was a lot easier to just provision, right? We bring up another UCS domain and walk through the same entire process again. So it ensures consistency and reliability of deployments and it also achieves source controlled configurations. So CloudForms, how does it fit into the picture? How are we using it currently? So we use CloudForms currently to manage both our next generation environments as well as our legacy environments. So, you know, VMware centric environments being our legacy open stack being our next generation environments. And then we also use AWS as well. So there might be specifically regions or data centers where we will leverage AWS because they already have a strong presence there and it's just easier for us to deploy Red Hat Linux and then put OpenShift on top of it and then containerize it. But the main issue is a lot of countries don't allow you to take data outside of the country. So if there is not already a region there or an availability zone there, there's no way for us to leverage it. We have to build our own individual cradle or pod. We use CloudForms specifically for a single pane of glass. So whether it's from a developer perspective and engineering perspective, we want them to actually have the ability to take a look at utilization, make sure that they're right sizing their applications rather than them constantly coming back to us and saying, well, you know, we think it's right sized or we think that we need to go this direction or that direction. So we provide them dashboards and reporting and visibility. We also use CloudForms specifically to give them a self-service catalog. So they have the ability to go in there, select whatever type of either series of applications or instances that they want to deploy and they work, even use Vagrant or whatever you want to consider to be to actually do the deployment. And CloudForms also provides us the ability to, it acts as a translator between Amazon APIs. So monitoring. So Zabx is currently used for validating services and simple alerts. We've been using Zabx for a very long time and what we realized with Zabx is, well, it's not a bad tool. It requires a lot of custom integration to meet all of our needs and our requirements. So we're trying to simplify things. Right now we have way too many tools to monitor all the different components from the hardware layer, hardware abstraction layer, the hypervisor layer, into the operating system, the application itself, and then sometimes even into the containerization, right? So having to correlate all of the stack together and understand when you're actually having a problem, what's it related to is very difficult right now when it comes to a lot of the tools that we're using. So for example, we lose SF node, there's a rebuild that's going on, a VM that's being impacted by that for a transaction per seconds, they experience slowness. We'd like to be able to see a correlation with entire stack and be able to troubleshoot it. So one of the main things is, if you can't buy it, build it, really. You could go and there's 100 million things out there, you could use graphite, you could use Zenos, I mean, there's a never ending amount of tools, but if you need a Swiss Army knife to monitor your existing and your next generation architecture, you're either gonna have to do a lot of integration, and if you're at that point, you might as well just start building things on your own. We do use L currently for centralized log capture. So that ties into how we're trying to be automated and intelligent with being able to pull information from logs. And we actually have an internal product that we're developing right now that could aggregate logs. It's predictive when it comes to failures, validating the actual failures themselves, and then it attempts to execute to remediate. So if it necessarily sees that a service goes down, it goes in, it validates that the service is down, and then it tries to restart the service itself. And if it fails, then it opens up a trouble ticket in a service now. So that's all we have. I'll open up for questions, feel free, or not. We did. The challenges that we experienced initially were more hardware related. So we always started with Cisco, well, I shouldn't say we always, we went white box first, and then we decided that we were gonna go with Cisco because we already had a lot of Cisco deployed. And so there was a lot of understanding there as far as SNMP and having the MIBS and understanding like using Python to integrate. So one mistake that we did is when we were actually testing out SSDs, we hit a compatibility issue. So we weren't getting enough throughput when it came to the journals. So the journals were actually slowing down Ceph, and we didn't necessarily realize it. We didn't run it into production. We realized it when we were still in proof of concept phase. But one of the major things is obviously beating the environment until you find a point where there's a breaking point, right? And then understanding where that breaking point is, fixing it, and then seeing where the next breaking point is. So doing a lot of stress testing, whether it comes from the application layer or the infrastructure layer, things along those lines. So there might be specific areas or regions where we only have one customer and one product. And so instead of having to go deploy 16 Ceph nodes and 16 compute nodes for high availability or N plus two or N plus three or whatever you want to consider it to be, we can actually converge it and break it down. And the good thing about it is we can scale out with a methodology, right? I mean, if we needed to have more storage, we could add just dedicated Ceph nodes to it, or we could add dedicated compute nodes if we wanted to. But usually our architecture is kind of like a series of steps. It goes across and then up and across and up. So we have the ability to use that converged for small pod. And then if later on we onboard more customers or more products get deployed, then we could just increase it and keep deploying. It's also simplification, though, I know that sounds funny, right? Because most people look at it and they say, how are you simplifying doing a hyperconverged design? How do you understand what's breaking what or if something has an effect on something else? So we've been doing a lot of detailed analysis and that's where a lot of the C group stuff is coming in. Some of them, yes. Yeah, so we've been working heavily on the upgrade path trying to automate the entire configuration change of it. It's been difficult. So moving from Gino to Kilo was very tough for us. We actually stood up alongside a Greenfield environment and essentially migrated everything over. We don't want to continue doing that, right? We want to find ways of being able to operate it. And we think that as OpenStack is continuing to grow and, I mean, it's still a teenager in a sense, right? Upgrades are gonna be smoother, they're gonna be easier. And there's already a lot of other companies that are actually doing rolling upgrades, right? I mean, I think BlueBox is doing rolling upgrades. There's a few others that are actually, Cisco Metapod is talking about doing rolling upgrades. So I think everybody's gonna eventually move towards that direction. Same with the hyperconverged design, right? When we actually started deploying hyperconverged, pretty much everybody, including Ink Tank, thought we were crazy. And now you're kind of seeing the market moving towards a hyperconverged design, right? I mean, there's companies like Nutanix who are slowly adopting OpenStack and using their kind of general architecture. But then Cisco with the acquisition of Metacloud and now that they have Metapod, they're also doing hyperconverged as well. So that's just the way that we see it and we're glad to be going in that direction. It's worked out well for us, so. Sure. Like Nick mentioned, since we have high HIV, HIV, peace in the controller, we don't just upgrade all the controllers at a single point of time. Okay, let's say we are, I'm giving, when we migrated from ISOS to Juno, like what we'd actually did is, since we have in one of the region, we have only had like three controllers. So what we did is, we just took one controller out of the HIV proxy pool. We upgraded everything and we set the flags on the compute node. So that you get real practical thing is your console doesn't work. You have your Nova compute running on ISOS. You have your controller piece running on Juno. You have some practical examples your Nova console won't work. So you have to set some flags and make sure you do a rolling upgrade. You make sure all the services are working fine, spin up 200 VMs, 500 VMs. Okay, there is no zero failures. Okay, then put this one into the thing and get the second one out. Since we have hyperconverged thing, when the rolling upgrades send, we first do the storage piece. We'll make sure, you know, set from 1.2 like we did to 1.3. Everything is good. Do the mons, do the OSDs, then do the open stack piece. So, you know, like, but as Nick mentioned, going from Juno to Kila was pretty challenging. That's with Keystone we had, like LDAP integration, we had so much issues with that. But it's not only that, you know, your database breaks. When you try to upgrade, you know, like it just says, you know, you have to go to the alchemy, mySQL alchemy and try to build, like from maybe this particular iteration, one to two, those kind of things. The mySQL will be the thing. If you get your mySQL upgrade and it's done right, 90% of your work is done. Mainly with Neutron and Glance is pretty easy. That's what we think. But it might go other extreme too. Yep. And you know, maybe like to answer your question regarding the workload and those kind of things. We, you know, did a lot of stress stressing in the hyperconverged thing. You know, we had a limit, you know, like if we hit more than like 65%, we will add a new node, which might be more both computer thing. Because, you know, we, our applications, you know, I'm talking about in case of failure. Let's say, you know, like I have hit my, you know, my RAM, my storage limit to like 60, 65%. And at that point of time, I, if I lose a node, if I lose a compute node, you know, like that is like X% of my OSDs are going down. So which means, you know, like I may impact my VMs. My VMs might see some slowness. So that's the testing, you know, which we are done. So what we did is when we hit the threshold, we just add more nodes. So we do, we do the aggregation of it, right? And, you know, one of the things that Chris had talked about earlier is everybody thinks of FICO as being like a financial services company. And we're actually a software company that deals with big data analytics and predictive analytics, right? So for us tying into something like Elasticsearch and pulling out, you know, viable information is actually not that difficult, right? And we're actually trying to determine right now if some of the stuff that we're building, if we actually wanna open source it and let everybody take advantage of it, or if we're just gonna keep it for ourselves. So, but, you know, having everything collect in Elasticsearch is how it's going right now, right? We may pull with different things, but everything always ends up in Elasticsearch in the end. Any other questions? Everybody excited to go home? So we actually looked at a lot of different distributions. I think initially we started looking at Piston before the acquisition. We did POC with Morantis. We looked at Canonical. In the end, we were all running OpenShift. So, you know, I kinda joke about it and I say like we were running containerization before it was cool, right? When OpenShift was still on pre-1.0 days, we were already running it and using it in production. We had a good relationship with Red Hat and so we wanted to kind of like expand on that relationship, right? It's easier for us to troubleshoot and have one company to call, you know, or as a lot of people say, a single throat to choke rather than having to go and call Canonical and Canonical says, oh well, you know, it's related to this and then, you know, you have to call someone else and next thing you know, you have 15 people on the phone that you're trying to triage a problem with and no one wants to help, right? I mean, it's inherently the way that technology has always been, right? You know, you have a problem in VMware, you call VMware, VMware says it's a storage vendor. Storage vendor says it's not, you know, they blame it on something else, a driver version. Same with OpenStack, right? And you're talking about a lot more components that interface and have dependencies upon one another. So, when something breaks, it's a lot easier to only go to a single distribution. And it is always the network admin or developer. Any other questions? Doing the integration actually wasn't too difficult because a lot of it is already there. And I know I think in the next iteration, I think it's a 3.1 that's gonna be released soon. It is supposed to have even more integration and components, but it wasn't overall very difficult for us. The reason why we adopted CloudForms to begin with is we were running a VMware-centric environment in our legacy environment. There's right around, I think, 4,000 instances in it. And obviously, VCloud Director became end of life and obviously we were moving towards, you know, getting away from VMware for lack of a better term. So, we still needed that self-service catalog, that provisioning, the ability to give developers a way to actually integrate and deploy applications and things along those lines. And we really were left with, you know, a few different solutions. One, build something on your own, right? And we use service now heavily, so we tied CloudForms to service now. So, we could either build something or we could take something off the shelf and then do some of the integration and that's what we chose to do. I can give you general numbers, but I can't give you direct numbers. So, we have currently over 10,000 instances in containers that are running in the environment. And that spread geographically across the world. So, a lot of it, we actually, from an NDA perspective, we can't really give out. But I would say probably from a storage perspective, when you say storage, do you mean total storage, including solid fire and Ceph or just Ceph specifically or just Ceph? So, our smallest deployment of Ceph has, I believe right around, what is it, around 80 to 100 terabytes for the smallest deployments? About 200 terabytes for our smallest deployments. And that's what we consider to be our pod or cradle or availability zone. So, usually we stand them up in pairs. So, around 400G. And that's for the smallest, obviously, not the largest. Yeah. Go ahead. I'm sorry. So, if they're in the same region and then split between availability zones, they'll communicate. If they're in a different region, we don't let them communicate. And part of the reason for doing that is, you know, we use DHCP for handing out things and our network team wanted to use a single VLAN for management. So, even though we break up the subnet into different groups, we didn't want overlapping. So, at one point in time when we were doing testing and trying to do all the automation, we were booting nodes and they were getting an IP from a completely different DHCP server and it was trying to provision from San Diego to Turkey. And that didn't work. So, yeah. We started segmenting everything. And right now we're using Neutron only for L2. We're also looking at implementing a VXLAN. So, that's our next phase. Okay. All right, well, thank you.