 for FICO's private cloud project. So here's a quick agenda, what we plan on going over this afternoon. First we'll talk about who is FICO. Not sure how many of you have an understanding of the company or know who we are, but we'll go into that. Why we chose to go with OpenStack for our private cloud. Dive into the architecture a little bit and some iterations of what we've done over the last 12 months. Touch on some automation that we've gone through, changes that we've made there, and then talk about what our plans are for the next 12 months and more. So FICO, I don't know how many of you have heard of the FICO score. That's usually the most popular or keyword for FICO. That's a product that FICO has helps lenders make accurate, reliable, and fast credit risk decisions. What that basically means is it ranks consumers by how likely they are to pay credit obligations as expected. So one key point is 90% of all lending decisions in the US rely on the FICO score as for their decisions. And in addition to that, there's analytics software and other tools for businesses. We touch on debt management products, fraud and security analytics, consumer engagement, and big data analytics. A lot of financial analytics, if you couldn't tell. Two and a half billion credit cards are protected by the FICO fraud systems and FICO's been around for a long time. 50 plus year history of data and analytics experience founded in 1956. So why the move to the cloud? So with our footprint and so many different financial institutions there, they saw a need to expand beyond those top tier financial services companies and get into the middle market. FICO's products have traditionally been on-premise and the move to the as-a-service model makes it easier to get into those second and different tiers of services. The open source nature allows for participation for driving change within the community. Not only are we the consumers of a product, we are able to use what we have learned over the course of deploying and using the project and contribute back. We have contributions into not only OpenStack but other open source projects that tie in closely some form and contributions and contributions to the public community as well. Also, lastly, cost, which is a factor for a lot of people moving in this direction. We have 4,000 vCloud instances and we're growing at about 6%. So we were looking to attack that large growth and cost. So here's the, these are the words of our FICO CIO, Tony McGiver. This is describing the FICO Analytic Cloud product that we have. And this was a, this quote by him shows that the FICO Analytic Cloud product is in line with our push for a cloud infrastructure. So this is a big driving factor. This new direction is a driving factor for our decision to move to OpenStack for the cloud. So now let's touch on some of the, some of basically what our infrastructure looks like or what it has looked like over the past year. In the first iteration, we started with a number of different technologies. We're running our infrastructure on Cisco UCS hardware, the C240 models for the most part and C220s. Our first attempt was a combination of a virtualized OpenStack control layer, virtualized on Red Hat Enterprise virtualization. Also for the REV environment, we were using Red Hat storage with Gluster and that was for the virtualized OpenStack control layer. The Compute, the OpenStack Compute was a combination of Compute and Cef on the, on other C240s. And this we found was a pretty complex design for what we were trying to do. Too many technologies having to use the REV environment plus Gluster plus Cef plus OpenStack was pretty hard to troubleshoot. The existing legacy infrastructure at FICO is a lot of VMware and so we were, the engineering team had to think about how the operations team was going to handle this and all of the technology we used in our first iteration were pretty much new to the operations team. So we decided that after that came up, we decided that we needed to think about keeping it a little more simple than that. And the original design had a lot of, the baseline had a lot of the hardware going towards infrastructure and not, wouldn't ultimately be used for the end compute environment. So decided to redo that one and with a more keep it simple philosophy, decided to stick with OpenStack Cef and then optimize the hardware for compute. So that's where we ended up with the second iteration. Again, utilizing the UCS hardware, moved load balancers to C220s. C240s became OpenStack nodes and Cef nodes for the compute and the controller layers. This give us better use of the hardware, less complexity, very simple. We had REL7, we had OpenStack, we had Cef. Pretty easy to get your head around and easier to troubleshoot. And that's what we got, simpler design, much easier to get operations up to speed. It was, for the most part, things that they had already had at least some experience with. It took away a lot of that infrastructure, the hardware dedicated to infrastructure and brought it back into the compute, which was our initial goal. Then we started having, using Cef as the only storage technology, we started asking questions about, is that the right decision? Does that work for all our workloads? And then also, does the hardware we chose fit the job that we gave it? Using, the first question was using the 220s as HAProxies, was that efficient? Was that effective? The current iteration is another change from where we started. C220s are now dedicated to the controller nodes, the control layer for OpenStack, as well as some Cef infrastructure. And the C240s become the compute and Cef nodes, and we removed the load balancing, service load balancing layer from this physical hardware and slid it over to existing F5 devices. Also, to answer that question about the high performance storage workloads, we have introduced SolidFire 4805s into the environment. So now we're starting to see a tiered storage. We use Cef as the general purpose storage, and SolidFire becomes our choice for more high performance applications. So there we are talking that, like I said before, the F5s took over the role of the load balancing where HA proxy was before. C240s become dedicated for Cef OSDs and OpenStack compute, and C220s are for the controller layer. SolidFire again introduced, and Cef becomes still a major component, but it's a more general purpose or default, that's our default storage solution, and we'll choose SolidFire 4, the stuff that needs a bit more juice. So now I'm gonna turn it over to Oscar and have him talk about some automation. Hi, hello, my name's Oscar. I do the automation for FICO. Currently we're using Foreman, which is good UCS and Puppet Labs. When I first came to FICO, there was a lot of problems. So a lot of misconfigurations across our lab, a lot of different configurations on each node, and it was taking a long time to deploy, just one simple stack. So I decided to, with the help of a lot of puppet modules, the puppet forge to create something for FICO because there's nothing out there that met exactly what we needed. So we use Foreman already in our infrastructure, and Foreman allows us to spin up our UCS 240s and our 220s in any geolocation with the form of proxies. It's a really nice UI that allows you just to deploy very quickly and rapidly. So with FICO, there is a lot of custom things that we had to do, and the puppet forge modules weren't exactly what we needed. So I used quite a bit of them, a lot of Brabant MQ and MariaDB and a lot of existing, and just wrote a wrap around it to help us get exactly what we need, which allowed us to deploy our lab and our production environments very quickly. But some of that would come with custom facts, determine what node needs to what, and how to design what OVS Nick needs to be on which box. So with custom facts, it helped us bridge all the gaps. And that's it, Nick. So essentially that brings us to what's next for architecture design, where are we headed, where do we see the industry headed. So we started integrating a lot of our hyper convergent environment with Cisco UCS Central. Part of the reason was that the deployability and using REST APIs gives us repeatable results, so consistency, and that's across both availability zones, different regions. We actually have expanded a lot globally and internationally, so having a single framework to manage each component or infrastructure and not have to constantly duplicate the same things over and over and over, builds consistency in it, also helps with rapid deployment. We've also been starting to look at the Kilo automation, so we're actually running a combination of OSP5 and OSP6 right now. And so we've been trying to, for the most part, take a look at what the benefits are for each different release of OpenStack, where they fit specifically with our needs and our requirements, and really then user and then customers. So we started looking at general high density designs, switching from using a Cisco UCS rack-mounted servers and moving over to blades and using only solid fire. And we kind of determined in the end that it really made sense to use solid fire as kind of a guaranteed SLA or high performance storage and using Ceph for pretty much everything that doesn't need a guaranteed SLA. And we also found that some of the weak points when it came to Ceph were deployability of like shared SCSI bus, things along those lines. So we've been evaluating how much solid fire can actually replace and then trying to adjust the nodes accordingly based on what we're seeing. A lot of our file systems right now and a lot of the legacy applications are using only traditional like NFS and file systems. We have some cluster that they're using as well. And we're trying to move away from that. What we found was that the developers who were building these applications were using a tremendous amount of small little files and which was a perfect design to start implementing with Swift for object-based storage. So we're actually in the process of remodeling a lot of those applications, taking them from a legacy environment and moving them over to more distributed, smaller applications. We're leveraging OpenShift for a lot of the container. Platform is the service components. And that's both on-prem and off-prem. So including a public cloud. We're also in the process of moving towards satellite. Previously we've always been using Spacewalk and we found a lot of challenges when it comes to Spacewalk from a manageability perspective. Guaranteeing that we have the same RPMs across multiple environments and so on and so forth. We're also doing integration with CloudForms. So we're using CloudForms as kind of our orchestration layer. Since CloudForms has the ability to actually convert APIs between different providers and different clouds, whether it's specifically open stack or if you're actually going to be using Amazon or so on and so forth. So we started deploying CloudForms. We also were deploying CloudForms to move away from a lot of our legacy architecture. So traditionally we've always, on our private cloud, we used a combination of VMware. The typical design that you would see, VMware, Cisco, EMC, VBlock, FlexPod styles of designs. We found that those were inflexible, they were too large, they didn't scale accordingly. Or every time you did scale it was a huge investment. And so we actually were running a lot of VCloud Director on top of that for our developers to be able to tie into vagrant and deployability of development machines and development applications. And due to cost, we really didn't see a good ROI on that design. So we decided that CloudForms would be the next generation of replacement. Some of the things that we hit in general issues with CloudForms, tying CloudForms into a traditional VMware environment without rebuilding everything was network address assignments and things along those lines. And so by moving towards a completely open stack cloud from a development perspective, simplified a lot of those things with software-defined networking and so on and so forth. So I think that's pretty much everything you want to cover. Yeah, there we go. So anybody who's deployed open stack obviously knows that the hardest thing so far at least in my experience in deploying open stack is deploying open stack itself, not actually using it. Once you have it deployed, usually works pretty well. So, well, the question is, okay, so once we have it deployed, how do we know that everything's working properly? A lot of people in the environments I've seen go through and okay, let me spin up an instance. Can I ping it? Can I SSH into it? If I start an app, the ports configured properly, is there a firewalling in the way? So one of the things that we're talking about, how do we test this once this is deployed? And I'm sure a lot of you are familiar with the open stack project rally that's for testing full site deployments. And going through rally, I found rally to be kind of heavyweight really for what we were doing. So we started working on a simpler, more scaled down version called rally sprint, which is a shorter form of a rally race, that does exactly what I described. It'll create a virtual machine, it'll test the virtual machine capabilities, it'll test all the REST APIs to make sure everything's functioning properly, which we do also monitor independently of this framework. Another thing too, I wanted to talk about Oscar's puppet code, how do you deploy multiple sites? If you have 15 or 20 availability zones, how do you deploy that many AZs without having duplicate puppet code for each availability zone? And I know that one of the things that he's working on moving forward is having a data model basically, so you can use the same puppet code in all of these environments. And for storage too, I just, I wanted to bring up that, like you said, we have a lot of file system stuff already in place. And Gluster, Gluster's great, but we're like, what alternatives are out there for Gluster? So we started going down that avenue and seeing what else might be on the market, what else is coming out? There's a lot of cool stuff, a lot of other distributive file systems coming out. And we actually did a trial with Qobite. They offer a distributive file system with Swift Gateway for object storage too, and had relatively good results with that. So basically, I was just, I was expanding on what Nick was talking about, that we're keeping it bleeding edge. We're trying all these new technologies. We're gonna see some LXC stuff come into play. We're gonna see all the technologies that we talk about today come into play. That's all I have. I guess we can open up for questions. If anyone has any questions from architecture or design. So incredibly aggressive for the most part. So our CIO is also very technical. And so he wanted us to rapidly adopt OpenStack. I would say every phase that we moved through from development to production, within probably about a nine month timeframe, from end to end. And that includes also starting to build a lot of the automation and tools and configuration management, including general monitoring, and then obviously testing. A lot goes into building OpenStack Clouds, and then people forget that they still require maintenance and general design. How do you scale? How do you grow? When do you need to know that you need to buy more hardware or acquire more hardware? And how do you upgrade? And how do you upgrade? That's a huge one. And that's a problem everybody's facing. If you're in here for Mark Shuttleworth's presentation, he just announced the CI as a service where basically they're taking head from GitHub and pulling it right down into a deployed production cloud. So it's pretty cool. I think they're doing a demo Wednesday. So that's something I think every single one of us is interested in. Any other questions? There's a mic over there in the middle by the way. Currently, so that's actually a difficult question to answer. Including development and production? So in regular production, I would say probably in excess of over 100 individual nodes right now. Split between different availability zones and regions. So it really depends on the geolocation and availability zone. So we actually size our OpenStack Clouds based on the number of products that we're deploying in each of those regions. So if there's more products that are going or if there's specific requirements like IO dependency or let's say some of the containers that we're deploying that are larger than atypical, that cloud will be a little bit larger than other individual clouds. There's some geolocations where we only have a single product, therefore it would only make sense to deploy the bare minimum. And then as you actually add more product groups or you grow, we grow the cloud along with it. So that's part of the reason why we wanted a distributed and scalable system and moving away from like a lot of the legacy architecture for the most part. Yes, go ahead. So I think initially we ran into some general issues with Ceph from a performance perspective. We also tried a lot of different things with Ceph too. So we were actually using a combination of SSDs and 10K SAS. And what we saw for the most part was the solid state drives didn't really add any more performance to Ceph than atypical. If you were using maybe possibly slower drives maybe, but we were using 10K SAS across the- One Firefly though, and Hammer, there's been a lot of code changes that are supposed to pass that. Yeah, and that's good to expand on that we were running the previous generation. And then there was a need for general performance and as a whole that Ceph was unable to really provide for us. Certain applications needed higher availability because of SLAs that we agreed to with customers. So I think storage was probably the number one thing that we were dealing with on a pretty consistent basis. The compute in the hyperconverged, that's been also the second probably most difficult piece of the component because everybody always wants to separate the storage from the compute for the most part. We prefer to do the hyperconverged design. We get better density out of it and it actually shrinks our footprint for each individual availability zone or region that we deploy into. That's something here for solid fire only, but you're also talking about the performance levels that are something like Ceph and it sounds like solid fires going after that. You also mentioned ROI, so I know you care about things as possible. So how do you guide the use cases so that people are using the right storage platform? So that really is dependent on SLAs, honestly. What we commit to with our customers, we have to guarantee in the end and we have to guarantee those levels of performance. So is a customer, do you mean your internal tenants or do you mean? It could be internal, it could be external depending on the environment. So a lot of our design from an architecture with OpenStack was based on the FICO-analytic cloud, which is basically an X as a service model where customers can actually bring their own data into our deployed cloud and run data analytics against it. What we found initially was a lot of small, medium-sized businesses didn't want to deploy and manage their own hardware. And since legacy architecture was kind of more on-prem. As we moved to a cloud-centric model, we acquired a lot more small, medium-sized financial institutions and different government institutions. So it kind of made sense to move towards software deployments and providing a PAS platform and as I said, X as a service. So we use CEP for general workloads that don't need SLAs to be guaranteed. We use solid fire for things that actually do require SLAs. The other main thing is availability and replication. CEP doesn't really work good between multiple regions and availability zones for replication. There's ways of getting around that, but we actually wanted something that was more native. And then some of the other benefits we get from solid fire or DDUPE and compression, things along those lines. Yes? Actually, do you want to take that? I think Chris actually is probably better for that. But I mean, I focus more on the testing. Yeah, so, could you say that question again? I'm sorry. Metric's collection. He's not going to go along the lines of like salameter and things like that. Yeah, I mean, we use a combination of tools. We have some stuff that we monitor through Zabix. We have Nagios. We use a combination of like Alien Vault for sniffing individual traffic. It just, it really depends on the geolocation. What tools we have deployed in the size of those instances and the size of those actual clouds that we build. Smaller ones, we actually have less monitoring. Oh, I see what you're saying. You're saying like collecting and stuff like that. Actual hardware performance. Oh, so we actually, a lot of that information we pull from UCS Central and Cisco UCS as a whole. So a lot of the peak bandwidth, general performance metrics will actually pull through there. Yeah, so it's been fairly substantial from what we've seen, especially in the dev environments, not only have we increased productivity, but we can actually set up and deploy and scale much, much faster than we were able to previously. Exact amount of money that I think we were able to save was around 10 million dollars. But those are just rough estimates. But it's been substantial. Licensing has gone down, productivity has gone up. So it depends on how you look at it in the metrics in general. Yes? How big are you? Fairly large. So I think one of our smallest dev environments that we have is right around 4,045 individual instances that are running concurrently. That probably translates into right around 26 individual nodes right now. Yes? Actually, you can choose that. We found general workloads that are not incredibly IO sensitive and or necessary. So transactional stuff is really where we saw problems with Seth. We had some applications that have lots and lots and lots of little rights. And we found that those would queue up with Seth. And that was one of the main issues. Anything that's more CPU or memory intensive rather than IO? It honestly, it could vary. So it could be web centric. It could be middleware. Most of our databases are deployed on solid fire just from a performance perspective. So it really just depends. But I would say a lot of the web and things that run in memory, we run on Seth for the most part. Yes? When did you run it? Did you guys? Yeah, it was actually fairly substantial. So we did a small proof of concept. And I think we determined that it was right around $35,000 to $40,000 a month for us to run the same type of infrastructure in AWS. And that's actually just based on the initial model. That was a single product. So AWS, while we think that public cloud is definitely the future, it's not in our immediate future for the most part. Unless we leverage it for like high availability or if there's regions where it just makes sense. Where you can actually take data out of the region and you need to have multiple availability zones. We have quite a few individuals from customer perspective that we're working with that have limitations like that. Three specific issues. So I would definitely say Seth was one of those when we were initially starting out. Latency, latency, and latency. Yeah, latency, and latency, and latency. Also moving the load balancing over to the F5, that solved a lot of our problems as well that we were experiencing. Then having a common framework for a design and automation solved a lot of problems. We ran it into a lot of issues at first where we were running different versions of code across the entire stack. And so we had to minimize that. There was a very large risk for us. Plus from a supportability factor, no one wants to be able to support when you have different versions of code that shouldn't be running together and whatnot. Not meeting metrics or profiles. Yes? We are, yeah. So we have quite a few applications that are moving towards in-memory. Obviously it's faster. You could build it more distributed. It's easier to spin up pre-provision instances and get them ready. As I mentioned earlier, we do a lot of stuff with OpenShift and containers. And we use a combination of Docker as well. So moving towards an in-memory model is definitely substantial for us, especially being an analytics company and not being able to tolerate latency and so on and so forth. Yes? Yeah, so I think like a lot of customers are kind of in that same situation where our operations team is still struggling to get their head wrapped around software-defined storage and software-defined networking and OpenStack in general. For them, it's kind of this mysterious cloud. And we're constantly engaged with them and training with them, best practices and monitoring and things along those lines. So I would say probably 40% of our workload is on OpenStack and 60% is on traditional legacy VMware-centric designs. Also, running things like Oracle Rack and so on and so forth, they don't work on OpenStack. So we find instances where we have to specifically run on a different type of hypervisor. Yes? So I can't actually elaborate too much on that because we have a few different proof of concepts that we're going through on it and I don't want to necessarily give credit to one company versus another. These are Neutron though, right? Yeah, Neutron. Neutron ML2. Yeah. Looks like that's it. Okay. All right. Thank you everyone.