 Hello everybody. Good afternoon. My name is Travis Newhouse. I'm the chief architect at Formix, and it's my privilege today to introduce Mike Kraft and Brian Eckblad from Viasat, who are users of at Formix. They run a very lean operations team for their OpenStack environments, and we encourage them to come give a talk and then share their experiences with all of you today. So I'll give it a way to Mike. So hopefully it's Mike's work. Good. So I'm Michael Kraft, Brian, both of Viasat. I try to say he's from at Formix, a partner of ours. So today I'm here to talk about fancy title here, how service optimization keeps Flyasat flying, in essence how we deploy and use OpenStack at Viasat to some degree. So tell you further ado if I can get this keyboard to work. Maybe. So today I want to cover who we are. A lot of people don't even know Viasat still. We're a rapidly growing company. So why OpenStack? You want some clouds? We did OpenStack. Current state of the cloud, some numbers, some data, show that we're actually really using it, and some key areas of interest that we've found to be successful that you should use some guidance possibly, some good questions, and answers you should think about, and then some specific challenges. I think we got three of them today and some solutions, including the partner we just mentioned. So let's get into it. So Viasat, for those that don't know, we're a global broadband services company and technology company. We launch satellites, operate satellites, we do ground stations, we do IT, we do everything in between. So we provide consumer. A lot of people know lately from airline services. We do a lot of in-flight Wi-Fi. So we're getting a lot of publicity on that. Commercial, so business and residential, home broadband, so internet services, if you will, and not your old stuff. It's actually true broadband to your home. And government customers, insert three letter words here, three other acronyms here, but a broad spectrum of customers. And again, this is our PR statement, but we think big, and I love it. We think big. We act intelligently. We're not done. We're just beginning. So motivation. So why OpenStack? So we got to a point where we just needed to scale, as with a lot of companies, right? So that's an awesome problem, awesome challenge to have. So we needed some reasons here. And OpenStack provides for us on-demand infrastructure with self-service. We are truly using it as a complete self-service IaaS platform. Reduce infrastructure capital operational costs. Of course, you can get some vendors that cost a lot of money if you want to buy their hypervisor. So we were looking at ways to how do we get out of that existing model, that expensive model, and how do we get to 10x, 100x of that scale, right? And then behind the covers, we want to be able to have something that'll allow us to be transparent with capacity scaling and then also additionally add additional services and still stay within that standardized umbrella to API endpoint. So we have, again, a wide range of applications, which makes us kind of unique. I think we're customer-facing, internal and external very soon. We have airline verticals, government verticals, residential, commercial again. We are employee-facing also. So with that, we have a lot of pets. And then internal development test environments, which are more cattle, but again, have a wide breadth of things that we address here. Again, enterprise apps, service delivery apps, ISP-based apps, if you will. So history, so it gets a little more text heavy, sorry. So internal POC environment started in 2014. As you can see here, I don't want to just read it verbatim, but we used it for select development teams, kind of get our feet wet internally here, and then our customers see if people actually liked it, could use it, could actually do self-service. A lot of people, right? They just get thrown too many toys in their basket and they kind of freak out, right? So, and we start with Havana with Open V-Switch back then. I should mention we had success with the platform itself, but had very specific troubles and problems with OVS. As you kind of read on, you can see we kind of went away from that. So our production design minus OVS started in September 2014, so not too long ago. And then we iterated and started to build out of a production environment quickly in November 2014, with our first production cloud and customers coming on board in December of 2014. So it was a little nice little Christmas present for our customers. And that was with Ice House and specifically we dropped OVS and went to Straight to Linux bridge. So our current production clouds are running in pretty wide, you know, three versions of code now, right? Juno, Kilo and Liberty, which presents its own challenges. Don't suggest it, but that's where we're at. We're trying to stay with an NMIS one model, if you will, or maximum two versions. So we intend to stay with Kilo and Liberty once we get there, then hopefully go Liberty and DACA and so on and so forth. So for production releases, as you can tell, we've partnered with Rackspace. You know, there's our hometown, so we've got to mention them. But with that, it's enabled us to be, you know, you leverage our professional services, their support teams, stay lean internally and then get to a production ready cloud and offer that as a service for our customers really rapidly. And then the last bullet point, you know, we're ever-evolving to be in-house supported, hopefully 100% someday. Rackspace won't like that, but, you know, that would be the end goal with any application team really trying to deploy something. You really want to own it end to end, right? So OpenStack Operations is a bias set. So to reiterate some data, we have five private clouds. We're out over 200 physical hypervisors, 7000-plus instances. Not all of this is Rackspace managed, just to be clear. We have over two petabytes of storage between Cinder and Swift. Underneath the covers, a little ad hoc data point there. We are running, still, we have LVM, Ceph, and then we have some highly available storage arrays. So from Spring Disk to SSDs, we have completely standardized on Linux Bridge across everything at the moment. We are looking at possibly getting more flexible there in the near future. We have a six-member DevOps team. Again, something that makes us unique. I think we have no silos. We own everything within the infrastructure service umbrella. So if you will, you can look at it. We get basically a link from our data center backbone fabric, you know, our Ethernet packets, and we build everything and own everything all the way down to the floor in the data center within our small little team. And then last one, you have 300-plus internal users' customers. We are really, really pushing the boundaries of multi-tenancy and the wide, wide, wide breadth of requirements you have when you have that many customers. So our current, for instance, I don't have it up here, but our current biggest production enterprise cloud right now, we have close to 70 different projects. So I think it's pretty unique. Of course, it's challenging. So unique aspects. So to further that conversation, so we truly provide self-serviced IT to our users. We give them some resources, capacity, and they do what they need with it. We give them all the IPs they need. We give them big quotas, and they're allowed to digest it and eat us alive, if you will, you know, take it all, right? Supporting projects, expanding private public platforms. So question mark. But yes, we do. So this makes us unique, I think, because we have customers leveraging OpenStack privately and then additional public clouds for production rollout. Or even we even have the flip side of that, too. We have customers using AWS for development and then using us for production. So they see us as an equal, which I think is awesome, and allows us to invest more and get bigger and offer more in the future, right? So lean operations team, one operator to too many instances, but a lot. A lot of people like those kind of stats. You know, right now, our dynamic workloads, so it varies day to day, of course. But on average, we're easily pushing 100 plus instances created in deleted daily, fully CI CD, customer base. Over-subscribed compute ratio, so we vary. This is just one specific area, again, but we allow up to a four to one, and we're going to probably push this and maybe substantially higher, right? And we've been able to do this further, having some really cool new tools from the App Formix, but it's something you have to be very, very careful about, but we do push that envelope. So then our Densys Cloud supports NFV, network function virtualization, if you don't know. So firewalls and other stuff I can't get into, but this environment has 50 plus hosts, on only 50 hosts, 2,500 instances, and up to 80 instances per host. And this is one environment we're going to probably stack, even maybe two access them out. So to us, that's really, really cool, because you don't see that many people doing these kind of numbers on a given host, right, and being successful. And our network underlay is vendor agnostic. We're currently pretty much a major brand, major brand vendor right now, but there's nothing we do with that vendor that we could not do with any other vendor. So we really look forward to exploring that in the near future. Again, no proprietary features, right? So very conscious choice. Operating Lean team. So we think we have some key ideas here, if you will, that we have been providing that really enable us to be Lean and still have a happy customer. So availability. Because we have such a wide swath of customers from enterprise to very cloudy apps, you know, cloudy apps are more resilient by nature, but because we have just a huge swath, we focus on availability. We don't care if you don't need your app to be available all the time. If you can lose a node and be okay, we have customers that can't. So we just cover the whole swath. We try to make sure things are highly available, enable you to do things like boot from volume, be able to react to a hypervisor going offline, you know, have an F5 in front. So if you need to load balance, we can offer those kind of things. Be visible. So this is something we've personally been pushing really heavily on too. Be very visible to our customers, especially in a private cloud space. If you go to Amazon, you have no idea what actual host you're living on, what their real numbers of capacity are under the covers. Generally, you don't have to care. But in a private cloud, we are budget constrained, sometimes by the same manager that you have as a customer, right? So we try to be very, very visible about who's using what. Is there really capacity left? If there isn't, is your neighbor using too much? You know, can you be friendly and just go talk to them? Can they tear down their test, their playground, right? Because I need this for something that's actually making money. It's just the reality of business, right? So we can only build so fast. We can only invest so fast. So we think it's been really key to be very visible and transparent to our customers across the board. Cost management. Again, visibility helps cost management. We try to tell people not to be wasteful. We empower them again with data, real data. Not just, you know, here's your flavor size. And then that's it. That's chunk of data. Here's your flavor size. We can actually tell them all their instances in use and how much resources of that given flavor they're actually, actually using. So hey, if you need to add more capacity or use more stuff within your project and you don't have any more room, go look at this report. Go figure out what actually you need to use. Go redeploy with a smaller flavor or a bigger flavor sometimes, right? And be very cost-conscious there, which leads me into capacity planning. So it's been very, very critical and key, again, be transparent and provide capacity, very clear, easy to understand capacity planning. And I'll get to that later. I'll show you some visual on this. But we can literally show you all of our offerings of flavors. We can tell you at any given time how many we can build, how many more we can support. So if you have a smaller or tiny lot of people use like tiny flavor, we'll be able to tell you, hey, we can support 80 more of those right now. Or if we're out, it's zero. But we're transparent about that data again. And self-service all things. So continue to focus on self-servicing everything. We're not perfect. We still require some hand-holding onboarding tenants, unfortunately, but really aiming to be self-service all the way in. So if you want a project, we shouldn't have to be in that workflow at all, right? They should just be able to go to a forum, sync their user. Boom, you have a project, right? Should be that simple. So a little bit further detail on this. So availability is key. So understanding your customers. So like I mentioned, we have a wide, I think we're unique, but we have a wide breadth of customers from enterprise to very agile apps, if you will, real small, tiny little things because they're very cloud aware. But it's really, really key to understand your customer. Are they going to be okay with an LVM target sender back in? If that falls offline, did that just tear apart their glare cluster? That's affecting billing transactions? You need to know these little things. We really want to just provide infrastructure and not necessarily care. But in turn, you have to actually understand what your customer is using your platform for to be able to deliver the right service. Operator must know if resource is available. So again, I'll hit on this a lot, right? You need to know the typical 500 no-host available, right? Is that a real message or is it something else? So if you know it and you're clear and transparent on the amount of capacity you have, then if you get a 500 no-host found problem, for instance, which is very, very common in our environment, you know immediately, okay, this is probably actually a rabbit problem versus a capacity problem. So it helps you actual real data help get you down the right path quickly. And real-time data plus history for comparison is baseline. So this again for, so now that we have some real production customers, if you will, it's very important again, give them real data. Are their apps performing the same today as they were a week ago? Are they getting hammering harder, less, and then all the metrics, right? Just simple data, disk, network, and then are they consuming more of that host, too much of that host? And then be able to empower them to actually do something about it, you know, that's a whole other step, too. So visibility and hypervisor instances, you see, I hammer this a lot. So where does the problem exist? Hypervisor instance. So you know, in Amazon, you don't know. In private cloud, again, we like to show you exactly empower you to be self-service and be able to tell, is that host overloaded? You know, there may be 30 instances on the hypervisor you landed on, right? Do you care? Generally not. Am I slow? You feel like you're slow? We can tell you, just go the tool, and you can show exactly that host is overloaded, or is my instance flavor overloaded, right? Do I just need to resize and go bigger? Do I need to resize and go smaller? Or do I need to actually migrate my workload to another hypervisor completely? Of course, we want to, again, automate that kind of thing, so customers don't have to worry or provide some sort of DRS equivalent to VMware. People are familiar with that, and we'll get there. But for now, at least, we give them the data, give them the tools, so they can do something about it. And again, yeah, just with instance, generating IOPS load, and I'll show a very cool visual on this later, too, but you can literally tell what noisy neighbor, exactly what instance is the noisy neighbor. So pros and cons of that, right? Some, you know, it caused some internal politics, but it works. So manage infrastructure costs. So private cloud differs from pay use model of public cloud, of course. I mean, this is a really obvious statement, and I kind of had on this earlier, but you're very, very budget constrained, generally, in private cloud. So you have to be very cost-conscious and be very transparent, right? Make sure they're not wasteful. If they're wasteful, then be empowered and have the data to be able to make deterministic opportunities, if you will, with, say, no oversubscription, right? If you want to go oversubscribe, RAM, disk, CPU, you better make sure as an operator that you can actually do that safely. Organize, organization must collaborate, utilize resources efficiently, right? So again, visibility, visibility, visibility. Reclaim underutilized resources. So this was a very, very big challenge with internal tooling and open source tooling. You know, we were trying to use collect D, Grafana, try to graph instances, but at a certain scale, you know, we're over 1,000 instances on one of the smallest clouds. You can't graph every individual instance and be able to tell on a chart who's actually using their stuff or not. So I don't hammer too much on it, but I mean, the app-forming supporting has been critical in us for that. We can do a report for any given timeframe and empower users to generate those on their own. And, you know, as a project admin, if you will, you can come in and say, hey, I have 60 instances of my project. Who the heck is making all these things? Are they actually using them, right? Generate that report, be able to tell if there's actually any CPU, RAM, or disk usage on any of them, or all of them, and be able to do something about it. So this last one, does a user require a specific flavor size for an instance? So I'll touch on this later too real soon, but if you have to make a specific flavor size, if anyone here has actually done flavor size management or created flavors, please, from my experience, make the legas fit. Don't just make what they want. Don't make a 768 megabyte RAM of flavor with a 15 gig disk, right? Try to make sure that all the flavors can fit within each other, because it's real, real critical from my perspective, and even Brian's experience, right, that that's how you can generate these cool dashboards of visibility, reliable capacity metrics, and be able to scale and use your hardware. So capacity planning, what is the utilization of infrastructure? Where is it coming from? Memory disk, CPU. Again, these are just key points, things that I think are good questions, right, things you should think about. What is the utilization trend over time? You know, capacity information is a great point in time that's real, real critical over time, right? That weak view is going to look way different than in our view, right? And you need to know, are you running out of capacity just today? Is this a normal ebb and flow? Does the business come in Monday, spin up, use all your capacity, and then they're cool anyways, and they kind of just trail off and rinse and repeat, right? Do you really need to invest more hardware, or can you work within the footprint you have? When will infrastructure resources be exhausted, right? So this has been real interesting to track over time too, right? I really try to use a good 30-day view and this kind of stuff. And this, again, is really interesting. It gives you visibility to patterns, ups and downs. Do you really need to add on? So self-service. So some things that we've done for self-service for users. Recently, we've partnered with App Formix to provide monitoring as a service for projects and instances. So complete, touchless, out-of-box experience. Every new project we create and users we onboard, they are onboarded into this application, enables them to create their own alarms and event triggers without us touching anything additional. And your real-time monitoring of that data, and I'll show a couple screenshots later. We expose underlying infrastructure data points. We're very clear. We show raw collect D metrics from our hypervisors, your basic CPU RAM disk. And we even allow, we've made some policy adjustments to even show users where they actually live, so they can tell from their instance view what hypervisor they actually live on. We enable users to answer questions about their own resource utilization via the platform above. And then users can understand issues around hypervisor and storage health. So we not only have the data there, but we also educate them to help them understand what the data means. If the hypervisor is overloaded, it's different than if your instance is overloaded. I mean, these are kind of basic things, but a lot of people don't understand that. They say, I'm 100% CPU. Well, your instance is 100% CPU. There's plenty of room in the hypervisor. If you just resize to a larger one and you get more cores, you'll be totally fine. Infrastructure can handle it. So again, transparency is key to empower users into scale. And then we have implemented a standardized network design. We'll touch on that a little bit later too. That's also helped a lot. So a couple of challenges here we'll cover. So right sizing instances. So we started with custom flavors. It just proved to be very unscalable. Every kind of new project, every new tenant request that came in, if you will, they all wanted their custom everything. They were coming from VMware. They're coming from AWS. They had expectations. They wanted to literally basically copy paste what they had in the old environment to the new environment. And we took that as is and we were trying to make people happy. So what happened was over time, we ran out of capacity really, really quick. And we're like, hmm, what happened? So basically what happened here is we had an inefficient workload placement. So we had overallocation. CPU generally is mostly RAM and even huge disk sizes. Some customer come in and say, hey, I want a 500 gig local ephemeral disk. Well, you can only fit four of those on hypervisor. So what do you run out of at first, your disk? Even though it's all thin provisioned, you're technically out. And then at the time, we had no data to say, hey, we actually overprovisioned or not because they were telling us they need that space. So we had to trust them. So to get away from that, we standardized on flavor sizes. So we make all the Lagos fit. We avoid resource fragmentation. So the largest flavor all the way down the smallest. All the small ones fit into the largest. That's how basically the model I have. So for instance, I don't know how the exact math here, but our smallest one, I believe we can fit like 32 of our smallest flavor into one of our largest, for instance. So it fits together really well, which enables us to capacity plan really well. And then users need data to choose the right size. So to that. So here's our current capacity. This is something our customers can see. It's completely self-service. So this is an example of one of our clouds and the resources available capacity and our use capacity. So this is one of our brand new ones. And you can tell it's not very used yet. But you can see very clearly up on the upper right, how much capacity we can provide at this given moment when I took this screenshot. So this has been very, very, very helpful for our customers. And even us as operators. So again, if you get the good old Nova 500, no host found, looks like there's plenty of capacity to me. So it must mean there's actually a real problem in you to go look into. So here's some deeper information. So this is a reporting mechanism we allow our customers to self-service. And you can see within this given project, there's quite a few instances here, maybe the majority. So you can see for instance, the host CPU column, instance column. So instance CPU, do you see anything that's actually really, really used? Probably not. So what does that tell me? It tells me that this customer can probably resize their flavors down dramatically, right? And get a lot more out of their current existing project. So we're not forcing our customers to do this, but we're empowering them at least to get the data, be cost-conscious, be aware of what they're doing, and how they're actually using their environment. Challenge. So hypervisor health. So we've had issues in the past with virtual memory thrashing, due to many different reasons. But is the instance memory oversubscription good or bad? Is it disk blocking? Are we just exacerbating some local IOPS problem? CPU contention, disk IO, latency issues. So tenant missing software raid over, misusing, sorry, software raid over Cinder LVM. We've had experiences with customers trying to carry forward their old pets and doing software raid over LVM targets, which were actually making the latency issues they were trying to address worse. So this is another example of another self-service view they have. So this is showing, as a tenant user, not an admin, they can come in and look at a host, and there's a host name up her left. You can see this box has 30 instances, right? And then they can see, is this host that I live on actually healthy? So then they can just self-service all that data from CPU usage, and you can flip a lot of these things around. CPU usage, host, disk response time for the fan roll local, memory usage, and then vCPUs allocated. So they may look at this day and go, well, I don't like how overallocated this box is, because in the future I know my apps are going to need to leverage a lot more of my CPU power in my instances so they can use this data then figure out how to migrate to somewhere else or work with us to get less oversubscribed. Do you want to talk about this one? Okay. So last one I have for today is this challenge of rightsizing a network. So we initially allowed each project to request basically anything they wanted from a network perspective, mainly around the subnetting, and we have a lot under the covers connectivity things too, but this talk is mostly subnetting, so size or design, and we just had too many stuff, like everyone was unique, everyone has a different idea what the network showed or shouldn't look like. So we took that and we standardized on L2L3 project design, including it was basically a slash 26 subnet with certain amount of instances, certain amount of disk, and all this standardized quota across the board actually. It's a go, we standardized IP project allocations, some people didn't like it, most people did, and we architected the whole underlay to simplify the design to provide better tenant isolation. So this results in better end user experience, mainly around project onboarding, so a lot of this was just to make people, not make people, but enable people to use the platform quickly and easily, and our old experience took them forever to get onboard. I mean we were talking two or three weeks of just initial back and forth before we even created the project, now we can turn a project around, or project request around in a day really easily, a lot of this stuff is already pre-allocated. So that's all the information I have for today. Any questions? Business model for VISA or everything? Why do we need OpenStack? Is that the question basically? So the company as a whole is basically porting over, transitioning over to using OpenStack projects for all applications. So that's what I think makes this unique. Everything from NFV, like you mentioned, I can't get into too specific stuff right now, but from NFV stuff all the way up to, I want to say, exchange-based type applications, so corporate-facing, database-facing, billing, revenue, all of the above. So we're going to do that by multi-site, we're all single site right now, being completely geo-redundant, multi-site, true private-cloud operator, and then there will be places where public-cloud makes sense too. Yes, a lot. Again, it varies wildly, yes, but a lot of it is because they are dev and test, and then they may prod somewhere else, if you will. So a lot of it is very bursty, and then they just sit there idle. As you can see, that report was very, very idle right across the board. That allows us to oversubscribe certain areas like CPU real, real heavily. I mean, 4-1 is actually pretty mild. We can probably triple that quite easily to be quite frank, and we'll probably get there. A lot of the reason too has been historically, they actually were not empowered with any data, so they were provided some, say, full bare metal box, and they were using that as a reference to come into OpenStack, which creates a false economy, but they had nothing else to leverage they thought they needed 16 CPUs, for instance, right? So we go, well, it's not going to work, try 8, and then they go 8 and they're barely using a tenth of that power, right? But they had no idea until now, which is sad but true. So now that we provide the platform and the spin key, that monitoring and service built in, they can get that visibility and start actually designing better apps, better resource utilization, see if they actually need to scale up or scale out, kind of things. So, hope that helps. Yeah. Hi. I have two questions. First question is, you showed hypervisor monitoring, and then you mentioned about underlay. Do you do underlay monitoring as well? Not with our current App Formics tooling. It's on our roadmap starting next week, actually. So, if you will. I mean, have you had any congestion scenarios that, you know, required some underlay monitoring kind of? Let me ask for clarity. So you're talking about underlay, the OpenStack control plan underlay, or the networking underlay? Networking underlay. Yeah, we've been very lucky. We've had over a year of solid uptime with our networking underlay and have had no real hard problems. It's a lot to do with, because we still do everything probably by hand, unfortunately. It's all VLAN-based, because it's Linux bridge-based, but it's been really, really stable. But we do need visibility, and we need some kind of App Formics-level view of our real-time data streams to see when, as these applications actually start to ramp up, when you get more prod applications. Yeah, we need to keep track, make sure we're not saturating ports. Are we starting to get drop packets somewhere? We lack that visibility, which is just not okay. My second question is, what about your orchestration? What tool did you use? How did you... So, for those that don't know, Rackspace leverages and we leverage the OpenStack Ansible Project. So that's our orchestration engine and how we deploy and manage OpenStack today. So our environments are a minimum of three control nodes, and then for those that don't know, the control plane lives within LXC containers running all on top of Ubuntu. So it is a network, it's more like your own network, right? To not host it on Rackspace is probably... Yeah, we are completely private, isolated in our own data center. Okay, thank you. Hi, so I've got a network latency question. You've got an application that probably more, certainly more so than most OpenStack users. You've got a lot of latency in your network just by the laws of physics. So my question is, I'm curious if there are any unique architectural elements that you have to take into account as you're deploying OpenStack and running it, well, I guess that's more about deployment, to take into account and allow for the latency so that the customer experience is preserved. We don't face any current latency problems. I'm not referring to... We don't have any current challenges around latency, any hard latency challenges yet. We're referring to the satellite stuff. There's nothing we've ran across yet in OpenStack that cares about that kind of data stream. Yeah, we take care of a lot... Before it even gets to us, it's taken care of mostly in the hardware. So it's a good thing. So we're kind of insulated from that, thankfully. But I envision in the future something will happen, but I'm not sure what yet. So again, we are looking at getting away from Linux Bridge potentially and actually going to an SDN solution to hopefully get in front of that problem, too. So, yep. Anything else? Cool? Well, thank you for your time, everyone.