 So wait, so let's get started. Quick intros, thanks everyone for showing up and look forward to having, I wanted this to be more like a two-way conversation so we're gonna go through the deck fairly quickly, 20 minutes, maybe 30 minutes tops and then spend some time doing Q&A and kind of having a conversation. So I'm Chino Sahu, I'm an IT architect at IBM based at RTP and I'll let Bill introduce himself. Hi, I'm Bill Hankard. I'm based out of the Littleton Mass Lab. So I'm a Boston native and I work with Chino. So just to kind of, well what you guys can expect from the session and you'll get some more details on the context of what we do within IBM. It's basically our internal cloud. It's our internal way of enabling our development teams that are actually building and delivering products and services offerings, making sure they have what they need to do that in an effective, efficient, cost-effective way manner. So what we wanna kind of share today is what sort of challenges we had while operating this global hybrid multi-cloud service, what those challenges were and then how we addressed them. And so just a little bit of context, around 2015 or so we had a mission to enable our development teams, kind of two fundamental business objectives. Ensure that they're able to seamlessly move workload and work and operate and build their software in both an on-premise cloud and in our public cloud, which is software. And also allow them to deliver continuously and in a DevOps-type way. I think I'm one of those topics and we try to enable the developer so we kind of get the IT out of the way. And that's where we found OpenStack very beneficial for us. They're self-enabled, they have the resource, they can go forward and be productive. Yep, it's a great point, Bill. Very much self-service, API-driven. Now that's definitely the key. So this is a really, it's a busy picture slide but it kind of shows some of the technologies and tools that are included. And we're gonna go into a little bit more detail but you'll see OpenStack on the top left. We're leveraging Jenkins, Grafana, they were tied into Slack and GitHub and a couple other corporate tools that the company, that our company uses as well. What we ended up trying to do is not just do OpenStack. It's OpenStack as the base but we kind of think of it as OpenStack plus as a service. And so listed here are all the things that we've built around and on top of OpenStack. We wanna make sure that a consistent user experience. We've built a way to do automated image replication, an engine for users to go build their custom images and have them pushed out to each cloud, custom DNS registration, hybrid networking to and from our public cloud, automated security compliance. We've got a lot of corporate security tools that we've got to integrate with a corporate directory, multiple platform support. We've got a significant investment in power and we're enabling containers as well. Automated continuous business need and other like business processes. And so these are kind of the things that we could focus on because OpenStack solved a lot of the other things that we didn't have to invest in. So the focus was building above the stack and around it to really impact our users. So here's another busy diagram, just kind of our high-level architecture. On the right you'll see our on-premise footprint. Right today we're fuel-managed OpenStack, open-source fuel, open-source Morantis's distro using SpectrumScale which is otherwise known as GPFS, is storage, OpenVswitch for networking and then the StackLite monitoring stack for kind of monitoring availability and operations and et cetera. The just one point, the SpectrumScale is a clustered back-end file service that IBM's had for years and integrates fairly well into OpenStack. And you'll see in the bottom right some of the things we've got to integrate with are corporate tool, corporate ticketing, corporate authentication, our chargeback mechanism. And then on the left you'll see the footprint that we have in SoftLayer. We've got a Bluemix dedicated environment that we're integrated in from a connectivity perspective and this is an instance dedicated to IBM internal. It's not our public Bluemix but we do have cases where we go out to public Bluemix as well. So I won't get into any more details there. So let's jump right into some of the challenges that we faced while building and operating the service. Obviously there's, with any cloud provider, you're gonna have costs. Cost is one of the key constraints. We kept adding on more complexity which resulted in more costs. Our support costs went up as we onboarded more users. We have a global footprint. Four major sites today, two in the US, one in Canada and one in the UK and we're gonna expand that even further. As with any cloud you're gonna have cases where you're not fully utilizing all your resources. So we hit that as well with our OpenStack based solution where teams and users just aren't being responsible. And how do we address that and what do we do to help them be more responsible and optimize their use at the same time, get the most of what they're doing at the lowest cost possible. And then I wanna speak a little bit about how we optimized enabling these teams. Bill had mentioned, IT was kind of seen as a gate as a manual step in a process for our developers. We wanted to eliminate that completely and really put the power in their hands and make sure that they're able to do what they need to do but in a safe, controlled, efficient manner without them crashing the car per se. So we'll jump right into, anything wanna add? Yeah, no, I just, as Chino said, the self-enablement of the end user was key paramount to they have their own networking, they have their own images, they're basically in their own project doing it and we're stepping back just providing the infrastructure for them. That was a key component. Yep. So rising costs from adding complexity, one of our key things that we needed, we know what we needed to do was standardize. It's pretty obvious, the more standardization you do, the less variations you have and hence when everyone's working in the same way, heading on the same mission then you gain efficiencies doing that. But we've got a heterogeneous data centers across the world. We knew standardizing at the harbor layer wasn't gonna be possible and something that we could tackle and so that's where OpenStack really helped us is it helped us standardize at that layer, that infrastructure as a service layer. And so we've got a recipe where we're leveraging the same storage technology, same networking technology, same configuration and we've automated all that so we can literally stand up a cloud in a day or less and roll it out into production very quickly. And that kind of leads me to this last point here. We built a way to incrementally roll out features. So one of the shifts we had with our user base was, they traditionally been used to the VMware base model where they would get a bunch of VMs and they would kind of hold on to them and they would craft them into these snowflakes and they trusted that these VMs would always be there. When we kind of shifted to this OpenStack based service, the message we conveyed was we're not gonna guarantee that your workload's gonna be up all the time or that it's gonna exist. We could have a cloud failure, we could have a hypervisor failure so we wanted to drive them to build their code, build their processes around expecting failure. And so our guarantee was that we would have OpenStack services up. There would always be an OpenStack cloud whether it's in the one in RTP, the one in the UK or the one in Canada, they're always gonna have an OpenStack service but you won't necessarily have your workload, that exact workload, it was not guaranteed to be up all the time and that kind of really forced our users to kind of redefine, look at the process and build that sort of failover mindset into their process. And we do that as well when we're trying to roll out features, we expect things to fail. So when they do fail, we're able to recover quickly and then minimize any impact because of that. So rising costs from support, there's OpenStack is complex, there's things that happen that are difficult to debug and one of the models we chose to go with is instead of purchasing formal support from a vendor, we chose to invest in people. So we skilled up people across our teams to become OpenStack experts and kind of self support our service. And we've realized from that, we were able to really build strong technical people and I think we resolved issues faster. I mean, historically, we've had vendor support for other offerings and usually it takes, it's a lot of back and forth, there's a lot of overhead with that. And so far we found that having our people support the offering versus... I think collaboratively throughout the IT team, we have skills, various skills everywhere and we found that that really helped in this whole global model of deploying and supporting OpenStack moving forward. So having those diverse skills within Python, scripting, hardware, software, storage, networking technologies really helped us immensely. Yep. And it's all about collaborating. I mean, we're all in a Slack channel, we are multiple Slack channels and we're constantly keeping each other abreast of what's going on and because we have a global team, we're able to kind of do this follow the sun model where teams will pick up where when days end here in the US and vice versa. Well-built runbooks can't emphasize that enough. That includes automation, automation and well-documented steps on what your admins need to do in order to debug and address issues. I mentioned earlier, we leverage our clouds as kind of failover instances. When something does go wrong, we've built in a way to quickly shift users to another cloud or another region or availability zone, et cetera. So they're not impacted while we continuously debug where issues do arise because OpenStack is, as all of you probably are well aware, is very sensitive and can go down at any point. And I think the standardization part was also, it's a key point from being able to debug people's, the other environments across the globe that we've had issues with. So if we know it's cookie cutter across the IBM, then it's easier for the teams to debug and fix problems when they arise. You know where to go and what to expect. So usage governance, just kind of helping users be more responsible and giving them the tools to kind of self police, self monitor themselves. Automation, we invested a ton of automation. CBN, making sure that they still need their resources for the allotment that they asked for those resources, kind of pruning and kind of self cleaning projects and cleaning out volumes, et cetera. That's something that we definitely rely heavily on. And we built a dashboard that is user facing and kind of shows them this is what they're consuming across all their clouds, both on-prem and in software and public cloud. What they're being charged for. This is what you're spending. And just simply just having that data and understanding where their costs were and the ability to kind of model out, well, if I shifted 10% of my workload from public cloud to on-prem, this is how my bill will be impacted. Just having that information, that data really drove a lot of their behavior to kind of think about being more responsible and making the right decisions to optimize their cost. And in other cases, where you need public cloud for some reasons, but just understanding what the trade-offs are and what they're paying for really help kind of drive some of that behavior. And then really kind of focusing on enabling these teams to be self-service, to be more empowered to take on more of the IT type tasks, education was definitely key. We have a really well-defined, automated onboarding process. And then we spent a lot of time kind of really consulting with these teams and kind of showing them how to use OpenStack and how to use automation, develop infrastructure as code, enabling these coaches to kind of teach other people on their team, providing them tools. We've got a central Jenkins instance. We've got a lot of Ansible playbooks and heat templates that they're able to use and kind of take off the shelf and then customize and leverage on their own. And really the shift from static to dynamic, that's really the mantra that we wanted to push was get out of this old model of using static VMs and static workload and doing things when they need to on demand and in a dynamic way. Do you wanna add? Yeah, I would say that we've found that the dynamic deployment of instances has really gone up dramatically when a user realizes that they don't need to keep it around forever and maintain it in a lot of the workload automation that runs out there is constantly being redeployed every day, thousands of instances created up and down. So we see a high volume in that area. And so kind of the results of addressing these challenges. So we can do, we can roll out a new cloud and half the time we used to, we can roll out new features and half the time that we used to, we're able to provide 24 by seven near 24 by seven. It's not full 24 by seven, but it's close without having to hire third shift employees. We've enabled users to optimize their consumption patterns and again, in a more self-service way, we've reduced our overall TCO by 30% and there's definitely some savings to be had because of OpenStack and all the things that we've kind of done with OpenStack. So that's basically it. Let's want to spend some time with some Q&A. I think we've got about 20 minutes actually. Just a reminder to, if you do have a question to use the microphones and the aisles. So any questions? Comments? Yes, I have a question. For your dashboarding for the consumption of resources, were you making use of the OpenStack Solometer project for that or did you do something yourselves? We used a combination of Solometer and- Grafana, Gabbana. Data, Grafana, Gabbana. Well, yeah, we used Grafana to display it but it was a combination of Solometer and some of the data that comes out of the Stacklight Monitoring Suite. Okay. We kind of customized some of the HECA pipeline to pull stuff off their queues and kind of transformed the events into messages that we would then display in Grafana. Okay, thank you very much. So I'm assuming you guys do charge backs to your internal customers? Yes, yes we do. So how do you manage them saying you guys are more than the public cloud providers? What is your response to that? Is it you guys have support from the top down saying this is the way to go, you guys can use public clouds or can you guys talk about the experience? We actually want, our company direction is to consume public cloud as much as possible in this hybrid way. Ironically, our public cloud is more expensive than our on-prem charge back. It's almost, and so there's actually an incentive for teams to use our on-prem stuff more so than public, but there's definitely a lot of teams that want to do that hybrid where they're doing their test dev internally on-prem and they're hosting their public SaaS offerings. So internally, basically all the development and tests within OpenStack we do is in our labs which is not on the internet. However, we promote to software which is internet facing and there's obviously a cost when you have an internet facing. So that model, it does vary. As Chino said, there's the cost internally and then there's the premium cost to be internet facing because of the security, the firewalls, et cetera that go with that. And then my other question was, how do you guys bill your internal customers? Do you same as the public cloud per CPU? Do you give them flat fee? Can you guys talk about that? Yeah, it's actually a little bit varied site to site. We've got a couple of different models depending on the geography there. In the U.S., it's all usage-based. So RAM hours, whatever you consume is what you get charged for and that's billed out quarterly. And so just, you can get a view of what you're gonna get, end up being charged but the actual payment doesn't happen four times a year. Whereas like our European team, they're all burden pool-based. So, and they're kind of shifting away from that. So it's not usage-based, it's they basically have this team, this is what it costs to run their cloud and run their services and then that gets split out evenly across all their consumers. But now that we're in this global team dynamic, we wanna shift everything to usage-based. Pay for whatever you use. Thank you. So you mentioned, you made an interesting comment about the count of dynamic instances went up. I'm just a little curious as how have your thoughts shifted on the kind of metrics you're tracking in order to reflect this new pattern that you're endorsing? Yeah, I mean, that is, I think as the people or developers are finding, once again, to go back to, they're enabled, they have a project, they have their instances. We're not quote inhibiting them anymore. As they experiment and they do more work, for example, we have a team that's using Kubernetes to do deployments and they find that it's a lot faster for them to develop these sandboxes that the development teams use. And it's something that they can just pick up and throw away. And we capture that, I think through the stack light and through some of these metrics that we're producing. But I think with OpenStack, we find there's a new level of thinking and a new level of creativity that the development teams are getting the hang of. Just to add on, we do track number of deployments. I mean, just historically, we've always tracked that. I don't know how relevant that is, to be honest, in this new sort of mindset. Another metric we try to really focus on is what's the life cycle? So something was created, when was it deleted, and tracking that as one unit? How many recycled events happened and then measuring the time between? So if it's an average of two weeks, something's not right. They're not the behavior we want to drive. And then there's also just, because the whole goal is to get as dense as possible, as utilized as possible. So then we're also tracking how well are they meeting their quotas? Are they hitting 80% of the quota that they've defined for their project and then track that over time? So we're still kind of playing around with the right metrics, but I don't think we've figured it out, or the industry's really figured out what's the right single way to determine how efficient you are and how well utilized you are other than you're using up by resources and. Yeah, two part question. How did you calculate your TCO? And where specifically did you see a 30% reduction? So TCO is calculated based on infrastructure capital, actual data center, racks, servers, storage, network equipment, and then the people, the people cost. And we were able to kind of shrink our teams and support more capacity, more clouds with what the same number of people. So investing in more automation and kind of changing how we did things kind of reduced our overall cost of people and infrastructure. So essentially in some cases where you have the legacy sand team, network team, Intel team, application team, that's in one person sometimes. They do it that stack engineer, that full stack engineer where we can consolidate all those skills within one to deploy those clouds. So just a clarification question. So the 30% reduction in TCO was largely AWPEX based and not CAPEX? That was a combination. I mean, you wanna talk about the CAPEX stuff? I mean, it's the CAPEX. I mean, we have what, five year over that? Yeah, I mean, we just tried to go for like smaller, more reusable servers. And hence the, we drove the cost down from that. Some of the networking equipment, going to more kind of appliance based networking, kind of reduce some of that CAPEX cost. But the majority was AWPEX. Thank you very much. So I had a couple of questions related to support and what level of support you provide to the end users. So in our environment now we have some developers using public, we have an internal private cloud, and then we have an environment that's, I don't know, we're gonna call it classic, like a traditional VMware environment where people have long running machines, the infrastructure is highly available. So as we've transitioned more into a cloud model, a challenge that we're facing is related to like compliance for patching and vulnerabilities. So in the older environment, there was a lot of hand-holding. It was a managed service related to that. They just go to a dashboard. There were teams doing that. But as we move towards more self-service, the onus is on them. And initially in the private space, that didn't work out too well. There's more policy in place for the public. So they're mandated more and machines will get shut down and destroyed. But do you guys have that challenge or any recommendations related to that? So I'm doing a talk on that tomorrow regarding the image compliance and how we use the different tools to basically produce compliant images and help them with those challenges moving forward. Yeah, I mean, we can talk after if you like that, be fine. But yeah, it is a challenge in the cloud world to keep these images. I mean, and they spin up. So as we see the usage of all these DevOps, these continuous delivery, continuous integration tools out there, you know, the developers are enabled and they're building up these environments. They're doing their testing. They're spinning it up, they're tearing it down. And you know, but at some point when the long-lived ones are out there, that we have to have that level of patching. We have to have, if it's Windows antivirus and for what business purposes, they have to have certain password policies, et cetera. So that's the challenge we found. Not saying we solved it, but we made a good step into it. So do you have like agents deployed that you're watching the stuff or do you? So in one of the cases, we deploy the IBM endpoint manager, which used to be called BigFix, which will basically check the image for compliance and what have you and then report back to our compliance engine. And when an instance gets deployed within our environment, we have a program called IT SAS, which is our IT security compliance. The image basically gets scanned. It gets registered in the compliance tool. The end user, it gets notified of that. And then whatever patches that do APARS, they basically get flagged. They get notified via email. And also the endpoint manager is sort of the quote overseer to say, hey, wait a minute, we see this vulnerability that just came out. You may want to address this sooner than later. Yeah, so for any image that a user will use, it'll be compliant out of the gate. So we maintain the images. We've got a process that automatically keeps them up to date on a monthly basis. Save the user wants to bring their own image. We've given them a process to go upload them through Jenkins, which will scan it, patch it, then kind of spit out a image that's just as good as any of our own images that we provide. So we give them this process. Instead of just importing their own into their own project, they import it into through Jenkins and it gets pushed out into their project and it's compliant out of the box. And then as Bill mentioned, ongoing compliance, we kind of scan it behind the scenes on some cadence to make sure there isn't like vulnerable systems. And when there are, they get notified and they've got a window to patch them or fix them or they have the option to automate that to leverage our endpoint manager component to do it automatically. So they have the option to do it like every weekend, any time or only when they're notified. So one thing we're trying to implement is to try to have people stop thinking about patching and that they redeploy their infrastructure once something becomes non-compliant. I have a feeling that we're gonna run into some challenges with that. So we have, we debate back and forth. Do we deploy an agent? I don't want an agent out there. Like, hey, use Ansible to do it. All right, then I need Tower and how does it scale? So historically we've had agents. There's a policy now that'll say, after two months, all instances that have been instantiated off of this image should be destroyed and rebuilt using somebody's developing like a cloud custodian type tool for OpenStack. So I don't know what we're gonna do long-term, but I know the agent, non-agent, let people manage their own stuff is something that has challenged us so far. So I'd love to hear more during that session tomorrow. Thank you. Any other questions, comments? I think we're still got a few more minutes. Hey, you talked about that you calculate your TCO and you collect the money quarterly. How do you define your prices? It's all cost recovery. So whatever it costs us to provide the service, that's what we end up charging to our users. Is there like a price list for the client by the sizes of the VMs or is it okay? Yeah, we... It's RAM, CPU. Yeah, I think we've actually broken down to like per RAM hour, because we realized it was not any cloud of RAM is like your constrained resource or your most constrained resource, RAM and IPs. And I think it was a roughly 10 cents an hour or something. So we try to break it down to where it aligns with what it costs us to manage and run the offering. Thank you. I mean it's an internal deployment. How do you get your future demand? How do you know when you have to add and add capacity? So you're obviously not just gonna keep building like some of the accounts I've provided will just continually add. How do you measure that? Do you have to go and get clients to give you the yearly expected deployment or are you just using historical data? I mean it's historical data. We're monitoring obviously the clouds to see where the resource, the consumption is and as needed we add resource as needed but is there a scientific model to it? No, I'd say no. I think it's looking at the utilization across the board. We tend, I don't tend to get too consumed with instances, instant counts. I'm more concerned about the underlying hypervisors and what they're doing, how they're churning away and that's the barometer basically. You get that through the stack light, the monitoring of the base metal, the bare metal. Which even though you try to do is good job planning there's always gonna be cases where you need unplanned resources and this is where the hybrid part of the solution actually works out. And I will try to cloud burst into our public cloud and provide resources that way or we shift resource from one, not just shift resource, but we shift the user to another geo where there may be more capacity. I mean this is where we leverage the multi-cloud approach for those unplanned needs. Thank you. So if you don't mind two questions. This is using a cost recovery model. How often do you go back and revise your pricing structure? And then the next question is, as you've gone through this, have you used metrics for both either utilization or cost recovery that you've later determined to be useless metrics and thrown out? And if so, why? We have revised costs in the past. The more efficient we become, the less costly it takes, the less cost for us to run the service. The lower the cost to the end user, because we're not trying to make a profit internally. Our role is to enable teams and make sure we're doing it at the lowest cost possible. So we have lowered costs in the past. As far as a metric where we kind of threw away, I can't recall one that we've kind of given up on totally, to be honest. We try to look as much data as possible. We used to track CPU, but it hasn't really been CPU utilization. Yeah, I think the metrics were really, I mean, it's the, I think the underlying infrastructure is what we're really, the metrics we look at there. And you go down your memory, CPU, network, some of the key components within the hypervisor, throwing away or looking at other metrics. There could be some things in that area. But typically, we stick to the basics on that. Everything we got, a couple more minutes. Anything you want to bring up or? Just so that we, using IBM Spectrum Scale for the backend file storage that is used for our public clouds or private internal clouds, fairly robust backend storage, use sand storage, clustered, redundant, resilient. I think the thinking and doing that, I come from a VMware background. I wanted that clustering filed system in the backend so I can easily migrate instances to and from. GPFS provides that for us and we have a significant amount of GPFS skills within the teams. And just a quick plug. Again, Bill is going to be doing a session tomorrow with the Sheraton. Yes. It's one of the late sessions, like 5.30. So maybe it'll be a pretty happy hour stop but we'll go into the kind of details on our custom image engine. That's all driven by Jenkins and it's really cool. So if you want to come by then, we'll be there too. All right, if there aren't any questions, thanks for attending. Thank you. Have a good week. Thank you.