 So, good afternoon, everyone, and I'd like to start, before the session, I'd like to start with a question. So I want to know how many of you guys are operators or working on OpenStack operations? It's pretty easy to hand. Oh, so many. So, how many of you think you have survived for over one year? Okay. Not that many. All right. So, let's get started to talk about how to survive in this well, well OpenStack word. Good afternoon, ladies and gentlemen. So I'm Joshua, and from IBM Bluemix Private Cloud, which is, which was previously IBM BlueBox, and now I'm working on the operations thing. Okay. And hello, my name is Heo Fan. I'm the cloud architect in IBM China Lab, currently focusing here on BlueBox, which is Bluemix Private Cloud, landing architecture, implementation, and also operations. Okay. All right. Sort of that. So, before this February, I had been working as a DevOps engineer for years and with zero operations experience. But this year, we plan to land the IBM Bluemix Private Cloud to support each IBM's Private Cloud business there. So, then I started as I operated, and after eight months, I am still alive and very good and good enough to stand in front of you to share with you this very interesting journey. So let's start with a little bit of background. No. All right. So, as you may know, that Bluemix Private Cloud, which was used to be IBM BlueBox, is the IBM's Private Cloud as a service offering, so which is based on OpenStack, and which means that IBM will manage the cloud for the customers. And this year, we just landed this service to China in order to support the IBM's cloud business in that country. And before the landing, we have no operations team in China, so we spent several months to build a new OpenStack operations team from scratch. So this is why we are here to talk about this story. All right. So what we are going to talk about today is we will not be that technical, but we will be talking about various aspects of how to run an OpenStack operations team as you can read from the agenda. And it's like a bit of framework stuff and consists of methods and practices. We will also spend some time on some topics like upgrade, HA, light migration things. So cloud computing is a cool thing. So you will admit that. So you thought you would work like that, but the real world is kind of cruel. So operating OpenStack is like fighting gigantic monsters, and there will be disasters after disasters. So just admit that. So it's quite not that cool, but very interesting. And you surely need a strong team to do that. All right. So let's talk about how to define an OpenStack operations team first, step by step. So firstly, you need to know how your cloud offering is operated. So this includes how you need to know how your cloud offering is provided, is a public, private, or hybrid. And you also need to know what is the SLA between you and the customers. And another thing is that you need to figure out if there is any other stakeholders like the business partners, the data centers, and the backend development team. So after figuring out what these key business entities, you need to figure out the processes in order to connect each of them together in order to make the whole thing flow. And then the next thing is that you need tools, because without tools, the processes will be just like documents. So you need the tools to implement the processes. And the last thing is the most important thing is the teaming. So you need people. You need to see how to run your team in order to make the customer happy. So let's look into these four things one by one now. OK, so here comes first the operating model. So I remember that back to the very beginning of this learning thing when I was discussing how to build the new team for our China business with my colleague in Seattle. And he just gave out a very high level diagram, just like what I will be showing here, to sort out the key business entities in the offering. So it's $10 to be very, very useful. So I think the most important three things are the customer and our operations team and the SLA. So the customer will basically use our offering and will operate that. And this offering should comply with the SLA. And there will also be support entry points that the customers can use them to raise their requests or issues to us. And these entry points will be rooted to the operations team. And also you will need to figure out if there's any business partners or third party data centers and the development team. So basically you will also be working with them to resolve some long, long run issues. All right, so this is the first step. And let's come to the next one. So next one is... Sorry. No, just additional comments. By the way, in many cases you will also have this security and compliance in this diagram, but we are not showing here just a footnote. Thank you. And so after knowing all of these essentials, so time to figure out how to make one flow, so it's about the process. I'm not listing all of them here, but some things I think which are most important. So the first thing is operations teams. So this is mainly about the rules and responsibilities. So we define these different tiers just by skill level or control power over the production systems. So basically we'll be using this model. So in front we'll have a tier one team called Support. So Support Team is just standing in the first line of defense and to face the customers directly and they will try to resolve the issues, but if not, they will try to escalate to the tier two, which is the operations team. So the operations team will accept the escalation from tier one and in addition to that, they will also be responsible for the cloud deployment, upgrade, administration thing. And there will also be tier two, tier three teams like the open set engineering or network engineering who actually builds up the product. So this is just a reference. Okay, so the second thing is quite simple. So the escalation flows will define how your tickets, your alerts and your incident go between different tiers or teams. So it's just like quite similar to the escalation policies which you can see in the page duty. So the thing will be quite complex. So you need to define the flows in order to know when the issue happens where to go. Yeah, so the next thing is about incident management. So incident management is the management of when an unplanned interruption or a reduction in the quality of service happen. So you should need to define the key properties for the incident like the priority level, how to define the incident and using all tools, you will need to match your outages and how frequently you will need to update the customers and in which way like you are using the customer ticket or just use some third-party service like this page.io. By the way, the response time and the time values here is only for reference. So it will be determined by your own stack, your SRA level which is a generic value listed here. Yeah, so the next is the change management. So change management changes happens just every day and you have to measure them. So first, you need to define different types of changes, so including novel changes like cloud upgrade or just implement the CPU OS subscription for customers and also there will be some urgent changes like when the customer just hits into the capacity limit that we should expand the cloud for them very, very quickly. And also you need to define, you need to describe how the changes will be rolled out in your change management, it's just like a MOP, something. And another important thing is when the change will be rolled out. So this includes what is the size of the window and when the window will happen and also the lead time between when you notify the customer and the change will be rolled out. Also, you will surely need review and approval to make sure that the key persons are notified and agree on this change. And the last thing is still communication. You need to let the customer know the penalty will happen. All right, the next thing is about shifts, so it's about people. So you will need to define how your human resources are utilized and for a best coverage, well-maintained, well-spaced spread of pain, which is quite important. So you will also need to define how your shifts are handed off. All right. The last thing is about security. So security will, there will be several activities like health check and patch reporting. You need to scan your network ports and also you need to verify the accesses into your production systems from time to time. All right, so that is all for the process thing. So then that's into the next thing is about tooling. So there are hundreds of thousands of tooling to support the operations. But in order to make this chart clean, I'm trying my best to group them into six. Yeah, and six is a good number. So the first thing is monitoring. So monitoring, I have grouped monitoring, alerting, log aggregation, and dashboard into one. So there are some very famous tools like Sanshu, Nagyos, Elk, and like Uchiwa, Kibana, so you can choose from either of them or just build it on. And the next one is collaboration. So collaboration here means the internal team collaboration. So this will include chat. So you can just use Slack or you can build your own XMPP-based chat system. It's quite easy. And also the fair sharing, this means not only share your documents online, but also support online document, online edit so that people can work together more easily. And the third thing is project combined. So our operations team will sometimes do some development things. So you should need a place to track the efforts. So things like travel will help you with that. And the last thing is the shift management. All right, next thing is cloud management. So this is a topic which ranges very wide. But I think there will be some key components like the CMDB, which is used to store, to read, to update your cloud configuration when you are building a new cloud or update the existing clouds. So and the next thing is about asset management. So it's mainly about how to see how your customer is using the cloud. It's kind of like a capacity report. And you also need systems to cover change of management and incident management. All right, the next thing is knowledge base. So two parts. The first is the internal knowledge base that will be used by your operators to know how to operate the clouds and also the external ones for the customers to know your product better. And next is security. So security will include access management. So it will be about authentication and authorization. So also you need things like NASA scanner to scan to make sure your network is secure and also things like the patch tools or the server specification tools to make sure that your cloud is safe in terms of operating systems and software configurations. And it's worth noting that in our Bluemix private cloud, we have integrated this healthy checking and patching reporting into our deployment and as a playbook. So if you are interested, you can check it out on GitHub or in Ursula so you can know what we are doing with this stuff. So it's quite neat. So the next part is the biggest part and it's about the customer support. So you will show the ticketing system in order to allow the customers to raise tickets and ask to resolve them. And also the customer chat so the customer can get quick responses from you maybe within 90 seconds, which is our SLA. And also you need something like a nice reply so in order to get feedback from the customers that you will be able to know how your team is doing. And the next thing is all about communication. So you need tools to communicate things like cloud level or start level maintenance events with your customers when your cloud is done or there are some network interruptions within some data centers. All right so this is all about tooling. And next thing is teaming. So teaming is, I think the teaming is about the size of the team and how the team will be on ships. So it's really, it will really be determined from your SLA and your service availability. So basically I think 24-7 is mostly like a hard requirement nowadays but in our case it's not. So at the start we just started with a 16-5 shift because there was no that many active customers for the beginning but now we just boosted up from that availability to 24-7 as well. So also you need to spread the pain between the team members. So you don't want to have one person on duty for 17 or maybe 10 hours and the other people just 40, maybe 2 or 3 hours. So and also you need to consider how to eliminate interruptions as possible because interruptions can kill people's efficiency when he's working. So here's just a quick example. So this is a communist thing at work on core module. So we just, we started with this but we found this kind of model is not, cannot very well coordinate the operations tasks. So people will need to communicate with each other who will be taking what task. And so they will also often to step on each other's feet. And so we now just switch to a new model, it's like a three-layer. So there will be a person on charge in the front. So during his own charge hours, maybe one or two hours or maybe 30 minutes. So he will be responsible to acknowledge all the incoming alerts, tickets, and chats. So but his task is not to resolve everything, his task is to distribute these tasks in the best way so that the people at work can handle them. And the people at work can be focusing on the long-time tasks until he is interrupted by the charge person. So this is just an example but not much. So the next part is about the two integrations. So this is a very big topic but I'm not going to go too deep and too concrete here. I just want to briefly introduce what we have done, what we are doing and approaching. So I think many of them are working with a lot of screens every day, so which tends, which looks very, very cool. But that will also mean that you are working with too many systems and you are jumping from one system to another very, very frequently. So that means also means a lot of interruption again. So it will be nice if you can use your tools or use some automation or any deployment, development effort to queue all these interruptions. So it will be very cool if you are using one place and with one click to get everything down. So at the start. You can just look at this diagram, it will be a little small but this shows that some typical systems when the operator needs to work for his daily jobs, like he will need to be working with shift management tools or customer support or data center support, things like that. For some long run tasks he will be jumping from this system to another, so which will queues a lot of time. So it will be good if we can start by integrating all these operations into one platform. And there's one, as you may know, that's a de facto platform called Slack. So you can just simply develop some Slack boards or Slack commands to do your trick. So here's an example. So this is a lovely code we are using in our teams. And this code, what he's doing, so he's just helping us to match our charge. So with this board, you will be easily see who is on charge now and who will be on charge next. And when you are, you need to go away from the keyboard for something urgent or maybe just for lunch, you can use one command to override your schedule to another people and when you come back, you can get it back. So this is only one quick sample, only one board we are using to store our daily jobs. So we have the others, boards and Slack commands to support this thing. But the idea here is to allow operator without leaving the context and he can deal with a lot of operation tasks. And if you have further interest, actually, there's already a lot of sessions and discussions around chat ops, which is, yeah, I remember there was already session in previous summits. So you can check the YouTube channel and also search for chat ops for the, their implementations. Yeah, because the daily life of operators will be highly on chatting. So it will be good if you can work on the same system when he's talking with the other colleagues. Yep. All right. So the next page is the way that we are approaching, I think, so we're still working. So this is about queue order interruptions. So we need to automate the org flows across different platforms. So this is mainly aimed at the long run tasks, including cloud deployment, which may take four hours, and the change tasks. So I just introduced our basic idea here. So in order to do this, you will need to break the things, break this down into several layers. So down there, you will need some common libraries, some generic common libraries on which you can build, you can encapsulate the tasks and operations against different platforms. So as you can see here, you can build your own inter-operation libraries with Slack, with OpenStack or with Payndom. And on them, you can, with these automated operations, you can then start to build the tasks, build them to the higher level. Things like jobs and pipelines, just like what we used to do with Jenkins. So you build Jenkins jobs and then you use Jenkins build flows to organize them to support the long run tasks. All right. So I think that's it for me. And then my made fan will be talking about some cliche that will boom happens, so as you may know. All right. Thanks, Joshua. So Joshua already talked about a lot of, about this, we have these two keywords for the session. This is a quick ramp up of OpenStack operations and the second is survival. So this second part is more technical, more about the survival part. It just pick up some typical scenarios that need the OpenStack operation team to pay attention. It might be a trap, it might be a miss, so things need to be deal with caution. Okay. So the first one is change management. So we, as Joshua also mentioned, we, for the class we have the CMDB to store the cloud information. Was the current state, was this, might also be access information and how the metadata you use to deploy and upgrade. So this is basically the same as the approach we mentioned as infrastructure as code. And for many different teams have different approaches. Some they do not use a database actually. They use GitHub either private or self-hosted to do the checking for this car changes and you can also integrate with Garrett or other reviewed tooling. So basically it's like the picture listed below. You have changes coming and you do a pull request for changes. You verify review and after it's done you get merged in the master. But the one interesting thing here is for coding you can have, you can have as many branches you want, but for a cloud the state it can only be one. It's like the main, the master branch in the GitHub repo. So basically we can have incoming change requests like from customer. They want to do an expansion. They want to do some particular change like over commit, things like that. And also we have internal enhancement to be rolled out either upgrade or you have some compliance request to fix the security things that changes. So the problem here is to manage the priorities and dependencies. Because if you have multiple changes ongoing and they are also dealing with the CMDB, it can be a disaster. You have conflict or you have inappropriate order to do these operation tasks. Then you will need more effort to mitigate or to recover from the bad state of the company. Yeah, I remember correctly. So maybe for one time or two we just reverted the availability zone settings for the customer because there are some co-conflict that we did not notice. Right, which means to have our consistency in the cloud state management, we will need a careful planning for that. Second thing is open stack upgrade. Actually, this is another big topic. There are hundreds of sessions about that. And generically to have a minimum impact, which is less disruptive upgrade, it has certain prerequisite or requirements for the deployment automation. The two basic things, one is you have the cloud configuration control. Another thing is you have the idempotency, which is you can keep running the upgrade and it will reach the convergent state. Another part is the upgrade process design, which is basically the orchestration, how you define the upgrade process, and how you execute that. And if something bad happens, how do you roll back and recover? This diagram is just an example to show what you have in the controller nodes and what you have in the data plane. And the reference is actually a document written by blue box guys, just kidding, and other people. You can just search the keyword on Google. It's a very good reference to the reference about how you design the upgrade process and what are the other items to consider. And also you can also check out the OERISWA code. So there is a YAML file, I think called upgrade YAML, which is publicly seen. So in there, you can also see how we do the upgrade, so you can see the sequence we are performing the upgrade against different components. Okay, high availability, which is almost my favorite topic. So H8, from today, no matter you set up the cloud, operate the cloud for yourself, your own organization, or you provide that for your customers, we always have the H8 in the architecture of the cloud. We have H8 everywhere, like the infrastructure network, the open stack controllers, the routers, the level three, and all these services, a database or message queue, they are H8 everywhere. And the goal is to eliminate the single point of failure, of course, and also H8 will allow us to do some things that you cannot do without H8, like non-disruptive upgrade, or you want to also benefit from the load, benefit from the H8 about load balancing. But and also the inherent availability is easy to be calculated. In the meantime, to failure, you divide it by meantime to failure plus the recovery time. So what I want to talk about is the dark side of H8, which is sometimes when H8 really failed, it's not easy to recover. Basically, you not just need to recover the service, but also the H8 mechanism itself. As an example, I posted this screenshot here. You got a Neutron level three agent list hosting router. You see two active agents, two actually, and the H8 state is both active. It's a typical brand split for the level three agent. And now what should we do? What do we do? Do we restart the agent, or do we delete the router and the recreate? I can only say good luck. But the problem here is we can deal with this level three agent or router. But the most important thing is to recover the H8 mechanism and the line to support this high availability. For example, you have the different agents to run, to create and maintain the two routers, maybe on different network nodes. And each of them they will communicate to each other with this VRP protocol. They will use maybe multicast to communicate with the demon like keep LiveD, keep us alive. And if that hard bit thing is not being recovered no matter you do a quick fix on this on this nose. The same symptom can happen again, which is we don't want to see. So the compactity is something needs to be deal with caution. And they will also impact the recovery time. One of the suggestions here is we can the best case is we have a monitoring mechanism for the HA itself. And also we have accordingly a recover automation to recover that part. So we don't need to be in a bad state about the availability. And live migration, this part is also interesting. And in the previous years, we saw there are a lot of debate on that about what is the appropriate scope of live migration. So live, if you try this NOVA live migrate or with the block parameter, does it work? If, particular to this question, the answer will be yes. So live migrate will work and it's stable. But it's not something we want to abuse in the operation context. To the right, you can see if the diagram, I don't want to try to scare you, but this is actually the NOVA live migrate workflow. So the view from internal components and code flow to make live migration actually happen. It's very complex. And sometimes if we rely on this to do a day-to-day operation, it can be a disaster. For example, live migration, we can actually do automation on that, which means integrate with some PFA like predictive feeder analysis. If the hardware is got a lot of sending, we can hook it up to a live migration. And it will give you more time to fix the hardware issues without dealing with SRA challenge. But if the things are hooked up with automation, you cannot control what is actually happening. Like the other operations running on the cloud. And if you don't have enough capacity, there will be a chaos. Like VMs just keep moving in and out from here to there and back. So live migration is something we need and it's useful, but better to be in limited scenarios. Typically just four of them. One is you do want to deal with the hardware replacement. And the second thing is you want to evacuate. Then you can have maintenance on a particular node. And there are other things like customer requested. You need to handle a capacity change, things like that. But if you want to use automation with that, better with caution. And also, integrated pre-migration and post-migration validation is also very necessary for this to be successful. Yeah, and I think another point is that I think the operations team or the other team is also responsible to educate the customers to how to use the cloud in an appropriate way. Because I'm not sure if your customers like this, but our customers, they may not have an HA. So when maintenance comes, we need to shut down that machine, and they will just jump up and ask for it. So I think it's also very important for us to talk more to the customers and educate them to make them best to use this open stack. OK, I think we reached the last place. Advertise time. Advertise my time. So we have the open cloud sessions running in Room 116. It's the right time to catch the latest four, so welcome to join. Just a few steps away. Yeah, so that's it for today. And if you have any questions, you can come to the microphone and talk. So we only have one microphone. So hopefully everything is clear. Thank you. Thank you, guys. Thank you for your time.