 Hi, good afternoon We'd like to thank you for coming. This is to discuss building an operations team from the ground up at Comcast You'd like to start with speaker introductions My name is Sheila Sabi. I am an open-stock engineer. I work on the operations team at Comcast Hi guys, I'm rich. I also work at the operations team at Comcast and I've been there for a little over a year now And my name is Megan Rosetti. I am on the open-stack operations team at Walmart So we have a very very quick quiz for you Especially after our introductions Just want to know Which one of us doesn't seem to quite belong in this talk? Just wondering feel free to yell it out So I did recently make a change wanted full transparency to bring it up I did recently make a change we had put in for the talk and I'm very proud of the work that we have done and I think that this is really good information to have throughout the community and Thankfully both companies agreed and decided to move forward. So we're going to just jump into jump into it We've obviously gone through the speaker Introduction so our agenda today is to talk about things that worked at Comcast I'm going to focus primarily on the positives but also talk about some things that we found as roadblocks things that didn't quite work Okay, so just to give you an overall view the operations team is comprised of both a junior and senior level engineers who have experienced throughout all aspects of operations the structure of an organizational team certainly can change from company to company and I Think it's critical to be able to take a step back and reevaluate every so often Is the structure meeting the needs of the company? I is it able to pivot quickly and you're looking at primarily interrupt driven work Can you focus when you need to the overall goal of the team has been to build and develop a team that is Cross-trained in all areas Through that with continuous training and continuous improvement Some of our team responsibilities We make sure that the cloud is in good health. So we do Maintenances upgrades deployments customer onboarding our customers are internal Customers, they're not external customers. They're internal engineering teams that are moving their applications to the cloud so we have onboarding for them when we bring them on to the cloud and that comes with a lot of training phone calls education there's ongoing customer support as well reporting and logging and monitoring Sorry, some of the some of the workload that comes our way is the routine requests like day-to-day ticketing Customer we have a slack channel, you know, we help we do support day-to-day support and then reporting Then we also have break fix maintenance break fix incident management Kind of like emergency things that come up and After-hour support 24 by 7 we're on call. We have a rotating schedule And then we also have maintaining kind of the overall health of the environments and that is with maintenance is upgrades and Deployments when we have to expand and grow And then there's tech debt kind of stuff that we need to work on We know that we have to work on but it's kind of in the backlog or then project work as well and automated automation So I'm looking at an overall workload of an operations team It's critically important to be able to prioritize that workload to make certain that the team is able to Focus on the priorities at hand Highest priority are going to be any type of your emergency requests Not that anything ever goes wrong But in case you have any type of outages or incidents and then to be able to triage Priorities and this is usually quite an ongoing rotation again things change daily weekly monthly being able to Prioritize different tasks different projects and of course different targets as well. Sometimes this can change also rotating responsibilities I'm trying very hard not to have a single point of failure in which it's always the same person on call or it's the Always the same person handling customer inquiries So being able to rotate that on to on call schedule rotate through some of the customer support and Maintenances not having the same person always handling maintenance is but being able to have cross-team handle those different tasks We have always found that communications has been extremely critical and Transparency across team really trying to eliminate the one-on-one conversations and by that I mean using a chat channel Instead of taking a question to one individual posting it out across the team So that you don't you try to eliminate those side conversations If somebody on the team knows and is able to answer they try and write in if not then you have several you might go Through other iterations you might also find that there's Information lagging and that might move in another direction And you may also find that you have five or six people who know exactly what you're talking about and they can jump right in And you want to encourage that transparency as much as possible Focusing on the agile method, especially with operations. You have to be able to pivot very very quickly any type of emergency request or large project or Although it never happens deadlines that get changed or moved up or reprioritized You need to be able to to move through that and then meetings meetings are always this somewhat of a necessary evil sometimes it really Sticking to it starts at X time it ends at X time. What's the agenda and then quite frankly? Evaluating is this meeting necessary? Does the whole team need to be involved? Do we have the people that we need there? Is it a meeting to discuss a meeting about a future meeting? probably not necessary then and Really really keeping that time to a minimum so that the team really can focus on the priorities at hand Okay, so one of the biggest parts of building an ops team is obviously finding the people for that team And I really love how we've been handling that process at Comcast We keep the team involved in the entire hiring process from beginning to end That way there's never a time when management just shows up one day and says this is Todd Please make Todd fit into our team It's always a group process Starting with resume screening we take the resumes we've got and we all Talk through them together look for key points that we want in team members And then if we take a handful of those resumes and schedule phone interviews with them We might not have the entire team on a single phone interview that might be a little overwhelming But we do have two or three team members on those phone interviews But then we discuss them as a whole team to see if they move to the next step Which would be the in-person interview and that and for those we do try at least to have the entire team Now when we do phone interviews in in-person We tried to do Like a conference bridge sort of thing Because our team may or may not all be in the office at once We have a lot of remote work time, which is really nice And so we keep the team involved even if some people are in the office that day and some people are working from home And even for in-person interview, I think we've done like Skype We've had like a person sitting at the table just a face on a laptop, but they're still involved in the process whether they're in office or working remotely and Keeping the team involved is a really good way to find the right people from very different angles Somebody might have all the right qualifications on paper, but then if they can't hold up in a group working environment They might not be the right fit for us. Oh And yeah, the entire team makes the decision on whether to Choose to hire a person or a pass on them And once somebody is hired we still keep the team involved in growing the new team member into a Fully functioning member of our team. We try to split up the training of a new team member between different people's experience and expertise And we try to keep even the most recent hires even if they were just hired one month or two months ago They're still training the newest person just so that we have a constant influx of Training as people are brought on and that helps people get a very wide range of knowledge In our cloud as well as getting them good experience sort of teaching that knowledge to other people We also try to make sure the training schedule is published. So Everybody knows Who will be handling which part of the training? So if at any point something needs to be Moved or rescheduled. It's much easier when that's across the team as well And it's a checklist so we get to actually, you know mark things off and then we know where how much Development that person has made. All right. So ongoing Operational training. How do we keep all the operations engineers up to speed with what is going on? We do brown bags if we have somebody who is an expert in puppet or Ansible or Somebody just recently may have a deployed Solometer and they've they've had that as their personal project as soon as it goes Into production or right before it goes into production We have brown bag sessions and we encourage everybody from the team to come to the brown bag And if they can't make it that will publish the presentation or we'll have tons of documentation on it so that keep people can actually keep up to speed with what's going on and we also like to keep the You know at least two members of the team trained on a certain topic and in order to prevent a single point of failure PGC days, sorry, that's the acronym for personal growth days and personal development days. We have time allocated Depending on workload and priority to work on stuff that we want to work on we have people on our team that are working on Python we have people that are working on expanding their SA Linux skills And we also have like bi-weekly Python training for people that are interested and they want to join And another example is like I had a brown bag session on how to do a commit So everybody from the ops team came in and we found Bugs and then we went through the entire process from a to z and everybody put in a patch. So that was pretty cool All right So we wanted to run through as an ops team Again really focusing on the positive what worked what have we seen that has really given us the best value in the team and the Overall priority has to build a really cohesive team Part of doing that has been through constant communication and a lot of transparency Really pulling the team in to have that input and that communication across board So what I'd like to do what we'd like to do is open it up to the audience for questions What are you finding within your companies that you might be stumbling over? What questions do you have for us and really have a good Q&A session? And we do have a mic here that we can pass around Go ahead so that is made as a determination and honestly even by the team Because of as the team has grown and developed where some areas might need further So it depends if it's broad-based and it seems to work for the entire team then that's looked at if it's something that's more specialized Then you know people who might be working in that particular area Might attend and then bring that information back. I'm sorry. How large How large is your environment and then how many engineers do you have in your support team? We have 20 plus data centers, it's pretty large and we've got We started off with a handful of engineers and we've pretty much doubled in size since then so it's been it's pretty I don't even know we have three different teams, but we're all kind of meshed into one team. It's operations Development and engineering and so they're definitely like fuzzy areas between all three teams because some of us work on We cross-train with each other So I'd say It's grown tremendously Do you think that we could do this on a smaller scale? Absolutely I think so Definitely finding talent is also hard with open stack There's definitely a steep learning curve and that kind of goes hand-in-hand with the training that we do We've found what works for us is to find Linux system admins for the upside and people that are passionate about open source that has come a really long way for us because they understand the basics and then You start getting into open stack from there and and as time goes by you start learning everything So the team started small Over the years they've definitely grown and expanded and that's where some of these concepts came out of the team started small what worked and Those carried over moving forward Really finding with a especially with the interview process when the full team is involved Then the team is really on board for that person coming in and you already have this vested interest in their success and That's something that isn't always found and and can translate when you come on board to a team and people weren't quite expecting you to Be there or they haven't met you before and Sometimes that can carry over to Feeling uncomfortable. So really putting the full teams involved the training schedule is out Everybody is on the same page and moving forward from that But definitely larger small scale Yes, please Yes, in other areas whether it be 24 7 network operations support monitoring, you know, how does it divide between kind of the more Open-stack specific skill sets within your operations team and maybe other operations teams within the organization Sure, so we do have exactly as you said 24 7 Network Operations Center But that's really the only other internal resource that we consume in that regard Everything open-stack specific. We definitely Do the operations for Can you elaborate a little bit on what that dividing line is between those two organizations because there's a smaller scale shop? You know, it's it's important for us to clearly define what you know, maybe talk about what works and what doesn't between that that line So I guess our network operations they monitor Our servers like they would any server And so it's largely whether it's up or down or if some critical service on that server Has stopped working. They'll get the alerts from that and then they'll contact us It doesn't matter to them that it's open-stack servers. They don't know what's running on the server It's much more physical device monitoring for them And so what we pay much closer attention to is the actual open-stack services We monitor how they're running whether or not they're running if they go up or down It's also setting your SLAs your SLAs for open-stack for a cloud environment Generally are different from straight bare metal One node going down is not necessarily a pageable event at 3 a.m But establishing those your SLAs out to your customer base as well And a lot of that ends up being the company and the team and kind of how you build that out. I Want to know that the operation team need to contribute back to the open-stack like a report of the backs They all make some bro point Yeah, sir The operations team does contribute back to the community. Everybody has a strong point I want to say we have people that speak different languages So if they don't want to do a commit or they're not ready to do a commit they actually translate Horizon like they'll try they'll translate the horizon dashboard into a different language And that's help right there and contributing back to the community. We have three people from our team are on the operations operations guide Training team or not training team. Sorry. It's like a specialty team. Yes, and it's brandish under docs So three people on the team are there there, you know, every single two weeks we go we were in the meetings We help out people file bugs We also have We have a there are development team also, so they they also contribute code We've had people work in different sections like neutron and horizon But generally you don't see that with an ops team for people to actually be Contributing back to the community, but it means a lot to us. And so every single person Does something when it comes to that it keeps us very invested in OpenStack When your operations team is also members of the community Question for you Level one and level two help desk how far does your training extend into that to Maybe ease your team workload So I think level one and level two can vary greatly across companies So I'm going to go off of a little bit of an assumption So certainly let me know if I'm incorrect. Usually level one is eyes on glass and then And here's where it gets a little murky. You can do like a level 1.5 and and then level two is usually your ops maybe entry low, you know more Junior for level two and then level three is typically you're more senior As far as a current setup The what Richard was speaking about earlier the knock is more a level one And then the ops team is more of the level two and it's rotated through so it's not the same person When you when you bring your new team or resume or when you bring new team team members on your team Are you also part of your training working into your level one? Not currently We provide documentation to them. That's kind of maybe we're always going as the documentation so that you As you find new ways to solve it further down from you, I guess is where I was going with it Hi, I'm hoping this is in scope for this talk But can you give us an overview of some of the tools that you use of the operations team uses to monitor the physical infrastructure the open-stack infrastructure and maybe the workloads running on top Sure, I don't know exactly how much I can say so I'll wait until that guy starts shooting daggers at me So we do have a Nagios based hardware monitoring and it also monitors the open-stack services running on the service as well In terms of workloads The applications that run on top are monitored by those teams Mm-hmm, right Mm-hmm. Yeah, so we want to make sure that the infrastructure is up and running, but that's we don't go past that Yes, open-stack system. Yeah, I think someone over here was right over here And then we'll come back Have you have you had to solve how to deal with some of these team problems across? large geographies or significant time zone splits Not all of the team is in the same location. So we do deal with remote team members Something that we do We communicate really well with each other. We have a constant Group chat going on. We just transferred tools, but it's just it's all the same like group chat No matter what tool you're using and that makes it it makes us feel like one big team Whether or not we're all sitting in the same physical location if everybody is constantly talking to each other on that Pitching in ideas helping out a lot of times. It's really easy to forget That they're not all in the same physical space and as I said, we do also have remote work So there might be times when none of us are sitting in the same space, but we still get the same workload done Because we communicate really well and we have daily stand-up. So every morning at X time we meet Online and we post what we're working on today and what we worked on yesterday And if there are any blockers or issues and so we pretty much know what everybody's doing all the time So I I took the same group chat approach to solve with a worldwide team And it was really excellent because folks coming on can go onto the chat and review What happened in in Europe and stuff that time highly endorse it good stuff Oh So if you want to categorize your problems in buckets, which are the top three buckets that you have Resources, yeah, let's talk about technical problems Do you mean like customer issues that come our way or troubleshooting what can you can you kind of expand on that? Let's say troubleshooting and that relates to some open-start components like deboss or neutron or Keystone errors or rabbit MQ errors. I think we're gonna have a difficult time dive well It comes down to exactly what information how how much we can get into some of that I mean we ran into a rabbit MQ issues in the past sometimes some networking issues, but Those seem to be big ones But overall we're pretty It's pretty solid It is a full enterprise Production environment. I'm not sure if I think sometimes that's not always the entire environment is production This is the guy in the back that was gonna throw daggers Where do you guys burn your time? So we do have Customer support in the same way that we talk to each other doing group chat communications We also have a channel to talk to our customers in that way So I do a lot of one-on-one customer support through that And then also bringing people onto the cloud also consumes a large chunk of our time Because then we're adding users adding groups Finding what region works best for them Fitting everybody into our cloud Can be a pretty daunting task when you have so many people Wanting to get their projects into your cloud Especially people that don't understand cloud native exactly we have a lot of people coming into our cloud Who still run on the pets model and we try to get them into the the more cattle model If something goes wrong with one instance That shouldn't kill your entire application so we try to We try to get all of our new new projects on the cloud to understand things like high availability Backups, of course And just try to get them to change their mindset so that our time isn't all being spent Trying to deal with single instances that have gone down and trying to fix those very minute problems And a lot of time on deployments as well so upcoming deployments that are coming in the near future We know what's happening. And so we work on Getting open stack installed and from end-to-end and then getting customers onto those new environments one question I have is We actually just launched operations as a service for open stack is a Solution called chai, but one of the things that I wanted to ask you guys is How much time do you spend on a daily basis? Or on average on automation, right? So not everything can be solved by hiring people We spend a decent time a decent amount of time on automation I Personally try to do a lot of automation for our team Especially when it comes to whenever we need to do maintenance or small upgrades on our cloud We try to run those with an automation tool Yeah, we try to do automation with our deployments as well. So that's another good chunk of our time Yeah, coulda too so a lot of that is also prioritized Has to what is needed today and what's needed down the road To try and help eliminate the overwhelming factor of we have to do all of this today right now And you know strategically plan out some of those projects what has to be done today What can be done in that within the next quarter those kind of? Kind of tasks as well If you can share besides slack or whatever chat application use What what other tools do you find are most useful to you or really help you do your jobs? Whether it be ticketing systems or what you use for prioritization and managing that workflow and and all that other fun stuff If you can't share, I totally understand sure, okay, so yeah, we use puppet we use Ansible And and to Rich's point if we could spend all day on automation We probably would but because we have customers and we have the ticketing and we have You know calls and needs and trying to get people cloudy We we can't spend our entire day on that but so we use slack for chat. We use to use IRC And then yeah puppet Ansible. I can't think We use your effort to getting Every time there's a task we we basically put in a project ticket and then we put sub tasks under it We track our work all the time we put how much time we've spent on it. We're very very transparent What's really nice is putting details into the tickets There are many times I've run into issues where I'll get paged on call at four o'clock in the morning And there'll be x and x and x and x alert and I have no idea what's going on but I'll just do a simple search and JIRA and Boom I found all the steps. I did this I did this I did this I restarted this service and then boom done It's fixed so we try to be transparent in our in our ticketing system And it is JIRA and within looking at that also Because it's a customer supporting team Breaking that out actually until what is customer facing and then what's internal to the team Because you find very different metrics They're combined or separate so and and also it You have a customer supporting base that's going to take priority and you need to be able to have that You don't want 300 tickets in which you know ten of them are customers. You don't want So much that you can't can't sort through things. So we try to make it very Very straightforward as to what's where and sorry just to follow up Do you come since I you said JIRA do you use confluence for like a knowledge like a wiki or do you have something else? Yes, so we do yes We do have a wiki and that's kind of the source of truth for a lot of our docs We have an internal wiki for our team It's and that's also accessible to our customers But it's mainly for us how to do this how to do that. These are the maintenance is that are upcoming Here's just the standard method of procedure for it And then we also have FAQs kind of for our customers that we link to we also have a forum Which is confluence based and I think that's like that brings up a really good lesson learned from any team kind of starting and ramping up you have a tendency to get very focused on Moving to production are being ready Document document document because you can turn around and use that documentation as educational information Across teams with your customers But if you build out that documentation as you're building out your platform Save you a lot of time a lot of time So I have a question. How often do you guys do deployment and do you have continuous delivery and? And it do take out teachers while you're doing deployment Boy our deployments It seems like they're getting exponentially quicker that we put out new regions to our cloud But I want to say we might average I Yeah, I would say maybe two every two months. Is that like it seems like yeah, that's probably more true overall And in terms of outages while we're deploying whenever we deploy, it's a new New region in our cloud, so we're not taking anything away. We're only adding So there's no outages from that software upgrades It's what version of OpenStack are you on now and how do you keep up with the new releases all the time? We're running on Havana and ice house right now It's tough to keep up with the with the releases. We're a little bit behind so I Think that's pretty normal though the team started the proof of concept with really the beginning of OpenStack and So with that as you go through building out and upgrades and such There's more that you end up kind of carrying across Yeah, and we're still we are actually testing our migration right now as we speak probably back in the office To see you know what the best way is to do it But the good thing is for some of our bigger customers They understand how the cloud works and so it's really easy for us to tell them This is more more specifically when we've had to do a maintenance on our region that requires Some downtime however small that is We can we communicate to those customers And if they have engineered their applications For the cloud Many times it's very easy for them to just flick a switch and their traffic stops going to that region and goes to their application in another region. Oh, I think he has the mic the pets versus cattle thing is something every OpenStack ops team deals with and What methods have you guys used to get people to transition to the cattle mindset? And how successful have you been? honesty Yeah Honestly, it has been a lot of if the same customer Comes to us over and over and says my instance is down my instance is down my instance is down We pretty much sit them down and say you have to stop that That's just not gonna work it it gives it's more work for you It's more work for us. It's not making anybody happy So the best thing you can do is Engineer your application in a way that if one instance goes down. It's not the end of your application So communication and honesty We don't have anything that necessarily forces somebody to be cloud ready before they get on to our cloud We've talked about it though We've talked about putting together some sort of like training or pre-req prior to coming on to the cloud What we do is we give them a welcome letter and in the welcome letter if you read it You'll see what you're supposed to do. We also put together training videos. We've put together PowerPoint presentations we get on the phone prior There's a little checkbox when you get on board it onto the cloud Do you need more information and if they check that we'll we'll reach out, you know extra But as rich said, you know, you'll you'll still find people that have not changed their mindset yet And it will just take a little bit more work You said you're cross-training inside the team, how is that how do we actually do that and How do you motivate people to do it? Honestly the motivation isn't That hasn't been an issue. I think part of it is also focusing on areas of interest not necessarily Allowing people you have a core operations team you have core Responsibilities that everybody on the team is going to need to partake in but then being being able to expand Within the team to Sit in on a different project work through something with other team members Not just on the operations, but also on Dev and engineering and then also the personal development days You know giving people an outlet of being able to expand upon their interests I would say I haven't found a lack of motivation a lot of times people on the operations team Once they have that area of interest they want to work with the person on the engineering team who's doing storage networking So that they can learn from them and get more knowledge out of that they might not be going quite as deep technically into it as Somebody doing the development and coding for that project But it's very common for us to work with members of the other facets of our team As we're interested So that goes for both sides the the guys that train our They they have a message to send out and the others they just want to learn And it's in everybody's self-interest not to have a single point of failure so people have really rallied around that to You know share that knowledge We are coming right up on time. So I actually think you might you're going to be our last question choose You kind of a talking about it from a very much a traditional infrastructure and then the app people. It's there's not a kind of a lap overlap between kind of dev ops and Architects working with infrastructure architects working with that people to actually understand how to design Kind of more cloud I did native apps Do you have those people in your time? Do you have architects? Do you I know you talk about doing educational videos, but? That's still app developers We have two people on our team that are actually developers I mean they were developers their entire career and now they moved over to the ops side So it's pretty cool because they have experience on both fences and they they even do a lot of Training with the rest of the guys on the team But but it's definitely a gray area and you have we have people that buddy up with the engineering team Sometimes like when there's an incident you want to know what happened or how they fixed it You just sit with them or get the notes from them at the end and then next time you're doing it So it's really really fuzzy So we are actually right at time I would like to thank everybody for attending and thank you for the questions really really great questions. Thank you very much