 Good hello everybody Thank you for joining us this afternoon My name is Kyle Forster. I'm the founder of big switch networks The three of us are on stage here to talk about some of the lessons that we learned in a very large NFE deployment altogether Yeah, we've this was a this was a heck of a project joined by By Chris Emmons from Verizon and Redesh Balakrishnan from Red Hat We're gonna go through one by one and talk a bit about the lessons that we learned over the course of the the nine-month sprint For for the first phase of the project Thanks for joining us Chris, let me turn over to you to kick us off Hi everybody Welcome to the third day of the conference. I guess We'll we'll jump right in here. I'll start out by just you know, there's a lot of buzz words on this slide I Think what the point of this slide is really what we set out to do and that was to build the foundation of the next gen Network for Verizon so we wanted to have a Network that are to build a network that we could use across all of our locations all of our business units So not just wireless not just telecom not just our enterprise business But across everything to let to leverage that the economies of scale and the operational efficiencies that we can gain by doing that And we wanted it to evolve in a modern way with the pace of software and Everything that we're here at this conference to talk about so how can we leverage the the most modern technology to deliver the same Quality network the same reliability network, but in a more agile cheaper more efficient way So we have a multi-year plan in place kind of the second bullet here is about the multi-year plan We have to build out the network across Verizon and to then move all of our workloads We have a plan set up to move all of our workloads on to the infrastructure So that in the not too distant future will have a fully programmable network that will enable us to innovate and experiment and and heal the network and grow the network and all those kind of things in a much more In as much faster in a much more programmable way So this slide's kind of found the foundation for what we did So if you start at the bottom We say we want to we want to embark here on this on this project starting with commodity hardware and open source software So the benefits here are obvious right the commodity hardware We can drive some of the cost out of the network of the equipment that we purchase and install The open source software we also see an opportunity for cost there But more importantly we see an opportunity there to enable Competition and an innovation in a way that we never have in our network before so these were key concepts to everything We wanted to do Embarking on this project so and then the pillars on top of that foundation that we're that we're taking here is Starting on the right software defined networking virtualization and automation so my partners on the stage here that that helped us with the first princess project as Kyle explained Big switch helped us out with the the software defined networking component And then the red hat component for virtualization and then on the automation side Obviously a lot of work there going on internally as we automate our processes and our procedures and and get things more to that That software based model that we're looking for Importantly on the commodity based hardware piece. It was important again that we use that hardware and that We picked our vendors in such a way to maximize that so big switch Was one of the key vendors we had who was capable of running on the white box hardware that we selected for our switching Fabric and it was it was an immediate plug-and-play there They integrated great with the with the hardware and then also with the open stack software that we selected from red hat My other partner on the stage here We'll go through a little bit more of the details here and hopefully hopefully we'll have some some time for questions at the end so I thought I'd jump to some lessons learned before we talk about the kind of the timeline of the of the of the project so The left side of the slide here. We have a funny-looking airplane with the wings backwards and all that kind of stuff That's that's meant to symbolize kind of oops. What happened? There we go The the funny-looking airplane here is meant to symbolize kind of a different way of doing something right? So we're used to landing airplanes with wings that look normal and all that kind of stuff But when you start talking about moving our infrastructure to a cloud-based to software programmable to a self-healing all those things that We want to accomplish that looks very different than what we're used to doing in the network So so you think about that airplane and building that airplane is one challenge. Why does it keep doing that? Michelle is that you? So anyway funny-looking airplane trying to land it at an air at an airport that's not used to seeing that funny-looking airplane So we have a network that's tried and true We have we have processes and procedures and things we've done over the years that have that have built the most reliable network And we can't sacrifice that but we do want to try new things So the point of this slide is kind of you know prepare the runway for things that even you're not used to seeing funny-looking airplanes funny-looking Technologies different processes and procedures different ways of thinking You know you really got to adjust yourself to be prepared to take something like this on and that's kind of what the right Side of the slide indicates the people and that make all this happen. They have to be prepared for this as well They have to learn the new technologies have to be willing to challenge the status quo They have to be willing to look at things differently like yes We don't want to sacrifice who we are or or how we got where we are or or the services that we provide to our customers But maybe there's better ways to do that and you have to be willing to kind of step outside of your comfort zone and and take that on So now I really want to move on to this slide as opposed to whatever was going on here earlier. It must be this Mac Anyway, so from project inception this slide's a little bit deceiving because it kind of indicates that we started in March of 2015 March of 2015 was when we kicked off in earnest with this particular deployment there was a lot of foundational and and other work that went on well before this lots of trials lots of work lots of Lab work lots of POCs that kind of stuff just like a lot of you in the room have probably done But by March of last year. We said look we think this technology is really really getting to the point where we can deploy it And really start to use it in production. So that's kind of the March date is alright guys We've done enough playing around. Let's go figure out what we're going to build and deploy it So roll through a couple of months of kind of final technology evaluations looking at different vendors looking at different options You know doing some final POCs that kind of thing just to make sure that we were sure certain of our selections and how we wanted to move forward Took us into the you know the June-July time frame where we we actually locked down the design built it in the lab And then did all of our testing so this is our reliability testing our destructive testing our load testing all those kind of Things that we did in the lab. I think Kyle's going to talk a little bit later About some chaos monkey type stuff I mean we really went to great extents to abuse this thing and make sure that it was going to work the way that we wanted it to work When we got it deployed in production So then August to November August to November we actually deployed the hardware to the five data centers as was discussed in the press release read we released this week and And we had those five data centers ready by the end of 2015. So we're now full-on into You know getting workloads migrated into those data centers And actually transforming our network so kind of a parallel line to all this that you don't see is the work We were doing on the applications in parallel. So working on virtualizing the applications getting them to run in our environment making sure the network Requirements were satisfied all those kind of things. So All together, you know a very aggressive timeline a great team of people we put together our partners and The teams within Verizon many of you sitting in the room here. Thank you for your contributions But you know we moved at light speed here, and then we're really moving the network along. So with that I think I'll turn it over to Thanks, Chris. Good afternoon everybody I want to do three things or touch upon three things touch a little bit about what's red ads Approach when it comes to NFE and what's our offering and then quickly go to talk about the nine month of frenetic pace That Chris Simmons and his world-class team set on ourselves and what pay the learnings at a high level and then Last but not the least from a forward-looking perspective What are the areas that we are working on to make sure that we continue to delight Chris and his team as well as have an impact together so our approach is Upstream by default or upstream first Now if you had to pull a random red-ad engineer and ask them why are you doing that? They might say that's the only thing we know or that's the only way if you have ever done Software it's important to ask and understand why we do what we do Our approach as a product company is to make sure that we don't create any kind of lock-in Which some of the services organizations tend to create? Where in you know some of the customers have told me a six by six by six model six consultants for six months and Rinse and repeat the same thing six months later Whereas our approach is to make sure that every bit of code that we write is available upstream The second Thing to ask is okay. What is the value of doing that the value of doing that is making sure that there's no vendor locking because at Any point in time the source code is available in fact We do our best to make sure that it's documented exactly where it's available as well But most importantly as a product company it helps to make sure that maintenance And being able to support Implementations across the globe is can be easily done because if every customer is going to have a different forked implementation Clearly we can't scale to be able to address that because we're not in the services business The flip side of the question is okay. What's the risk of? Forking probably the biggest thing is the fact that you end up having technical debt because every customer was stuck in an island of Proprietary fork is going to have a technical debt Especially in a place like OpenStack where the pace of innovation thanks to every one of you who's in the part of the community Helping us move the needle on a daily basis. You don't want to get stuck in that island So that in a nutshell is our approach in general for any Problem space when it comes, you know from a product perspective particularly so when it comes to NFV and OpenStack as well What is our offering in this space? There's a lot of chatter about OpenStack being an integration engine We you know if you think of it as an engine we believe it as more as an engine plus steering wheel and the you know The wheels that are required to actually run a car if you will just in my attempt at real-time to give an analogy, but The reality to realize the important thing to realize is that we are fundamentally Co-engineering with Renat Enterprise Linux KVM as well as OpenStack a solution that a customer can bring in into their environment Continue to live with it as long as a lifecycle of the product currently at three years With a piece of mind that you know Somebody is there to watch out for Upgrades and patches along the way now The important thing to realize is that from an NFV perspective the innovation is not just at the OpenStack layer In fact the interdependency of innovation across all layers from Linux to KVM as well as in partnership with Intel DPDK LibWord from a KVM virtualization perspective all these are moving at different paces at any given point in time Our job as a product company is to make sure that we are taking the customer input driving the innovation at the same time Stabilizing and giving an experience that a customer such as Verizon can deploy and live with on an ongoing basis The second aspect I wanted to touch upon was the lessons learned from the project itself I You know I'm a big fan of number three so three areas that we got feedback around one is a set of core capabilities that we wanted to make sure are Available from a deployment perspective so clearly there were Tens of features and I think Kyle will talk a little bit about the kind of the Increase in the number of features and expectation that were placed on us as we went through the journey The attempt here is to at least characterize the top three areas where we had to actually do something To make sure that we are meeting the project needs so IPv6 support Exists from a you know open-stack Perspective, but we had to make sure that from a deployment and management perspective its surface in a way That's easy to deploy and manage the second area was around SSL support from endpoint SSL endpoint support again Making sure that director which is our deployment technology can make it easy to roll out SSL Implementation was another area that we had to work on in a truly upstream aligned manner and make sure it's in the product as well High availability. This is a critical area because you know in this NFV space. There is a lot of foot around Quote-unquote carrier grade without even a standard definition of what that actually means The good news is that you know between Verizon OS and big switch we agreed on the fact that it's not about some carrier grade is auto return but it's all about Reliability availability and service ability of the infrastructure. That's important, right? So that being the focus availability clearly became a key area focus for us. Our Implementation is fairly straightforward with pacemaker Corosink HAProxy implementation and we are to also make sure that you know we can stand up a three-note cluster With high availability from a controller perspective We also have the option to turn on Instance HAPut, you know from the deployment perspective. We're not there yet, but just to highlight the fact that The feedback that we got during the project actually helped shape the product the fact that we are also Leading contributors be it at Corosink level HAProxy level As well as pacemaker does help the fact that we can react to the demand from a customer perspective And make sure that we're doing it in an upstream aligned manner. So we feel very good about having met the needs here Two more areas of Feedback is one is around partner integration itself. So, you know big switch was you know after a lot of trial and Successful proof of concept was chosen as the SDN Provider of choice now the reality is that this opportunity was the one actually made us work closely together at two levels initially It was about how do we certify and make sure that big switch solution is Coexisting peacefully with our open stack platform right at open stack platform offering The next step that you know, we were challenged to was how do can you make sure that the? serviceability of the implementation Is at a level where the deployment of our open stack the deployment tool director can actually take care of Integration at a much tighter level so that it's not a two parallel implementations, but one seamless implementation experience as you scale So definitely this meant that in addition to engaging with Verizon We also work closely with big switch and you know Kailin as a world-class team where super energetic in terms of engaging with our engineering team as well as product management to not just make sure that We have a solution for Verizon, but also think through What it takes to sustain this roadmap alignment on an ongoing basis, so that was another Exciting outcome out of this process and last but not the least in this third area is scale This is an area where we're actually as we speak We are spending some cycles at a Dell lab to start You know testing scales of you know hundreds of nodes not that every implementation is going to hit, you know hundreds or thousands of nodes, but You know in an environment where if it's a conceptual maximum that we are talking about open stack scale You know that it doesn't actually behave very well for the kind of the pushing the limit on where technology needs to be So what we want to focus on is not only just test and prove that we can get to You know multiple hundreds of nodes, but also the serviceability as well as the performance that you get from multi hundred node Implementation it continues to be meeting the SLA requirements as well as customer requirements So this would be an ongoing area that we would be investing as well Now I kind of touched upon our Learnings as we were in the project the third aspect. I want to touch upon is forward-looking. What are the areas that we are focused on I Could synthesize this into maybe three points first one is There's the classical put the genie back in the bottle Requirement around hyperconverged infrastructure with Open stack and Ceph by that I mean today. We are at a point where you know seven nodes is the minimum footprint required to have a You know well architected implementation for Ceph and open stack How do we make sure that we you know shrink it down because the reality from a Verizon perspective the requirement is that They you know, it's not just the data center. There are you know multiple, you know, there is Corona corona light and also light in terms of scope of and size of deployments that are being considered So how do we make sure that? technically we can bring these elements together as an area that we are invested on and Especially an area that we are focused on from a Newton perspective now that metaka is Out the door already The other areas around split stack or a composable role here you know the the driver is to make sure that there is you know isolation from a role perspective so that maintenance dr Can be accomplished successfully So think of it as you want to scale out the controllers by a given specific role or a service if you will so that's the second direction that we are focused on as well and a third area has to do with the fact that there are Different kinds of workloads that are going to be placed Which require different kind of underlying infrastructure with bare metal on one extreme to you know super small lightweight implementation with a minimal little or easier to implement high availability without pacemaker dependency etc. So In a nutshell these are all areas that we would not have normally thought about if we weren't engaged in this project and In line with our philosophy of stream aligned innovation We have taken all this feedback and we're working through the road map to make sure that come Newton time frame We are knocking these off the park as well as continuing to innovate on our NFE solution here And like I'd mentioned part of the solution or critical element of the solution was partnership with big switch So tell you more about that. Let me invite Kyle over Thanks, Redesh So I think to me one of the One of the big things that worked here is if you see the context and I think this was especially like last summer was really when this was to me most most visible You have a team with a big vision, right? I mean Chris laid out for his team a big challenge and Said, you know, it's time to rise up and meet it with a whole series of new technologies and a new kind of cultural approach So it in that context being able to grasp on to a few familiar things Makes a big difference It's my perception that one of the things that really worked over the course this engagement We have this metaphor called one big switch if you're a networking professional and you're familiar with Logging into a big chassis Well, everything that you used to do on a soup in this model you do on the STN controllers Whether it's CLI in or BGP out. This is a very comfortable model Everything that you're accustomed to doing on the chassis backplane ie nothing That's what happens on the spine switches. You don't configure a spine switch You don't add a protocol on a spine switch in this model. It acts like a chassis backplane Everything that you're familiar with on a line card in this model. That's how the leaf switches act You don't have separate software images on lease switches you plug them in and they update from the soup You don't have to configure protocols so that line cards can talk to each other So being able to draw on this very familiar and comfortable metaphor in my view is a lot of Explains a lot of how we were able to accelerate very rapidly with the team make it very intuitive and easy for everybody to understand When they're swallowing, you know STN software open stack NFV Instances and planning coming down the pipe and all of this going on on bare metal hardware I think one of the neat things that's very specific to NFV. NFV stresses network designs like you wouldn't believe It puts pressure on L2 L3 designs in ways that it's our belief that only SDN can solve It puts pressure on network designs That make this type of model very very interesting so when we have In this model and we have an integrated v-switch when the v-switch plugs in the v-switch acts like the line card and The leaf switches to which it's attached begin acting like the chassis backplane this we call P plus V integration for NFV simplifies so many of the really really hard routing and switching problems that show up in NFV That this to me was one of the big reasons why we were able to accelerate through the project the way that we did Chris mentioned resiliency one of the things we did in in preparation was this test this chaos monkey test barred from the Netflix team and The I would say that several of the teams on your end put us through tests that were even more extreme than the chaos monkey testing We could do in our labs So this we load up on a full pod with about 48,000 VMs a mix of Hadoop workers and traffic jam Shoot a controller every 30 seconds to shoot a random switch every 8 seconds shoot a random link every 4 seconds and Every software image that ever went into the Verizon lab would pass this test for half an hour before we would send the software image over I Mean 640 component failures in about half an hour There is no spanning tree network in the world that can withstand this kind of beating and It's our belief for NFV This is the new bar. I think when we talk about a culture chain there's this culture change of the Yeah, kind of the I'd say the the old approach to HA In the communication provider world And I think under under this team's leadership. This was a very very different approach This was a much more Google style approach. This was an approach that says hey a component can fail any time And it doesn't matter that much The only thing that matters is that the system as a whole stays up and it's a different cultural way of thinking and it's incredibly important and It was so much fun going through this cultural transition with this team that I was a at least for me one of the really exciting parts of the project You know I Think as Redesh mentioned we started with kind of five really key criteria that were really good fit and literally by three months in had a Had a long list of the requirements, you know, this was I think a snapshot from June of last year of Here are the set of requirements that we need in order to really deliver NFV into full production I think somebody might have gotten a photo of my calendar there I'm not gonna go through all this but I think one one that I would highlight This was a problem that we had solved before the engagement, but it became a really really interesting one How do you how do you integrate the the v-switch and the physical switch? right there's a huge amount of operational and Frankly packet performance optimizations you can get there do and hardware what hardware is really good at and doing v-switch software Would v-switch software is really good at? And what we found over the course of really loading this thing up and doing tests for NFV traffic mixes and NFV operational models Was that this integration goes from being you know an interesting thing to an absolute must have It's the only way that you can get both the performance and the operational flex that you need for NFV You know So that I think that we've made sort of progress on before we started the before we started the engagement It was one of the hypotheses that we had come in and improved it over the course of engagement But one new thing that we really discovered let me just point out the second bullet This was a really interesting one. You know when you sit there in the lab and you're trying to design product on paper You make all these assumptions about NFV workloads That like hey, there's one connection model and that one will always work Right and or there's like some best way to connect an NFV workload into a NFV infrastructure And that turns out to really be not true and frankly that's not really in anybody's control There are a lot of different NFV vendors. They're extremely heterogeneous they have very heterogeneous views on what their own performance requirements are and very heterogeneous legacy code bases That they're building on and we have to you know the law physics of engineering here apply, you know people are On the NFV side for the most part building from legacy code bases. So just using open v-switch is Great for some terrible for others For us our switch light vx our switch light v-switch just using that great for some workloads not enough for others Just using single attach sri ov Worked for a very small number of workloads. It didn't work for a very very large number dual attach sri ov That creates a very interesting and hairy set of problems on the physical network But it's the only option for some of the NFV's out. So something the VNF's out there And then last a lot of DP DPK work that we're doing, you know that's a very very interesting one, but it's certainly not going to cover the entire galaxy of NFV is here in the short run so if I'd kind of point out two on the slide, I'd say the you know The technology for the the P and the V is is an absolute must-have and that creates an operational Silo challenge on a silo busting challenge, which is a management challenge. It's really hard And the one size fits all it was a for us a big lesson learned that was not obvious on paper not obvious Not obvious day one until we really started doing a lot of the heavier VNF work So me and there and I think we still have a tiny bit of time for questions. Thank you Could you comment on your date to monitoring? Logging system and what what do you plan to do when you have to do upgrade to new turn or whatnot? What's your plan? So We've got our initial set you ask about day two so an initial set of tools kind of kind of very basic rudimentary to get us up and running and then You know Keeping with the open stack or the open source Themes here we're looking at what are the best and brightest opportunities to do the monitoring going forward the monitoring the whole service Assurance package is a challenge and how that we get the information out of the infrastructure and process it and react to Invents in infrastructural. So I would say from a day to perspective. We're still Evaluating and picking those tools and same goes for update and upgrades Update update and upgrades to which part open stack services. Let's see the open stack services Yeah, so we've been through in our lab environments now. We've been through many upgrades The production part. We're still working through our first set of upgrades there. Thank you If I could add to that from an open stack perspective We just released a open style platform version eight last week And that's the first version that has got in place upgrade version version to version So seven to eight and then eight to nine when nine becomes available The challenge there is that we even gotten confidence that there'll be API stability beyond one release So this is an area that you know, we're looking very actively to make sure that At some point in time we can get to end to end plus two as well. We're not there yet technology wise Thank you. I think on the dead on the fabric side We put in something very early I mean this is almost three years ago now that you can actually upgrade the whole fabric and about the same amount of time as it Takes to upgrade an iPhone and The downside it's like a really cool thing the downside of that is people actually use it a lot So you wind up getting these like very the speed of upgrade and I think this is going to be true across these infrastructures On the in the vendor community. We have to plan for speed of upgrade being much much much faster I'm incredibly excited about the the ROS bait side. I mean, that's We're creating the capability to for for people to be much more agile in this area Thank you It's really interesting case study and I could not imagine in just six to seven months time You guys have achieved so much one quick question DPI and GTP related scenarios is I think pretty complex to handle in virtualized scenario. So what was your learning on that? I'm gonna have to pass on that one. I don't know honestly I'm not the network guy. I'm the guy that runs the team with all the smart people on it So I don't have the answer to that off top my head Thank you, sir. One thing that so across some of the early performance tests You find a lot of the default open-stack stuff and kind of default L2L3 networking is very tied to TCP So we actually had to do GTP optimizations at the spine for GTP hashing at the leaf and even down at the V Switch to VNIC GTP hash Thank you. It's a very good presentation It's wondering which are the specific NFE use cases which are being deployed in production Is it when you're talking GTP? Is it really the EPC use case or are you looking at something else too? So I'm not going to comment specifically on what we're deploying right now That's that's not part of what we're ready to announce at this point But I will tell you that we are looking at our entire directional network. So that includes EPC it includes IMS it includes Volte includes all of those things So we will be solving all of those workloads in the coming weeks and months and years Thank you. I Think there are several sites for the open-stack site and how many controller sets a control of those sites I mean there Sure, I actually meant to comment that on in my section. So thank you for bringing that up we At each site right now, we have one control infrastructure We've deployed a pod approach So we're deploying individual pods at each location and then connecting that into the data center network So our intention is to deploy pods at at the scale. We can support today, which is about 14 racks and one pod That'll evolve to larger pods as we go But then we'll just replicate that pod infrastructure and each pod infrastructure will have its own open-stack control plane So in other words, it's each port kind of independent Independent at the controller level. Yes, but then we're using higher level Orchestration and coordination techniques to work across those pods. Yeah Okay. Thank you very much You mentioned that The data center is setting up us for your whole company Are you plan on is Verizon plan on moving in of its telecom services to this network? Telecom services, you mean the wire line component or you mean just in general are Yeah, so so again, we built one infrastructure here for Verizon as one company So we're gonna serve workloads for the entire company out of these locations. So those five locations are just the first five There's multiple locations that are coming after that and each location Each application will be evaluated for where it needs to be based on latency characteristics things like that We'll place those loads that where they best will function in the infrastructure and but we will serve all of the all of the business Units within Verizon from from this infrastructure Thanks everyone