 Hello, performance testing geeks. Sincerely, thanks for hanging out with me. It's a mid-afternoon last day, thanks a lot. It's a little cold outside, so at least you can stay warm here. My name is Pierre Lynch. I work for Ixia, which is now part of Keysight Technologies. And I'm also chair of the Etsy and Evie Testing Working Group, which is responsible for testing, open source collaboration, and experimentation. This, what I'm gonna talk to you about today, is actually a shining example of a collaboration project between Etsy and Evie and OP and Evie, the open source community. And I'm gonna highlight that all the way through on top of the technical content. So I'm gonna start by introducing those two communities a little bit, just in case you're not entirely familiar and will set the groundwork. So Etsy and Evie, Etsy itself is European Telecommunication Standards Institute. They host a multitude of different communities for different technologies, including 3GPP themselves. 3GPP are responsible for their smartphones in your pockets right now. Etsy hosts them, organizes them, the whole thing, and is also actually a member organization. Amongst many other communities, you might have heard of Etsy MEC, Multi-Axis Edge Computing, is hosted by Etsy as well. So they give us the tools, they help manage all the frameworks to do our work, and then communities get hosted by them. We are one of them, and it's called NFV, Network Function Virtualization. It was born about six years ago in 2012, I guess. As a result of the NFV White Paper that got published in early 2012, it was written by Mo15 mobile operators. And they gave us the challenge of trying to define what that is. And so the architecture that you see behind me, which is pretty familiar now for anybody looking into NFV, was built by Etsy and Evie, and it's pretty much, everybody refers to it at least as a starting point for NFV. The community itself is divided into six working groups, each with a different focus. Every working group has multiple projects for what we call work items, each one of them with a lead. The working groups, the main ones are IFA, Interfaces and Architecture, the second one on the list. That's the one that came up with this architecture and all the interfaces between them and the data models, no, information models for the APIs between them. Solutions down at the bottom is responsible for implementing the protocols for all those reference points and all those APIs and the protocols are all REST API based JSON encoding. I represent the TST working group. It is, again, open source, testing and experimentation. So we're responsible for fostering collaboration between open source communities and Etsy and Evie. Moving on, OP and Evie, open platform for NFV, is a Linux foundation hosted open source community. It was born not long after NFV itself, so maybe a year later. And its purpose is to define, it's an integration project that packages up multiple different open source communities products, if you will, packages them up into a reference platform with many, many options. So some examples, well, all OP and Evie scenarios that they call them are open stack based. There's about three different SDN controllers, ODL, ONOS, the old ConTrail, it's now got a new name. It brings in DPDK, of course, open V-switch, of course, some acceleration technologies again, like FIDO, and packages it all up and brings it into one reference platform. They do develop features, but they develop them upstream. So for instance, there's an SFC project with an OP and Evie that was developed and pushed upstream to open daylight, then reintegrated into the OP and Evie platform. But being an integration platform, right from the beginning, testing was their main activity. So they have a very mature CI CD pipeline, and in addition to that, they have a project called XCI, so cross community CI, that draws from Master Branch, from ODL, OpenStack, FIDO, and ONAP itself, and brings it all together in one massive CI pipeline and feeds back any results back to the original communities. It's been very helpful to the OpenStack community itself. And on top of that, there's a ton of testing projects, different goals, different scopes, store perf looks at storage performance, C perf is the SDN controller performance. The two main ones that I'm gonna talk about a little bit more that helped us out with this work are VS perf and NFV bench. VS perf stands for Virtual Switch Performance Benchmarking. That's its whole goal in life. And the NFV bench is, it's a performance benchmark, but it looks at the platform as a black box. It uses T-Rex open source traffic generator as its main tool. And both of those helped us out quite a bit with this work. Now, getting into the testing. At Etsy NFV, the test working group, we produced a document called TST-001 that looked at pre-deployment testing of NFV platforms. And we took a look and we took a step back on what's the impact of NFV on testing methodologies. And we found that there's quite a bit. You gotta look at testing in a pretty much different way. And the main things are a lot of different features or features that exist that exist today but are implemented in totally different way because of a shared platform, because of a virtualized platform. So obviously a virtualization layer, VSwitch acceleration technologies that are meant to be standardish. SFC scaling is done differently on a virtualized platform than the old box way and also failure recovery techniques. So all of these have to be reexamined and tested in brand new ways because they now, these functions, we took a look at them and it's kinda like they're delegated to the platform now whereas they used to be self-implemented in every application. Isolating the function under test is very, very different. That used to be really easy. If you've got a box doing one thing, isolating it for a test, any type of performance test is you just simulate everything around it and you just bomb it with traffic and measure it. Okay, you knew that there wasn't anything else going on. While as here, it's a shared platform with a Manostack, with a VIM, with the virtualization layer. You cannot stand a VNF up by itself without the entire platform present. So it's harder to isolate it as it was in the past. And it basically boils down to defining okay, what is under test? Paint it yellow in our case and everything else becomes what we call the test environment and you gotta constrain all the configurations and all the settings to that test environment to be constant throughout all your tests or else you're throwing your results out of whack completely. Performance testing was changed a little bit. It's still a matter of performance verification or data sheet validation, benchmarking but a lot of dimensioning as well which is closely related to benchmarking but a lot of people would approach us and say how big does my cloud need to be in order to support these types of network performance metrics. And so it just adds a little bit of a dimension to it. But if you isolate the NFVI which is the purpose of this talk and this piece of work, this is what it looks like. Having said that, it's typically, it's normal to include the VIM into the system under test in this case because you gotta configure the network somehow and you also have to place a VM or two or three into the system under test as well. So we typically include OpenStack in the system under test because it's hard to isolate in this fashion here. And then we took a look at, okay, how do you go about it? What do you consider when testing the platform, the NFVI, the virtualization layer, the V-switch and then the hard one? For all compute networking and storage. Even back then, and by the way, this was one of the first collaborations here. This was built three years ago but this document and especially this chapter was a close collaboration with the Yardstick project from OPNIV, it was built together if you will. And we started taking a look at how you test the platform depends a lot on what you're gonna run on the platform. And this was a thought back then and it's increasingly the thought now, I've seen a lot of literature out there saying, platforms could be tuned to different types of applications depending on the workload is. And the two main examples of that is is it a user plane centric type of application? Is this a switch, some sort of gateway firewall versus control plane application? So in 5G or 4G and MME or something like that. Because it totally changes how the platform will perform and who will compete for what type of resources? Are they memory intensive, compute intensive, storage intensive? So it was thought from the beginning that the workload type and what type of operations it does could impact how you configure the platform and therefore will definitely impact how you test it. So by considering the workload operations, what type of workload this is, this will drive your metrics and which will in turn the metrics that you're looking for which will in turn drive the test cases that you'll run in order to benchmark the platform. So for instance, latency is a really big deal in a user plane, latency in jitter let's say. Really big deals in user plane centric types of applications, but for a control plane application like a DNS or an MME or DHCP, not so much. Different metrics for different applications. Now, getting straight, getting more specific into user plane, data centric types of applications. This is from OpenFV, they started looking into it and they said, okay, what can affect the test? What can affect the results and especially the repeatability of those results? Well, it's things like time. Over time, you'll have different results and you don't know why. And the minor changes to the platform, identical nodes in quotation mark, implement the same DUT configuration, device under test configuration, might change the results, might impact the repeatability. Of course, different management tools and certainly different test tools will impact that as well. So OpenFV started a testing campaign because I'll show you in the next slide, they were getting a hard time repeating the same results. That was the major problem. They'd run five tests and get vastly different results each and every time, especially for network benchmarking, user plane benchmarking. So we started looking into covering time with multiple nodes and the initial focus was on improving search algorithms. And what I mean by that is, in RFC 2544, it's prescribed a binary search. So you keep stepping up, stepping up, stepping up and then you get one packet loss, you're done, that's your maximum throughput. And we thought maybe that's not the case. So these are just a couple of examples of results that were obtained by, I think this was VS, yes, it is VS perf, which over time you'd get these wildly differing results all the time. You'd get some type of consistency than the next day, whoop, nothing the same. And so it was far from predictable. And these different lines here, they're all at different packet sizes with the smaller packet sizes achieving higher packet rates, obviously. It's mega frames per second on the left over time. And each one of the graphs, so the red and green are the smaller packet sizes starting with 64 and 128. But it was just hard to predict. Lower packet sizes kind of worked out. But here in this configuration, which is physical, was a VM back down to the physical and I'll have a drawing of that later, using VPP, which is a technology from FDIO and DPDK, again, really hard to get repeated results. And in benchmarking, you want to do that. You want to trust that whatever you obtain, you can obtain it again and again and again. Without that, it's kind of useless and pointless. So that's when we started our research as well. So this, at CNFV, this became TST09. It is basically, when it was introduced to us by the author, the proposal, he said, well, RSA's 2544 is kind of meant for dedicated boxes, but it's so old now that it can drive a car in the United States and even buy a beer in the United States. It's getting a little old, it needs a refresh. So we accepted it and took on the work. And we said, we got to modernize it basically because it's a shared platform. You got to look at things differently. It's not a dedicated box router or switch or anything like that. So this document looks at how we can do that. It defines benchmarks. This is just a summary, I'll go into the details. It defines a bunch of test setups, test tool requirements, and also methods of measurement. So how do you go about doing this? So for example, the benchmarks, concentrated on four, okay? So throughput, latency, delay variation, and loss. Each, you'll see later, there's a bunch of variants for each one of them, but this is what the initial focus are. These are the big four. For each of the benchmarks, there's a definition, there's a background saying what it is, name parameters, there's parameters to this, scope, meaning what are you measuring? Units of measure. So for throughput, for example, it's either frames per second or megabits per second or something like that. Sources of error, and then even a reporting format as well. So as an example, the parameters, the possible parameters for the throughput benchmark are listed here. So offered load frame size and step size when you're doing the test. Step refers to that. A trial repetition interval, how fast do you try one level of load versus another level of load versus another? How long is the test? The trial itself, the duration, loss ratio, maximum number of trials. So all of this is defined for each and every benchmark. Now, just quickly, I look at the variance. What does that mean? Well, throughput has two metrics, okay? And also one metric in the variance. So throughput itself means zero loss, no excuses. And then capacity, what is the throughput basically with X% loss ratio? So there's value in trying to measure that. So what's the maximum here where you can tolerate a packet loss here and there or not expressed as a percentage? Latency, okay, end-to-end unidirectional latency. Then you have transfer time as a percentile, meaning what's the typical transfer time for 95% of my traffic or 99% of the traffic at a given rate? And then minimum mean and maximum transfer time, okay? So the delays, pretty normal stuff. Delay variation, okay? So a frame delay variation. So you send one frame, send the second frame, third frame. There's a delay measured and it's all relative to each other. What's the variation of delay between them? And then there's interframe delay variation, old name jitter. So between frame one, frame two, frame three, frame four, what is that delay and how much does it vary? That can impact different types of traffic. It's got a serious impact on, say, voice and video, but no impact whatsoever on HTTP, but it's still something important. And then loss, obviously. What is the loss ratio? What is the loss ratio at a certain amount of throughput, meaning not 100% max, but somewhere less? Is there anything that's tolerable? What's the max loss count and then how much time can you go without losing anything? So loss-free seconds. The supported setups for this methodology is varied. The top left shows things like, okay, straight five to five, two ethernet ports going straight through into the V-switch right back out of the platform, or what's called PVP. So again, physical, but a VM in the middle, okay, up top. And then you can also have PVVP meaning two VMs. Down at the bottom left, we started looking at a little bit more modern technologies as well, meaning an overlay. So two V-switches with some sort of overlay technology or some sort of networking bridge between them. And then containers as well. So straight container to container within the same pod or two containers in two different pods. And then last but not least, acceleration technologies or bypass technologies are taken into account and are still valid under these tests. As well as more of a mesh formation when you have multiple PVPs. A definition of the methodology, the core procedures, how do you go about this, okay? So it starts with a method at the very top. And you say, all right, I'm gonna find out my maximum frames per second for 64-byte frames. I'm gonna repeat that a few times. So each time I repeat it, it's one set. And then I run a test and a test given the fact that it's a search. You're looking for something. It's typically implemented by an algorithm. Each test has multiple trials. So you'll try at one megabits, one megaframes per second, two megaframes per second, three. And you keep going, keep going, keep going until you hit the max. So this is just kind of like a definition of terms. So you apply a method at the top level, one set of constraints. And then you repeat it a few times. But for each repetition, it means one test that has multiple trials to try to find the maximum value. But this is where we get into the fun stuff here. Is when you had a dedicated box, a dedicated application in that box, you pretty much knew that once you started seeing loss, one packet, you're done. And then you might zero in on a better granularity than one megabit a second. But you knew that you had reached what we call resource exhaustion. It means it just can't run any faster. It just can't go any faster than this. But in the new world, because it's shared, you can go and you say, and that's what the tables are illustrating, by the way. You just go up, up, up. Is the resource exhaustion achieved? No, false, false, false, false, false. So ultimately, you get to a point where that question is true. So they said, you can take a look at this as a test being a question and answer. So you've got a traffic generator providing loss-free traffic. You've got the receiver. And the two questions now, instead of one question, is that resource exhaustion? No. But then did something happen? Did a transient process? Did it interrupt? Did a hiccup? Because it's a shared platform. Did something happen in there to cause loss before resource exhaustion was reached? And that may be true. And that's what leads to what a guy who did some research in theoretical computer science said. He calls it half lies, meaning you come up with a negative result when, in fact, it's half positive. So it's a half lie. That's the concept that we're running with. So what we're trying to do is we're saying, OK, you isolate that. And you say, all right, it's not just resource exhaustion. Something else is going on here. What can we do about it? And it's because of the nature of the shared platform. So we started looking into binary search with loss verification, so improving search algorithms to take this into account. And you say, the whole goal of this is to separate the concepts of resource exhaustion. You've hit the max versus loss that are due to transient events in the platform, due to the fact that it's shared. And when I mean shared, I don't only mean two V&Fs. I mean, the platform shares with the virtualization limit itself, the OS. There's a whole lot of stuff going on. So it was a luck of the draw. If you ran a test, just when one of those hiccups happened, all of a sudden, boop, your test ended. That was what the algorithm saw it was your maximum performance. But if you happen to run it when those processes or that transient or that hiccup wasn't happening, you're all good. So you kept going. So the objective here is to simply say, all right, if I've got loss and the loss isn't too big, represented here by Z. So if you run a trial and you have 10,000 packets lost, OK, we're done. That's just too much anyways. But if you have some loss that you can define as Z, instead of just stopping, keep trying, basically. If at first you don't succeed. And run the same test, same input, same stimuli again. And then if it works, then keep going. But if it doesn't work and you define a max number of repetitions as well, in this case we defined it by two in this example. Because you don't want to stay stuck there forever. And so it's really that simple of a concept. It's just let's take a look at this and give it another chance. And when I presented this back home to our engineers, they went, no, no, no, no. That's unacceptable. The fact that it's a virtualized platform does not excuse it for losing packets. And that's right, and that's not the point. We're not trying to find an excuse for a virtualized platform. We're just trying to isolate and separate the two concepts of transient losses versus resource exhaustion. Focus on, and you keep the trial short to avoid the transient, so you try to be between those arrows. But that time of the trial is better determined once you've done long duration tests. I'll deal with those in the next slide and you'll see why. But for this series of tests, we focused on frame size and iMix, number of repeated tests, and then number of repeated tests where the outcome changed. How often did that happen? And then try to get a metric on consistency as well. And then you can turn around. And the reason, by the way, let me step back for a second. The reason you want to isolate resource exhaustion from the transient process is that they're both dealt with in totally different ways, meaning resource exhaustion you've done anyways. But the transient process, if you can characterize them, if you can isolate them, then you can tune the platform to avoid them for the workload that you're looking at. In this case, we're looking at user plane applications. So you can deal with them in a separate way. Now, tuning a platform typically means give and take. It means it'll be really good at user plane and maybe lousy at something else. But since we are looking at user planes straight throughput frames per second raw speed here, this might be appropriate if you concentrate those platform for that type of workload. So you conduct long duration tests or near what you found to be the maximum, zero loss throughput level. And then you characterize the transient events. You try to, what's the average frequency in period? How often do those things occur? What's the impact when they do occur? Is it a huge amount of loss, or a little bit, or whatever? And then consider that there may be multiple event loss signatures as well. This, you can all send back to the platform designers and then they can tune it in order to at least minimize them. And then you restart the tests trying to find out what the resource exhaustion is. Typically, this will, and this has led to immediate better results when we were experimenting with this. Now, this is how it worked. We would produce an idea in the document, and all our drafts are public. So we'd ping the OPNV guys. They'd take a look at it, and they'd implement it. Two things. They'd review the document and say, no, no, no. I'm going to have a bunch of comments back. But also, they'd implement the prototype of a search algorithm that we were proposing, and then send us feedback. They'd run the tests, send us feedback. We'd tune our stuff, and we went back and forth about 15 times like this. And it ended up being really, really good result. And the tests were run using vfperf and a vbench. Those were the main tools that were used to prototype these types of results. Now, I don't have the results with me, and you'll see in the next actions that they're forthcoming. But there was an immediate impact in finding out what truly is the maximum performance of whatever platform we were playing with. And secondly, we could easily repeat those tests many, many times, which is exactly what we were looking for. So this here, I just wanted to illustrate. Again, I'm jumping up and down how much of a collaboration effort this was. So this is the exhaustive list of everybody who participated in this. But the top half is people from Etsy and my team. And then the bottom half was OPNV people and even some FIDO people. So Maciek is from FIDO as well. Alec Holson from Cisco participated quite a bit in this. But the two main guys who did this, and again, I am the messenger only here. I'm not these guys are the real experts and did the real work, is Al Morton from AT&T, who's also the chair of IETF benchmarking working group. And if you do a quick Google search on Al Morton RFC, you'll see that he has 34 RFCs to his name and about nine active drafts. The guy is a rock star when it comes to benchmarking. And then the prototyping itself was done by Shridhar Rao from Spiron. He did most of the work within OPNV to try this out, prototype it, run all the tests, and give us back the results. Follow on, we're not done. We're opening it up again for, this is published, but we're already gearing up for the next version. The main things we want to do is reexamine the container setups and see if we can modernize that because it's very much an evolving field right now. New material for NXB, which will be the summaries of the tests run by OPNV that I was mentioning before. There's another search algorithm with, what's it called? What's NDR? No drop rate and partial drop rate binary search that is mentioned in the document, but is not detailed yet, so that needs to be completed. And then new methods for long duration testing, taking advantage of these new search algorithms. And then a new metric variant for loss where the timestamp of the loss events would be collected and correlated with system events, which would help the designers isolate why this is happening. And that's it, 30 minutes, just my target. So if you have questions, I can entertain them or direct them to people who actually do know the answers. Yes, sir. One question, the test setup you had, you had physical to physical VM to VM. And all this one looked that you're not testing between VMs distributed over different servers. Is this true? Because that's definitely another setup. In a sense, yes, but the point was to test one server, so one platform, okay? So not inter-platform communications. So that's the scope of these tests for the time being. Because it's mainly V-switch, anything else? We've got another document, by the way, that would do that. And it looks at path implementation time, so TST004, if you Google that, that one's pretty interesting as well. True, true. But this year wasn't so much VNF performance as platform performance, which is why we're trying to isolate it to one. Anything else? I want to understand what is it that you were actually measuring? You had a application that is running inside a VM and you were measuring how it behaved? All it would do is route. It would just, if there's a VM that's involved, all it does is forward the packet or send it right back to the test system. So it's very bearable as to introduce the least amount of delay. So you were really trying to characterize how the platform behaves, not the application. Correct, correct. So this is NFVI, this is platform testing, only no applications. And if there has to be a VNF in there, that's one really important thing. You have to make sure it has, if it introduces an error, at least make it introduce the same error for every single test. But it's also really same going once. All right, thanks a lot. Have a great.