 Well, welcome to Continuous Regression in high-end CPU development. My name is Travis Lazar. I'm here from Ampere Computing. My focus is primarily on Continuous Performance Regression and all the data analytics that go into making that possible. So Ampere Computing may not be very well known here at a conference like this, so a little bit about what we do. We're a high-performance CPU development company not to be equated with HPC, which is a separate industry, but we build high-end CPUs for cloud data center and the use cases surrounding it. We're an ARM 64-bit microprocessor. Again, really focused on the high-end. Our current product is a 32-core, 3-gigahertz part, so you can play with them. We've got a couple on packet, which is a public cloud offering, so you can actually go get an ARM 64 server part and access to it there. Our CEO is Renee James. She came from Intel, where she had a pretty decorated career over 20 years, most recently as the president. She left Intel and decided that she wanted to change the game with how silicon development was done and how microprocessors are built and all the culture that goes into that, and so Ampere was born. And we're a relatively young company we're, I think, 18 months old at this point, so pretty young for a silicon development company. And today, I'll talk a little bit about our continuous regression system. I want to put this up here to say that continuous regression is related to CI CD. And I want to sort of set some common language for this particular demonstration. So we've all heard of CI CD, DevOps. It's a pretty common practice these days. Continuous regression, I view as a subset of that, and really the focus of continuous regression is taking the entire set of your historical results and applying some sort of statistical analysis on top of it so that every new data point that you add makes your system a little bit more accurate. And this is different in that, I'm generalizing here a little bit, but in a typical CI CD system, you have a list of unit tests or functional tests and they either pass or fail. And those are fixed criteria when the functionality is built. And so for continuous regression, especially around performance, that's not good practice. It's not good to set, hey, 1,000 megabytes per second and below that is a failure, above that is a success. That's not the whole picture when it comes to performance. There's too many factors involved in that. So from a continuous regression standpoint, we really take, it's a moving bar. So as new results get added, the past fail criteria change based on what we're learning about the system and performance improvements that we've done to the system. So this is important for a couple of reasons for us and it's a really big problem for us. So we're a silicon development company, we build hardware. One fairly unique piece or thing about hardware is that it's unchanging. Once you install it in a data center, we can't go take things out or put things in. The hardware is the hardware. And once it's installed, it's very expensive to go and install a new version. It's expensive to build a new version. So our hardware is sort of an element that we can't change once we sell it and once it gets deployed into a data center. But the things that happen in the data center are extremely varied. A common picture that we might see is something that looks like this where we install in a data center and you're all familiar with many of these pieces of software and there's an infinite variation to the versions that are installed as well as the way that they're configured. So when we look out and we look at the world of our install base, we see an infinite problem space. And to put in perspective how big this problem space is and how much it can change, I took a look at 20 pretty large open source projects given, I mean these numbers, but I took a look at just 12 months of sort of the history of these projects. There was 200 million lines of code across the projects that I sampled. So again, these are big active open source projects like the Linux kernel. Over the past 12 months, there's been 1.2 million commits by 31,000 contributors. So this is a pretty substantially high amount of change. For us, we view this as every part that we deploy and every part that we push out into the world, we have to expect that there's going to be continuous change in the software stack and in the software ecosystem. And so from our standpoint, our major goal is to maintain quality throughout that continuous change. And by quality here, I mean both functional quality does it work as well as performing quality. Does it work well? And so for us, our silicon can look bad if software starts to break. And it may not necessarily be something that we directly did, but it's our responsibility to support that ecosystem and to make that ecosystem strong. And so to this end and to this goal, we developed a system which this talk is about called TARS. It's the totally automated regression system. So this is a set of tools, a set of infrastructure that we have in-house that's responsible for automating all the testing, analyses, provisioning, et cetera. It's an entire stack of software and I'll show sort of every component of that here in a minute. This system adds a ton of value for us. We find a lot of problems early, especially around firmware development that happens or we get early builds of software and we see that it breaks or performance changes. We also get some marketing nuggets from it when performance goes up, more than we expected or unexpectedly and it helps us sort of keep a handle on what's going on in the software ecosystem on our hardware. This system is not as valuable as it could be with open source partnerships. So this is sort of the door that's been opened for us where we want to invite open source projects to come and collaborate with us. Now there's a couple of motivating factors here. One is we are ARM hardware and the ARM ecosystem is less mature than other hardware ecosystems. And so because of that, it can be very expensive or difficult to get access to high performing ARM hardware. Raspberry Pi is very easy to get, it's very cheap, you can go buy one at Target for $35 but we're talking about 32 core, three gigahertz systems with 512 gigs of RAM and big hard disk deployments and so those types of systems at that scale can be very difficult to find in the ARM ecosystem and so we want to work with open source partners to make that more accessible, especially around some of the performance information that we're gathering, which can be its own problem space. So the system that we've built works in effectively three phases. The first phase is provisioning and configuring, so this is always done on bare metal. We do virtualize testing, so we test virtual environments but we always, always, always start with a bare metal provision. This is for control and reproducibility. You can go deploy a system on EC2 or some other kind of virtualized environment but you're going to be sharing that system so when we do testing, we need to ensure that we have full control and complete understanding of what's going on at that system or our test results get tainted. Second, we go to testing and benchmarking, so this is the meat of what we do, it's actually the easy part. We go and we run a series of performance tests and we get data, that data then gets analyzed and action is taken and I want to emphasize here the action is taken portion. I think it's very easy these days to collect a mountain of data but if you're not doing anything meaningful with that data, then why collect it, right? Why spend time and resources to do that? So this is broken down into eight sort of sub steps and this is what we'll sort of demonstrate here. So first we deal with scheduling. So this is a little bit of a screenshot from our scheduling dashboard here, a couple of things I want to point out. One, when you go and you provision a new test set to be run, you need to know a couple of things. So the first thing we look at is what tests are we going to run? We have a standard test set that's like the baseline of information that we think we need to know to understand the performance of our system and the ecosystem. This standard regression is run as often as we can but we do have other test sets that may be targeting specific workloads or specific customers or specific software. The second thing you need to know outside the test set is what OS are you gonna run it on? So here I've highlighted one item which is an iPixie booted Ubuntu 19.04. This is run on packet. We have an internal cloud that we run in our labs and we test on the packet public cloud as well because there are different configurations. You can actually go and provision a system on packet if you'd like. You can provision any OS you want through their iPixie infrastructure or you can use one of their pre-built OSs. And the last thing that we identify is what type of test are we running. I highlight this because we get a ton of data from our system here especially across different OS distributions and sometimes we wanna go and we wanna compare, we wanna change one thing about the software stack or one thing about the software configuration and we wanna know what impact that has on the performance but we don't want to clutter or taint our regression results. So we can run these tests and we can take advantage of all this automation and infrastructure that we've built without influencing sort of that pass-fail criteria of the actual test set. And so again, we're getting more with less, we've already built this, you might as well use it for some other things and so we identify the test type. Once a provision is identified and the system kicks it off, here's kind of our dashboard of what tests were run and what tests are running. In the bottom right, we've got the actual physical system pool so you need to go and you need to allocate a system and that system can't be used for anything else. Again, because we wanna keep our test results quality and we don't wanna taint them by running two benchmarks at once or two performance tests at once. Aside from a physical system, like I pointed out on the previous screen, you need to know what OS you're running and what actual tests. So that test set gets expanded. In this case, there's 140 tests that will get run here. The order is randomized again because you're not entirely sure all the time if you run tests back to back to back if you've warmed up the caches in a specific way or you've done some things that may impact future testing and so we always randomized the order and we have some criteria about how we write tests to make sure that we're not changing system-wide configurations that might impact downstream tests. And then we have some pretty comprehensive logging and metrics that we can get at once the system actually begins the provisioning process. So here there's a couple of things to notice. So one, there's some red boxes here. So we have some failure points before we actually get test results and this can give us some valuable information about the system. On the top, in a red box here, the provision actually failed. So we have found early firmware builds can break a bare metal installation process and that's a problem but we can catch that before it goes out the door. So this type of automation is not just about the results that you get out of it. Sometimes just by building and maintaining the infrastructure, you can find problems with your hardware or you can find problems with your software stack and then it gives you a place to go debug them. So the stability of this process is almost as important as the results we get out of it. So once the system decides to provision, you could do something like this. This is a pretty typical kickstart and Ipixie boot script. You can use any kind of bare metal provisioning flow that you want. There's a ton of tools out there that do it. We use Ipixie on packet and Pixie internally. And when this process completes, you end up with an OS. This example is based on Linux but it could be any Pixie compatible OS that you'd like. And at this point, the system is pretty vanilla. We'll do a package update. So we get the latest version of every package because we are testing software change in the ecosystem. But other than that, we don't do any configuration at this point. The system just boots and it's in a default state. Then we hand over configuration to a piece of software called Ansible which I think is a pretty common piece of software and well understood. There's a lot of other tools that do this but we use Ansible. This configures our system to now be able to run performance tests. This configuration platform doesn't do anything for a specific test but we have to install performance monitoring tools and the actual test runner itself and get the test on the system. And that's what Ansible helps us do. This ensures that we can have reproducible results. So if we find an anomaly and we wanna take a system back to the start state so we can reproduce the result, we have source control, the Ansible installation scripts and profiles and so this again gets us to a reproducible state. I can always get a future system back to the same state that a historical system was in. We then move on to test execution. So we use Veronica's test suite which is an open source, pretty well established, pretty mature test framework developed by a guy named Michael Arobo. Openbenchmarking.org is where the actual test and test results get stored if you use it out of the box and you choose the upload option. It's an awesome tool because it gives us a sort of standard for collecting results. So it's got standard test writing formats. There's a Fronix sort of way of writing benchmarks and outputs things in a standard XML format. So you can write one parser and as long as every test that you run is always run through Fronix you get results in a really standard way. This helps with shareability in sort of developing visualizations you only have to do it once. We do not use the open source or out of box Fronix test library. So there's roughly 300 tests now in Fronix. You can just run Fronix benchmark stream and it'll go download the Fronix wrapper of stream and it'll run it and it'll give you results and for things like FIO it could take four days because it runs 4,000 variations of FIO. There's a couple of limitations for that that we had to work around mostly around reproducibility and communicability. So take FIO for example, there's thousands of ways to run FIO and if you go tell a counterpart or a peer or a customer in FIE to go and say, hey, run FIO and tell me what results you get. Well, you actually have to include a shell script of how to run FIO, how to configure the system, what version of FIO to download and that can become cumbersome. So what we did was we created a set of test extensions that they're all Fronix compatible so they all run under the Fronix framework and they're enumerated. So here's a subset of our FIO tests and take for example, FIO 26, FIO 26 will forever run with the same version of the FIO benchmark and it will forever run with the exact same command line parameters to FIO. You could do that in a shell script or in email or whatnot, but it's much easier to go and tell field engineers or to tell performance architects that hey, run FIO 26, it'll tell you how we perform in this vector, it'll tell you if we fixed problem A and so again, it creates a common language and an easier way of sharing performance results across teams and if you've done any kind of performance testing, one huge bottleneck is communicating exactly how you ran a test, how you got a result and what it means and this helps us sort of set that context and create a common language. Once these tests run, remember they're in a standard pharaonics output so there's a standard XML output that pharaonics creates as well as sort of system configuration information, it gets uploaded off system and unstructured. Compute is expensive, especially bare metal compute of these full systems and so we get our data off the system as quickly as we can so we can recycle and do more testing. We process it and structure it so we put it into a database so we can actually go and do our analysis and then out of that becomes something that looks a little bit like this. So this is a set of a little over a thousand test results from one benchmark, one configuration of a benchmark. The red line here indicates sort of the full test set best fit line so you can see we have a slope slightly to the top right. This particular benchmark seems to be impacted by kernel updates in a good way and so it's good to see a positive performance gain. The colors here indicate OS distribution so you can see some clustering around those OS distributions. We don't compare OSs to one another because like I said, we're looking at the health of the performance of the operating system so the operating system itself, the OS distribution and version itself create the baseline and one OS being more performant than another is not necessarily an issue. It's a design decision, it's a timing issue. When was the OS released? How long is it being supported? What kernel version is it on? Those things all impact performance. It's not a bug, it's a design decision. One really great example this is sent to us out of the box, uses tune D is set to balanced which is a very conservative power management profile and I wouldn't expect performance to be very high but it's optimized for power management not for performance and so those are not bugs. We don't view those as issues, they just are. If you zoom in on a specific OS distribution now you start to see something more meaningful so this is one OS distribution for the same benchmark, the same performance set, about 150 results so this line is a little bit steeper in terms of the performance trend. You see our human eyes can recognize this cluster in the top right and mathematically or with a visualization you can see that the slope, so I've added three lines here. The one is the last 25 results, the last 50 or sorry, 10, 25, 50 and then the red line is all and you can see that most recently we have a pretty significant performance increase so one of the things I might look at for that is well what changed on the system? That's very important information here because performance changed so what caused it? First order of business is overlay the kernel updates. Lucky enough there's only one kernel update in this entire set of data and when you overlay the kernel change you see that it correlates pretty nicely to the first result in our new baseline so I mean this is an indication of where the performance change came from. It helps us focus and understand better where performance differences are coming from in the ecosystem and now we don't waste a bunch of time chasing down things that it's not. So again it's about data that helps us focus and less manpower. If this were in the opposite direction it would be a much bigger deal we like to see performance gains but this highlights a point of how we might find problems. Another example of something we might see out of the system are bi-motor result sets. I don't have an answer for why this is I haven't had a chance to debug this particular problem. You can see that half the time our results are around 60 and half the time they're around 30 here. This was caught because of a very, very high coefficient of variance so it's basically the standard deviation as a percent of the mean so when we have high standard deviation we view that as a big problem. Consistent performance is almost as important as high performance. There are some workloads where that's not true but for the most part our customers rely on our performance to be stable and so where we see issues like this this is a big problem. We would not have caught this if we weren't doing as much continuous testing as we are. This problem only shows up on one test on one OS distribution and my gut is that it has to do with power management profiles and how they're interacting with the system. So this is sort of an interesting result set that we could see out of all this data. We take results like this and we correlate them into an OS score. So this is a health score. Don't take this to be literal. This is on our test environment that I mess with all the time and the colors looked good so I brought it to show the variance that we can see in the ecosystem so we need a good way of taking these thousands of tests, hundreds of thousands of results and figuring out what to look at first. So we sort of up level this into a weighted score of OS health and again this is based on itself. So there's a baseline that the OS itself takes and creates based on the performance that we see over the first end results and then as that performance changes for good or for worse we can correlate that or correspond that to a specific score and we can even break that down into its subcategories. So you can see database, IO, language, memory, et cetera are their own categories which helps us to identify more specifically where problems might be coming from. This also gets correlated to a timeline so we monitor this over time. This goes to pattern matching so we're looking for patterns in the industry as well which can identify sort of systematic problems. Some of that comes from lack of availability of ARM hardware and ampere hardware for testing and for building and so we can find problems like that as we see patterns in the data. Again we wouldn't get there if we weren't testing as often and as on as much of a diverse software set as we are. Underneath this is a significant amount of statistics. I won't drill into any of these too much but one of the things that we're trying to do is identify sort of again patterns where you can see most of the data here is grayed out but we might see sort of a consistent pattern to performance improvements or decreases and we can pop those out. So again we're trying to bring our engineers' eyes to the things that we know matter and then we know need human intervention and so we're trying to again automate some of the visualization here to sort of pop these data points in a more useful way and again each result here gets its own score as well so we check scores on a per benchmark level. We do compare across operating system but again not to identify problems. This is more of an educational thing so if you look at like Red Hat versus CentOS CentOS again the performance profile that it uses is balanced whereas Red Hat will use throughput performance and then mostly impacts a lot of the impact there is on CPU scaling and sort of where the resting frequency is but we could compare results and we can get an idea of the impact of those settings and the way that those OSes are configured because they could be like a binary exact distribution but the configuration is going to be very different. So the performance profiles is one and then Ubuntu so on ARM there's two sort of common page sizes you can use out of the box when you compile your kernel. So 4K page size and 64K page size all the Red Hat based distributions uses 64K page size whereas Ubuntu uses a 4K page size so that can have impact on some performance profiles in a pretty significant way and in others it doesn't at all so doing comparisons across OS is more of an informative and an educational exercise for us rather than anything determining health or past fails. At the lowest level for every test result we can look at charts like this so this is based on perf data. The top row is all cores coalesced into one and then you can see the 32 cores on this system top to bottom and so we can see where cores are doing work this is a heat map of IPC so how much work is each core doing for each clock cycle and you can see sort of where this benchmark might be looping over something or moving files. We're working on charts like this for IO, network, et cetera so this is really kind of the beginning of this perf analysis and we're playing with ways that we can look at the sort of underlying performance of our hardware because you could change a workload in a very subtle way and you could see where your bottlenecks move which is a really interesting exercise which can tell you where your architecture might be deficient for a specific workload. Beneath this data and using the actual same set of data we can look at, so this is a full system again it's based on perf, this is IPC above zero is user space, below zero is system space and this gives us an idea of what the benchmark might actually be doing and what the system is doing in response to that. You can zoom in on one of these data points it's sort of cordoned off by thread so you can see the thread names and what they're doing. Again, this is data that we can use to sort of debug problems or get more insight into why things are behaving the way they are. And another aspect to understanding performance changes is obviously the packages and the versions that are installed on the system and so we track the entire set of packages that's installed as well as their change over time. So here for Fedora 30 you can see we're tracking a number of updates or additions to the package set and again performance changes in a Python benchmark you could go look and see well did any of the Python packages change and so this is another layer of data that we can look at to debug or find issues. There's a ton more data in the system that I don't have time unfortunately to talk through and I did want to sort of get through some of the cool sort of techniques that we're using and the automation that we're building to test this the software ecosystem that's out there is so large that you can never test everything in every way but you can start to make a good dent in it and you can sort of plant flags that give you indicators and you can use manpower to go and dig into those indicators a little bit more. So as I mentioned before we are looking for open source partners so if you're part of a code base or you're part of an open source development project and you're interested in how you're performing on ampere and ARM hardware then reach out to me I've got cards or I'll throw my email address up on the screen so effectively the way this works is an in development code base can be integrated into the system where TARS will test analyze and regress it using the same techniques that I showed here and we'll deliver results in whatever way makes sense for the project. So it could be a PDF file or it could be a set of raw data and CSVs so we're very open to open source partnerships and we think that it's a valuable step to making the software ecosystem stronger especially in the ARM space which may not be fully mature yet and that's one of our priorities. So if you are interested or you want to talk about the system a little bit more here's my email address it's Travis at amperecomputing.com I've got cards if you're interested in that or you can connect with me on LinkedIn. Really appreciate everyone taking the time to come and watch this talk. I'll be around for like the next hour if anybody's got specific questions. Thank you. I guess I could take questions now. Sure. Yeah so it's, so we asked if TARS is a proprietary technology it is an internal only system right now there are components of it that are open source and our intent is to make a number of components like the test library that we've been developing open source some of it is it just runs on our internal data center like the pixie infrastructure is pretty standard out of box and that's behind all of our security and whatnot so we can't expose that. But I don't know what the future is in terms of making it publicly accessible. We've had talks about it but there's obviously some sensitivity around your performance in benchmarking especially involving software that's not yet released to the public. What's that? Which components? Yeah I mean we are sort of from the perspective of supporting a wide range of tests it seems applicable to hardware design but the same methodology can be applied and the same sort of test infrastructure can be applied to in development code as long as you have a set of performance tests that are intended to test that code. You know a lot of in development projects have very simple benchmarks that ship with the project and those often tests like you know Python has like import tests it's basically like a baseline test for how well does the Python build import other Python libraries and then it's fairly simple test. As long as you have a set of performance tests that give you an indication of the full libraries performance then I think that this kind of an approach adds value regardless of the size of your software project because inevitably you care about performance at some level and you're going to be gated by hardware the architecture, the compiler, et cetera and understanding that in different environments I think adds value regardless of the size not just for chip development companies we just care about automating this problem in a big way because the number of software projects that we interact with is effectively everything that runs in the data center and so we have to automate this but I do think this methodology applies to any level of software development that goes on. Yeah, well there's bare metal behind everything right I mean there's no such thing as truly serverless so, yeah thank you. So yeah so it's a really great question so most of the data that informs our chip design and our architecture comes from the perf data so when we look at a workload and we look at sort of silicon architecture we look at where are the bottlenecks and that can inform our cache architecture or how many cores do we need to support the common workloads in the data center and so I think a lot of that low level perf data especially around sort of the fully baked workload you stick up, I mean I perf is a good example for networking or you run a web server on it WordPress or a LAMP stack or something like that and you wanna understand where your bottlenecks are are they in compute, are they in IO, are they in networking those types of things is probably what informs our CPU development more than anything not the raw performance numbers because it's nice for us to know where performance may be failing but that's more of a software problem when we answer the question of why is performance failing you need a specific answer right like I'm evicting too many pages right and so is that a cache problem or is that a software problem I mean there's a million possible answers but the low level perf data is usually what informs our hardware architecture and the second part of your question I forgot oh to contribute so this particular project is fairly new in terms of actually being mature enough to glean meaningful results from it I would say we're three to four months old in terms of getting real high quality data out of it most of what we do in terms of communicating with open source projects is if we find a problem we typically work with them to find a global fix we don't target anything except maybe driver support or whatnot for ampere specific right it's all really arm ecosystem focused rather than ampere specific we don't when we go work with an open source project we've identified something it's never a can you fix this for ampere it's hey here's a problem when you build with the arm target right or something like that those are the kinds of fixes that we or optimizations that we might go after nothing specifically for ampere in those contexts I don't think anybody I don't think you know we want that like everybody benefits when the entire ecosystem gets better and I don't think it's good for us or for the software project to go and start if deafing you know ampere into various things but yeah yeah yeah exactly and I think there's been some conversations going on in our ecosystem about getting developers on to arm equipment you know there's no replacement for doing development on the native architecture right and so I think there's a lot of work going on and we're you know working with arm and others to try and figure out how to how to better get developers access to arm hardware so that you can develop and test natively on arm hardware which is much better than you know cross compiling or you know as an afterthought thanks Peter he's our Peter's our developer advocate so yeah you need a load generator yeah so in our standard rack deployment for the test environment is ignoring sort of the rack management that sits on top we pair systems one to one with a low generation system it's always a different architecture right you don't want to test yourself with yourself or you could end up you know not catching the same problem so we typically have a low generation system paired one to one with a system under test and that low generation system has all the relevant software on it to actually with a direct you know 100 gig network connection to the system under test so you can drive sort of basically any kind of workload like an IPERF workload or a web server workload with a b or c stuff like that so we have a one to one pairing if you need more than one load generator or more than one system under test then we have a special lab in Raleigh that handles that on a more you know case by case basis where they might set up one off configurations that are more custom in that way but for two system tests where you only have one system under test and one load generation system our standard rack deployment has that for every system under test it always gets paired with not in the same way but they're copy exact so every test that gets run with a load generator runs on the exact same load generator across the board so we'll never get skewed results in that way and we do monitor to make sure that our load generation is not peaking right it's not it's not taking you know the entire load generation system and maybe bottlenecks you don't want bottlenecks in your load generator so we do monitor that and we have sort of some real time metrics that come out of it but we don't track and plot to the same degree as our systems under test so we do a pretty substantial amount of testing on the load generation you know elements themselves so we understand the limitations of our load generating servers and we make sure that we don't exceed those on any test so we do a lot of work to ensure that the load generation itself is never a concern I don't actually remember a time where we've narrowed anything down to a problem with the load generation itself that's a pretty stable element cool another question it's a Peter will have a we also have like some customers and partners that we work with on a regular basis will also give us their workloads and those may be based on open source projects that they use but it may be their sort of secret sauce configuration of that in a sense you know how they go and configure that open source project in that case it's a partnership with you know sort of a private relationship but we may collectively go to an open source project and and inform them of you know a problem with ARM or hey if you compile with this flag instead of that flag you'll get an additional 10% performance you know we do things like that um but a lot of it as we go after software that we feel is important we are talking with a couple of of people right now um it's not public yet but we're talking with a few people now who are asking us to help them do this kind of regression versus them building the infrastructure themselves um you know because we've done a lot of heavy lifting here and so there are cases where people come to us but again this is a relatively new project so right now you know I think this is the first presentation on it just trying to get the word out there about the kind of things that we're doing we hope it turns into a two-way street where we are we're going to open source projects and open source projects are coming to us that would be an ideal state