 Okay, so I'm Shelly, and that's about as much of an introduction as I'm going to do right now. I'm here twice today. The other talk is at 5.20. I really do thank everyone for being here, but also for the opportunity to come here and talk to everyone. I think about verification, and I think about it a lot. This talk really is how I think about verification. I want to make it better than the second class citizen of all. We have to run tests. I guess we'll run tests. I want to come at it from the proposition point of how do we make testing better? How do we make it easier? What can we do as an open community to do this? Really, this is early days for me looking at deep learning from that perspective. I'm not an expert. I know there are some at this conference, which I'll be seeking out. If you're one of them, I'd love to talk to you after. Over the next year, essentially, myself and a team of people that I work with are going to be looking at some experiments, doing some experiments, and determining feasibility. A lot of what this talk is about the early stages of that and in preparation for that. Reviving some old projects that we worked on as a team in the past that got to a certain level and then essentially had to be parked because we didn't have the right technology to move forward with what we wanted to accomplish. All again with the goal of make tests better. Taking a lot from Andrew Ng's notion of the virtuous circle of AI, how do we apply a technology to make the thing we're working on better? It's the virtuous goal here. I'm not going to spend a whole lot of time on what is deep learning because this is stuff you can go out and find yourself if you haven't already done so. It's so loosely based on the function of the brain at this point that's probably no longer an analogy that's very useful, but let's go for it anyway. Through all of your senses, you take in inputs. It hitches into a neural network where at each node or neuron in that network, decisions are made to come out with an output. The same goes for an artificial neural network that you would build to be able to model or represent your real-world problems. In fact, there are a couple of modes, a couple of ways of going at this. Commonly a supervised model, which is we train a model. You give it data. You check that the data set or the training set performs as well as if a human were to do the task. At the end, you update the model and progressively it is learning and becoming better. There's also an unsupervised mode. It's math heavy, which is a lot of fun. It's been a while since you've visited math. I'm not going to spend lots of time on this because I see everyone's running short, but essentially functions are applied at each neuron in the network. Depending on what type of decision you make at those little points, different functions are probably better than others to apply. This is just to say the other thing to know, talking about human, comparing how did a human do at this task versus how did your training set do versus the development data set that you're using. You measure the errors of those and you compare them to know if you have high bias or high variance. That helps you know what your next step is to make the model better. But typically in modern deep learning at this point, a lot of experts essentially say, no, how you make it better is just throw more data at it. That goes again with the human experience. Most people learn from experience and the more experience you have, the more you typically know. However, I do want to advocate for beginner eyes on things. We've heard a lot today about people coming in, giving their background and how long they've been around. I could do the same thing, but I'm not going to because I want this community, this open community, to be open to people who are new and with new ideas. I really, really think that's the way we keep a vibrant community. But that's an aside. Really, the process is select the problem, determine some features, gather your examples, and then you train, train, adjust, train. Where do you know where you could apply deep learning? So the guidance really is anything a human can do with a second of thought, it can be automated with AI. But there are other criteria that make it easier to apply this tool into the toolkit. Where are you data rich? In our case, where do you have ideas that you tried to accomplish before or solutions to problems you had, but couldn't get far enough with them? You didn't find that you could process the information or get to the solution you wanted to get to. Where you have parked ideas might be the right place to apply. Then, of course, I also am a big advocate in that model where you get to the output. What thing are you trying to predict or to know? Those outputs for me in a verification world really mean that I want to know how to drive next actions. Okay, so I could categorize the test failure. I could use a deep learning model to categorize and help triage tests. But then what? So for me, it's about fitting it into the whole flow of how we work. And so what can we feed it from a verification space? These were already on someone else's slides, this list. There's a bunch of static stuff we can pull. So static analysis and code reviews. There's a bunch of dynamic data that we pull. And I am very involved in testing at Adopt OpenJDK and Eclipse OpenJ9 projects. And we have copious amounts of data. In fact, this talk is to let you know that. And we're going to try to prepare that data so that it's available to people who are doing machine learning projects. We have some collaborations coming up this year with universities. Anyway, and the peripheral stuff, this is like how long did it take a test to run? What schedule do you have? Was the machine recently updated? A lot of different stuff. Okay, quick audience participation thing if everyone's willing. Looking at this, what would you say? Categorize what category this test failure belongs? And I didn't include subcategories, but infratest, or is it an infrabug, something on the machine? Is it a problem with the test material or is it a problem with the JDK? Good, good guess. All right, how about this one? Does everyone see? Probably some people don't see. The error is the stack size specified is too small. Specify at least 328K. And one more, this is my favorite demo one. Okay, so, again? So, wrong value for Java vendor property. It says adopt open JDK at expected Oracle Corporation. This is actually in the open JDK regression test suite and that I will call it test bug. All right, so just to go, that was that one second of human thought thing. There's a lot of defects or bugs that get found in testing that take a little longer than a second. You might have to drill down into more data. But let's just quickly talk about what's happening for me in my world. At adopt open JDK, we test four of those five implementations. And within my role in IBM as a test lead for IBM Runtimes, we also test an IBM SDK. We test many, many versions of Java. Obviously some of those are now parked, but we also test Lume, Panama, Valhalla branches. So we count those as subtets or versions when we're counting up this number. And then platforms. Each of these different implementations support a certain number of platforms at the moment. So I'm calculating that in here. Loose math. It means typically we're running about 87 million tests a day. And then when you throw in the variants, we haven't turned on the variant testing and I'm going to talk about what that is. We don't run them all of them in the open yet because we don't have enough machine resources calling out to all sponsors who want to donate. But essentially we could ramp that number up a lot with more machines. All right, quick revisit of some of the past work we've done. We played around. I challenged our team, which is quite a small team to say do something fun. Let's not just do the triage work and let's not just work like this in the test verification kind of area. Let's think about how we can make our jobs better and make things easier and find more bugs and know that our tests are good. So some of this stuff that we did is actually open sourced as well now and some of this is parked material. So open source under that category I'll say we actually have all of the code for the result summary service and a comparison tool opened at adopt open JDK. We still need to fire up the live instance of it running and I can show you a quick demo of what it looks like when you run it. We actually have a bench engine tool that's also open sourced there. I forgot to put a circle around it. But that's really to make running and looking at benchmark, performance benchmark output easier and you can probably see where I'm going with all these services. We take raw data, we munch it a bit, make it easier to consume to look at make it easier to feed to other services. These two, we never really found a place for them in terms of feeding it into something else. These are very interesting, we were looking at them as a static set of things that said, okay, I can predict where bugs live in this source code based on this research paper and I can have a tool that rates all of the files in your repo and gives it a score and says the higher the score the more likely there are bugs to be found in that source code. And I also can say, when we crash either in testing or if a customer reports it through a GitHub issue or a user, then we can actually look at that core file and browse through it and easily see the data that is stored in there and it's rich in data. Alright, so looking at the core analytics just quickly, my mate, see if my network is still holding. But really, so what we wanted to do in this case, I don't represent all of the data right now but one of the things that was most interesting to me because we were doing a lot of combinatorial test design which is to say, think about how many different inputs you want to put in and how they can be modified to best give you functional coverage of your tests. So I'm not talking in this case about code coverage which says I'm going to step through this many lines of the code because we actually know code coverage is good if you have huge gaps, but functional coverage is what you really want. I can step through a line of code and not find the bug in it because I didn't put the right inputs into that method. But in this case, so my interest for this project, for this viewer and then how do we take it and apply it into our pipeline was to say, okay, well how many of these this data set which is maybe 400 or 300 and just under 400 pieces of dump files. What are people using in it? Like when I go and browse through it, oh, I can see that there are certain values that were used but having this as individual inputs, these command line options isn't actually as interesting to me. What's more interesting is what sentence of these words were passed into the system, right? What combination of inputs were used? So the other thing we added to this and I don't even think it's running right now on this instance is if I want to search through a data set, can I put in a set of command line options and find the resulting files that we're using that? And that can alone give us numbers to guide us in testing. It can say, oh, if these were reported from users, we would actually know that many more users were using this combination of inputs. We better have those inputs in our testing and that goes back to, oops, sorry, that's a different one, that goes back to the presentation and a couple of other services that I mentioned. Okay, so bug prediction, I'm just going to briefly mention this. All references are at the back of this slide, but essentially this says the theory is and they've shown that this is true. The more recent changes on a source code file due to a defect reported means the greater the risk of more defects in that same file. And there's an algorithm that you can use to calculate a score for all the files. This is from back in, I think, 2015, this data, but you can see that if we wanted to focus on improving testing, we would start adding some functional tests around certain files. We can browse by a component of the different pieces of the JDK to see where the whole thing needs to be ramped up. All right, I'll show the view after when I move over. And this service, which is originally called Modes Dictionary Service, how many modes do you want to be testing with? It's not a pretty service, but what it really does for us is actually as we run tests, it adds the modes to a dictionary so that we know, oh, certain tests are using certain inputs and why would you say this? We could actually just have a flat file that someone added or we could have a script that would do this. But this allows the entire development team a free form test and us to know what's been tested and if a crash happens or if there's an exception or something, we can start to correlate it to some of this. So, again, this was an experiment for us. And the other thing that this, I guess, is useful for, as I find out now, is to help feed into a deep learning service. So this would be where it would fit into our story. We have a bunch of stuff taking raw data and then refining it and making it available for other services. And in particular, how we can do this is when we create our models and we have them running, we have tests running in the open source projects, those tests, the Jenkins servers, all of this kind of raw stuff can be sent, prepared, cleaned, and then presented in a refined way to other services. And that's, in fact, how that test results summary service works. So areas of interest. What problems do we want to try to solve with machine learning? Obviously, I have an interest. We had a prototype in that picture you saw, which is a test generation service. So not just simply, can I look at a signature of a method and generate a stub test, but can I also then generate input based on combinatorial test design? So we have a working prototype of that, but there's a lot more that can be done now that I understand where we can take some of the pieces of data and feed into it. One of the very interesting projects that we're embarking on this year is a project with Professor Leather at University of Edinburgh to do some fuzz testing to verify compilers. And as was mentioned in some earlier talk, it turns out that's also really good for identifying security vulnerabilities, because really what we're doing there is intending to try to crash and break and bring down the JVM. And of course, bug prediction. I've only mentioned that one little service, but obviously there's other things that can indicate would a bug be likely in some code of interest to me just because we're very short-handed in terms of human triage is painful and awful. We really can elevate this story with a lot of this new tooling that will come forth. But for me, that would be very useful in my day-to-day work. Almost all of my time is spent in open projects and with very few people. I invite the entire room to join us. I really would love to work with all of you. But I also don't want anyone to have to do the drudge work of triage. I want to help get us to the next level so no one has to bear that burden themselves. Obviously, analyzing performance. There's an application there. We do a lot of performance testing, being able to have tools that can help predict if a pull request will improve or not improve performance and possibly predict by how much. And then little things, side things like optimizing machine usage. So if we have a lab, a farm, where we're running a bunch of tests, but we're not utilizing the machines very well, there's also, this is a solved problem. There are models that we can apply. All right. Gotta stay focused here. This isn't a complete model, obviously. This is really just me saying there's a lot of inputs that we have, those in the command line options. So I'll call those variants that are used. Failure expression, how did it fail? Did it hang? Did it crash? If it crashed, I actually have a core file which I can go and grab the GC model from. I can go and grab the active threads. There's a lot more stuff I can have if that was the expression of the failure age. So if we didn't get to triaging it on day one, how long was it around? That helps me form what pull request list is in the thing I'm testing. We know version, implementation, platform. We know a lot of stuff that can be fed in. And if we think about how we triage something, we take a lot of that information every day to try to know what's the next best action. So in this model, maybe an output. And just as a reminder, from one single data set, you can have derived many models. So one of the outputs you might want to know is categorizing the defect or even what is the next best action. But also a thing I would like to know is I want to rate the value of the test. How useful was that test? Because there are some tests that, for us, are duplicate. There are some tests that maybe aren't as effective as other tests. So being able to see that is a very powerful thing. So I already mentioned some of the upcoming projects that we have. This one is kind of cool because creating a good fuzzer is essentially time-consuming. It's very slow by hand. What you really want to end up with is creating nearly valid code. I'm talking about generation-based fuzzer, not a mutation-based fuzzer here. The project is about, in fact, Professor Leather has done this for OpenCL already, and we're just going to be bringing it into the Java space, if you will. But the notion here is we can have deep learning go scrape GitHub repos for Java code and build nearly valid code, code that's really good at finding compiler problems. So that's what we're doing, or we're embarking on this year with that collaboration, and I won't go too much more on that. Sorry. I think what I wanted to do as well is just quickly show, because I don't have a chance, probably later to do it in the other talk, but this is actually a running instance of that test result summary service that we have. There's kind of two components to it. On the front of it, what will actually occur here is an aggregate view of test results, so taking all of those different platforms, implementations, tests, and presenting that data in different ways. But we can also grab all of this refined data and feed it into models. And then, of course, it has a component to be able to look at performance benchmarks a little easier and better. All right, last part of the hall here. So what are the plans forward? I guess I've mentioned that most of my time is done, I work in the open. I'm focused on testing AdoptOpenJDK binaries. I also help for the functional testing at the Eclipse OpenJ9 and EclipseOMR projects. We've applied a lot of this type of learning and testing to those projects. We have a constant desire to continuously improve and invest in testing. And that goes all the way from finding better tests to writing better tests to measuring them to see what we can call. Because it might be one thing for me to say, yeah, we run 87 million tests a day, but I should be ashamed of that in a way. My job should be, how can we reduce that number to half and still have full functional coverage? All right, I can put those challenges out for myself and my team, and we are actually looking at, constantly looking at ways we can improve. So we're building our skills, especially in the deep learning arena. We're laying the groundwork for that deep learning service, so we're preparing data so that it encoded it so it can be fed into some models and have math applied to it, if you will. And we're going to observe and measure. So we can have this already. There's no technical reason to not start gathering this and start observing and measuring. It's easy enough to go tell the service to just pull artifacts directly from Jenkins. It's easy enough then to serve that model up so you can feed live stuff once it's been trained to give you a prediction. What do we do with that prediction? We can then have that next service that tells you what is the next action. If it was, we predicted that it was a bug in the JIT. Oh, I know exactly what a JIT developer is going to come and tell me. They're going to tell me, can you rerun it 100 times with this set of options? Yeah, so I don't need to, they don't have to come and tell me that now. We can actually automate that. So there's all that. I guess my goal for this is to leverage all the useful stuff into the open projects. I really don't care to do anything not in the open these days. Those open projects are so fun for me to work at, but they're also like the place where everything's happening. Everything interesting, at least in my world, is happening in the open. We'll continue to bring more testing into the open, but we'll also continue to kind of refine and revise it. And I think that's essentially where I'll end off. This is actually some of the papers and stuff that I might have referenced quickly. So if people are interested in that, they can go check some stuff out. People have questions. Sorry, there's no time for questions. Okay, sorry. All right, thanks everyone. Find me later.