 the future of reproducible research powered by Kubeflow for KubeCon and CloudNativeCon Europe 2022. About me. First of all, thank you for coming to my talk. As is customary among our people, the second slide is about me and a quick highlight reel of my credentials. My name is Trevor Grant. I am from the Humble Park neighborhood of Chicago and this background image is the skyline at the top of the Slutting Hill in the park. You get this cool effect when the sun goes down, all the buildings turn orange at any rate. I thought it was cool. First and foremost though, I'm the PMC chair of Apache Mahoot and while not the second most important thing, a fun fact is I'm also in the project mentorship committee of Apache Streams and Apache Community Development and I'm also a member of Apache SDAP incubating all cool projects, please check them out. And from all of this Apache activity, you might have guessed, I'm also an Apache Software Foundation member. But wait, there's more. I'm also author of Kubeflow for machine learning from lab to production available from O'Reilly Media. And lastly, since I believe our job shouldn't define us, well, I guess I don't have my jobs listed. I've got a couple of my hustles there at the bottom. So I think some versions of the schedule, so I'm listed as working for Erecto, but I am not anymore. They are a very cool company. I have nothing but nice things to say about them, but we have parted ways. Now I'm into importing various family and cargo trikes from China via Alibaba to Chicago. Check out the R&D section of my website. There's a post about building your own e-bike retry from a kit. I'm hoping in the next day or two to have a blog about how to rig out a lights and a horn kit, working on dash displays with raspberry pies. You can check out the website, the blog, the Redbubble store. But the key thing is I'm doing this to get cars off the road and not to make money. So if you want to partner on something, hit me up. If you want to see all my work and launch the next Tesla e-bikes, cool. I mean, maybe give me a nice cushy job as an engineering consultant. But mainly I'm just stoked to get greenhouse gases, decreasing, and cars off the road. So a good question is, how will this relate to Kubeflow? And the answer is it doesn't. But my main hustle the last couple months have been, this little guy, his stage name is Merlin. And since I don't need any internet randos stealing his identity, let's just say he was born around eight weeks ago. Why am I showing you this? Well, obviously the main reason is to get some ooze and ahs and otherwise endear myself to you, the audience. But also normally I make jokes with my talks. I took a stand-up class with a friend in early February with the express plan to make jokes, but I didn't get them written and I didn't get to test them out at open mic nights. And you might be thinking, what's that? I thought you were just naturally funny. I'm not. I put a lot of work into making funny jokes for my talks. A couple of years ago, I did a talk for ODSC East. I had a bang in type five that went with that talk. And if any of y'all are thinking about doing talks, which you should, doesn't matter what level you're at all the way from junior dev to C-suite, you should be given talks. And my best advice is to have five minutes worth of jokes you can do at an open mic and then also do them at local meetups. Okay. At any rate, the point being is I didn't get around to writing any amazing jokes, but well, here's what today's talk is going to look like. I'm a little over a week late turning in the recording. So some to do slipped in production, like me coming up with fun new title for this slide. And also you might have noticed that I didn't realize stylized this deck very much, but okay. But the overview of today's talk, we're just finishing the introduction now. Next we'll lay out some motivation on why all this content is important through clever use of memes and Wikipedia articles. We'll provide an example of a pipeline we published for an article in a peer review journal. And then I'll berate you with some calls to action and possibly we'll have time for conclusions and Q&As, but one thing I know that I am horrible at is estimating time. So I would guess there will either be no time at the end or there'll be way too much time. There's only one way to find out by staying tuned. Also, I should be in a chat room that you'll have access to answering questions. So don't feel compelled to wait till the end. So motivation, a hilarious article. If you can click the link, I'm not really sure how all this will work. I think the slides will be available somewhere too. This Jeff Lee person said new methods and analysis without software or just paperware. And I love that. But now let's progress on to memes. So in theory, you're already motivated by this topic as a KootKanit attendee, since you've passed up so many other high-quality talks. See this one or as an internet rando, you found this on YouTube, but since you're watching this instead of practicing accordion or whatever other fun thing you could be doing. But in the spirit of compelling, telling a compelling story, I'm still going to have some motivation in internet meme form. Here's our first meme. The so we see here a super Google image search for reproducible research memes will indicate how serious of a problem the lack of reproducibility is. Or if you like to consume information by journal articles as opposed to internet memes, here are some other articles I've stumbled across to illustrate my point. Nope, the Wikipedia one doesn't count. However, what I will be doing is using the Wikipedia article as a framework for a high-level overview of the problem. So the replication crisis, what is it? To save you having to go and read the Wikipedia page, let me give you a quick summary. It's super useful as the problem of reproducibility is larger than the remedy that Kubeflow provides. Psychology and medicine have the biggest problems with reproducibility, but other social and natural sciences are also affected. We'll talk about medicine and a few slides, but the punchline being if you're, if you aren't convinced this is an issue, you can Google replication crisis and see tens of thousands of articles, videos, blogs, and memes about why it is. So first let's talk about some of the causes and why we have the crisis. Again, basing a lot of this on the Wikipedia article, but the commodification of science, what does that mean? Phil Morowski argues in his 2011 book, Science Smart, that science is for sale on a market like other goods and it has commodified. Quality assurances have collapsed and that is to say that as private companies push to fund science, they're really not investing enough in their QA practices, but it's not just capitals on driving bad science. Academia, which is notoriously insulated from capitalistic pressures, still pushes faculty to publish or perish. That is publish papers or look for a new job and there is publication bias against reproducing prior work which is probably why grad students end up getting stuck doing it. More on this later. Then there is the case of straight up fraud and deception. This includes researchers not being blinded to the control versus experimental groups as well as cherry picking results and all the other things that lost you points in high school. But most hilariously, there was a survey done among 2000 psychologists in 2012 where 90% of the respondents admitted to using at least one questionable research practice in a published work. But, and this is the hilarious part, the survey itself used some questionable practices. So take the whole thing with a grain of salt or that's really the gold standard of making my point that even studies about questionable research end up using questionable practices. Then there are statistical issues. When I was in grad school, I had a part-time job tutoring at the local community college for statistics class being taught there. The professor had students attempt to reproduce results from some study. This girl came in, she was working on an article related to physical therapy since that was the field she wanted to get into. She was having a hard time getting answers the author came up with. To short circuit a long story, which I've forgotten a lot of the finer details of anyway, the girl couldn't replicate because she was using the correct standard deviation formula and the authors had used the wrong one, the population one in their work. And since their results would have been significant either way, I don't think they were being fraudulent. I think they were just bad as statistics. Other times statistics can cause issues where there's only a handful of people being studied. It creates an issue called low power or even if an effect is there, you won't see it since you don't have enough people or subjects to look at. And then you also have a base rate hypothesis accuracy, which was a new one on me, but it makes a lot of sense. Figure someone failed to reject a null hypothesis at 95% confidence interval. While there's still a 120 chance that was just a luck of the draw of the test subjects. That's also why your statistics teacher always hard that you never prove anything with statistics. You just fail to disprove also known as failing to reject the null hypothesis. But we grow up and we forget these little bits of semantic pearls of wisdom. So the primary consequence of concern when someone produces an incorrect result and no one detects it is that there's a risk that the incorrect result will be canonized and other results will be built upon it. But there are other consequences too. In so far as political repercussions, I can speak more to the issue in the US than Europe, but anecdotally, it's common among climate change niers, tobacco and automotive lobbyists and others. The way the story goes for political repercussions is there's legitimate, there is legitimate reproducibility crisis, but some actors will exploit the crisis and say for instance, that a study from the 60s and 70s that showing car pollution was a primary cause of acid rain and the data sets no longer available. Ergo, the study can't be reproduced. Ergo, it's invalid. Ergo, car pollution doesn't really cause acid rain and never did. I think you can see some of the logical missteps there, but that's the short version of it. But politicians in theory in a Republic are supposed to mirror the concerns of their citizens. So in a very tangential thread, we also see consequences in public opinion and perceptions, which in turn also affect policy. Someone starts crowdsourcing how about, I was, I'm sorry, someone starts crowing about a study showing that the sky is blue can't be replicated. So people start thinking that the sky isn't really blue. Then the sky isn't blue, speds all over Twitter and Facebook and 24 hour news cycles pick it up with some networks claiming that science has proven the sky is in blue and other networks claiming how the sky is always blue and clouds and nighttime don't exist and anyone who disagrees or thinks otherwise is a long list of mean words. Things really spin out of hand. We see this play out. Unfortunately, almost weekly it seems anymore. But the name calling doesn't stop with the talking heads on CNN and Fox News. The replication crisis gained the most attention in psychology. A professor from Princeton said, anyone that calls out research that can't be replicated is a methodological terrorist and that criticism should only be expressed in private or by contacting the journals directly. In essence, her response to a serious issue in her field was to start crowing about how no one should talk about the issue, which I mean is a solution, but not a great one. So, according to this Wikipedia page, there are four major buckets for potential remedies. The first bucket is reforms and publishing. Remedies in this bucket include meta science, which is the study of science itself, presentation of methodology, not just the results. Our solution kind of fits along those lines. Results, blind peer reviews and preregistration of studies, that is to say how you explain and how you're setting up your experiment and then the journal says yes or no to publishing the results before you actually done the experiment. And then, and finally, some folks suggest using something like Google Scholar to track how often studies have been replicated and what the results of the replication are. The next bucket is statistical reform, which includes using smaller p-values. So, it's going back to the odds of the 5% confidence interval. You have a 120 chance of just randomly selecting people that will show that effect exists when it doesn't. So they're saying, okay, well then instead of the 5%, being the gold standard, the 1% should be the gold standard. And that would make it, so there's only 100 and 100 chance of just seeing results by chance. Another remedy would be to do with p-values altogether. And also words like significant and non-significant sins. Most folks don't really remember their stats on one course or what those words mean. And finally, the last statistical reform mentioned would be to use larger sample sizes, which is always a good goal for us. And there's probably a lot of big data users in here, so usually not much of an issue, but there it is for a complete report. The replication effort remedies, in essence, call for more funding of replication studies for students to be required to do more replication studies. I remember having to do replication studies in grad school. It was not fun for many reasons, which I will espouse later in the talk. I, the authors usually didn't fully specify what their data prep steps were or maybe they didn't do QA on their own experiment or their data as private and it would be too costly or impossible to reproduce the data set. The solution there is to involve the original author. And when the original author is involved with a replication attempt, they're successful about 91% of the time compared with 65% when the author isn't involved. And the final bucket is a bit more meta, changes to the scientific approach. The first thing in this bucket is called, they say use triangulation, not reproducibility. And it's a good option. The idea is, if X is true, we should be able to demonstrate X in a number of ways. In case of research then, if some other research asserts X is true, instead of trying to reproduce their result or their research, you should find some other method for testing whether X is true. Another remedy in this bucket is to stop using linear models for everything. The reality of the situation is not everything has a linear relationship and researchers should feel more free to try and use other models besides linear regression. I'm guessing a lot of folks in this crowd would agree with that. Then there is the remedy, especially related to publication bias. The replication should seek to revise and extend current theories, not merely replicate them. For instance, if you replicate a well-known study, but then you also add an interesting finding, that would be a more publishable paper than simply four pages of saying, yep, it worked the way the original person said it did. And finally, the remedy most closely aligned to our proposed remedy, open science. Open science is sort of an umbrella term around open data, publishing your code to get and other things. We're expanding that and saying, don't just publish your code, but publish your full pipeline. Now a slide based on my experience trying to reproduce other people's work. So everything doesn't come straight off a Wikipedia article I read. As a grad student, I remember reproducing academic papers. The authors always seemed to leave out entirely ending the data prep they did or they would try to confuse you with lots of formulas. And candidly, I did a lot of that too when I was working on my MBA. If I didn't have time to do enough research, I just put a lot of scary math in the middle and hope the professor wouldn't check my math and the MBA professors rarely did. So it worked. Now, I'm not sure if this audience is more academic or business oriented. And I feel like the wiki page mainly addressed academic nature of the problem. But other reproducibility issues that I hit in the wild are normally when other programmers either throw something the fence to me or something that or they left a long, long time ago and I have to dig around in their own mess. But the programmer who I load most of all, the one who's the worst about throwing trash over the fence to be sorted out is past Trevor. Me from the future hates the me of today just like I load Trevor 2019. But especially when I open up some code that Trevor of 19 was working on and spend three days trying to figure out what the hell he did and why I recently ran I recently had another run in with past Trevor which I will use as an illustrative example in this. Now what we did. So I'm going to give a short version of this talk but if you're in Valencia, definitely head over on Wednesday to see Holden Carrow give the full version of this talk. It's also chapter nine in a book that I'll be plugging later on and a peer reviewed article which I'll also be plugging later on. So here's the elevator pitch of the summary I'm about to give. I'm putting this up as I'm going to speak fast and I think there are some probably a lot of non native English speakers here. So this will save you having to mess with the playback speed controls. The short of this is in the early days of the pandemic everyone was scared. There were no solutions and no solutions right of bounds various emergency rooms turned to CT scans and ultrasounds to detect ground glass occlusions which was a hallmark of COVID. The technique had been used in the yards in the past for rapid pneumonia detection. CT scans deliver high doses of radiation. Low-dose CT scans deliver lower doses of radiation but they produce noisy images. We use Kubernetes, Apache Spark, Apache Mahoot and Kubeflow to denoise these CT scans. So March 20, people are dying. Spain, Italy and New York hospitals are being overrun with COVID. Now you can see the here the dates or maybe you can't really see it but the date was March 28, 2020. This guy is wearing plastic wraps with some sort of cootie shield. Looks silly now but it was scary back then. Nobody really knew what was going on. Nobody knew. Yeah. Not everyone had COVID. It took a long time to get a test results back. This was another one from March 28 to 2020. PCR tests take three days to return. People were trying to come up with rapid tests but we see from March 26, 2020 only detecting 60% of true positives at the time was considered promising. These weren't even widely available yet. They were like coming soon teasers. We kind of take for granted that you can just get a 15 minute test now but like that wasn't the case in the 2020s. But that was compounding the issues with hospitals being overrun because everyone who comes to the hospital thinking they have COVID doesn't actually have COVID but you had to sit there for three days waiting for your tests to come back. Here we see something from March 23rd. Creativity and finding new ways to rapidly detect COVID with equipment already available at the hospital was a premium. The Sardicol refers to doctors at some of the early hotspots of Italy and Spain using ultrasound to check for check lungs quicker than a PCR test could come back. More on why CT scans are great or at least people were thinking they were especially back in the early days of COVID. But CT scans have issues. The biggest ones being the radiation that they hit. A typical, a thoracic CT scan which is a CT scan of your chest region which is how they would diagnose COVID gives you about 6.1 millisieverts of radiation. There's a chart to put that in perspective. It's not horrible but it's pretty high for diagnostic procedure. And then another metric to put it in perspective over the course of your entire life you're really only supposed to get about 400 millisieverts of radiation. So, yeah. Now, low-dose CT scans have been around since the 90s in early odds and they were developed for diagnostics explicitly for detecting lung cancer or early lung cancer. A lower, they'll give about 1.4 millisieverts of radiation. However, the trade-off with that lower radiation dose is you end up with a noisier image. And when I'm talking about noisier, think of if you grew up and you had an antenna like the static on a TV channel or if you had cable and there were channels your parents didn't pay for then kind of, yeah, just static on TV channels. So now to denoise images the methods were being developed in parallel with the low-dose CT scan technique to do some hand-waving. The way you denoise a CT scan is basically the same method as you doing a principal component analysis. However, if you have ever done a lot of work with principal component analysis, you have to do a matrix inversion. And again, I'm trying to just go fast here but I did, I tried doing it with Numpy just for fun and it threw a net warning saying it would need 500 gigabytes of RAM. Now you can get computers with 500 gigabytes of RAM but they're kind of pricey to rent and we're wizards of open source. So let's do something else. Now the data source that I used was there was this Brazilian radiologist who got CT scans from 10 patients from Move on China posted them corona cases.org. Since then they have messed with the metadata so you can't pull these images anymore. You can see them at corona cases.org but if you try to download them they don't work right anymore so or at least in my experience. But that's a ground glass occlusion that I think I'm not a radiologist so let me also be very upfront about that. So why was it critical to use open source through this? I doubt I have to sell this too hard to this audience but we didn't even realize how much the socioeconomic disparities were going to affect sub-Saharan Africa and other poorer areas of the world. We see here from a headline from January 21st January of 2021 in the Wall Street Journal how COVID-19 has widened the gap between rich and poor companies. Open source software can level the playing field. It can be distributed much more quickly than proprietary software. Don't really have time to so box this out completely but I hope you understand why and when and how free and open source software is great. So the pipeline, the gist of what happens here S3 buckets have Dicom images. Pi Dicom downloads, or I'm sorry, the Dicoms are loaded onto a persistent volume claim. Pi Dicom turns them into a numerical matrix which is loaded into Spark and loaded into an RDD which is then wrapped by Mahoot DRM. The reason we're using Mahoot, we'll talk about it a bit later. Mahoot has a distributed stochastic singular value decomposition which in essence allows us to invert the matrix and get two matrices out which allows us to do our denoising. S3 buckets, we know what they are. The main reason for this is on at least Google's Kubernetes instance. I believe for sure when it was written I believe still is the case. You don't have read write many options on persistent volume claims. I'm going to assume that since this is a KubeCon and everyone here is somewhat familiar with Kubernetes, you know what that means. Spark, when it's writing, it's going to be writing from every executor so you have to have an S3 bucket to read and write from and that's why we use S3 buckets. You do take a performance hit when you're using S3 buckets just as a note but if you've got to use Spark that's usually a good solution if you're not running on your own private Kubernetes cluster or Kubernetes cluster in general but if it's not support, read write many. So I'm going to take a big breath and start going a lot faster because I realize I'm not tracking well in time. Py.com, great for reading an easy manipulation of Dicom images. That's Python library, Apache Spark, a great distributed engine requires a resource op to run in Kubernetes. There can be a number of fun and unique challenges about doing Spark on Kubernetes. I'll let you figure that out in your own adventure. Mahoot is a library that goes on top of Spark. Spark's support for matrices, it does have matrix support but not distributed matrix support and Mahoot does have distributed matrix support and I'm a maintainer of Mahoot. So of course I wanted to rope that into this and do my talk and my book and my journal article and everything else. So there we go. Distributed singular value, I distributed stochastic singular value decomposition. Nathan Halcoe wrote a thesis on this a few years ago. It allows you to distribute a singular value decomposition which was important for what we were trying to do and there's some cool graphics about it. So visualizing the results to be clear, I don't think that these were low-dose CT scans to begin with. Really what this is showing more than anything is that if you over de-noise, you begin to like, as you can see in image D at the bottom right corner, you start to lose signal as well as noise. But yeah, so that's that. A big point of this is if you want to know how to do this with Kubeflow, that's a Chapter 9 example in the book that some friends and I wrote, if you want to read the article about why I was doing all these things, it's free and open article on Noble Research. Feel free to check it out. So and once again, or if you just want to see the talk, go check out Holden's Talk on Wednesday at 1725, if you're in Valencia, or I'm sure it'll exist somewhere on this YouTube channel as well. I, yeah. Okay, so what? When I was working at Erecto, I dusted off the old code and made sure it still worked. It does. One issue that I hit was the data set had changed, but as I was talking about earlier, the metadata daikoms didn't load right. However, you can put any daikom image in an S3 bucket, aim it and it should run and clean them if needed. So let's say that you have some, but here's another thing. Let's say you have some deep learning that you wanted to use to detect COVID. Cool. With Kubeflow, you can just add a step at the end that does that. But what if you wanted to detect lung cancer? Cool. You can just add a step at the end with that. What if you're a hospital that, like, given a low-dose CT scan, you want to check for COVID and lung cancer and tuberculosis? You can just add three steps at the end. And what if you have an idea and you want to see if denoising helps neural nets detect some sort of malady? You can feed an armatory image, have your algorithm look for the malady at the raw daikom, and then again on a denoised daikom. The point being, since I published a pipeline, you could spend your time extending my results as opposed to trying to figure out how the heck my code ever worked in the first place. So, calls to action. Assume replication. So, assume someone somewhere someday, possibly you, will need to replicate your study. Be a cool person and help them by documenting all the steps in your paper. That would be, that's one called action. And you might say, well, that's why I published my code. And I would say, okay, cool. And if this were an academic conference, I'd let it go at that. But by virtue of the fact that you're a coupon, you're going to assume, A, that you get this joke, and B, that by extension, understand why publishing code isn't enough. In case you don't, let me very explicitly lay out some reasons. The environment being different. Perhaps whatever package you use to calculate standard deviation was doing it wrong. Or worse yet, what if they fix it later? Or what if it used to be correct, but then some regression was introduced? The environment being as software is all living, not documents, but it's a living thing, and it can change over time. You can set a seed. How many people actually remember to do this before using deterministic stochastic processes? Can also check out, just Google low background steel. This is wild, and will make you forever doubt the accuracy of any of your analysis ever. So maybe don't look it up. Just use the high ground steel, cool stuff, and a good reason of why sometimes computers don't work the way you would hope they would. So I touched on this in the comic in the last slide. But your laptop environment is like a snowflake. It's very special and unique. Like the billions of other laptops are snowflakes. No one will ever create another environment quite the same ever again. So instead of using just requirements.txt and calling it a day, make a Docker container, which will also version lock. In addition to breaking changes that might be introduced between TensorFlow 1.4 and 2.0, there's also a thousand little things that happen in the background of your computer. What version of CUDA drivers are you running? What version of Flavor of Linux are you running? Et cetera, et cetera. So Docker containers help make your code a lot more reproducible. Assuming someone will want to replicate your work and that they don't have access to your machine, Kubeflow provides a nice framework for reproducing results. Now, that said, what is Kubeflow and how will it help? From the website, the Kubeflow project is dedicated to making deployments and machine learning workflows on Kubernetes, simple, portable and scalable. Our goal is not to recreate other services but to provide a straightforward way to deploy best-of-breed open source systems for ML to diverse infrastructures. Anywhere you were running Kubernetes, Kubeflow. So if you're using Kubeflow just for reproducing reproducible research, it's probably a bit overkill candidly. Kubeflow does model serving, you can bolt on feature stores, metadata tracking, and a host of other things, but if your goal is just for reproducing research, you probably won't need all that. However, it is easily deployable on various cloud providers, so there may be some value out there. And again, I'm no longer associated with Ricto, but they have a cloud deployment that runs for about 50% an hour, 50 cents an hour, or you can use Ubuntu's Charm Kubeflow. Those are going to be probably your two easiest ways to get Kubeflow up and running. And then the pipeline itself, you can just download us a zip file and publish also to get repository. So, this advice was earned over an untold number of hours trying to do Kubeflow installs. It can be a very painful process and your time's probably best spent. If you've got the 50 cent, if you've got three bucks just to fire up Kubeflow and take all your code and put it into a pipeline, it's probably going to be your best way to go about it. And I would strongly recommend that because you have better things to do than install Kubeflow. So, shake out your couch for a few dollars and change and just get it set up. And I'm saying that with no kickbacks from anyone. So, in conclusion, reproducibility is the first stone of science. So, do everyone else a favor and make your science in uber reproducible. For more information on Kubeflow, obviously you should buy our book. We make tens of cents for each copy sold. Also, check out the paper and if you feel like doing some cool stuff with Dicom images, you can extend my research and site me and I don't get paid for that either, but it makes me feel happy. And if you're trying to reproduce my work, as we said earlier, things are much more reproducible when the original author is involved or if you just want to chat about something else or if you are really into electric bikes and trikes and want to do some cool stuff with that or if you just need more friends on LinkedIn, whatever, that's cool. There's how you can get ahold of me. And I should have been available in the chat this entire time for Q&A and hopefully people had questions answering them and yeah. So thanks again for coming to my talk and I think I actually nailed this perfect on time so great. Have a good one and thanks again.