 Live from Washington, D.C., it's theCUBE. Covering .conf, 2017. Brought to you by Splunk. Welcome back inside the Walter Washington Convention Center we're at .conf, 2017 and Washington, D.C., the nation's capital, it is alive and well and thriving, tell you a little warm out there, almost 90 degrees, but hot topic inside here, Dave. There's a lot of heat in this city. Yes. Especially this week. A lot of hot air. Yeah, absolutely. We'll just leave it at that. No politics aside, of course. Joining us is Ben Miller, who is the director of high throughput screening at Recursion, rather, Pharmaceuticals. Ben, thanks for being with us here on theCUBE. We appreciate the time and first off, many questions. First off, let's talk about the company, what you do and then what high throughput screening means and how that operation comes into play when you have this great nexus of biology and engineering that you've brought together. Yeah, so Recursion Pharmaceuticals is treating drug discovery as a facial recognition problem. We're applying machine learning concepts to biological images to help detect what types of drugs can rescue what types of diseases. And we're one of the few companies that's both generating and analyzing our own data. So as the director of the high throughput screening group, what I do is generate images for our data science teams to analyze. And that means growing human cells up in massive quantities, perturbing them with different types of disease reagents that cause their morphology to change, and then photographing them in the presence of compounds and in the absence of compounds. So we can see which compounds cause these disease states to revert more to a normal state for the cell. Okay, and HTS then, I mean, so... That's our high throughput. You can walk us through that if you would. Yeah, yeah, so HTS is a general term that's used in the pharmaceutical industry to denote a assay that is executed in very large scale and in parallel. We tend to work on the order of multiples of 384 experiments per plate. So we're looking at hundreds of thousands of images per plate and we're looking at hundreds of plates per week. So when we say high throughput, we mean six to 10 terabytes of data per day. Just extraordinary amounts of data. Yeah, gotcha, gotcha. Okay, so, and the mission, as we understand it, I mean, you're looking at very rare genetic diseases. Your goal is to find cures for these over the next 15 to 20 years, up to 100 of them. And so that's why you're going through these multiple examinations of vast amounts of data, human data. Yeah, there's been a trend in the pharmaceutical industry over the last years where the number of dollars spent per drug developed is increasing. And it now takes over a billion dollars to bring a drug to market. And every year it costs more to bring a drug to market. We believe we can change that by operating in a massively parallel scale and also analyzing image data at a truly deep level, looking at thousands of different features per image instead of just a single feature in the image. Yeah, I mean, that business is just like this vicious cycle going on and you guys are trying to break it. Yeah, exactly. So, what's the state of facial recognition, Ben? I've had mixed reviews about it, right? Because I rave about, oh my God, Facebook tagged me again. Must be really good, but then others have told me, well, you know, it's not really as reliable as you might think. What has your experience been? You know, so the only experience I've had with facial recognition has been like yours on Facebook and things like that. What we're doing is looking more at cellular recognition, being able to see differences in these cellular morphologies. And I think there are some unique challenges when you're looking at images of thousands of cells versus images of a single person's face. Okay, so you've taken that concept down to the cell level and it's highly accurate, presumably, right? It's highly reproducible is what I would say, yeah. Okay, so then you take some work to get it to be accurate and once you get it there, then you can reproduce that. Is that right or how does the sequence work? Yeah, so, I mean, there are two parts to the coin. One part is how consistently we can produce these images and then how consistently those images represent the disease state. My focus is on making the images as consistent as they can be while realizing that the disease states are all unique. So from our perspective, we're looking at thousands of different features in each image and figuring out how consistent those features are from image to image. So paint a picture of your data stack, if you will, infrastructure kind of on up to the apps and where Splunk fits in. Sure, so I mean, I guess you could say our data stack actually begins at hospitals around the world where human cells are collected, from various medical waste samples. We culture those up, perturb them with different reagents, add different potential drugs back to them and then photograph them. So at the beginning of our stack, we've got biological agents that are mixed together and then photographs are generated. Those photographs are actually TIP files and we have thousands and thousands of them. They're all uploaded into Amazon web services, their S3 system. We spin up a near infinite number of virtual computers to process all of that image data within a couple hours and then produce a result. This drug makes this disease model look more like healthy and doesn't have other side effects. We're really reducing those thousands of dimensions in our image down to two. How much does it look like a healthy cell and how much does it just look different than it should? And where's Splunk fit into that stack? Yeah, so all of those instruments that are generating that data are equipped with Splunk forwarders. So Splunk is pulling all of our operational data from the laboratory together and marrying it up with the image analysis that comes from our proprietary data analysis system. So by looking at the data that we're generating, how many cells we're counting, how bright the intensity of the image is, comparing that back to which dispenser we used, how long the plates set at room temperature, et cetera, we can figure out how to optimize our production process so that we get reliable data. And you say essentially storing machine data in the Splunk data store and then you have an image database for? Yeah, yeah, and the image database is incredibly large. I wouldn't even guess at the current size. And what is it? Is it something on Amazon service? Yeah, so right now all of our image data is stored on AWS. See, this is one of those interviews, Dave. We have these that the subject matter, kind of to me trumps the technology because I want to know how it works. But you need the technology obviously to drive it. And so I'm trying to figure out, all right, you're taking human cells and you're taking snapshots right in time and then looking at how they react to certain, you said perturbed actions. But how does that picture of maybe one person's cell reacting to a reagent to another person's? Like what are you using? Or how does, I guess, your data analysis provide you with some insight because Dave's DNA is different, my DNA different than everybody in this building. So ultimately, how are you combing through all that data to make sense of it? Yeah, that's true. I mean, everybody has a unique genetic fingerprint, but everybody is susceptible to the same sets of major diseases. And so by looking at these images, and really that's the billion dollar question, is how representative are these individual cellular images? How representative are they of the general human population? And the effects that we see at a cellular level, will they translate into human populations? And we're very close to clinical trials on several compounds, but that's when we will really find out how much proof there is in this concept. Okay, and you can't really predict it. Do you have a timeframe or is it just sort of keep going, keep getting funded until you reach the answer? Is it like survive until you thrive? I personally don't maintain that kind of timeline. My role is within the laboratory, producing the data as quickly as we can. Yeah, we do have a goal of curing or treating 100 different diseases in the next 10 years. And it's really early days, we're about two and a half years into that goal, and it seems like we're on track, but there's still a lot of work to be done between now and then. Okay, so it's all cloud, right? And then Splunk is throughout that stack, as we talked about. How do you envision, or do you envision using it differently? Are you trying to get more out of the Splunk platform? What do you want to see from Splunk? Yeah, that's a good question. I think right now we are using really the rudimentary basic features of Splunk. Their Database Connect app and their machine learning toolkit are both pretty foundational to the work that we do. But right now a lot of our data models are one time use. We do a particular analysis to find the root cause of a particular problem, we learn that, and that's the last time we use that model. So continuous implementation of data models is something that is high on my list to do, as well as just ingesting more and more data. We're still fairly siloed. Our temperature and humidity data is separate from our machine data, and bringing all of that into Splunk is on the list. Why are your models disposable? Is it, it sounds like it's not done on purpose. It's more of some kind of infrastructure barrier. You know, we're really at the cutting edge of technology right now, and we're learning a lot of things that people haven't learned that in retrospect are obvious. To figure out the true cause of a particular situation, a data model or a machine learning model is really valuable, but once you know that key salient fact, you don't need to keep track of it over time. You don't need to know that when your tire pressure is low, your car gets less miles to the gallon. You have the answer. Right, but there are a lot of problems like that in our field that have not been discovered yet. Okay, but if I inferred from your answer, you do see the potential to have some kind of ongoing model evolution for new use cases? Yeah, in the extreme situation, we have a set of hundreds of operational parameters that are going into producing this image of cells, and then we have thousands of cellular features that are extracted from that image. There's a machine learning problem there. What are the optimal parameters to extract the optimal information? And that whole process could be automated to the point where we're using machine learning to optimize our assay. So to me, that's the future of what we want to do. Were you with recursion when they brought in Splunk? Yeah, yeah. Okay, did you look at alternatives? Did you look at maybe rolling your own with open source? I mean, was that even feasible? I wonder if we could talk about that. Yeah, so I'd already been introduced to Splunk at my previous job, and at that previous company, before I'd heard of Splunk, I was starting to roll my own. I was writing a ton of Perl scripts and all of these regular expressions and searching network drives to pull log files together, and I thought maybe that would be a good business model behind that. You were building Splunk. Yeah, yeah, and then I found Splunk, and those guys were so far ahead of the things I was trying to do on my own in the lab. So for me, it was a no-brainer, but for our software engineering team, they are really dedicated to open source platforms whenever possible, so they evaluated the Elk stack. Some of us had used Sumo Logic and things like that, but for me, Splunk had the right license model and I could get off the ground really, really rapidly with it. What about the license model was attractive to you? Unlimited users and only paying for the data that we ingest, the ability to democratize that data so that everybody in the lab can go in and view it, and I don't have to worry about how many accounts I'm creating, it's really powerful. So you like the pricing model? Yeah. So some users have shared about the pricing. I saw some Wall Street concerns about the pricing. The guys that we've talked to in the cube today have said they like the pricing model, that there's value there, and you're sort of confirming that. Yeah. You're not concerned about the exponential growth of your data, causing your license fees to go through the roof or? Yeah, I mean, in the laboratory, the image data that we're generating is exponentially growing, but the operational parameter data is more linearly growing. Right, okay. So it's under control. Yeah, for our needs it is, yeah. You're not paying for the images, you're paying for the metadata. Yeah. Well, it's a fascinating proposition. It really is very eager to keep up with this, keep track and see the progress, and good luck with that, and look forward to having you back on the cube to monitor that progress, all right, Ben? Great, yeah, thank you. Very good, thank you so much. Ben Miller, joining us from Salt Lake City. Good to have you here. Back with more on the cube here in just a bit, you're watching our live coverage of .conf 2017.