 Elven Ice Campus in San Francisco. It's theCUBE, covering Apache SparkMaker Community Event brought to you by IBM. Now, here are your hosts, John Walls and George Gilbert. And welcome back to San Francisco as theCUBE continues our coverage here of the Apache SparkMaker Community Event. I'm John Walls along with George Gilbert and we're joined now by Seth Dobrin, who's the director of digital strategies for Monsanto. And we're going to be talking about some fascinating case studies, some work that you've done, Seth, at Monsanto. Thanks for joining us. Good to have you on theCUBE. Thanks for having me. I appreciate it. Yeah, first time. Before we get into what you guys are doing in terms of a wide variety of applications right now within your company with Spark, let's talk about just digital strategies in Monsanto. Your primary responsibility is pretty much, what's the umbrella there for you? Yeah, so my primary responsibilities now is the director of digital strategies is really helping Monsanto to define what it means for a company in agriculture to become digital. How do we operate differently as a company? How do we interact differently with our customers? And even just kind of basic, how do employees behave differently in a digital company and digital ag? And then thinking about how do we apply tools like data science to solve some previously intractable problems? Yeah, I love the fact that you're here, that Monsanto's here. I mean, just because you think about agriculture, right? And this agrarian society, we're back in the 17, 1800s and we all have this pretty old fashioned notion of ag. And here you are very much at the cutting edge and applying this tremendous technology. What are you, in general, what are you finding to be the big benefit or the big value in terms of applying this level of technology to the services that you provide your clients? Farmers, basically. Well, farmers and ourselves, right? So I think the biggest advantage of some of these newer technologies, once you put all these different open source platforms together, is really being able to address some questions that were just previously impossible, computationally impossible to address correctly in the past or even at all. So there are some problems around environment where we're trying to classify or better understand smaller environments around the world where you could not bring in things like weather and topography and different soil types and different levels of fertility in the soil on a meter by meter basis around the globe. With things like Spark, you can do that. We couldn't tackle genetic problems about really understanding our total pipeline which is composed of millions of individuals or nodes in our graph database and they have billions of relationships. And being able to understand the genetic relationship or to do a basic imputation problem on those families to be able to predict what the genetics of individuals that aren't genotype that we don't have genetic information are based on relatives that do. And this was on the stage. And if I can interrupt, when you're talking about individuals here, you're talking about seeds. Seeds, plants, I mean not people, but all the seeds that of course that you're providing and helping develop a higher caliber, higher yield kind of seeds. And it's actually more than we're helping to develop. So if our pipeline like a pharmaceutical pipeline has orders of magnitude more at the beginning in terms of numbers of entries that we have and then it does at the end. And so when we're talking about the beginning of our pipeline, we're talking about tens of millions of individuals that we look at every year to generate new products five years from now. And so the size of the pipeline is massive. And from a scientific perspective, I mean I'm a geneticist by training and from a scientific perspective, in fact I'm a human geneticist. We can do things in ag that you can even imagine being able to do in humans to understand problems, to do things like we're talking about here, leveraging these technologies to solve problems about how do you impute the genetic information about completely or very distantly related individuals, just the populations don't exist in human genetics. And we can create them. If you're using this incredibly rich representation of all these individuals in terms of plants that are manifested in the genomes of these seeds, like help us understand what that poor Russian, Greger Mendelev, when he was just like matching up peas or whatever it was, tell us how many orders of magnitude richer and faster you can solve these problems. Yeah, so I guess I'll tell you the difference in order of magnitude of the problem, right? So I'm gonna oversimplify this quite a bit, actually. So Greger Mendelev was looking at a single gene that represents a single trait, okay? And let's say we really care about in corn, we wanna maximize yield. And let's just, for the sake of argument, say that there's 20 genes that contribute to yield and there's actually many more than that. We don't even have a very accurate number, but we know there's many more than that. If you had 10 of each of those genes in two different plants and you wanted to create a single plant that had all 20 of those genes, you'd need to make about a billion different crosses. So I'd need to cross a billion different seed together in order to get one plant that had that combination of 10 genes, but it gets even worse, because to find that one in a billion plant, you actually need to plant about three trillion plants. This is, Greger would be doing this. He'd need to plant about three trillion plants. He'd be very busy for a long time, George. So how big is three trillion plants, right? So the corn belt, every year, there's about two to three trillion corn plants planted in the corn belt, which encompasses Nebraska to Indiana to Ohio. And so you'd need to walk that entire space and know what that plant looked like. But with leveraging molecular techniques and things like data science, right? In fact, leveraging data science and molecular techniques together, we can identify that plant in the lab from a seed and that we can put that seed in the ground and it'll grow. So you find out of the trillion, you can find the one by traversing this graph database using, I assume, Molecular markers. Yeah. And Spark is doing the analysis. Spark is doing the imputation because we don't have complete, so doing a DNA sequence of anything right now, that's about the same size as human, which corn is costs roughly about $1,000. And so we don't have all of these plants sequenced completely. But what we do have is we have some very high-density, some very deep sequencing done on some of them, and then we have others that have very shallow sequencing. So we'll sprinkle a few hundred markers here and there, so DNA markers. And then we'll use Spark to impute that whole genomic sequence down from those ones that we have that are completely sequenced and up, leveraging all the markers, and we can basically fill in the blanks and say with some level of confidence, this sequence matches this sequence, which has this observation. And so what Spark does, it takes all that information and it takes the observations that we have on all those plants from all of them, and it combines them together, and it says, we know if you have this combination of genes, this is the trait that you're gonna see manifest itself in the plant. And in what period of time? I mean, what kind of time frame are you talking about here? So we have DNA from plants, and we have observations from plants going back 60 years. And then I guess the question would be, when you do a run to try and figure out what's the one that's gonna have the highest yield? How long does that run take? Or is it not one run? So we did one run, and I don't recall how long it takes, but now we do just updates. So every time we have new data entry, it updates in near real time. And so as we make, so thinking about this graph database, those imputations actually feed back into this graph database. And if we find an error in the graph database and it gets corrected, it's gonna run back through that imputation engine and in near real time, it'll correct the mistake and it'll re-imput all of the genotypes that are related to that. That's a lot to wrap your head around. Yeah, right. I'm still, I'm just, I'm blown away by, you know, so I'm at two, three trillion combinations, basically. Manifestations of these, of this combination work. And in that, you can find the magic one, you know, because of this new technology. As opposed to, you know, I mean, what would you do before? So 10 years ago when I came to Monsanto, we developed a technology where I could take a piece off a seed, right? And this is when we did all these calculations, how I know the 20s, right? And the billions and trillions. And we could run some lab tests like you see on CSI that would tell from that piece of seed what the molecular component of it was, right? And so we would have to actually physically take a piece off of each of these seeds five, 10 years ago to do this. Today, based on all that data we collected in the past, we don't even have to do anything for that first round. That whole first cycle happens in Spark, essentially, right? So that whole first cycle is done in silico. And we don't have to plant a single plant for that whole first year of testing. We just say, select these seed and then plant them into the next round. And so we're saving field work, we're saving labor, we're saving natural resources so we don't have to water and fertilize and put pesticides down on these plants. And so all of this happens completely in silico now. This is like, well, this is difficult for a humanities major. So the part where you've got sort of from a trillion down to one, do you have yet enough observations to say this is how much the yield has increased? Yeah, so the trillion to one is a hypothetical, right? So we don't do a trillion to one, but we reduced the odds probably down to five to one. So we've reduced it quite a bit. And we don't know yet the exact phenotypes or characteristics that the plant will exhibit, but every time we go through this, we learn, right? But we do know that these, for this first round, so we'll do some actual testing later in the pipeline to validate this and all that feeds back in. So all those observations that are correlated will feed back into our imputation pipeline and we'll update that. So you've done basically like a corn genome project, basically, right? So you broke it down all these components and identified the most significant traits, right? And so what works what doesn't and now over top of that, you've been able to do these computations that tell you what's the super ear of corn, right? Basically, so you're creating super corn if you will, I mean, with the help of Spark that eventually gets you to a higher yield, more optimal breed or variety, basically. Yeah, and so we leverage, so we're leveraging data science to increase the amount of what we call genetic gain, which is year over year increase in yield. Do you have an expectation for before and after in terms of before you started experimenting with yield as a trait to where you might get in terms of 10%, 20%? Yeah, so we expect to see with just leveraging this, we expect to see a 4% to 5% increase in genetic gain, so yield increase year over year on top of other things that we've done to increase that genetic gain. Can you tell us just briefly, because we have just a couple of minutes left, some of the other things where you have moved from being like a seed business to being a agricultural advisory business. Yeah, so we're here in Galvanize, right? In downtown San Francisco, about two years ago we bought the Climate Corporation, which is right around the corner from here. And what the Climate Corp does is it's a data science company and they were originally very much into weather and predicting weather for agriculture. And Monsanto at the time had some efforts to build some advisors for growers and we bought the Climate Corp with the hope of having them really take their expertise and talent that they have in data science and applying it to provide advisors to our growers to help them understand what seed they should plant, at what density they can plant them. So you can also increase yield by planting more seed on an acre, when to add fertilizer, or when different weather events were gonna happen. And so that's our biggest foray into digital ag. And the Climate Corp represents our digital agricultural face to our customers. And so they're the ones that go out to our customers with these advisors leveraging these tools, leveraging the output of Spark and R and some big models that they do to predict these things. And if you put all these digital technologies together, have you seen sort of what uplift and yield have you seen so far and what do you expect as, you know, going out a few years? So that's a really good question, but we've only been doing this a couple of years. And unfortunately in ag, right, you start developing a product that takes five to seven years to get it to market. And so we don't really know what the output of that, that is in our growers' hands. What we do know is that we test things in our pipeline and they perform much better in our pipeline than they do in our growers' hands. So right now we're working with the Climate Corp to understand how do we better align what we're seeing or normalize what we're seeing in our development pipeline with the growers' fields. And that gets back to what we talked about before with different environments and micro environments that live in a growers' field. And they're doing many, many hundreds of acres and we're doing small amounts. And so it's really trying to normalize what we see versus what they see. And so we have a guesstimate, but I don't think it's completely accurate yet. In our software world, the vendor would call that user error and the user would call that a bug. Yeah, I think we call it a bug. We call it black gold syndrome because we always historically pick the best places to plant our trials. And so we always got the absolute best results, but we need to match with what the Climate Corp is delivering to our customers and develop products so that they tell someone they're gonna get something, they're giving them the right answer. Right, I call it success. Yeah. I mean, in the environmental gain you have, the cost gain you have in the yield, I mean, all of it, it's just tremendous upside and we really appreciate your sharing the technology and I'm sure the folks later on tonight, I'd say get ready to answer a lot of questions. All right. You're gonna be a very popular man, Seth. Thanks, well, thanks for the opportunity. I appreciate it. You bet, right, thank you. Seth Dobrin from Monsanto will continue from San Francisco in just a bit. You're watching theCUBE.