 I'm a research scientist at the Carnegie Geophysical Lab. And I'm Smeyn Zerug, research director with Slumberjee. Hi, I'm John McGahey. I'm president of Mirage of Science, consulting and software company in Earth modeling in the mining business. Hi, I'm Paul Johnson. I'm at Los Alamos National Laboratory Geophysics and Geoscience in general. Hello, I'm Dan Connell, vice president of business development and technology with Consol Energy and a member of this committee. Good morning, I'm John Sirac, I'm a professor in geosciences at Virginia Tech. Yeah, good morning, my name is Steve Enders, I'm currently department head for geology and geological engineering at the Color School of Mines, but moving into a leadership role as director of their frontiers initiatives. I'm Lance Waller in the department of biostatistics and bioinformatics in the School of Public Health at Emory University and a member of the board of math sciences and analytics. I'm John Holm, professor of geosciences at Virginia Tech. I'm Russell Hewitt, I'm a computational scientist and professor of mathematics at Virginia Tech. I'm Deb Wixon, I'm a senior program officer with the board on earth sciences and resources here at the academy. I'm Carmen Agaritas, I'm an extension associate professor at the University of Kentucky and a member of the committee. I'm Dorothy Merritt, geologist, professor at Franklin and Marshall College and a member of this committee. I'm Joel Renner, geologist, retired from the Idaho National Laboratory and now I'm a sole member of the committee. Good morning Jerry Meis, I'm an environmental regulatory and litigation practitioner here in D.C. and a member of the committee. Good morning everyone and welcome, I'm Elizabeth Ada and I'm with the National Academy staff. Good morning, I'm Bill VanderMeer, I'm a project officer with the geothermal technologies office, DOE. I'm Elizabeth Metcalf, I'm with the geothermal technologies office in the U.S. Department of Energy. Hi, Sean Ports also with DOE in the geothermal technologies office. Zach Ruhn, fellow with the geothermal technologies office. Chris Moses, director of geology, geophysics and geochemistry science center in Denver with the USGS. Harvey Thurlison, state geologist of Minnesota. Thomas Halsey, chief computational scientist at ExxonMobil. Tyler Cloughcorn, program officer here at the academies for the board on mathematical sciences and analytics. Laila Howard, intern for American Geosciences Institute. Christopher Keane, director of geosciences profession in higher education at the American Geosciences Institute. I'm Jackie Richards, I'm also an intern at the American Geosciences Institute. I'm Sophie Hansen and also an intern at American Geosciences Institute. I'm Sarah Riker, I'm the USGS acting associate director for energy and minerals. Good morning, my name is Darren Damiani, I'm the carbon storage program manager for the USGE office of fossil energy. How do you? I'm Kelly Rose, I'm a geology geodata scientist with the National Energy Technology Lab. Hello, I'm Grant Rommel, senior fellow for geological environmental systems at the National Energy Technology Laboratory. Jennifer Bauer, geodata scientist at the National Energy Technology Laboratory. I'm Dave Cole, professor of school of sciences at Ohio State University. Okay, well we have a really great attendance here and do we have some folks online as well? Do we? Okay, we're not going to, we're going to try to remember to keep them in mind and so we again remember to use your mics, but also they'll have a way to flag us I guess if they have some questions or input or four things like that. So that's good. Well, so I talked a little bit about the committee being a convening space and that's kind of what we do, but one of our key roles is to be an incubator of idea. And so when we do, so twice a year we have, for those of many of you are familiar with our meetings, but some may not be. So we typically meet two times a year in the spring and in the fall and we look for a topical idea to be the focus of our meeting to engage this Earth Resources community and hopefully we're finding ideas that are, there are, there's some component that would be useful to advance the understanding of or to carry forward. And the purpose of the meeting then is to try to sort through that to see where that might lead in terms of other types of national academies work. That might, might delve into that topic in more detail. So in some of these, so at the end of our meetings we generate, we are, our committee generates a little one, you know, page or page and a half document that we kind of keep our catalog of ideas. And this, this topic of, started as just subsurface data a few years ago, which was more focused on the, you know, all the variety of data sets that were out there. And, and how to access them, which I think that, that issue, perhaps is still, still out there. But it's kind of evolved into the whole area of machine learning of you have all these data, these data sets are all in different kinds of, of quality and and there's, and how do you really make, how do you turn that into a value. And so, so I think, so we're continuing to, to develop that idea, I think today is going to be a very, I'm very much looking forward to it to advance the understanding of the aspects of this topic. I'll share just so, so just before we get into the agenda, let me just share. So, so our day goes this, this kind of information gathering synthesis goes through kind of mid afternoon and then we take a break and then we're going to invite everybody that is wishes to come back together for, for the last session, which is really the session that in that is okay from a national academy standpoint, what can we do with this, how can we, how can we take this forward and are there folks that are interested so in, in, so, so I share that now so as we work through the day you kind of kind of jot down some ideas so you're ready to come back for that last session and, and help us figure out and how, how we can work together as a community to, to really take this topic forward for that with the National Academy so I guess with, with that I will, I will turn this the, the turn this over to Dan and who's going to be our, our moderator for the keynote and first session. Thanks very much Jim and again good morning to all of you and thank you for joining us here today on behalf of the entire committee. As, as mentioned I'm Dan Connell I work for Consol Energy out of Pittsburgh. We are primarily a coal mining company so certainly understanding subsurface data and innovative ways in which we can use it is very much of interest to us from an industrial perspective, looking forward to learning a lot here today and I think we're going to kick things off in a very strong fashion with the keynote speaker and then the first panel that we, we have assembled. We're going to kind of seamlessly flow from the keynote into that first panel and then hopefully just before lunch, conclude with with a discussion type format so if you have questions that are specific very specific to any one of the talks will allow a minute or two after each talk to get those taken care of but otherwise I think our intent is to have a more interactive panel discussion, following, following all of the, all of the presentations. Our keynote today, we're very fortunate to have called Johnson join us from from Los Alamos National Laboratory. On our industrial panel, we have John McGahey from ERA Geosciences who's going to bring us kind of a mining and mineral exploration perspective on opportunities in subsurface data and machine learning. Simone Zerug who's going to bring us kind of the oil and gas E&P perspective from from his work at Schlumberger and then Shauna Morrison who I think is going to give us a very interesting perspective from from her work at the Carnegie Science Geophysical kind of at the interaction intersection of Earth space and life sciences that she describes it in her bio so should be a rich discussion with with a lot of perspectives and I think what we're what we're hoping to do here. And, and Paula is going to lead the way and kicking this off is really better understand what does all of this mean I mean you hear buzzwords machine learning big data artificial intelligence. tend to think boy these are these are solutions to just about any problem. And I think there are tremendous opportunities to solve a lot of questions with with the data and computing technologies that are that are rapidly developing. But getting an understanding of what all this means specifically in the context of subsurface data, where the true opportunities lie and where where maybe there are questions that these are not the best solutions for. The key part of the discussion today. Also, I think in particular with some some industrial perspectives. We hope to dig into the questions around data sharing and publicly available data and and some of the limitations just surrounding getting data into the right format and making it available for use. In machine learning and AI applications and maybe where the where progress has been made and where there are still challenges to be to be solved in that area. And then finally, you know, I think all of the speakers are going to be able to share with you some some real world success stories and maybe some real world challenge stories about application of these technologies and where there may still be gaps that require further technology development that the academies can can consider so with that introduction setting the stage. Like to welcome Paul Johnson to kick things off today. Hi good morning. Really an honor to be in this building I've never been here before and quit on inspiring To move on my presentation. The goal here is is not to offend everybody, but I think I'll succeed in doing that. Meaning the it's really geared at people who know very little to nothing about artificial intelligence flash machine learning. And so for those I hope that this is something you'll learn something from and for those that are experts will just bear with me, knowing that there are people in the room that are really not very familiar with these topics. So with that I'll go ahead. I'll start with a brief introduction and background to machine learning, maybe a little more than brief. I'll talk a bit about I was going to talk about supervised learning applied to laboratory and fault physics that some of the work we're doing but you know I've elected to throw that out because of lack of time. So that's something that I can talk you about offline or you can read about in in articles that includes cascadia, but we'll go from kind of the overview and introduction to machine learning to the perspective and path to the future. So there are been quite a number of technological shifts over the last 100 or so years that have been revolutionary in Geoscience has had enormous impact. This is a list that I put together it's probably not comprehensive in fact I am sure it's not comprehensive. But if we look back to a little over 100 years ago, we were learning about the age of the earth from radiometric dating. The magnet the magnetometer that was invented in the 1930s and then towed behind battleships etc during World War two had enormous impact later and we'll get to that. Spacecraft and satellite in the early 60s, you know we have the we had tremendous number of interesting and important discoveries the origin of the KT extinction. Eventually GPS especially Earth imagery which is sort of the first thing that came out like Landsat for example many of you remember that these data are still with us. Then World War two just stepping back oceanic research vessels as I mentioned these were the vessels pulling the magnetometers behind them and discovering that there were magnetic stripes in the basalt in the ocean. And therefore that led to the proof that plate tectonics was the theory plate tectonics was correct. That was sort of the the final straw in terms of the information needed to prove it so that was huge. Then we look forward to widely available computers in the 1980s. I know I got my first desktop in the early 1980s and I was thrilled and it changed everything I did. The invention of the World Wide Web in the late 80s early 90s energy technology advances such as horizontal drilling in the 90s that is completely revolutionized this country in terms of energy extraction and energy independence it's truly remarkable. Fast computers starting in the 80s dramatic advances in waveform inversion important to people in this room. Large scale simulation etc. And then by the early 90s or so we had GPS and INSAR and those also dramatically have changed how we view the earth our ability to image both primarily fault. Quick movements in the earth quick displacements such as a fault slow displacements are harder but can be done. And gaming and GPUs for the present and the importance they have in terms of artificial intelligence and machine learning. And then at present I don't know if you can read that but you know the confluence of machine learning big data has and super fast computers has led us to where we are today and the focus of this meeting today. So just you know I won't go into details of the scales of our problems and again this is probably an incomplete list but you know we look at very very quick scale phenomena such as earthquake faulting. We have longer scale kinds of phenomena where we have fluid and gas extraction leakage fluid flow and reservoirs over much longer time scales. We have induced earthquakes again on short time scales we have volcano eruptions on fairly on time scales that can bridge a lot of different time scales. Earth deformation due to tectonic and anthropogenic forcing. These are indeed long term effects and important to us in terms of understanding what's going on underground when we do extraction or injection of fluids. And fault flow slips was very important in regards to the tectonics of the earth. So let's see I'll try to actually point to here so I'm pointing over there and many of you can't see so I'll do that. So machine learning let's go to some background. So this is thanks to Rich Baraniak at Rice University. I love this image. So I mentioned the confluence of big data, big computers and deep architectures really that's where we are today. In the last five years it has just exploded in terms of the evolution of the deep architectures the last seven years or so the computers are just getting bigger. The big data is getting massively bigger very rapidly. So as everybody in this room knows there's tremendous hyperbole and tremendous promise at the same time. The data availability has increased in by at least 100 by 1000 fold and key algorithms have improved 10 fold to 100 fold. Hardware speed has improved by at least 100 fold in the last two decades alone. And this is a fairly profound statistic that 90% of the digital data in the world today has been created in the past two years alone. That's just remarkable. And well there's the singularity and where does that lead us? There's a lot of discussion at government levels inside in scientific circles about what's going to happen? How is society going to deal with this? How are countries going to deal with this sort of new frontier which we may be facing? And none of us know how this is going to pan out but we have to think hard about it and it's really important. At the same time you know that people worry right now that computers will get too smart and take over the world. But the real problem currently is that they're too stupid and they've already taken over the world. And if you think about your experiences phoning your doctor, phoning Verizon, phoning AT&T, phoning an airline, what is it like? You're stuck in some infinite loop where you rarely get to talk to a person. And these are driven by programmers but these are computers. So that's where we are today in fact in everyday life for many of the many of the advances that have been made. So this is also a figure I stole from Rich Peroniak at Rice, the hype cycle for emerging technologies. This is as of 2018. So on the bottom axis we have time and on the y-axis we have expectations. And so this is typical of the hype cycle. This is not just associated with what's going on today with artificial intelligence and machine learning. But we see that we're somewhere right in here right now in particular deep learning. We're possibly at the cliff here and possibly very soon at the trough of disillusionment. You know this is just bound to happen right? It's been so hyped and so overhyped this whole domain that it's got to happen because not all of our expectations are going to be met. Some will, some won't. There'll be a lot of bad work out there and there'll be a lot of good work out there. It'll be hard to tease out what is good and what is bad by your average person. But following that is the slope of enlightenment and then the plateau of productivity where we hope we get sooner than later. So that's what we might be able to expect over the next decade. I just saw this yesterday. I was on the airplane coming from France. I was reading Le Monde, sort of the New York Times of Paris. And I found this, I was reading the science section and I found this interesting article on machine learning. And this is just the numbers of individuals working in these areas in different countries. So you can see the US is, in terms of pure numbers, if we can believe these numbers, well advanced when we compare to even China, which really surprised me, the UK, then Canada, Japan, Australia. And then you see Germany, France, India, and Italy here. So that just sort of made me feel somewhat comforted the fact that we have so many people working in this area. It doesn't mean we're better than other people. We have a lot more people working in the area and that's got to help us. So let's get into a little bit of the details. Classically trained as scientists were domain experts. We know a certain domain and the scientific method means we pose a problem and we hypothesize a solution and we look for a means to get there. And it may just be as simple as doing Fourier analysis or something like that as an example that will take you from that hypothesis and the data sets you're looking at to some solution and interpretation. In the world of machine learning, we're looking through the machine learning lens, you're posing similar questions, but they tend to be harder questions. That is to say you may have some data set that you're trying to map to some variable of interest. In our case, for example, we might be studying seismic data and we might be asking the question is what can it tell us about fault displacement at the Earth's surface. There's no simple mapping as it turns out. And that's where machine learning comes in. So basically you're building a function in this space. This function is f of x, we'll call it, and this function is building using training data. It's building itself so that it tries to determine whether or not it can make such a mapping. So it handles much, much more complex problems and can explore a function space that is vast. The function space being maybe it's frequency effects, maybe it's the moments and signals, the variants or kurtosis, maybe it's some chemical signatures, but it's combining all of these in some simultaneous manner and iterating on that to build this model to see if it can make the mapping. And that's the great power of machine learning for especially supervised learning problems that you can solve problems you could never hope to solve using before we had machine learning. At the same time, you can get yourself into a lot of trouble and we'll come back to that. But that's the general idea that is really the difference between what we've done classically in science and what we're doing today in science. This function space I talked about is here and the function space where we live as researchers is somewhere in this red circle. And the function space that the machine learning algorithms can explore is this much larger space and getting bigger all the time. So this is conceptually what we're talking about in terms of the function space. So let's talk a bit about supervised learning. Not all there's unsupervised and supervised learning sort of broad categories. And the idea here is that you have an input data set and here I'm using acoustical or seismic data or some just some time series signal that you've got. And as I mentioned earlier, you're interested in some quantity of interest. This could be some chemical marker. This could be some physical effect on a fault. This could be some permeability of an underground reservoir. This could be displacement in that underground reservoir. That's the quantity of interest and you're trying to determine if the signal or signals you're looking at can make that mapping by building this f of x function. One does this not directly but by building what is known as a feature space and that feature space is comprised of statistics of the signal you're looking at. And so the operations that are developed or built inside the f of x are based on these features. So that's a very, very rough and sort of naive introduction to how this works but it's right. It kind of captures how this is done. The quantity of interest is known as the label. That's the jargon of machine learning. The input data is the input data and the machine learning algorithm is this function that I've just described. So there's also other kinds of unsupervised learning and we'll get to sort of the array of machine learning approaches fairly soon. But in unsupervised learning the idea is you don't know anything about the data. You have this data set and you really don't know necessarily what to do with it or you want to see what an algorithm can find out about it. What is contained in there? Is there information in there that you're not aware of? So for example, so in this case the computers learn to teach themselves and they model an underlying structure if it's there or distribution in the data if it's there or that's the hope. There is no correct answer and there is no teacher and that's the idea. The algorithms are truly left to discover interesting structure in the data. And a simple example is classification or clustering. So for example, many of you people in the room have probably used something like K-means clustering and that's an example of this kind of unsupervised learning. So deep neural networks can be used for both unsupervised and supervised learning. Here was a really fun example published in about 2012. So this is now quite old, but this was a huge advance. So Google's artificial brain learned to find cat videos. So they looked at some very large, they used a very large number of computers. They never told the computer during the training, this is a cat. So they didn't use labeled data themselves. Jeff Dean, a Google fellow, blah, blah, blah. But the cat videos were in many cases labeled online. The idea that you throw a ton of data at the algorithm and the software automatically learns it from the data. So ultimately it did what we did, it started to find cats. So for deep learning or unsupervised and supervised applications, they're much, much more powerful than traditional machine learning algorithms. They're not always what we should turn to because they're opaque. You do not know what they did to get to the solution or if you do, you have to work very, very hard at it. So if you care about how you got to the solution, these aren't necessarily the best approaches to take. There are other approaches that are transparent where you can tease out exactly how you got to this, how the F of X got to the solution. But there are many applications where we just don't care. And I like to think of our telephone. When we hear a song on the radio, we want to ask Siri or Shazam, what is that song? Well, we don't care that the neural network, how the neural network gets there. We just want the name of the song and who the artist is. So it's really application dependent. So there are details about neural networks in deep, et cetera, but we won't go into that just for the sake of time. So I'll call this the zoo of techniques that machine learning is comprised of. There are many, many, many. So when you hear the word machine learning or artificial intelligence, it's really an array of many, many different kinds of approaches one can use. And herein lies the risk or the challenge, which is to say, which one do you use? And I encourage anybody in the room that's doing hands-on work, and those of you who are supervising people that are doing hands-on work, to think very, very hard about this very problem. Because if you run off using the wrong approach to solve a problem and you might get an answer that looks realistic or you think correct, and it could be absolutely wrong and you won't know that. So here's where the domain expert, that's us, and the machine learning expert, that's some of us, come together. The best way in our experience at my lab is we bring together people that are really well versed in machine learning. We have the domain experts who are asking the question and showing the machine learning experts the data, and they're working together for a solution. So that's how you address this problem of trying to decide which algorithms you're going to use, algorithm or algorithms. So let's talk a little bit about data approaches. Elizabeth asked me to, this is one of the topics she wanted me to hit. So, you know, I would say most data is appropriate. I think there are probably people in this room that can think of data that are just simply not appropriate. Data quality is absolutely key. So if you throw bad data at a machine learning algorithm, it's not going to be any better than throwing bad data at a domain expert doing some sort of analysis. It doesn't improve the quality, although there are ways to improve quality if you're focused just on that question. This idea of label data for unsupervised learning is truly fundamental. So you have to have good, reliable, labeled data sets. And that means that you've had to have somebody go in, an individual, almost certainly an individual go in and identify what something is as an expert in the area to create those labeled data. So if you have poor labeled data, just to say you have 80%, 20% of your data is bad, that could lead to serious problems. So machine learning is appropriate when you're working in an unknown and unexpected function space that must be explored. That is to say these bigger problems where you're looking for a way to get to, to make this mapping we've talked about in supervised learning, for instance, but you don't know how to get there, that's where machine learning really excels. When are other approaches appropriate? Well, simple questions may require simple approaches, okay? I mean, this is not a simple answer and this is not a simple question in itself. But if you can just take an FFT or do a chemical analysis and get the results you want, that's probably enough. If you're trying to tease out more information from the data sets you have, that's where machine learning comes in. It may be there's more information contained in there than you thought, which is what we're discovering and the work we're doing. And that's been truly remarkable. It's a data-driven learning experience of the physics of the problems we're working on. So what topics can machine learning help answer in GS science? Nearly a re-topic in GS sciences applications. I don't think off the top of my head I can think of a single problem that doesn't. This is just a list of some. There are examples here of GS science applications that include geologic mapping on that top row from Anya reading and her associates in Tasmania. There's simulation work by degrees, et cetera, in terms of improving simulation and doing simulations in entirely new ways without doing tomography necessarily. The tomographic model is being built inside the machine learning algorithm. Inversion, this is work by Martinda Hoop and Gupta and others. This is tomography in radically different ways using also deep learning. And discovery, this is some of our work where we're discovering new physics by looking at signals in new ways based on the machine learning approach we're using. When does machine learning fail? Well, there's nuisance variation in images. That is, you know, changes, imagine, excuse me, imagine changes in location, pose, viewpoint, and lighting, particularly with images, geological images or images of humans, et cetera, cats, et cetera. So these kinds of variations and one can imagine even in satellite imagery this can lead you into problems. So the non-stationary data, what that means is that you have an evolving system where the data you collect today is system dependent for that instant in time and tomorrow or next year or 10 years from now that system has evolved so that the data are different because the physics of the system are different. And you have to know that about the system before you start throwing machine learning at this kind of tool at these problems once again, if you try to throw machine learning to a non-stationary problem you can get yourself into real trouble, although again there are ways to deal with it if you're dealing with people, working with people that know what they're doing. So the randomness or entropy, you cannot learn a pattern that does not exist, okay? So you throw data at a problem, you're trying to learn a pattern and maybe there is no pattern, but you work hard enough at it that you get a pattern. So again, a note of caution. A lack of training data, this is really an important problem and why machine learning can fail. If you don't have enough data to learn on for many of these approaches then you can't solve the problem. So that's a very important thing to know as you go into applying these tools. Overfitting, this is a chronic and classic problem in machine learning that you overfit. And this is again something that you want to work with a domain expert, excuse me, a machine learning expert to help you with. So that's the bottom line there. And so in terms of going to the future, these are things that we're involved in. There are workshops for the NeuroLogical, the NeurIPS meeting as it's called. This is the biggest meeting in machine learning annually. Next year it's in Vancouver. It's very hard to get into this meeting. Last year registration opened up and 11 minutes later it was closed. 9,000 people attended and registration opened up in the middle of the night and of course in some countries. So imagine what that's like getting in. But this is an amazing meeting and everybody there is under 35. It's really amazing for somebody like me, my age. So special sessions, you know, these are happening all over the place now but we know they're going on at AGU, the Society of America, SEGIEE. We know that there's focused industry meetings as well. Today we host a meeting. We had the second one this year on machine learning and solid earth geoscience. That was in Santa Fe last March. The first meeting was the year before in Santa Fe in February and the next one will also be in March next year in Santa Fe. Please come if you're interested. And this is both basic research and application to industry. Also there are contests. This is going to really help drive this field forward. One thing that we did in collaboration with Laura Pirak-Nolte at Purdue and with funding from the Office of Science, which allowed us to post a competition. And it just closed this last week. We had one of the largest responses they've had in the history of Kaggle. I think in the end we had about 5,000 competitors playing. Right now they're deciding on the winners. And then the idea is, what do we do with that information? We start to try to collaborate with the very best teams. They seem to be interested in geoscience. They're probably not geoscientists. How do we engage them? That was the whole idea behind this. And so in conclusion, sort of how do we proceed? But before I say that, I just want to make one really important point. Machine learning is a set of tools. That's what it is. This is the new tool in our toolbox. It's going to enable us. It is enabling us to do new things. And it's going to enable us to do new things in the future. Sometimes profound. We're already seeing that. So that's the thing to keep in mind. In itself, it's for the people that are developing it. That's not what we do. We're the users. We're not developing new algorithms. So it's something that we need to impart on the young people that we're working with today, that this is something they have to learn. They have to learn this because in five years everybody will be using these tools. So they've got to do it now. And that's part of this conclusion. Do I have time to wrap this up or should I just stop, Elizabeth? Okay. Sorry, Daniel. So the idea here is kind of to, how do we move to the future here? So we, science and education, I kind of just mentioned the educational aspect. And conference and workshops, I mentioned that. Joint work with machine learning experts. That's this portion in here. So we've talked about that. The new machine learning architectures. This is something that we can contribute to, but most of us will not be doing these kinds of development. There's some people that straddle these two domains, but they're quite rare. And then, these are really important in my mind to open access, open source software, open data. This is very difficult for industry. You know, how can they share data without giving away something they don't want to give away? Even in the world of open research, getting hold of data that we're publicly funded can be sometimes very challenging. But at the same time, there are a lot of data clearing houses out there where you can download massive amounts of geophysical data easily. Iris is an example for seismic data. You can get great satellite data for gravity. You can get insider data, et cetera. Then benchmark data sets. This is related to the competition, or you just simply have a benchmark. This has taken place in many fields, including industry. Industry has kind of led this aspect of science in the fact that you post benchmark data sets. People compete to see if they can produce the known result without knowing the result in advance. So these are going to advance our field in terms of our understanding of the solid earth in general and applying this new suite of tools. Thank you very much. Thank you very much, Paul, for that excellent introduction to our day-to-day. I think you really established a framework to put us all on the same page for the ensuing discussion. We're going to, I think, segue right into the panel now. And Paul will be participating in our panel discussion at the conclusion of the session, so certainly can take questions at that time. Kicking things off, kind of, as I said, with the mining and minerals exploration perspective, is John McGahey. John? Yeah, thank you. It's a pleasure for me to be here as well. I'm not sure where we are on the hype cycle in the mining business, I would say, somewhere near the top. There's a lot of action at conferences around machine learning, a lot of sessions that are very well attended. Most of the work to my mind isn't that interesting yet, but I suppose that will come in time. We've been using machine learning in a couple of mining-specific applications for a few years, and that's what I want to share with you today, sort of how we think about that, how we put data together for that, what the challenges are, and where I think we have further opportunities in the future. The two specific applications I'll be talking about are mineral exploration, one, and the second one, geotechnical engineering. In geotechnical engineering, specifically looking at the possibility of forecasting geohazards of various types. So this is a conceptual picture of what we're trying to achieve in mineral exploration, usually now in 3D, sometimes still in 2D, but this equation as shown here is very much the f of x equation that Paul showed in his presentation. In this case, we're trying to combine different data sets that are coming from different really sub-domains within the geosciences. We're looking at geology, geochemistry, geophysics. We tend to have a low number of data sets. We might have a few tens of data sets if we're lucky. Those data sets have many variables within them. And the game here really is to combine all that data somehow into a function with which we can predict the probability of an ore deposit occurring somewhere in the subsurface. And the picture in 3D is really a picture of that, those cells or voxels are cells that have been determined to have a high probability of mineral deposit occurrence. And then we target a drill hole on that. So this is very much the supervised learning approach that Paul talked about, the label here to use that same jargon would be the target. And the way this is approached is to look at areas where we understand where deposits occur and look at the data sets that have been gathered around those deposits and use machine learning to try and build this function so that we can apply it in new areas. We'll switch now just conceptually to the geotechnical problem. And I'll be going back and forth a little bit because one of the things I want to do here is to show that even those specific problems are quite different from each other and have different audiences within the mining industry. How we go about setting things up is actually quite similar. So the geohazard problem looks like this. We again have this f of x that we want to solve. In this case, it's a four-dimensional problem. The hazard is very much a function of time as well as space. Things are changing in a mine all the time. The data sets are changing. Many of the data sets are a function of time. Many of them, like geology, are not a function of time. But many of them are. Stresses are changing. Excavation. Geometries are changing. Micro seismic data, which is routinely collected, is changing. Lots of things are changing. And the hazards, of course, are evolving as well. So we have to be able to reduce these hazard maps, as we call them, regularly in time. For us, that means typically daily, in some cases in coal mining, maybe more like hourly. When we're modeling hazard, we model it not throughout three-dimensional space, but we model it on the rock interface, because it's the rock interface where the failure occurs. So if you have a landslide that's occurring on a slope, if you have a rock burst in a mining tunnel as pictured here, that's happening on the rock interface in the tunnel. So the challenge, I mean, the benefits surely are immense. In terms of mineral exploration, discovery rates have plummeted, thinking being that the more near-surface obvious deposits have been largely found, and now we have to search deep or under cover, which is difficult. The problem itself, though, is that we have to identify the location of an ore deposit at the core of a very complex natural system. So these systems are the things that we're trying to sort out really with machine learning. And that, I think, is where a lot of the current challenges are, because the way machine learning is typically applied to mineral exploration today, when you see the talks at conferences, et cetera, is that machine learning is really looking through the space to try to assess everywhere in that XYZ space, are you an ore deposit? Is this cell or is this point an ore deposit? When that really is probably not the most appropriate question, because we know the ore deposit is small relative to the system in which it sits, and detection of the system itself and understanding where you are in the system is actually a much more important problem. So this really gets to my sort of broadest point today, which is how these problems are set up for machine learning is the greatest determinant of success or failure, much more important than the machine learning algorithm itself. Just a note without going into details on how we now approach the problem in mineral exploration, as this is a cartoon, it's a cross-section of the earth, the ore deposit is A, and those zones or domains around it, B, C, D, E, are zones of alteration. So these zones together comprise the mineral system, which if you're lucky, will be kilometers to tens of kilometers across, and that's what we're trying to detect. We're trying to sort out where we are in that system in such a way that we can vector towards its core, and that's how we approach these now. And in fact, we use a mixture of unsupervised and supervised learning to do this. Paul talked about that distinction. In unsupervised learning, you don't have labeled data or training data, and we use unsupervised learning to bring data sets in to try to understand structures within those data sets that will determine groupings of data that we can assess as being likely representative of the alteration domains. So we use unsupervised learning, in other words, to create the labels A, B, C, D, and E that we then use for the supervised learning, and then we train the system to recognize those domains. So if you have new data, if you're in the new area, you can have a better chance of success by not asking the question of new data, is this new data correlating to an ore deposit at this location, but is this new data within a mineralized system? And if so, is it proximal, medial, or distal to the core? And in the best circumstances, you may get an indication of the targeting vector towards its center. Switching back to geo-hazard, again, the parallels are strong. We're trying to identify the location, and in this case, timing of a phenomena, the hazard, which can be of many different types. It could be a roof fall. It could be a slope failure. It could be a rock burst. These all result from very common systems with both natural and engineered components. It can be applied to the purely natural slopes as well, but for us in mining, there's a combination of the natural geology with the engineering components on top of it. We have to assemble evidence from the interpretation, and this is key, of numerous sparse, heterogeneous, time-dependent, non-colocated data. As we say, our problem is not so much big data as it is messy data. We have data that is not only messy in the way described on that slide, but it's messy in terms of a lot of its analog. It's typically a mess when it comes to us, and a lot of the work has to be done up front on the data. Again, the challenge for us is much more the problem set up in the machine learning algorithm itself. And the conceptual framework for the geo-hazard looks something like this, and I use this drawing for a reason because this is how data typically comes to us, and this is what I mean by messy data when people are showing us on the mine site when excavation geometries occurred, as often as not get something still in 2019 done with a colored pencil. And we have to do a lot of work on that data to get it into shape so that we can start to apply these techniques. This is from one of the deepest mines in Canada. You can see 7,680, that means 7,680 feet below surface. And the way we approach this, this is looking for rockburst hazard, is also indicated conceptually by those red dots because what we do is we somehow have to digitize the supporting structure on which we're trying to compute the probability of the hazard. And we typically just digitize center lines along the tunnels, typically a meter or two, spacing between those points. And what we then have to do is we have to get all of the data that we think may correlate to the hazard onto those points, and that is the toughest thing. This is the so-called feature engineering. It's the generation of the variables or features that you're going to correlate to the labeled data or the training data, which is the experience of the hazard. In this case, we have this typical grab bag of different data types. Forgive the acronym there. Our QD is rock qualities, which fracture density more or less. But we have many, many different data types, typically a few tens of data types. I want to make the point again that our biggest problem in my view is how we get all of this data together onto these data supports that we want to correlate to the thing we're looking for, whether that's a mineral deposit or a geohazard or something else. And it's just a huge outstanding problem. It's a problem that existed before the advent of machine learning and the problem just has a brighter light on it with the advent of machine learning because we want our data to be good and we're correlating all of these different variables at all of these different points of the target. But how well do we understand what those variable values are at those points? And we are always moving from sparse heterogeneous data into these models. So we may go from a magnetic airborne magnetic data set as shown in upper left to a 3D grid of magnetic susceptibility as shown in the upper right through an inversion function which typically incorporates smoothness constraints as a mathematical convenience and on the geological side we'll go from map data or drill hole data typically now through implicit modeling routines into creating a model of the subsurface which is the exact opposite of that. It's looking at a geological model of discontinuities and formational contacts and structures the assumption being that properties in the earth are not smoothly distributed or discontinuous which is actually probably more correct but in the geophysics we're still working with smooth models. So these sort of inconsistent assumptions in the modeling are actually the biggest bugbear I think in the successful application of things like machine learning because we're trying to correlate things that we don't control well as an input. Getting into the machine learning mechanics I won't do other than to show this data fusion table as we call it and others call it as well and this is the representation that we have to get to for all of these problems whether it's mineral exploration or to hazard or anything else this is a flat ASCII file typically a big CSV file in which we have rows of variables and then we have that labeled data on the right which could be a hazard or it could be an alteration domain and a system on and those observations those rows could be every point of a 3D grid or there could be every digitized point along the mine tunnels but in the end we typically would have hundreds of thousands to millions or even tens of millions of rows and a few tens of columns and then with the machine learning algorithm whether it's neural networks or logistic regression or whatever the algorithm is that we apply we're trying to understand the patterns amongst the variables in the rows with the occurrence of the what I just put here is the X or an O the X being the thing we're looking for the mineral deposit perhaps or the O hazard and the O is where it's not occurring I do want to make the point here though that for our domain at least in these two application problems we definitely want other algorithms so Paul talked about the opaqueness of deep neural networks versus the transparency of some other algorithms we want the transparency and that's because when we show the final results the mine engineers or the geologists want to understand why we're saying some spot in the earth is more probably deserving of a drill hole target or enhanced support in the mine than another spot they really want to understand what went into that so in machine learning in general we'll take a table like this and then we'll put it into data space where we'll look at where those X's and O's line up with respect to the variables and if I just show it for two variables so two columns of that file I would get the X's and O's plotting somewhere and then what I think any machine learning algorithm is trying to do in principle and they all go about this different ways is they're trying to find areas of the X's in that data space so of course in practice the number of dimensions in the data space equals the number of features or variables so it can be a very tough problem but at the end of the day this is what they're trying to do and some of them work exactly as shown in the slide, they'll do a brute force search through the data space and look for so-called rules which are bounded intervals of variables where that described where you have more training data and then we turn that into a result which is typically a coded heat map in this case this is mineral exploration this is a regional scale this is actually from Australia the long axis is 80 kilometers so it's truly regional and warm colors are more likely to be sitting at the core of a mineralized system in that interpretation the training data are the black dots so you see that the training data are not all that numerous geohazard we also give the people on the mine a heat map this is looking at a deep underground mine, you see the tunnels and the stoves from which the ore is extracted and the red areas are those that are more hazardous and actually you can see here with the red areas being sort of semi-linear these are actually following faults that are cutting through the area so solution requirements I think at least for the domains we're looking at are subdomains in mining although the particular mechanics or methods of AI or machine learning are not unimportant I think the focus needs to be elsewhere the focus needs to be on how the geoscience problem is set up for AI and that is where the deep domain knowledge and a domain specific supporting framework is required for tackling these problems that's something that we've been working on over the last four or five years building a system this is really driven by the needs on the mines if we're doing geohazard we need to be doing updates to the hazard estimations say every 24 hours and that means we need to be actively taking in data from all of these different sources convergence stations and micro seismic data stress data when it's available there's a different kind of data into one data warehouse where that data fusion table can be constructed people can get at it they can see hazard updates they can get reports emailed to them it has to be a system that works autonomously people on the mine aren't going to sit there with complex software and add a desktop and run it and this is sort of a little bit of what that looks like in a flow diagram where we have data coming in typically from file folders or other databases on the mine connected live to this data warehouse where that data fusion table is built spatial computations if they're required can be done on the servers so in other words proximity to a fault or micro seismic event density those sorts of things are computed autonomously as new data comes in the machine learning is not done on the system though the machine learning is still done by experts offline but the so called rules or relationships are encoded back in so they can the people on the mine can get updates so in conclusion that f of x equation that Paul talked about earlier can be can be constructed and it can be solved and we're doing that in those two two different domains it requires integrated multidisciplinary 3D and 4D modeling that's probably the hardest part of it it requires compilation of that data and the model components into a data fusion table connecting to the statistical machine learning applications and then finally if you're going to do it in anything like real time you need an integrated data management model update and analysis application to handle the flow of data and reporting that's it, thank you very much thanks very much for that John we'll move along now to Samayne bringing us kind of the EMP oil and gas perspective I'll just start by thanking the organizers for having invited me to speak on this topic it's quite an honor to do it before this audience so my talk would be somewhat semi-technical I've read carefully the motivational text for this meeting and I figured actually it's better to have a helicopter view with helicopter view of certain aspects so I'll talk about the context within the industry right now and so far as the problematic that we're looking at I'll touch upon data and I'll spend some time actually on data because there is no machine learning successful machine learning without the 80% that the guys will spend doing the data curation and I'll speak of the environment if you'd like and I'll highlight that that Slumberger is doing something tangible about it because what I could read between the lines and the again motivational text is that it's almost there's a question of what is the oil and gas industry doing about it what it's contributing so I'd like just to say that as far as Slumberger is concerned we are bringing in some tangible and viable solutions that we hope would extend beyond the industry users I'll do, I will do some machine learning and presenting some examples that kind of give us a promising and encouraging results on the adoption but also highlight the pitfalls, the risk and the challenges and I'll conclude with some remarks so in so far as the the context if you'd like I think there is literally a digital transformation going on in the oil industry right now in the oil and gas industry and one can think of two fronts if you'd like it the leveraging of convergence and the advanced progress in a number of enabling technologies around sensors computing power, a high performance computing to the cloud of the edge analytics, robots and also communication or connectivity leveraging of that towards two fronts if you'd like that somewhat competing for resources and funding one of them is concerns us which is how subsurface data can be leveraged within a particular let's see environment along with machine learning to so it can inform decision making to what's happening on the surface in particular but also to environmental monitoring and exploring to accelerate or accelerate the discovery in the exploration process the other front is leveraging of these technologies actually to bring about gains in the operational efficiencies and which can translate into gains for the bottom line this is quite important for the oil and gas industry given the context, the business context that it has been in for the past few years and the reason I'm highlighting this is simply because it is competing in terms of resources and funding right so I heard from Jim that the question of types of data subsurface has been discussed in the past so I'm going to go rather fast on it but I'll share with you the type of data that we deal with very briefly I mean in terms of images and other types and in terms of images we're dealing with data coming from multiple scales quite a spectrum there running from the nanometer scale where we do microscopy on thin sections all the way actually to the right hand side here the extreme where we look at low frequency electromagnetic probing like control source or seismic in fact that gives quite a total resolution and large extent of what the basin of the reservoirs are and if I were to focus on some of the key technologies that the teams that I work with are focusing on and those are images and data obtained through lowering devices in boreholes as well as shining ultrasound or EM waves at the surface the picture is shown here in fact that is a whole panoply of physical measurements that are used listing NMR electric density and all the way to measurements that as you lower the frequency tend to see deeper into the formation at tens of meters perhaps and then at hundreds of kilometers as well and this is extracting physical parameters that speak to the static but also the dynamic properties of rock formations okay data comes in terms of time series like production and flow liquid flow logs also as depth images sorry depth logs in terms of the physical parameters of physical properties of the rock expressed as a function of depth geological crops are often used and we go on field trips and actually we get to see analogs through the art crops that we keep in our minds and we use actually in the interpretation of what we see through physical measurements lowered in the borehole at least but not last the text actually that is in reports and interpretation reports this is extremely useful I'll come back to it so this is just to say actually that the data that we deal with in the oil and gas industry is encompasses multiple physics that needs to be integrated and those are multiple scales and there is a diversity in the character also right I'm going to go fast again on this data because I'm sure many of you if you have said that this meetings here in this building you must have discussed perhaps to a great detail but just to say that the type of data that we deal with the subsurface data one couple of things to think of is that it's extremely heterogeneous but it's rich in properties and it's not big data it's somewhat sparse sparse and it's low volume but also sparse if you'd like in the physical properties that it bring one can think of seismic as being big data it's not and it's full just actually reminded us of what the what we as humanity generate and the internet is a lot more feel like many orders of magnitude bigger than let's see the largest marine seismic acquisition job that could go on about a hundred of terabytes or so and when it gets to the seismic let's not forget actually that it gets reduced to sometimes a couple of gigabytes of use for that through an image of the subsurface and I'm showing here or perhaps tens of gigabytes the other components that I'd like to also highlight is well logs so in terms of volumes it's minimal miniscule few gigabytes per well we do have out there millions of logs actually but in few areas like the US and the North Sea but if you go outside of these regions it's really is a great deal of sparsity owing to the fact that you have wells that are quite sparse or sampling in sparse manner a field and for most of these wells the logs that are available tend to really give us some basic physical properties okay I want to highlight now the elephant in the room which is everybody talks about when it gets to data on machine learning which is that it takes a long time to create the data to the ETL extract, transform and load and that we need to keep it in mind all the time so here it's for the seismic it takes literally sometimes a year to go from the raw data all the way to a cube let's see that describes parts of the Gulf Mexico for well logs you would ask a petrophysicist for instance and they would tell you that at least 80% of their time is spent just preparing the data before they start the integration process and also some interpretation around it using whatever tools they have and that has to do with a variety of of factors that I'm sure you have you have a discuss here when you discuss with the data that is the poor quality of some of these the diversity of quality is used sometimes to the calibration of sensors that are used there are problems with calibration also often missing data in these logs an example for instance if you have a well with some washouts then the sonic slowness doesn't really doesn't get established and you have a missing velocity in that particular interval for instance depth matching is another thing we acquire physical property data by lowering these devices but we often do them through multiple passes in the same well and so we're using a cable let's say while drilling and that leads actually to mismatch in the various curves that you need to put on right depth that's something quite outstanding to get right just want to go back to what I highlighted one input of the data that you deal with this text and it's extremely valuable in the sense that it captures the context of whatever it's interpreting whatever is describing in terms of quantities and that would like to mine it if you'd like to extract it because to some extent that's the label that we're after a label that you can use in doing supervised learning there are plenty of inconsistencies and I think John just showed actually one of them if you like how people annotate and how they how do you digitize that and capture it so it presents really significant challenges so I set a slide here just to oppose and just again remind everybody and I'm sure many of you are aware that this is in the words actually of Michael Stonebreaker just two days ago actually I attended the annual meeting of the corporate alliance program that they have at MIT in the computer science and artificial intelligence lab Michael is or professor Stonebreaker is a touring awardee who works in the in the relational database research so two days ago he said actually the data curation problem is really the 800 pound gorilla and speaking to representatives who are about 80 corporates that are part of that alliance he said you're really out to put your best people on it so I am part actually of team scientists who have been tackling data and using machine learning over the past few years right to what Paul mentioned to complement our arsenal of deterministic approaches that for that case would use to interpret subsurface data and try to build solutions that can be used by people in the operations if you'd like to support the decision taken and accessing the data at a scalable fashion and asking the question whether it's structure enough actually to start using machine and AI and it is a big, big challenge at the end of the day what we kept asking for is what is highlighted here in blue is like we'd like actually to work in an environment and the line if you like and the line by software architecture and high performance system to really orchestrate the access to the data the access to algorithms that interact with the data but also operate on it visualization is part of them but through a user friendly experience that still caters to the requirement of the business and the regulators and here what I mean is there are plenty of aspects that we need to keep in mind serious one security, accessibility and environment that allows for collaboration between various stakeholders as well as scalability that's what we end up actually wishing for and it's something that really propagated within slumberj for the past several years and the company transformed itself to try to provide a solution of the sort working with the best vision for technology partners out there and we ended up actually going again to follow all to some place that we deserted in the 80s when we used to have an artificial intelligence lab so we went back about five years ago and created a software technology center there we ended up also transforming and looking really at software technology from a different point of view the outcome is an environment in what we call the data ecosystem that kind of speak to these challenges that I just highlighted and it's highlighted in this slide here what I want to highlight in terms of attributes particularly that it's an environment that allows users to collaborate and have access to tools they can bring their own tools they can bring their data the environment is open and extensible and perhaps I can expand on that at the panel discussion I'd like to highlight two examples actually that we are as a user my teams are quite excited about this we're two examples that highlight what can be achieved in an environment like this first one has just actually been rolled out three days ago at the EAGE conference that happened in Europe and it's focusing on allowing operators that are exploring to significantly accelerate the discovery of hydrocarbons through screening and ranking of exploration opportunities by accessing seismic data and data of the sort at the basin level as well as well-bored data that is available from Slonderger but also from additional providers that are providers and finding the tools to cure the data to process the seismic as well as bring in their own AI tools to operate on it within an environment that allows for productivity and for efficiency so one example actually that going to the cloud and using this Delphi environment has helped us is we've managed to take a typical processing job substantial processing jobs that in the past would take us about 12 months reduce it to two months actually through the use of high performance computing in the cloud but also a number of machine learning that tend to remove the remove the manual tasks excuse me. On the well-bored interpretation I have a movie here actually that described that things also in place and this is really more of a demo for let's see a task where a user would like to highlight go to a particular area of interest and highlight particular well with the interest of finding if you like tops of the formations and do a correlation across multiple wells so this is an environment that allows this as I said as a user my teams are quite excited because we don't want to do this within our own on our desktop or even accessing the cloud but having the data on our own premise so these examples so I want to now move on to machine learning and share with you a bit some of the again a helicopter view some of the lessons learned and I think Paul touched upon it John as well so I'm not going to repeat what they said but I'll highlight actually what we see as if you like scientists in the service company embracing machine learning and complimenting our deterministic approach where we see these machine learning could bring in terms of if you'd like of value to us so I would say that it brings through that initially automation if you'd like of algorithmic solutions it brings efficiencies not just actually in the interpretation algorithm but also in the curation itself which is again the 800 pound gorilla in the room we actually seeing also the machine number of techniques to help us integrate across physics and across scales and do more effectively the interpretation reducing the non uniqueness and the uncertainty if you'd like that we would deal with not integrating the data I think we gain insight so the data and Paul highlighted this quite well for problems that are difficult to approach with deterministic approaches gain insight in the data that as a domain expert for instance you could lead to a better approach to resolving the problem we're also looking at augmenting the capabilities of our experts and non experts if you'd like through approaches such as causal graphical models or causal networks to bring in some cognitive if you'd like somewhat mimicking the reasoning let's see workflow that a number of experts will have and this is not to say that we aspire to remove the expert out of the loop I don't think we're far, far away from that but rather help them actually in conducting more interpretation jobs so they can focus on the tough questions opposed to the laborious and perhaps simple questions natural language processing is also something that we are using and definitely encountering a variety of challenges but we making some headways if you'd like an extraction of value out of the reports that I said are quite challenging and in particular to bring what we do often is generate modeling if you'd like of the subsurface and we'd like to label that teach an algorithm through what they have seen in other interpretation reports for instance and through analogs to the data that we're looking at whether geology or let's see a log put a label of geological or bring a geological context to that inferred model so I also want to highlight actually the way we organize ourselves perhaps that hasn't been addressed by other speakers the way we approach now the use of machine learning and the leveraging of data is really through we try to discipline ourselves first to find that there is indeed a business outcome to it a business question to address and not do machine learning for the sake of machine learning one thing that we've also have been increasingly doing over the past few years is really see partnerships with clients that bring the expression of that use case business use case but also data and domain if you'd like context of that data sometimes the data is already curated sometimes we work together to do it and the teams that we have internally tend to be just like Paul highlighted tend to be made out of domain experts and data scientists that's a recipe that seem to have been working so far challenges and risks I think there is a long list there I will just highlight to bias and interpretability what I mean by bias is that it's almost on a daily basis that as we attend conference or as we look at our own projects where we see actually what we are doing with machine learning often supervised learning is interpolating within the dimensional space that the data represents and that's where the bias is meaning you have a model a learning model that is biased to what you ingested the data and cannot be extrapolate the interpretability is something that is becoming quite important that is in our industry and I think ourselves as well as physicists and engineers we're not ready to accept an inference or prediction provided by a black box we'd like to understand how it came to that and that's very true also for the clients or our customers especially when the decision is critical they'd like to be at ease understanding how that algorithm has come about to that prediction the way we are addressing it really is through great deal of injection of our domain knowledge and physics and if I had time I would go actually through a variety of ways how do we do that I'll just highlight with one which is transfer learning where we simply train on synthetic data coming from a particular measurement and we have simulation of those measurements within the formation rock formation context we generate data we train a neural network for instance and then we do transfer learning on real data and there are a variety of how to do that that we have done I just also want to highlight that there are also challenges in the deployment of machine learning products or services that we are just scratching if you'd like the surface of that's the challenge and the risk and you have a model you start inferring if you'd like or giving you results and you put it in a production line well if it's being used by experts the expert is also just seeing and correcting you'd like to take advantage of that correction put it back retrain the model how do you do it if you'd like that's the way without all the sudden figure now that your system is not capable of that the other thing to keep in mind is robustness these are the in particular adversarial effect two examples very quickly of successful machine learning in this case is supervised machine learning that on a cube of seismic data the problem here is really to pick the boundaries of salt objects like salt body objects what this has helped us is reduce the time that it takes to do the salt takes from 80 days to literally few hours after having trained doing after having done supervised machine learning with people that have picked on a particular slices of this cube and then do supervised learning on the rest so this is one successful like project that we've had it does take few hours to do that but still about two weeks to QC and sometimes adjust things right the second one that I'd like to highlight is generative modeling for really spitting out the 3D stratigraphy there we use generative adversarial networks that have seen actually quite a successful I'm showing here is what paper by NVIDIA in particular that generates beautifully looking people that are not really that don't exist in the world and simply by teaching by feeding this again a large number of beautifully looking but real people if it's like so we figured if instead of teaching the GAN how to spit out beautiful faces how about let's say stratigraphy from cubes and that's what we ended up doing actually so this is an example of spitting out turbidite channel levy systems in various realizations one thing that we've did actually is we wanted to spit out realizations that also honor the data that we have and in this case it's in this case let's see we have 10 wells and the wells tell you where the sand is and where the shell is and that's kind of discrimination between a channel and levy so you end up spitting out a number of realizations that still honor the sparse data that you may have okay let me conclude with remarks here so I think the first one is just the feel like the content the industry is in it's really focusing on bringing performance efficiency through these adoption of enabling digital technologies right and not only is it looking at the leveraging the subsurface data but also at what happens in the surface and try to bring efficiencies to that on the subsurface data itself I think it's significant challenges lied towards making it readily available if you'd like for ML at a large scale and I'm not talking about project by project where people spend and focus on it but something that can be done in production mode it does require as I said comprehensive approach more of an ecosystem framework that allows you to do a number of things and still meet the requirements that are surrounding you and what I highlighted here I hope you allow me you have a lot actually to say that Trumdege has done something that we hope is tangible and viable and that could be a particular solution for the whole industry machine learning opportunities and challenges abound there are early successes that really encourage us to continue in this in this direction if you'd like where domain and physics are key enablers to where these successes I would say though that our experience so far has led us to believe that what we call AI or deep learning if you'd like is still not mature to tackle some of the complex problems that we deal with complexity here has to do with on one hand the complexity and the environmental variables and in the data that we deal in what is expressed in and second also the complexity and the interpretation and the reasoning that we go about reducing that to some answers that's what I mean by lack of maturity as I highlighted there are partnerships across the industry to accelerate data projects to learn how to do these things inefficiently lastly I would just highlight actually and I think Paul mentioned he went to a conference where the majority were below 35 years old so what I've seen actually now is that in particular Trumdege there is a great deal of enthusiasm that is injected in our industry and also in my company in particular which is attracting young generation of engineers and scientists that are quite at ease with the digital and eager to make an impact on a century old industries and that's really refreshing to learn all about this digital enabling company and sorry technologies and work with folks actually that teach us something with that thank you thank you very much Samane and last but not least we'll have Shauna Morrison with the Carnegie Science Geophysical Laboratory share her perspectives great it was it was really encouraging to see how everyone seems to be converging on some of the same problems and opportunities I don't you know we didn't really coordinate that so it's really encouraging to see that that's just naturally happening so thank you all for inviting me here I'm really excited today to have the opportunity to share with you some of my work and some of the work of my colleagues as well as some of my ideas about the needs and opportunities in data-driven discovery system much of the work you're going to see here today is part of the tech sponsored deep time data infrastructure project we were tasked with a goal of better understanding the co-evolution of the geosphere and biosphere from a data-driven perspective so we were going for a more holistic approach to answering questions that require multiple domains to get at these things so this ended about a year ago and the next step is the 4D initiative so this is a deep time data-driven discovery we're specifically thinking about the evolution of planetary systems but essentially what we're doing is we brought together a large network of over 200 different scientists across earth space life and data science I've been mentioned many times that it's very important that we have domain scientists working with data scientists and I could not agree more with that so we're looking to do that but also to bring in different fields of science so we can answer these bigger questions that if we stay stove piped in our own realms we will not be able to answer so I'd like to just take a moment and explain to you why I'm really interested in machine learning and in data science I'm a mineralogist and a crystallographer I spent a lot of my PhD in the lab so doing very very technical things nothing to do with data science but what I've seen over the past 10 years is my field of mineralogy go from a very descriptive field into a much more predictive field so usually in mineralogy we find minerals in the ground we make them in our lab and we characterize them and we report them however over the past decade or so we started to be able to do things like predict the number of missing mineral species I'm going to show you in a bit we're working on algorithms that can predict the location of mineral species that we didn't know were there before which I think is particularly of interest to this group today and we're also able to start thinking about ideas like is the diversity and distribution of minerals that we see on Earth's surface a planetary scale biosignature not going to get into that today but if anyone's interested I'd be happy to talk about that and of course we are relating mineralogy to a lot of other things and of course that's tectonics and supercontinents mineral resources and we're even parlaying some of these techniques that we're developing into paleo biology applications so I just want to run through some of the data resources quickly I'm going to give you a couple of sort of case studies little vignettes of projects that we've been working on and then I'm going to talk about some of the needs and opportunities that we're facing in data driven discovery so the data resources we have the rough project so this is a huge mineral library and database housed at the University of Arizona it contains chemical, spectral and diffraction data there are about 5,500 mineral species known today this project has characterized about 3,500 of them across about 10,000 samples so pretty big data resource also as a part of the rough project we have the mineral evolution database and this we were talking about data curation and I'm going to talk about that again in a minute this takes a really long time so what you're seeing here is a countless hours of work this of course was started by Bob Hazen at Carnegie Geophysical Lab and Josh Golden at the University of Arizona is a graduate student who's been helping and really leading the initiative of going to the primary literature, reading the paper and extracting mineral location age information and the broader geologic context and populating the databases all been done by hand he's been working on this for I think we're approaching 8 years now so this is when you talk about data curation being one of the main issues I really agree with that so currently we have nearly a million mineral locality pairs and over 160,000 of those have age data so it's actually a pretty huge database we work closely with Jolene Ralph at mendat.org I assume many of you are familiar with it this is a mineral locality database which is a big wealth of information about 95% of it is from the primary literature and about 5% of it is crowdsourced so I really like this as a flexibility to be able to add it but about 95% comes from the literature so we have in this he has over a million mineral locality pairs we also heavily rely on Kirsten Leonard she has a number of databases through her work with IEDA and this is Earth Chem this is a huge wealth of geochemical whole rock information she's also I work in planetary applications and so she's also developed MoonDB and we're now working with NASA to develop the astro-materials database where we can actually start bringing together all of the kind of planetary materials Paleobiodb is a huge database on paleontological work a lot of it has been done by Shannon Peters who has a number of other interesting applications including Geodeepdive where he's using machine reading of PDF to try to deal with some of these problems that Samayn was just mentioning so as I said I'm a mineralogist and a crystallographer so I like to think about problems from a mineralogical perspective so I want to give you some context for the stories that I'm about to tell you the case studies the mineralogical frameworks that I like to approach problems with are things like mineral evolution and mineral ecology these are fields that have been pioneered by Bob Hazen at Carnegie and mineral evolution is simply focusing on the changes in Earth's mineralogy through deep time and these come about through the changes in the combination of chemical and physical and sometimes biological processes that are different at every stage of planetary evolution so this can give us kind of a temporal framework with which to think about things we can go to the mineral evolution database and we can actually look at how these minerals change through time so here on the y-axis we have age on the x-axis we have the number of copper mineral locality pairs this is the number of occurrences that these copper minerals have they are colored according to their oxidation state so the reduced copper 1 plus is in blue and the more oxidized copper 2 plus mineral phases are in green and there are a couple things that you can notice here first let me just zoom in after 2.5 billion years ago which was the great oxidation event when we started to have the rise of atmospheric oxygen in our atmosphere we start to see a shift in the ratio of the more oxidized green copper phases to the more reduced blue copper phases we're always going to have reducing conditions so we're always going to be creating reduced copper phases but we start to see that there is an increase in the ratio of oxidized material after the great oxidation event so this can be thought of as a proxy for looking at the oxidation of our atmosphere among other things bioavailability of certain elements to the environment also a couple other things to notice here the first of which is that we see pulses of mineralization associated with supercontinent assemblies and so you'll see there's a little pulse with each of these you might notice that Rodinia is a little bit shorter it's not quite on trend with the others and that's not only true in copper that's true in most of the mineral species that you look at and my colleague Chail Yu and Simone Ronyan have been working on trying to characterize exactly why that is it's also true in the geochemical data so that's where these big data resources can become really powerful for recognizing issues and then for digging in and trying to figure out what those are we had no idea when we looked at this that this was going to be the case if anyone's interested I actually have all these skyline diagrams for all the elements so now let's talk a bit about mineral ecology mineral ecology takes more of a spatial stance so here we're looking at the diversity and distribution of minerals on on Earth's surface today it follows an L and Re a large number of rare events distribution in which most minerals are rare occurring at only one or two but definitely fewer than five localities on Earth's surface and there are a few that are very common things like quartz and stars what we can do with this trend is we can actually model it so we can model the frequency distribution and we can generate an accumulation curve so that's what you're seeing here on the left and with the accumulation curve we can predict the number of missing mineral species and I'm kind of glossing over this quickly but if you're more interested in the math we can go into that later but in 2016 we predicted that we're 145 missing carbon minerals and this sparks the carbon mineral challenge that's been led by Dan Hummer at Southern Illinois University and and as of last week 30 new carbon minerals have been found a number of which were predicted in this paper so we're really starting to see we can actually make some predictions here which is pretty exciting so I'm really interested in mineralogical frameworks because as I said they provide a spatial and temporal context for asking questions in Earth and Planetary Science and so now I will talk about and these are questions like what do the Earth's earliest environments look like how can we de-convolve and characterize complex geo biosignatures are there biosignatures in the diversity and distribution of minerals I mentioned that earlier and can we predict the location of minerals deposits in environments on Earth's surface so now here's where we really started partnering with data scientists so we worked with Peter Fox's group and Marshall Maugh's group at a Rensblair Polytechnic in New York some of these spaces are familiar to some of you in this room and so we partnered with them to try to start better understanding our data so the first sort of vignette I'd like to give you is our foray into network analysis so just to orient you our networks are mineral networks each node here represents a mineral species each link between them represents a co-occurrence it means they occur at the same locality on Earth's surface and these come together to form a network of mineral co-occurrence and this could be on an element or a deposit type or a planet things like that so the first example I'd like to show you is copper minerals we're going to stick in the vein of copper for a bit and each one of these nodes represents a distinct copper mineral their color according to their composition specifically the presence or absence of sulfur and oxygen and their size according to their frequency of occurrence so you can kind of think of that as abundance although it's not exactly abundance they are linked if they occur together and the strengths of that link the length of it is inversely proportional to their frequency of co-occurrence so if two things are close together that means they occur together more often if two things are farther apart then they occur together less often and so you'll probably notice a couple of things that pop out at you right away and that is that we see without coding anything to do with chemistry in this network the only thing we've coded into this network is co-occurrence we see that there's strong chemical segregation here so we see we have sulfides tend to stick with sulfides oxide carbonates tend to stick together and sulfates tend to stick together as well and what this means is that we actually have a chemical trend line going through the topology of this network and that was somewhat unexpected for us when we started delving into this I think it's the oxygen fugacity you can see sulfur fugacity and it turns out that we can find a lot of these trend lines throughout different types of networks which is really exciting for us so here we're looking at carbon mineral bipartite network so this one's slightly different we have two different types of nodes the colored nodes represent carbon minerals they're colored according to their age of first occurrence and their size according to frequency of occurrence similar to abundance and the black nodes are their localities so here we're seeing a snapshot of the relationship between mineral, carbon minerals and the localities that they format on Earth there are a lot of things to say here and this may be an image of a planetary scale biosignature ask me about that later but what I'd really like to point out is that we actually see there's an embedded timeline here so red is the oldest age moving into blue being the youngest and you see if you start in this sort of vase here at the bottom this vase we have the most common minerals here in the center happen to be the oldest and we move up into the orange going younger into the yellow and out into the blue radiating out so we have essentially this embedded timeline mineral networks aren't the only place so we found embedded timelines we've also found that in paleobiology networks so here we're looking at fauna through time so they're colored we have Cambrian fauna in blue, Paleozoic in red and modern fauna in black and as you would expect with a biologically evolving system especially we have a timeline from ancient into modern but what we noticed here is that we have these pinch points and each of these pinch points are actually associated with a mass extinction event so it's really interesting we could see this in the network topology it's you want to start to see the things you already know in your visualizations and in your algorithms before you can really trust what you're getting out of it that you didn't know is valid right so we were really excited to see this and so then we went to a different network this is my colleague Drew Massenzi's work and we went to the trilobite network and we see an embedded timeline here going from the oldest red into the into the youngest blue but what we found is that we have this pinch point here and this did not correspond to a previously known mass extinction event so my colleague was really puzzled and he went to the literature and really dug in and found that a number of years ago someone had hypothesized that perhaps there was a master funnel turnover but really didn't dig much into that and so he was able to dig into that and indeed determined that there was a master funnel turnover event at this time so it was really encouraging to see that we can actually discover things using these networks we are exploring different ways of looking at these networks so here we're using a virtual reality network this is the last AGU and you can actually explore this network with it around you it's a pretty cool experience I was a bit skeptical at first and thought I was just kind of a gimmicky thing but once I played with it I think there might actually be some uses so we're still really developing that so just a couple other projects to mention the first is a fantastic software so if we're talking about subsurface data we're really interested in understanding what are the drivers of mineralization through deep time so we've teamed up with the University of Sydney's EarthBite team they have a number of amazing pieces of software if you're not familiar with them please go to their website but they're an open source paleo tectonic reconstruction software so you can look at geodynamic models and paleo tectonic models and you can overlay your features on top of them so in this case I'm showing copper and uranium mineralization through time but this could be really any parameter that's affected by tectonics or geodynamics so if you're interested in mineralization like some of you are in this room this is a great thing if you're interested in oil and gas I'm sure there are applications here as well so this is a sort of new partnership that we hope to push forward I'd also like to mention affinity analysis I'm particularly excited about this and we're aiming to predict the location of a mineral or a deposit or a geologic environment some of the folks today have already hit on models like this affinity analysis is a recommender system that characterizes multi-dimensional co-occurrence relationships and creates probabilistic models for future or currently unknown occurrences this is exactly what Amazon does when it gives you recommendations for what you would like to purchase based on what you've already purchased it has a huge user database on which it can train its model and then it can make predictions based on your input data and so we have mineralogical co-occurrence data and that's what we're initially basing our model on but just like Amazon where you have user data if you know my demographics you can do a better job of predicting me we can also start to include geologic information in our models but right now we're just starting out looking at mineral co-occurrence and what we hope to be able to do this is mathematically a big problem but what we hope to be able to do is predict the location of you could think about an individual mineral species you could think about a mineral assemblage or a set of characteristics that typifies a deposit and this doesn't have to be there's gold in the ground this could be vegetation on the surface that happens to signify a certain kind of geochemistry geologic settings or planetary environments and we haven't gotten this quite up and running yet manpower is a problem people are trying to do phd's and stuff so trying to get it done is a bit of a struggle but we have actually implemented a pair-wise version of this recommender system on Mendat and basically you know affinity analysis you would look at all the correlations across the board in this case we just looked at pair-wise correlations and tried to make a prediction of where we could find an unknown mineral species and when we did this we were able to predict that at Cooks Peak District in New Mexico that there should be a new location of Wolfenight and turns out there was so a collector actually went out there for us and found this and confirmed so even with a very pared down not even you know just basic statistical method we were able to make a prediction so I think once we scale up to the full version we'll be able to do a lot I'm really excited about that so now I'd like to just briefly talk about some of the hurdles in data-driven discovery the first is creating interoperable you know fair data resources this is a huge huge problem a lot of data is dark this is definitely true in geology also true in planetary science there's also a general issue of lack of funding this is not true in industry right you can fund what you think is important in academia it's very true you know we have earth cube which generated a big framework for housing data but it's very hard to get funding to actually generate these data resources and there's also just a general lack of incentive to make your data open you know we don't get tenure or funding or really anything based on making our data open and available but it can take a significant amount of effort so I think we need to start thinking about the ways in which we can incentivize that you know people think about data journals and data DOIs and things like this and in the case of industry showing them the benefits of making their data open and of sharing their data amongst each other Australia is a great example of this where their mining data is much more open so I think the problems are different in industry and academia it's really fundamentally the same issue small sparse data sets man this is really really true in any sample based science and so you know using machine learning on small sparse data sets can be really really tough developing user friendly machine learning interfaces a lot of us did not come up learning how to code the learning curve for coding I will tell you it's very steep having learned only in graduate school it is very steep and often our needs are very specific and very custom so it's hard to develop these platforms that are kind of universal but now let's talk about the opportunities which I actually think are exactly the same things that I just listed as the hurdles so we have an incredible opportunity to create these interoperable data resources and a lot of organizations are taking advantage of this deep time digital initiative mainly coming out of China with the Chinese National Science Foundation is aiming to bring to get to essentially make a Google geology I mean they really want to bring together everything and this doesn't mean going to the data manually and putting it all together this means linking existing databases this means hooking up with organizations and researchers that already have these databases and finding ways to bring them together I think this also makes an amazing opportunity for collaboration cross disciplinary cross organizational collaboration that wouldn't have happened otherwise we also have the opportunity to develop new natural language processing and machine learning machine reading applications so I mentioned Shannon Peters at the University of Wisconsin he's been working heavily on this a geo deep dive is his algorithm definitely let me know if you'd like a link to that he's been he has millions and millions of pdf that he has access to that he can actually run his natural language processing and then some parsing scripts to actually extract information so people are making some some headway here and I think this is an amazing amazing opportunity and we also there is going to be some manual curation right I mean especially at this stage NLP is not quite there yet machine reading is not quite there yet but I think that that is actually an opportunity to train students I think they learn when they read papers and developing the mineral evolution database I've seen how many how many students have essentially become geologists through that process of having to read the literature and extract that information small sparse data we have the opportunity to modify existing machine learning algorithms and to develop new statistics and new math for understanding how we can get what we want out of these small data sets and once we do that that may give us access to problems that we wouldn't have had access to otherwise because we couldn't answer them with our traditional methods but it also can limit cost you know if we're thinking about going to an asteroid or you know drilling in deep water Gulf of Mexico that's costly and so if we can answer our questions with smaller data sets and learn how to do that in a robust way then we certainly can save a lot of money for ourselves and we have the opportunity to develop more user friendly interfaces and in doing this we can actually help train that next generation to become more literate in coding which I think is really important and we're also going to be giving upcoming students and the upcoming workforce a really strong foundation in statistics and math which I think is really important so in conclusion I think that we have a great opportunity to bring together our data resources and to leverage all of the work that's been done by data scientists and to partner with data scientists to begin answering some of these bigger tougher questions that we can't answer on our own so with that I'd like to thank you all for listening and thank you for all the sponsors and my many collaborators Thanks very much Donna. Dan if we could let's make sure we provide some a little bit of time for some discussion here so we'll kind of delay lunch until 15 okay. Sounds great and we've certainly covered a lot of ground and gotten some great perspectives so let's open it up for Q&A and if you would if you have a question please use the microphone Be brave and ask the first question just to anyone who wants to take about showing our teams maybe the most appropriate few what's the national security implications of this and particularly when you look at parts of the supply chain for certain minerals related to batteries and Chinese efforts to monopolize lithium graphite cobalt what's the flow of information between the global community in China are they ahead in certain areas like those minerals Yeah so this is a big issue and this is a big topic of conversation as we move forward and are creating partnerships in other countries that have these kind of strategic interests a lot of these data resources are open and are accessible and you know you we have our scientific purposes but they can be used for other purposes so certainly when you're partnering you know when you're when you have data that isn't open and it's sensitive and is marked as sensitive you know by the federal government or you know the agency USGS what have you you have to be really careful about the access that they're given are they ahead of us China is ahead of us in certain aspects of trading data they certainly are investing in it significantly more than say the US or Europe they really you know the National Science Foundation is just behind it they're having this huge initiative they want to bring everything together and there's it really is first for scientific exploration but certainly when you bring together these large data resources there could be other implications so I think that you have to read carefully and be thoughtful moving forward yeah it's a hard question so thanks thanks to all the speakers great material a lot of overlap a lot of you know independently nice things the the role of data curation I think there's a couple of things I would be interested in the thoughts from people are one okay it's really important and that's that's good but some of the examples we hear here and other places are graduate students doing this or so I think part of it is to elevate the data curation from a task that has to get done before the data science begins to being something that someone can be expert in and very good it's not a job that solved it's part of the pipeline and so the examples here and examples I've seen in other places I see some of this some of the curation and other places that interoperability so putting like the 800 pound gorilla quote I think was great put your best people on this the idea of tidy data idea of making it easy to work with it's not going to be one database to rule them all but it's going to be an idea of how do you live with data from different places some of it's big and some of it's small some of it's good some of it's bad so I like the idea of the DOIs for data and the DOIs on literature about what you did with your data the stuff we bury in supplemental information and publication now is often critical or reproducing this so I like some more thoughts on that I you know I I'm all in it's really important but I'd like to see where you know you see this being elevated or where it should be so I'll take a first shot at that I think one of the solutions to that well we're in a stage of transition I think everybody in the room realizes that in terms of how data are how they are kept and how they present presented in publication and accessibility to both public and private sector and so they're growing pains here and that's one of the points we're discussing here but but private companies are taking this on and making money at it you know there are I could think of many examples we have a couple in Santa Fe in fact near Los Alamos companies are doing exactly that their their their whole job their whole livelihood depends on downloading certain kinds of data getting it into a format that's consistent and then selling it and so that's probably a really good solution because there's a motivation there a money-making motivation and without I think without companies like that it's not going to happen or it's not going to happen nearly as quickly so as long as there's money in it it's great but if you're a company that's one thing you might be willing to purchase the data but if you're a basic research scientist that's another issue can you afford it and if you can't afford it what do you do so that's another side of the problem that I don't think is a simple solution at present some of these companies are working with scientists and providing data and platforms for free at the moment and that's really good but I think their end game of course is to get you hooked to be a user on their platform and familiar with their data and then want to buy it so I mean that's just savvy business but I think if you're working on basic research funds that may present a challenge in the future but in short I think private industry is taking up a lot of this I could add something to that as well clearly as you said Paul we're in a time of transition but in our work with mining companies who typically store their data in Excel spreadsheets or on paper the fact of accepting a solution based on machine learning and doing the data integration properly drives them to change that because they have to take say the blasting records or the support installation records out of the filing cabinets and get it digitized and that is as people say 80 to 90% of the job in any of these exercises but once they see the value of that and get on that path then it you know these the machine learning is really driving that change but it is slow for sure Just a few things I fully agree there is no clear cut solution to this it's evolving on how best to do it what I would highlight is within Slumberj what we hope is that leverage what any particular individual does for the benefit of the rest against I mean if it's a user from within the industry against payment for it you should like and that requires doing it within a particular environment what we call an ecosystem we should like that it's not outside of what you do to access the data to collaborate with others to bring your own AI into it or your own algorithms and make sure that it's secure and so I think with the event of what we are proposing internally but also outside in terms of data ecosystem these questions could be pondered and practice and then perhaps an effective solution would be would rise up and it would be at least in the domains that we deal with domain of data that we deal with yes this is kind of an oil and gas question so we've had decades of attempts to standardize data formats for subsurface data rescue ML with ML but we're still in a situation in 2019 where most of our subsurface data is native to particular applications often imprisoned in those applications not searchable outside the applications and often in proprietary formats it's been interesting in the last couple it's been interesting in the last couple of years to see the Delphi initiative take hold at Slumberj and of course there are also non-commercial versions with some of the same goals such as OSDU what do you think different at this time and what do you think our chances of success at addressing this industry problem are? So a few elements of an answer here I haven't really spent time to absorb the extent of the capabilities of Delphi but one thing that I would say is that I think the comment that you made and the problematic that you pose have been heard internally and the comment about Delphi or Delphi being open is really addressing that speaks to that is that the hope is that openness in terms of the format that we're suggesting we're recommending to the industry will be adopted by all so there is a standardization if you'd like in the way data is approach or data creation is approach so I hope this so what I mean by openness here is really the foundational code that relates to data format and how data I feel like gets ingested is made open source and would be made visible should like to all interested parties that would like to use such an environment and bring their own data onto that system in a secure manner and so they only have access to it if you'd like and others I hope I'm providing some answers there on the openness if you'd like of such an environment I didn't hear come out too much in any of the talks it's sort of underpinning you know there's emphasis on open data, fair data and we've seen huge strides even in the last few years in that realm where the at least in the academic and the government research realm people are starting to understand the power of publishing their data and I agree with Shana where she said it's a huge effort but there's a middle ground that doesn't often get discussed and I think there's a huge opportunity with this community to maybe even find a way to help bridge that which is anonymized data that you saw in some of the graphics that were up there when you have a massive data set you're not interrogating a single point anymore you're looking at the holes to try to inform something you're specifically going after but finding a way to get folks that are concerned about their data for legitimate reasons proprietary poorly structured whatever to get them to maybe share it in an anonymized way with end users that have very specific needs there's not a great resource there yet for that bridge and I'm wondering what the committee thinks about the opportunity there or possible ways to make that bridge happen more quickly because I think that could accelerate people's comfort with data going open and getting more and more data out there in the machine learning AI outcomes that could happen as a result accelerate it so I mean on the anonymization front what do we need to do to help make that build trust you know resources to make it happen it seems like a response for industry we're not doing research actually on how best to analyze or address if you'd like the privacy issue but we're following closely what's being done in academia that's why perhaps I was looking at perhaps I have an answer right now the immediate solution is really the provision of an environment that is secure and private so only the users and that data only those who are authorized to access the data have that access and I guarantee that there is security around the confidentiality but again I know of a multitude of research efforts in academia that tries to address this on how to go through a black box that would anonymize the data and make it make it usable by a larger community without revealing any aspects or attributes that relate to privacy so we're following that very closely and to the extent that it becomes viable from a business point of view we will adopt it as well so just the comment about 800 pound gorilla in the room and it's the data quality, data integrity data accessibility and put your best people on it I'm not sure I agree because it seems to me it's not so much put your best people on it put the right people on it because why your best people are not going to have the mindset I don't think to really address those big data issues and so it seems to me it's more a matter of find the right people and empower them and give them the resources to address those issues and having said that it seems to me that kind of within academia and within government research then the question is how do you reward those people appropriately because the system at least as I am acquainted with it is not geared towards elevating those people and appropriately rewarding them I think this really comes back to what you just mentioned we need to make a space for these people and an incentive for these people there are people that enjoy doing this and there are people that this is really what they want to do on a daily and that's not me so I understand what you're talking about sometimes not everyone who's a scientist or researcher is going to be interested in curating data but we need to stop forcing people like me who don't want to curate a database curate a database and find those people that want to do it and we need to make that job for them not just push it off on graduate students and undergraduate students make that job for them, pay them well for it and really incentivize the position that comes from a higher level I think within companies that needs to be coming down from the top within academic research that needs to be coming from NSF that needs to be coming from NASA that needs to be coming from high up and giving us the funding and the space to make that happen because at this point that structure doesn't exist but I completely agree just to follow on comment I really agree with what you said Donna and funding agencies will have to allow us at least in doing basic research and applied research funds to support people to do exactly this it has to be written into grants and it has to be understood by funding agencies that this is part of what you're paying for if you've got that then you can hire people to do it you don't have to do it yourself you don't have to have graduate students etc to do it and it's just a question of convincing our funding agencies but I think some of them are in the room I think they get this so it's a question of time so that you and I no longer have to work on those problems and the other point you made about getting some sort of reward other than your paycheck in other words published DOI so that you have references to your data so when people search on Google Scholar they find your name and they see citations because of the data you curated I do agree with this I agree with you that you ought to put the right people and by right here you may need domain knowledge as well to create the data properly so the way I envisage I mean I mentioned that we're teams of scientists trying to leverage subsurface data and develop solutions that help in the decision making right now we're leaving it to a number of digital center teams around the world to do that curation but I'm pretty sure actually as we start working with this curated data at a scale we will find actually that it may not have been curated properly and we will get engaged in that and put our good people let's turn to say how about doing machine learning and AI on it because it's a sentimental problem to get right why simply because if you don't get it right then your learn model with if you'd like not be the appropriate one not be effective and if you put like it says the adage and if you put garbage in you get garbage out so at some point it has to be recognized as the gorilla I think it's not well recognized I believe people tend to rush if you'd like to doing machine learning and AI around around problems that are attractive taking the assumption that it's data is there and what structures are so and I think emphasizing this will help perhaps will help address it better just a quick question this sort of goes follows on Paul's comments here we're talking we're talking about machine learning we're talking about the subsurface but the subsurface also interacts with other systems as well and so there's atmospheric systems surface systems you know industrial systems economics and so and I guess where I'm going with that thinking about sort of the funding entities and how they're organized to me seems way too narrow to be thinking of this it doesn't simply in the context of funding that might be coming safe from you know whether senate funding or DOE funding or such is that there's perhaps some room to basically sort of elevate it a little bit start thinking about broader digital applications that really are cross cutting and not just in the subsurface which which you know in some cases for example start connecting up digital applications in the subsurface surface with digital twin applications and industrial systems which is what a lot of which the oil and gas industry is doing right now you know so I'm not sure there's a question here but to me it's just a suggestion that I think big rather than think little a question like from the perspective of an undergraduate like I know shawna mentioned like a lot of people don't even in the geosciences learn how to code until they're graduate students so like what are the types of skills that you guys think are really necessary for undergraduates in the geoscience to like be learning and like what kinds of learning opportunities should universities be making for these students like is it just learning how to code as an undergrad is it taking classes in algorithmic thinking like what do you guys think are like really necessary for us going forward yeah so I think that certainly not just learning to code right learning to think about data science problems so I think really applied classes but classes where you're actually thinking about problems and where you're given a strong underpinning and foundation in in data science thinking and statistics and in yeah and of course encoding I mean coding is part of the integral part of the tool I think that and I think when we do this what we're going to end up with is kind of what you were talking about these people who know how to do data science who know how to do data curation who understand those things are important but also know how to use a microscope and know how to look at a thin section and you know know how to think about stratigraphy and things like that so I think that we at universities we have the opportunity to kind of train you know what what we need to do this data curation problem and people if we just give these problems to data scientists who don't understand anything about our data it's hard it's hard for them to get out what what we need and so we need people that kind of bridge that gap and so I think I think yeah data science general data science courses foundations and statistics understanding what happens when you're dealing with you know big numbers and things like that and when the numbers become small to be robust things like that so yeah I just have one quick comment I really impressed with things I've seen here in the subsurface I've worked in the subsurface for many years what I find is is one of the big secrets that all all companies do this with new people is they bring them in they throw them in Utah some place in the near shore setting let them see the near shore environment philosophy all the things that's in the near shore because that's where your porosity is you get sandstones and beaches and all these kinds of things point bars and fluvial systems but then as the years go by you learn to build information when you're working in oil and gas you learn to build and understand where the oil and gas is in each given setting that's where it is in the conventional setting it's in the near shore and the point bar system and things like that now as we get more the unconventional things and I've gone around the world and looked at unconventional resources and what I see in the unconventional world it's just what you had a little start model for and that is the turbidites what operate in unconventional resources you can go to primary invasions and you can see stuff coming back to New Mexico coming off into the Delaware bases and it's turbidite you can drill a well over here and a well over here on this side and you can see similarities in exactly that moment in time I mean these are real turbidites and you can pick law characters and correlate them for miles and miles in there so if you understand turbidites you get a turbidite model for the Delaware base the whole system you're onto something so the thing here is you have to learn the basics first as a young person you've got to learn how to crawl first you've got to get this information in your head if somebody wakes you up in the middle of the night and tells you to talk about point bar deposition then you need to talk about point bars you need to be able to calculate point bar geometries and sizes and equations for that all kinds of things but the main thing is the South China Sea when I go to the South China Sea I understand turbidite deposition I know on those high off to the southwest all that sediment came off into the South China Sea so I know where to focus and this turbidite detailed models like you show those turbidite nasobosum channels and all these things that's what you want to focus on after you get a little skin on your bones get some skin on your bones get the different models and take this kind of information and stretch it just as far as you can and that's where you want to go thanks John we have one question here then we're going to take one online and then I think break for lunch so just a quick comment from about four questions ago there are archivists it is a whole job for curating large groups not librarians but archivists but it is a whole job so that's something that can be twisted into a useful direction what I'm curious is if through your program managers or your colleagues in the domain sciences or your engineers or even management if you have encountered any sort of barriers, impediments or skepticism with respect to accepting the results of the techniques that you're applying we're all here because we're enthusiastic about what can be done what sort of resistance are you facing are you overcoming this so as I mentioned in my presentation for the work we do in the domains we work in mining the geologists and the engineers need to understand how the data flowed through the process to the prediction so even if you can get better prediction statistics out of a deep neural network the technique is opaque it's not going to be acceptable at least in our industry we need transparency they want to see each of the input data what the levels they were at and they want to apply their own domain understanding so they can make a decision on how to act on that prediction yeah I think in academia in geology specifically we have a lot of pushback and resistance and I think part of it is because we're not taught this when we go through undergrad or even grad school so I think it's very unfamiliar and so people are justifiably skeptical because they don't understand these algorithms they don't understand what you're actually doing so I think there's an appropriate level of skepticism it's interesting because I work across domains you don't have that skepticism in astronomy because they understand how to use machine learning they have to by nature of their data this is also true in biology you don't have the skepticism in biology when we show them our machine learning they kind of pat us on the head and tell us that's cute so it's very domain specific and in part it's because geology with the exception of seismology geophysicists where you have a lot of data we generally have a very small amount of data and so our data hasn't necessitated us moving forward in that way so I think we there's a healthy level of skepticism that I encounter in research just that so my experience has been that was a great deal of skepticism several years ago and few years ago but that's receding now in terms of resistance it's always with us and in my opinion it's healthy to resist just to jump to conclusions and you'd like to see whether there are other ways of evaluating that, of gaining confidence and what is this so it's that's has been our experience alright and a question online Elizabeth there's a question Eric could you put up the graph it's actually to John with regard to that graph in your model the question is I was wondering if he could talk a bit more about the process to generate the model in this particular graph sure so this model is is created using a class of geological modeling techniques that are called implicit modeling and that's been a kind of revolution in the last decade or so in the geological modeling applications that are available both oil and gas and mining the difference between explicit the older way and implicit the newer way is in explicit modeling you're sort of manually stitching together usually with a mouse digitizing control lines and things like that to build these wire frames that are the fault or contact surfaces and in implicit modeling you are generating a field throughout the 3d volume in color on that horizontal slice and isosurfaces through that field are the derived contacts and fault surfaces and the derivation of that function through that 3d field from which you're extracting the isosurfaces is done from the point geological information so what you're seeing in that picture there are structural markers so strikes and dips of structural observations which can be off contact as well as on contact and the markers the well markers or drill point intersections on the contacts that's how the function is derived in 3d and then the surfaces are extracted so that's what implicit modeling is thank you alright let's have another round of applause for our keynote panelists and I will hand it back over to Elizabeth thanks again to the wonderful set of speakers this morning that was really really really good really appreciate your input and all the questions too we have another chunk of time in the afternoon to do the same with another great set of panelists in between now and then I've heard some tummy rumbles so it's a good time to break for lunch everyone is invited to join us for lunch we have a rather large group here which is fabulous we have space for everyone to eat at a table we encourage networking interactions and so forth so I would encourage you if you'd like to use other parts of the building or if you dare step outside because there are some tables also in the patio in the front of the building it is warm for those of you who aren't from the dc area it's warm out there today take the time to interact with one another do your lunch I'm making a creative decision here because the networking and conversation is really an important part of what we do and we're a large group we can't get you all through the line and so forth in a reasonable amount of time so let's make sure and to those of you online as well we're going to start at 1.15 sharply so if you're not here in the room you will miss the opening talks we'll abbreviate the introductions and really get to the speakers right away and talk with the panelists afterwards so please help yourselves to lunch interact and engage and thanks again for the great enthusiasm for our talks today and machine learning for subsurface data from somewhat different sectors in the oil and gas business etc this afternoon we would like to pick up where they left off basically the same sort of thing but we're looking at rims of research and research to application approaches which we now what we now basically what we're going to do is where are we now what's that GPS where are we what do we want to go with this what's the thing we want to do with the research and that's the same thing we've been talking about in this one so right now I had to plan to do a little bit more in the front but in the interest of time I think what we'll do is let these people tell their stories so the first speaker will be Grant Bromenthal and Kelly Rose going to work into this if I'm not mistaken Grant will discuss some of his approaches with machine learning and how they capture value from these approaches he will talk possibly about some neural network things and how they use those but that's what we've been talking about all morning so I'm just going to leave it alone and let Grant and Kelly take this alright thank you thank you John and thanks to the academies again for inviting us to come here and talk it's always a great opportunity to be able to be here and something that you don't forget so I'm going to talk a little bit about where in FE and in NTL we've been thinking about using and I put the title of this is Science and Form Machine Learning because that's a particular emphasis of what we're what we're trying to do so machine learning we think of as a data driven approach and that's a lot of the focus but there are ways you heard a little bit about filling in some of those gaps in smaller or not such big data sets and this is one of the approaches we're talking about this is the standard US government disclaimer that basically says that if I say anything incorrect or stupid that they disavow the college of me okay so onto it so this is just sort of an example of how machine learning has been has an impact on another sector of the economy and this is in driving safety and so you know our traditional systems our passive systems you know you have videos that show how to do driving safety you have some of the testing and that's sort of morphed through the use of machine learning and computers into virtual learning where you can now basically you can learn how to drive a lot of that without having to step into a car and some of the driving tests that you take now you don't have to go into a car to do that you can do that in the computer simulated environment real-time data of course probably most people now have some something of this on your car is where it tells you if you back up or if you're driving along that watch out you're crossing the line or someone else is beware and of course there's a lot of discussion of eventually getting to autonomous control which personally I would love to be able to read while driving a car or not driving a car but we're not quite there yet that's where people are trying to go just sort of thinking along the same lines in the subsurface what's the what's the promise for this and so we have a lot of the traditional approaches where we gather these data from logs and well seismic data etc. we bring those together into a traditional reservoir simulation model which we use that may take it may take months or longer to sort of pull all of that together we heard examples from this morning that you know 12 months to do some of that which when you apply some of these machine learning approaches can take that down to two months and we're we're suggesting that that can even you know we think there's promise to move that even even faster along and so there are virtual learning environments where you can bring this geologic data and sort of look at it in a careful way from multiple different sources here's an example of a drilling rig where they're actually using a robot to robot arm to control that drill rig and this is one example of where it's being applied in the drilling arena but you can also see data collection systems that then feed this information back in real time from various parts of your well or reservoir system and then this is a control panel from a I think it's a shell site in Malaysia but the idea is that you can look at the different parts of the field and treat the subsurface as a engineered system and it's kind of anathema to the way that we've the way that we've sort of understood these systems for a while but the promise is that we can get to that kind of to that kind of place where we can take real-time data and we can actually make optimization and operational decisions quickly so you know Paul mentioned this in the morning we're really at a confluence of data, computational capability machine learning a lot of the novel sensors that have been developed and are being applied and what we think so what that does is the tradition approaches which are kind of in the gray there involve collecting data from the field and doing lots of computationally intensive simulation the idea behind these bringing in these empirical models these data driven machine learning models is to improve the speed of this and so with that you get autonomous monitoring systems you get knowledge and understanding these systems much more quickly you're able to set up virtual learning capabilities and you are able to get to things like real-time visualization as you collect the data you're able to understand what's happening in the subsurface and you can take that and turn it into the ability to have some kind of control of that subsurface quickly so that those are the things that we're looking towards in the DOE and in the NETL right now I mentioned this, this is a workshop and so a lot of the things that we sort of came out of this workshop and some of these have been mentioned already so that was held last summer at Carnegie Mellon but there are a few sort of recommendations that came out of this that I thought I'd share with the group and it's really critical I think this has been mentioned it may have been mentioned by every speaker so far so maybe I don't have to hit it too hard but you need to engage if you're going to look at this area you need to engage both data scientists and geoscientists I was talking with the chair of the computer science department at Carnegie Mellon and he said you know we've got a lot of great computer scientists here who can create the next great algorithm that does absolutely nothing if they don't know nothing valuable if they don't understand have someone who understands what it's being developed for so you can't just throw the data at data scientists and expect to get what you want I think this was mentioned by someone in the panel this morning but it's really important to use an outcomes based approach so when you're applying machine learning and data analytics it's important to know excuse me what is the application what is the target that you're looking for that helps you narrow the focus and understanding of that work and helps you identify what data is important and what data isn't and then where the gaps and such are in there so this lists a few sort of possible outcomes that are important from a fossil energy perspective access to large data sets is critically important I call Kelly Rose and talk a little bit about that more in a few minutes well and this is sort of a really important thing that came out of this discussion at this workshop is that it's so again in the morning you heard talk about small sparse data sets well sometimes we have large sparse data sets so we have sort of the paradoxical situation where we have lots of data that we can't figure out how to manage and move around and yet we don't have all the right type or kind of data that we need and so sometimes that data needs to be augmented with physics based simulation or other similar capabilities and so bringing in field data, laboratory data, simulation data and synthetic data all together in some way that you could and you have to have some way to manage all those different types of data so that you can then analyze that appropriately with your data analytics and in some cases this is this is you know straight data analytics straight machine learning algorithm sometimes this is a combination of physics based simulation and empirical model you know where you sort of go back and forth and you improve both the speed you improve the speed of one and the accuracy of another by sort of doing that and so then finally new data management tools and approaches will be needed it's going to be important to you know so I think we all recognize that one of the big challenges is getting data from various sources I don't know when the situation is going to turn so that we're so that the reticence of large companies who collect this data are going to look at it less as something that they have to hold and maintain as proprietary and when it's going to turn into something that they find more value in sharing but I expect that that time will come but but you need you're going to need regardless you're going to need these kinds of frameworks that allow you to collect lots of data from a variety of different pieces and here's just one example that came from recent Geo's article in the in the office of fossil energy are trying to invest in this area through a number of different ways so we have our field laboratories there are several different field laboratories in particularly we're focusing mostly on unconventional resources but we're also performing laboratory we're looking at the detailed geochemistry in physics of within the labs of these data driven approaches and so I'm going to have just a couple of examples of how we've done here and this is a more just pure data driven approach where we're actually something that you're able to do is actually predict the behavior the production behavior of a well over a long time this is from a shale so for those of you who don't know for example is in the Marcellus it's very difficult in the conventional formation we can use reservoir simulation or type of analysis to sort of predict long term behavior of production in it from a reservoir for shales trying to do that with with conventional simulators is generally been in the business failure and trying to do it with type curves has just been really bad and so here's some examples of where data driven approaches where you actually have 10 years of data from wells from across the field combining some geologic engineering properties together in forming these models you can predict to the behavior of that well out to a new well given information that you're collecting there out to a particular time so what you're seeing here you see it drops off in this longer time well that's because that data hasn't been collected yet this is just the data that's been matched so far and this is sort of the prediction so one of the questions in shales is how long are these wells going to continue to produce and these are some of the questions that we're trying to help get a better answer on and then I just want to talk about something that I don't think I've heard anyone else here talk about but these are the development of proxy models and so these are where instead of using your reservoir simulator which may take an hour or you know several hours or a couple of days to run a detailed physics simulation of the system you can use a you can train a neural network in this particular case to give to do what that reservoir simulator does based off off of simulation runs to produce those same runs in just seconds so that's what we've been able to do here the focus on this one was for CO2 storage but it shows the concept behind this and you're able to use that to help probe the uncertainty of the system because you can now run hundreds or thousands of simulations and you can sort of study what are the consequences of different kinds of initial geologic variables with that I'm going to turn it over to Kelly to talk a little bit more about some of our data curation data management data collection etc. I only have two slides and the goal is to keep you awake post lunch I was invited two years ago to give a talk in a similar to the same group at the National Academies but in relation to 40 subsurface and we've heard that theme come up a lot today it was part of this broader conversation the National Academies has been evolving around data and the subsurface and one of the things that is important and we've heard is the subsurface isn't static on geologic time scales we know that on the millions and billions of years of course it's evolving but as humans are engineering the subsurface and interacting with the surface even we are continuously changing what's going on in the subsurface so there's an increasing need for data and to resurrect data and to utilize data and find new science driven physics based models that will allow us to leverage that data to improve prediction of what's going on in the subsurface the pyramid here on the left of the slide is what we've heard in many of the themes today everybody wants to leap to the top of the pyramid they want to immediately inform and analyze with big data machine learning AI but the majority of the challenge the Wall Street Journal just published an article they're not the first I've seen money of these over the last five years but the majority of the challenge for us in the geosciences and the engineering sciences is at the bottom of the pyramid we got to find the darn data we got to move the data we have to do language translation unit conversion integration and every last single one of us wants something different and so our group for the last decade has been working not the top of the pyramid alone we're subject matter experts we're a geodata science team that brings together computing scientists, software engineers statisticians, geostatisticians ocean modelers engineers, geosciences, etc to work this entire span of the pyramid and that's what we're showing here so this project at the bottom is actually the most recent project on the bottom left where it says Carnegie Science Award we were asked by an outside group hey you guys are really good at data detective work can you quickly we want to be able to reduce fugitive methane emissions from sources and it was a consortium of oil and gas companies, environmental events on UN environmental program they said hey we need to know what the oil and gas infrastructure framework is can you do it in three months okay so they knew we had a smart search tool that is a big data machine learning tool it's not the sexy glamorous analytical side of the pyramid it's the non sexy non glamorous bottom side of the pyramid and we turned smart trained it with our subject matter experts and said we know oil and gas industry we know the terms we know the phrases we know the data formats we know the types of data we showed it the data sets that we had we parlayed that into search terms and phrases thousands of them that turned into a search training algorithm or training data set and we turned it on the worldwide web drilled down into FTP sites zip files everything else it was messy it was an alpha tool that's okay we're building it brought it back and the team parsed it we also traversed the world put eyes on it for that validation the trust do we trust what we're getting back from this machine learning tool and in three months minus the US we had the global footprint including an analysis of gaps where don't we have open source data and the group that we were helping out said that's awesome and we told you not to do the US because the major top 10 university was doing it but they didn't do what you did so please go to the US and a month later we turned it around like this there's so much data out there and everyone has been touching on that I mean the speakers this morning did a fabulous job of highlighting that problem but we as a community can can traverse this pyramid more effectively if if we're able to be honest about that the 80% at the bottom of this pyramid actually needs to be inverted and machine learning AI may have the most impact in your term if we put our eyes and our thoughts and our innovative brains on the bottom half of the pyramid because there's a lot of scientific scientific expertise about how to inform and analyze and optimize at the top but the data chunk at the bottom is one of the heaviest lists and we've applied that for other other application spaces for improving prediction of subsurface properties for you know using a fuzzy logic model to improve prediction of interaction between unknown and known faults and fractures well more integrity in this model up here and for forecasting in the second city probability just using data and big data the resolution matters too and that's something that you've heard as well you know there's not there's always going to be uncertainty but as long as you're communicating where you're more certain which is what this gridded model at the top is doing this is an output of subsurface pressure variability in the Gulf of Mexico we have much greater certainty where the grids are small and we have more uncertainty or more pressure variability where the grid is big we're afraid to communicate that kind of information as you're going forward our team's chipping away at chunks of all of this but one of the other aspects that Grant mentioned and that has come up in the conversation here is you know there's really important investments that are going on now for collecting data curating data transforming it and making it interoperable making it accessible for bringing it together but it's all happening in these little projects and these siloed off houses, portals, individual projects universities, industry etc at least within the FE mission space under the offices of EDX FE funds a lot of research that is relevant to their mission space has relevant to other people's application spaces so this summer we turned we're turning a whole group of summer interns on to training and labeling and trying to get information from those federally funded projects that we've been curating for the last decade within EDX trying to get them out of the data lake mode individual data sets and into something that's a little more integrated it's not going to be picture perfect it'll be more like the Alberta EOS article that was shown where it'll be easier to find those data but over time we're hoping to bring more and more resolution to that database and also more and more tools to the FE user community and anyone else that might find them of use so that they can access that information but it's all machine learning big data driven under the hood because no one person or one group is able to handle this volume of information that's out there and transforming it then being able to eventually plug and play what we're doing and with NSF EarthCube data 1 the British Geologic Survey the Alberta's of the world the new initiative out of China that was mentioned the opportunity there is huge but the importance is trying to get out of that 80% data resurrection integration so that we can actually do things with it at the tip of the pyramid and invert the pyramid so that the 20% is finding the data that you want and getting it into the format you want and the 80% is spent doing really fun science because that's the part most of us wanting to geek out on so that's a quick update from the talk two years ago and some of the things that I think thematically are resonating with the conversation today and I suspect the rest of the speakers in the afternoon will also hit on as well thank you the next speaker is Bill Van denier he's a project officer for geothermal technologies Bill's project Bill's project is explaining machine learning advanced algorithms to identify patterns and make inferences from data that will assist in finding and developing new geothermal resources if successful this will lead to a higher success rate of course for exploratory good and greater efficiency in plant operations and ultimately lower costs for geothermal energy thank you okay thank you for the introduction as specified I'm Bill Van denier I'm with the geothermal technologies office during the day I mostly focus on project management but I do have a personal interest in machine learning so what I did here is try to make a presentation that was worth everybody's collective time and I see some common themes in my presentation relative to others that we've seen so far today so I'm going to take that as a good sign with that we'll give you a quick advertisement on geothermal energy our big claim to fame is that we provide base load power we run it steady safe 24-7 and ultimately the goal of the geothermal technologies office is to help take geothermal for being a resource that's kind of utilized in the US anyway in isolated spots in the western US and make that more of a resource that can be used throughout the western US and hopefully throughout the nation at some point okay just a quick diagram here looking at your standard geothermal power plant so what we're seeking our targets if you will our heat and permeability underground and what we're doing is when we find that heat and permeability we're going to bring the fluid up through our production well run it through our power plant standard turbine setup usually create electricity and then re-inject that water back into the same aquifer we do have other program areas besides electricity generation but that is the main focus of our program okay some barriers and challenges first off for machine learning I think we've heard it talked about a lot today but just making sure that you have the availability of quality labeled data sets also within any industry and within geothermal we have certain industry entities and operators that are more willing to share their data and then certain operators that are less willing to share their data so sometimes proprietary data sets can cause some issues in terms of barriers and then for geothermal energy itself we are a small community and a small industry relative to other renewable technologies most notably solar and then on my next slide I'll get into this a little more but geothermal I would say has particularly high risks on the front end in the early phases of project development and then particularly high costs during drilling phases in particular be it exploratory or production drilling and just field work in general costs a lot of money and these are some areas that we like to focus on with our R&D and our program but as well with machine learning so as promised here's the risk and cost as you can see the risk the highest risk for a geothermal energy project is at the beginning it's when you're characterizing your resource and doing your test drilling after you've done that test drilling the risk falls off significantly throughout the remainder of the project when you're doing your production well drilling construction of power plant etc in terms of cost that's the dash line on the diagram there and you can see the cost bump up most significantly again during those field phases when you're drilling or doing other field operations and also during construction of the power plant of course but this is kind of a key point during the talk here and just something to keep in mind as we go through the next few slides okay here's what we have in terms of currently available data and analysis tools so first up we have the National Geothermal Data System this is an effort that was started I'm going to say around 2010 where we made an effort as an industry to digitize our data and encourage folks to upload the data to the system so that others could use it we also have a geothermal data repository that's a similar open community data system where in particular geothermal technologies office projects DOE projects upload their data when their projects are in progress and complete we also have a techno economic analysis tool called Getem and also a GIS based tool to take a look at geothermal prospects called geothermal prospector so with the National Geothermal Data System and NGDR one thing I will note based on conversations discussions earlier today is that we do have a curation team sure we could probably have a bigger team and more funding put towards that but we do have a curation team and it is data that exists you can access it one thing I would also say though is that this system was created probably in a time before machine learning had such emphasis so though we have data content models for example that provide some structure to the data as it's being entered we may need to as a program go back and look at those and see if they're going to be useful from a machine learning perspective okay we've talked about seismic analysis quite a bit today so I won't spend too much time here but naturally seismic data given the volume given the fact that in many cases a human may have already been through that data reviewed it maybe even labeled it that provides a good opportunity and a good place to start for machine learning alright we've had two recent funding opportunity announcements within the past year that touched on this topic we did have one specific machine learning FOA and we're on the front end we as a program are just getting into machine learning so what we wanted to do is take the first logical step if you will the objectives of that of this FOA as a whole were basically a continuing to develop those community data sets in national geothermal data system and GDR to make them more structured make them ready for machine learning we also wanted to perhaps develop some programs that are going to help us decide where we might want to drill future wells whether it's a green field or an existing geothermal an existing geothermal field also one of our biggest issues is that probably most of the geothermal systems that remain to be discovered at least in the U.S. are hidden they don't have any or very little surface expression so we want to find those hidden geothermal systems also another objective was taking a look at power plant data again there's a multitude of data there so how can we use that data that's been collected through years and use that to better optimize power plants and then the last thing prediction and detection of trouble events trouble events during drilling where you're losing circulation or you're having downtime for another reason how can we how can we have those off we had another funding opportunity announcement called edge it was efficiency for drilling and geothermal energy and we had what we're calling two cohorts of selection so the first cohort of edge selection was basically drilling R&D that was focused on the actual drilling technology that's being used our second cohort of selections focused more on projects that were taking a look at ways to process existing data create partnerships with the oil and gas industry and many of those projects had more of a a machine learning aspect to them so we do have some crossover between the machine learning announcement and the edge announcement and this is a list of the projects that were selected from the machine learning announcement as well as the edge announcement and I won't go into details on any of these I'm not sure I can remember all of that anyway but we can at least categorize them so on the top left there we have a handful of projects again that are taking a look at the data that exists in NGDS, GDR labeling it, making it more structured making it higher quality second category again taking a look at additional seismic analysis the third category on the lower right there exploration aid so that's where we have projects that are helping us decide where we might want to site our next well for example based on all the data that we have about the field or the geothermal reservoir that's already operating and then on the lower left there the power plant data several projects in that category also mentioned we have a couple of SBIR awards that are focused on the seismic and subsurface signals realm as well okay so that's that's what exists and now I'm going to get a little more speculative here this does not represent the thoughts of the geothermal program as a whole necessarily these are more my personal thoughts but along the lines of what we've seen today in the future as a long term vision if we can have tools that are helping us when we're out in the field during those high risk and high cost portions of our efforts they can help us make decisions maybe even as we're drilling a well if we can have a recommender system if you will that's going to say if you want to drill quicker maybe you should put more weight on that bit or maybe you should put weight on a bit quicker or change your mud weight things that we can do like that would really help us decrease our cost during those high risk portions of work and you know some of the challenges that we would need to keep in mind is that you know when you're on a remote construction site such as that you may be dealing with limited resources internet connectivity or computing power also any recommendations naturally we need to be in line with the operational limits of the equipment and make sure that we're not putting any personnel in any safety issues based on those recommendations also our drill rigs in geothermal are usually a little smaller and not quite as new as the oil and gas drill rigs so we have a lot of analog controls, manual operation kind of some old school equipment if you will it's been mentioned today you know the willingness of personnel to implement those recommendations from a machine learning system and then also balancing those with the existing expertise in both personnel and physics driven methods and then last cybersecurity is definitely a point of focus at DOE right now so we'd have to keep that in mind as well I think most of you are probably familiar with a drilling site but just in case the rather dim photo on the upper left there shows a double track leading to a lonely, lonely drill rig so these are the kind of sites that we're working at in Nevada in that case upper right photo there is a exploration well drilling operation so as you can see small pad not a lot of resources moving down to the bottom that's a production well pad so you can see there's more resources A, there's more space you can see on the right there of that bottom photo there's trailers that's where people are staying that's where they're living, working there's computing resources, power generators et cetera et cetera so there's more resources there to work with in a production scenario here are some of these analog controls that I was referencing earlier so yeah how do we take a potentially detailed and complex machine learning recommendation and translate that to pushing buttons pulling levers et cetera and this is my last slide right here as we've gone through the slides today I think some of the questions on my mind at least is the data from national geothermal data system GDR is that going to be sufficient for machine learning is there enough data is there enough big data there is a and hopefully after we go through this round of projects that data is structured further and labeled better but is that going to be useful are there additional pathways to you know kind of leverage expertise from mining from oil and gas from other subsurface industries I'm sure there are and then ultimately we as a program probably need to be a little more structured about how we would want to integrate machine learning into our long term vision final plug we did have a vision study that was released last week for geothermal and it basically goes through what we would expect geothermal energy potential to be at various along various funding scenarios over the next several decades so that's called the geovision so you can use the search engine of your choice to type in geothermal geovision and check that out so with that I will conclude and thanks for your time today next speaker when you Harrison a lot of engineering a lot of school of mine the field of expertise chemistry hydrology a lot of school of mine that is being checked creates a minimal industry to develop a minimal resources and then these exploration success rates advance the mining operation while cutting costs minimize the growth financial risk thank you it sounds very impressive so as always it's a delight to be here it's nice to be able to shut the door on the busy work of life and to engage in some intellectual discussion a lovely opportunity so in this short review I'm going to introduce our center the center for advanced subsurface earth resource model the motivation for forming the center is to make contributions in a globally competitive mineral and the material security industry with all the aspects of mineral resource discovery and recovery but these are also related to materials manufacturing recycling and they embrace economics politics and social license so you're going to say oh what are you talking about we should never forget that our science serves society and so I think we should keep at the back of mind that what we talked about today has a bigger picture I'm going to provide some examples of active pre-competitive research that address many of the topics we've heard about this morning and this afternoon but I just want to say one thing about pre-competitive research what does that mean? This is something we haven't really talked about today in our language pre-competitive research is the research that an industry government group would like to have as the foundation for advances that they will make that are proprietary that they will talk about works that have a shared need at a base level that will then go off and be used independently according to their users interest this is not work how do I advance? yeah oh there we go I got it thanks it's like opening my car door so what is our center it's a collaboration among industry government seasoned universities with the purpose of transforming the way geoscience data is used in various aspects of the mineral resource industry for disseminating this knowledge to the center members and then finally and this is important we haven't talked about this today addressing the critical industry need for trained and prepared employees by educating future research engineers and scientists so that's an important part and I'll come back to it alright I want to talk a little bit about the background because it's also informative we are an awardee of the National Science Foundation it's through their program of industry university cooperative research centers they provide to us organizational expertise and funding for center development and operation over three five-year phases we're in the first of our five-year phases this is a mature successful program it's operated for more than 40 years NSF has enabled the establishment of more than 90 science and engineering centers our center which was just established last year is the first of two now within the geosciences directorate so we are pushing new territory and we're excited to be doing so the members of our center support the research NSF is providing some of the support for the organization they can be both large industries and small businesses federal agencies and national labs we have Tom Crawford from the USDS here with us today he's a member we have two sites in our center that's an NSF requirement broadly our sites are Virginia Tech along with mine we have some Virginia Tech participants with us today and I'll come back to that towards the end our team between the two universities we have 26 faculty from seven disciplines with a breadth of expertise that includes racial statistics and mathematics high performance computing and visualization and the geoscience fields that include mineral deposits, geochemistry, mineralogy, patrology, geophysical inversion potential field surveys, analytical methodologies geophysics instrumentation so you get it right it's a lot of interdisciplinary potential and need the membership is interesting it's not similar to the way universities normally do business in consortia these from the members support center research activities these members prioritize the research agenda and direct the progress of the research 90% of their fees are directed towards research so that means the indirect cost of members is less than 10% so this is a very good investment members get access to graduates, institutional expertise and they get royalty free license to new discoveries so we have some opportunity space that enables the center to be successful and the first of these are the challenges within mineral exploration that have discovery development and economic characteristics and many some of you don't know too much about mineral industry so there's a few things here that I'll point out the first thing I would do is say there's a low global success rate in new fields that means new exploration and you could read all those numbers but look at the column on the right hand side probability of discovery always less than 1% sometimes for world class large deposit 0.07% and that really feeds into comments I would make about a prolonged project development time average perhaps in some areas of 18 years 9 years from discovery to onset of mining and that means the discovery costs are huge billions of dollars before there is return of investment so there's a lot of expenditure up front and this we'll come back to again in a little bit so equally challenging as the economics and business side of this the work the center will do has to do with everything we've already heard about today and it has to do with data data integration data scaling and version very large data sets which potentially allow the application of machine learning technology very dissimilar data types spatial resolution wide distribution however I think that there are some similarities here to some of the things we've heard this morning but there are some differences and some of these differences all bodies and we heard a little bit about this too all bodies are highly irregular in 3D space and in grade the highest grade isn't in the middle necessarily there are huge data scales micron size mineral grains nano nanoparticles of gold perhaps all the way to geophysical surveys that needs to be taken during the exploration and assessment phases individual mineral resource companies maintain very large databases these databases include time dependent adjustment as mining progresses and resource estimates are updated however and this is entirely understandable if you think about the financial investment there's a lot of protection of the data because of the economic risk and that's a topic that we can sympathize with in general in the field of mineral exploration or deposit models that are universal and not common there are some but often a new target starts and scratch so academic discovery of basic principles is a little bit inhibited so the third piece of opportunity space I haven't really talked about this so I'm bringing this to the table today is the workforce crisis that exists in the extractive industries now you can read all my bullets but the bottom line is really that there's a lot of evidence that the mineral industry is aggressively promoting novel approaches including machine learning, AI robotic stones and so forth modernizing exploration and production at a very aggressive rate the industry itself needs real-time decisions exploration and production from remote sites is desired to have a minimal workforce and reduced costs and timelines you can sympathize with that based upon the information I gave you earlier in the presentation the current workforce has invaluable real-time experience these are people who have extended experience working in traditional ways with the discovery and use of mineral resources but they need new employees who are experienced in machine learning and the various other skills that lead to experience and development so one of the big challenges that we also haven't talked about much is that we have to learn how to integrate these two sets of experiences the experience of the long-time employees that's on the ground and what our new students and graduates bring to the table and I think you noticed this morning that we had a great question from one of our students about how do they need to prepare and hope that some of you will have the opportunity to share your comments and thoughts with them so a little bit now let's turn to the technical scope we have research projects that have some interesting features all of them cover multiple scales with each individual project integrating data across several orders of magnitude all can potentially advance using machine learning methods now the ones that are 10 listed on the side these are 10 projects that were carried to an industry board advisory meeting where they selected what was of interest and there are groups the orangeish color represent integration of mineralogy petrolysis geochemistry strength structural information there's a group in blue which projects having to do with understanding of fractured rock environments and a group in green which address developing innovative methods for using large data sets in exploration and mining and the ones that are funded and currently active are in the bold colors and I'll talk briefly about some of those an important part of this whole program is that often member companies are providing site access and data and student internship and that I think addresses some of the questions that have been raised about proprietary data so I'll overview these quickly the three different projects I'm going to pick have been picked because they show how machine learning can be applied in challenges in different stages of the mine life cycle I'll talk a little bit about unsupervised learning in ore body delineation predictions of monetary value on each mine block for optimizing mine design which is supervised learning and then green field exploration which is currently using ply statistics and still undecided about whether machine learning is a valuable approach so I'll start a little bit about machine learning in the context of mine planning each one of these topics I have two slides that are organized kind of similarly I can't say too much in detail if you want to know more questions so the goal of this project is to develop geologically informed lasting with the hope of radically improving ore body delineation reserves mine ability and that has significant increase for the discovery this is a multidisciplinary research team who's involved geophysics geology applied statistics and computer science the ultimate the ultimate goal really is to set well I'll pass on that let's get to the nuts and bolts that's what you want to hear so the research methods really is focused around making full use of multiple data set in this particular example it's for gold silver deposits and they're provided to us by one of our members scheme and resources to develop training sets and validation sets for supervised machine learning and the intent was to use supervised machine learning and deep learning algorithms to construct the ore body in 3D and you will see that I've written in progress that it says develop data format for machine learning unsupervised but I didn't say that before because this is an example where as the initial investigation of algorithms to be used was discovered to be inappropriate and a change was made it's a good example of iteration between data availability and the algorithm the second project I'll briefly summarize is to use hyper spectral score scanning data and supervised machine learning to determine the qualitative mineralogy of core and then predict rock physical and mechanical properties and that in turn allows you to define the monetary value of each mine block because we can integrate metal value, energy and environmental cost the most effective mine block isn't necessarily the one that has the highest grade ore if it's going to be the most expensive one in energy use in terms of combination. This is another multidisciplinary team expertise in quantitative mineral analysis, geology applied statistics and computer science the research method here is also to use machine learning approaches to establish relationships between hyper spectral data and quantitative mineralogy developing training data sets then the intent is the up scale to drill core producing continuous three-dimensional downhole mineral modes so that in turn will be related to the structural mechanical properties that these cores provide and to build these into 3D block models and in essence the training the training data is to take thin sections that most of you know about maybe we have five with a pixel size that's a small number of 5x5 micron quantitative mineralogy SEM mineralogy cross match that crosswalk that with hyper spectral data with a bigger pixel size and then move on up to making this a continuous instead of a discrete set of information so far the training set data is prepared and the team is working on algorithm selection so the third project has to do with Greenfield's exploration and here the project is and we heard a little bit about this as well this morning is to look for distal signatures and vectors of hydrosomal systems and carbonates what out in the unmineralized environment tells the direction to which the undiscovered all will be found this is this is an interesting project the multidisciplinary research team again statistics field geology chemistry and industry collaboration our data and access are provided by London mining a center member so on this little cross section which is from the mine that we're allowed to access the red areas are known mineral deposits related to volcanic activity hydrosomal activity in deposits with mineral deposits in volcanic units but the interest of this research team and the supporting members is what can we find in the carbonate rocks which is blue on this diagram that provide evidence of what direction the mineral deposits should be in so although we know something about exploration targets in volcanic systems we don't know about them in carbonate systems alright so what this group has done is started to develop statistical spatial trends searching methods to get the vectors and they're applying these methods to a known data set so they know where the target is and then assess whether the machine learning is appropriate tool to get pattern recognition within a data set from multiple deposits so now what we see here we also heard about this morning is that the data set is not sufficient for a universally applicable methodology that they need more data well more data is is not necessarily a good idea the set of data that they have here in three dimensions 200,000 analyses now you say well that's nothing in terms of Google and Amazon well that's true but 200,000 analyses for a single project is almost overwhelming what they have discovered so far and this is another interesting aspect of machine learning they realized that the data set huge and good to start with isn't sufficient for developing this universally applicable methodology there's bias indicated by the statistics that the collection methodology for this target which was collected for a different reason has yielded data that are not appropriate for solving the problems through machine learning methods and I think that that's quite an interesting take away not all data is what you need nor has it been collected properly and in fact right now one of the questions this team will answer is are we going to end up with a situation of overfitting meaning that the model is too constrained to be applied elsewhere also a very valuable analysis of the role of machine learning in subsurface data so these three what I wasn't paying attention so these three projects really show nice examples of subsurface data analysis who specifically show that data sets regardless of size can be insufficient maybe more that different samples must be collected that the algorithms need to be adjusted or developed and just having the data and the math isn't sufficient so I think it's important that we recognize the feedback here it's not just data, machine learning output it's an iterative process so some of the challenges that we have in this particular application of machine learning in the subsurface is that the end goal is to have a long term vision in order to have increasing success rate and decreasing development time a long term vision sustained investment in innovative research so we'd like a well defined subsurface problem suited for these approaches we heard that already a well defined problem an interdisciplinary team we need all kinds of things about data there's minimal industry wide sharing and limited open source we have huge variations in spatial resolution and distribution acquisition of new data is very costly and very slow the data must be appropriate, meaningful, accurate and we need to know whether the data is going to begin as a structured or an unstructured methodology usually unstructured leads to the structure now there are some interesting challenges in the deliverables that we also haven't touched on how are we going to visualize and interrogate product when it's in the workplace the workplace experts aren't us sitting in the room how are we going to make something that's usable how are we going to get updates that are rapid so that our predictions can be used in a close to real time environment and then we have to remember as we work in with one foot in industry and one foot in academia that there's a requirement for fast deliverables that makes us in industry that really makes us rethink how we do our research in academia published earlier this week, I read it yesterday on the plane as a comment in nature, some observations relative to our discussions today about data being an important field of its own and needing appropriate recognition I quote, when reused by others the impact of collective data is multiplied it's worth reading you can learn more about us at our website, I'd like to introduce center director Rick Wendland who's with us online mind site director Thomas Monarchy who's with us online, Virginia text director John Schomack who's with us my job is to think about education and outreach and we have some of our faculty also here and thank you very much thank you that's the next speaker of the health it's another an extra character cancer near cancer because of the exposure orange exposure technology is about the physical message for children the application just okay great thanks it's a pleasure to be here lots of information today everyone's brains full the bases I see remind me of driving my kids home from the pool many years ago in what case our middle child put his towel over his head and said I'm closed so if you take a second open your mind I mean the last speaker in the afternoon who's the statistician I've got a few strikes against me to start out but I'll try to be engaging I took the minimal slide recommendation seriously I have three slides but I brought some supplemental information that I'll share with you so I'm in a school of public health so what am I doing discussing subsurface I'm a spatial statistician we do a lot of environmental exposure assessment I also came to statistics in a roundabout way my father's a statistician so I was determined to not do that I was a math major which a journalism major for a semester that's another story became a math major computer science minor did graduate work in operations research large scale simulations and large scale in the mid 80s was nothing compared to what it is now would be interesting but then my advisor had a problem a research project on cancer finding cancer clusters around hazardous waste sites and so it's putting together a bunch of types of data that weren't necessarily meant to go together to try to learn something new and that's what data science is all about I think the GIS community sort of struggled with some of this for a while it doesn't mean they have answers but we're sort of used to using other people's data and worrying about that so I was department chair for nine years and stepped down and they rewarded me with that by saying why don't you lead our data science planning efforts for the health sciences center so I go to a lot of meetings talking to people trying to get them to work together and one of our logos which has not been vetted yet so you can't deal this because they won't haven't told me I can use it is the double arrow in between data and science the idea is the data informed science we hear that a lot but also the science should inform data and that message has been loud and clear here today how do the subject matter expert interact so the real collaboration happens in the middle and that's really where all the good stuff happens so interdisciplinary interdisciplinary collaboration is about like traveling internationally the money is a little bit different the rules of thumb of how you get the check in a restaurant when do you order how long do you stay when do they serve dinner these are all things you need to learn moving from one to another if you're not open minded you're trying to turn them into you and they're trying to turn you into them there's just a lot of friction and not a lot of not a lot of production so this is where I'm kind of coming from so those two arrows took a lot of thought they took a couple minutes I'm sure it'll look better when a graphics designer put together there'll probably be some curvature and more speed and action to it but that's what I could do in telephone alright this is a figure I moved to Emory 21 years ago and met with John Richardson who is a toxicologist at the EPA region in Atlanta we were talking about GIS and he goes it's just like this and he grabbed this scrap of paper now the story would be a lot better and I could just tell you this but he didn't really take a back of an envelope and draw this it would have been such a better story if this was on the back of an envelope and he drew this picture and I said brilliant can I take it to my GIS class oh yeah sure so I've been using this in the GIS class for 20 plus years and then I start presenting it to data science people and they go wow that's really cool I said it's just boxes and arrows but let's think about this I did give it the whirling vortex name John didn't say it's just like this we start in the upper left we have questions we want to answer now in this setting we may be trying to find deposits we may be trying to find safety measures we may be trying to assess productivity and before we actually do anything we should think take one step down what data what would be the perfect data for this you know we'd like to have lots of test drill we'd like to know lots of things about it but we have to compromise and go to the right and here's the data we can get and the tools we do have now compromises are made in those steps these were brought out in a couple of presentations today John mentioned about the going from data to models and there were different assumptions between the geophysical and the other track so those things are often not discussed too much but we get to the data we can get and we apply some methods to it whether those are statistics or machine learning or whatever and we answer some questions there's the questions we can answer with the data we can get with the tools we do have and we often don't compare that to where we started so how close is what we answered or which part of the original question did we answer and then you go around again so if I present this to an undergraduate statistics class I get the question well how do you know when you're done I say you're never done this is how research works that question is not whether we got to the end it's whether we got further along than where we started now do we know more afterwards than we did when we started and is it actionable and is it so so this is a helpful way to think about it and so things like machine learning or new tools come along and the first generation science is showing that you can do it right you're doing bigger data than anyone's done before you're doing faster more efficient there's a lot of competition to be the first one to do something I think the second generation of science and application is much more how should you do it what's the best way to do it for our question so the subject matter knowledge is not in this may not be in driving what the algorithm is but how it's used in the choices of the data you have so we tend to like to think we march around the analysis and at some point we apply we're down in the right hand side we got data it looks like all the pieces we want we run it through something we get an answer and we're sort of expecting the angels to sing and truth to be revealed and that's just not how it is we get something from it and we want to understand how it works so we had some nice conversations today about transparent to the algorithm and all of us who are sort of immersed in this we've all used the figures that you saw multiple times today we've used this all the time too there's a bunch of boxes or circles there's arrows that split from them and there's more and they're crossover if you show that to the subject matter people that doesn't make it any clearer to them but we look at it and we say oh yeah deep learning so there's a gap between sort of helpful figures for us helpful figures for them they don't need to know every little choice that's made and a lot of times we may not know how every one of those connections work but we understand kind of the big picture of generally what it's doing and so if it feels like a black box it's a little frightening first of all it sort of devalues what's my contribution to this there was a similar meeting of the client climate scientists that have the large scale global circulation models for climate change and they were saying well machine learning might help out with that but you know we're modeling every little molecule in the atmosphere and how much sun is coming in and what's happening in the ocean and so that's real science we're learning how the system works but maybe you just need to learn when do I make a decision what's the next thing to do so trying to have the computing do the complete understanding of the science and give you information for policy may be a big ask and maybe they just maybe so as a department chair I would go to my dean and ask for things and they would send you to leadership training and one of the things you learn is your leadership decision-making style the Berkman scale I don't know if anyone's done this before but you get four colors it's like Myers-Briggs for administrators but it's colored because we can't keep track of letters it's just like okay so I'm between a blue and a yellow that means I like to get all the data I like to sort it out I like to make a table I make a list of pros and cons and I draw things and I cross them out my dean is a red you give him a problem tell him what you're gonna do and you know if I got do it get on with it don't never gonna look back if I go in with my stacks of paper that I've spent all this time in my office here's here's all the data I collected now you'll see this connection here I think is more important than this I've lost him if I go in and say I looked at it all I think we ought to do this great I have to communicate to him in the language he speaks for us to move on to the next thing and we have to learn how to present our science as well to those what are they needing to know what information do they need to know to make the decision at hand and some of that is you know it's hard for us to do because I like the details I like thinking about this stuff that's why I go to work and oh there's a new data set how's it gonna fit you know that's the puzzle I like that some of the people I'm working with don't care at all they just want to make sure I don't make a mistake so for me not to make a mistake I have to live in my world but I need to present to this so first of all data science is not data alchemy if people aren't seeing that there's something good happening in the middle that's what they're afraid of is that it's a number and then sometimes it will make mistakes and we need to understand what could possibly go wrong with this what are the some things that could drive it some of the medical researches I work with were classifying radiology images for cancer like pre-cancerous tumor development and it did great it was detecting them all and that's because the radiologists had tagged all the tumor ones with a little mark in the lower right hand corner so it's like it's so good it had the right answer it was cheating so yes it found something in the image that was indicator of whether it was a tumor or not but they weren't going to get that next so we had to think about that we need to model what we're seeing and model how we're seeing it Wendy had some great examples of this and some of her illustration that we may not be collecting the data that lets us talk about what we want to talk about so that's kind of my compromise of the data you can get but we can also start thinking about it what scale are we observing stuff at what scale do we want to make decision at what scale is the process we're working on really happening where does the mineralogy happen how are the crystals formed at what scale what are the conditions is it really small time and small space or is it large scale things I think there needs to be a healthy humility for what can be missed in science we don't get rewarded for being humble about our work we tend to push it as the new fast thing the new best thing but we need to say here I did the best I could here's what I got here's what I think it means does that make any sense to you scale is essential this kind of goes back to my modeling modeling how we see it there's the scale of what we're looking for geographic and in time there's the scale of what we actually see and the data and the technology we have for seeing it and there's the scale of how we might extract it so the decision we're going to make on the action we're going to take next may operate at a larger scale than what we're looking for because we're trying to get it out of a bigger unit and so David Donahoe has a nice paper on 50 years of data science and he talks about the science of data science so the science piece of this is understanding how it works how can I customize it for my application how do I make it work better so the science of subsurface data science sounds a little bit of redundant but we want to kind of we're trying to bring the phrase back around are we what can we learn because data science does things differently machine learning is doing things differently it's not just replacing calculations we do with computers to make them faster so an eye-opening experience and talking to a lot of data science people to me was sort of the time scale of how computer translation how Google translate kind of works so for a long time the algorithms were trying to take each word change it into another word rebuild the syntax understand the syntax it's coming from you know thinking about it like a human would what does this word mean and why does the verb and the noun and how does that go together and then what really scaled it up was there's an awful lot of translations out there why don't I just see the 3,000 times Don Quixote has been translated into English find the phrase I'm looking for and see which one one which is the one that's the most popular so Google's not understanding what it's changing it's just matching it's finding a cat because of the way a cat looks not because of the essence of a cat so it worked really better we don't want cars to drive like people we're not just trying to make them faster and so we need to get you know we've seen things about some of the automating some of the discover and like discovering new material science running a lab having a computer decide it can find new things we might not look for there's the go example of a move that you know freaked out the go master the computer did it because we're so trained to thinking the way we're thinking we want to be able to search the large space of possible answers but we also want to narrow it down to one that are going to help and that's attention it's not an automatic thing you know there have been a lot of articles about facial recognition and then they're talking about well what if we put all the you know the recognition that the Amazon algorithm and the police department in New York City talking about what if we put all the public cameras on this we could find those felons that are just at large and then they're surprised that the first two or three are not really felons and it was a false positive oh that's really mysterious we'll just have to make the algorithm better but having taught biostatistics in the health field I'll tell you the quintessential homework problem we give every new master's of public health student on Bay's theorem is what if we have a test with 99 99% sensitivity, 95% specificity should we have everybody take it so when I was in school in 1980 you know 80 to 87 we fill in the blank with what if we have an HIV test that's that sensitive should we test everyone in the United States and then it evolved to you think about mammograms for men under 50 and PSA tests for men under 50 for breast and prostate cancer specifically and there was a lot of oh it's not very sensitive it's not really that it's just that the prevalence, the number of people with the outcome you're looking for in the lower age group is so much lower even though there's a small chance of a false positive most of the positives you get are false so you're like oh man so we teach this as a calculation of Bay's theorem it's like an algebra thing and everyone plugs it in it's a problem see you are wrong probability is why gambling works because you don't have a good intuition for but really you can make a better picture of this if you think let's say I have 100 people and I have 99% sensitivity that means my chance of the test being positive given I have the diseases like 99% that's good we can test that in a lab because we put a tissue sample with the disease without what we really want to know is what's the probability of having the disease given my test was positive that flips it around so how do we do this we have 100 people let's say the red is the people whose test is positive and then you get one blue person they really had the disease but the test was negative now if we had 100 people without the disease specificity is not detecting it when you don't have it which is kind of double negative it's only specific to the disease so that's 5% it says well one of those people who should all have the blue result 5 of them got a red so the total reds are mostly the people who really have the disease and a few people without but if the prevalence of this disease is that most everybody is blue I'm going to get to the point where 5% of those is actually more than the 99% of the people with the disease so there aren't that many felons at Luce in New York and we're looking at everybody's face so most of the positives with the probability of being a felon it's identified you is actually really low so some of the machine learning examples are talking about sensitivity and specificity of 99.99999 it's like conceptually we think that's ridiculously accurate but we have to think how prevalent is what we're looking for where's the needle in the haystack and the chance of finding a needle in a haystack is relative how big is your haystack so if I can send my haystack down by saying let's only look at clubs that only felons can belong to for instance I'd have a better chance of finding it so if I can be smart, I can learn about machine learning we need to learn how to roll these things out and right now we're in a stage where a lot of amazing things happening with it but learning how to use it better I think is the part that's interesting and for us here it's on the subservice how can we apply this in the subservice setting so using the subject matter experts may not be in terms of tweaking the algorithm it may be about figuring out the data and finding the best chance of finding what we want to find and I think we're all thinking about that the idea of open and proprietary access in terms of working with a lot of different spatial data groups this is a group where that's more of an issue I mean I work in health and there's a lot of proprietary confidentiality issue but you can get anonymized data with a lot of discussion now is it anonymized enough if I know your age the zip code you live in and something about your health I might know who you are now New York Times had a nice result of location based service and if you watch the cargo from the mayor's house to the city hall every day you're pretty sure that's the mayor and that one day he stopped at this clinic to treat this issue okay now you've got a front page story so we're learning how this works putting my picture back up there one of the issues that came out this morning was the real challenge is setting up the question and I think that's really important another point that was made is data science and machine learning and deep learning are tools so if you have a problem that looks like a Phillips head screw and you have three tools to fix it a wrench, a hammer and a screwdriver all three of those are capable of driving that screw into the stair that's making noise there may not be efficient but I can drive a screw with a wrench if you get it hard enough and it works even better if I have a Phillips screwdriver so everyone has used the wrong tool for this job I'm sure in frustration so I think we want to think about things like that it is a tool, it is a powerful tool it is a tool that does some things very well and we're turning questions we have in the upper left into questions we can't answer with that tool and so classification it's phenomenally good at that if we can turn our question into classification we can do a great job and it fits in it can fit in really well if we have a problem that's not just classification but it changes in things and we don't have a big background set of data for it it's not going to do as well so we need to learn to get some intuition on this is a good setting for it or a helpful setting this is a setting that's going to need enhancement I think we need to learn how to learn and that's where that last bullet comes in and I'll close with talking about the question I mentioned earlier I thought the question about what else should I take is it just coding the language you learned today will not be the language you're using 10 years from now so it's really important to learn how to learn languages and I think classes on software engineering and algorithms how to take a problem and break it down into pieces that's what you're going to use over and over and over again because my past skill skills were really really good at one point and but my software engineering class is the one I use for almost everything I do they gave us a problem that was too big to understand and a group of people to work with and we turned it into smaller problems we could fix and then we had to think on our circle what we put together did it really cover all the pieces so we do that on every project we work on whether we say it or not so getting with a group of people that are doing that and thinking about it and I do think it's also important to see seminars by data science people talking about geoscience as well as geoscience people talking about data science another interdisciplinary yardstick is who are you trying to impress with your work if I'm trying to write a paper for other statisticians I may not have to be as accurate about some other things but if I'm trying to write a paper to go in a geoscience journal about statistical applications of machine learning in geoscience I better have my act together and to me I think that's an ultimately more useful collaboration but harder to do those are just several I was kind of given the task of well can you sort of summarize some things and talk about a bigger picture so I hope I've covered some of that and I'll end with my sons and getting a master's degree in digital humanities and he emailed me to me that working vortex diagram and I thought my work is done thank you very much thank you very much man thank each and every one of you for being here I think we've been asked to have a break at this point is that not correct we do like two questions then we'll do a break just to let there's probably a couple of people that are dying to get something on the table okay question yes you made an on hand remark about climate science that I think actually shed light on a lot of the conversation today we talk about subsurface data and often the purpose of subsurface data is subsurface models this is a matter of great sensitivity to me because we probably have more than a thousand people in my company in the Asian subsurface modeling activities and yet if I walk by the narrow station up on Virginia and I go in and I ask them to tell me a subsurface model they'll call the police the fact of the matter is that subsurface models in our industry have no direct economic value they only have value through the decisions they support and often you find that subsurface models that PhDs work on for years are used to decide should we lease a 100 kbd vote or 150 kbd vote so one skepticism that I bring to conversations about subsurface data is that the decision that's being supported either at a business or a social level is often not discussed and the nature of that decision can very much influence your modeling strategy and where you're willing to cut corners I think that's an excellent point that academics myself included sometimes overlook because we tend to think we're after don't you want to know what's really driving this and for some decisions that's not really what you want to know and if that's not to say it's not the right thing to do but focusing on getting the information from data that's available to provide actionable answers we kind of overlook that goal in modeling I feel it's my opinion in a lot of settings and this climate meeting was very similar to that in the sense that the discussion was why are we modeling what are we using it for they were talking about the intention that came up in that meeting was understanding how the climate works and measuring things that may or may not drive it changing as opposed to how do you regulate air pollution or not they're related questions but in terms of the information and the feedback you want out of the model it's different I mean it's not diametrically opposed but you may not need another analogy was talking to a doctor who inhaled steroids for asthma and personalized medicine so I may be able to do the genomic features of everybody in here and personally subscribe I may not be able to but a doctor could personally subscribe some little tweak to absolutely everything and everyone gets perfect or 99% of you get a perfect response whereas the inhaled steroids we don't really know how they work they just seem to work pretty well for most people at what point do you need the additional information to have the effect that you want to have and this can be billions of dollars in large decisions or it can be as simple as which inhaler am I going to get and is my insurance going to cover personalized medicine versus inhaled steroids if my health outcomes the same it may not actually be based on what's driving this disease or the effect in me but the decision can help make the change I don't know if that's getting to that but I think what you're bringing is a valuable perspective on this for modeling in the first place and then how do you model to get to the answer you need just to reinforce that point that you made I mentioned that workshop that we had last summer and one of the key outcomes from that is that you should be focused on you should be looking at what's your outcome what's the decision that needs to be made before you go in and decide like the whirling vortex diagram so what is it that you're focusing on which one form what kind of data do you need do you have enough of that how can you fill in gaps in what you don't have in order to get the answers to the questions that you need to address otherwise you can solve the wrong problem so you can go off and spend lots of time in resources getting something to the end decimal point when you just need single precision information I guess I'd like to hear a comment about the variability of these algorithms I'm a water rock interaction geochemist and there's a whole range of programs one can use to model systems but they don't all give the same answer so I wonder about that in this particular arena in terms of are there standard platforms that one could use to always get the same answer are we going to see different answers with the same kind of input using different platforms one thing I'll put out there is that some of the statisticians I think the variability is critically important there's a lot of work on uncertainty quantification and so on but in terms of modeling I think we need to not think of the model but the structures of the model and how the choices of the type of model can influence this and almost every competition for the best model usually what happens is a couple of them start to get ahead on whatever performance measure it is on some of the others and they combine forces and they end up winning so that's the Netflix challenge a lot of these Kaggle type things the CDC did a predicting the flu competition and their goal when they set it up was let's find the modelers who do the best job and use their model every year to kind of allocate think about flu resources and what they realized if they really wanted two or three models if they all told the same thing that reinforces some of the decision if one of them says this and two of them say that that's important information it may not be the exact numbers but it's like we don't know enough to really pin this down so we need to make our choices accordingly so that's the kind of interaction I've had on multiple models giving different answers yeah so just I guess it depends on the situation so there are a lot of different algorithms that you can apply a lot of the time if you're going to use a data driven approach solely as opposed to like a predictive physics based model then you kind of know what your outcomes are so you have to train those algorithms and some data and so you know what the outcome is that you want and so if you're your algorithms are either better or worse informing you about what that comes about to be so when it comes to then incorporating physics based models in there that's where getting close to being where you have variability and different results I agree with the same comments just to add to that a little bit, geothermal sponsored an effort to do a code comparison study over the past few years and that was a really valuable effort again we weren't looking for the best model or the best geothermal code for example we were looking for which codes might work most optimally in certain situations and I think when you bring in machine learning algorithms you could probably do a machine learning comparison study effectively and then you can even take a step beyond that and say at what points do I want to give certain weight to the physics based solutions versus the machine learning solutions and that could vary too depending on the complexity of the situation okay with that I'd like to thank our panelists and our audience and we'll bring the session to a close we'll break I know we're a little bit behind schedule we're going to take a 10 minute break we're going to ring the bell at about 2.59 and start promptly at 3 or something like that I guess I'm giving you 12 minutes total but be back in 10 and this will I mean this is kind of the discussion how we carry this forward and I think that that will allow us to continue this conversation as part of this as well so if you have some kind of questions I think there's an opportunity to bring those out in this last session so thank you very much I'll let you take a quick break and we'll reconvene very quickly couple minutes behind we're going to get everybody reassembled here and we'll get started and they'll mosey in so we we're going to pull up we have a PowerPoint oh it's up now so okay so we're going to kind of we're kind of transitioning out of our meeting and I'm going we're going to work through a little bit of background we want to have some conversation but I think so we'll have time to talk about this discussion but part of it is and we had some real interesting discussions during the break here about where are we going and I think that's what we want to spend some time talking about here so so in the next hour and a half and we're going to so I'll share a little background on on the round table process and Elizabeth's going to talk a little bit about some of the national academies processes and what white might be as part of this topic area and then we have a couple folks that are participating in round tables that can give a couple minutes to gain some insights on how that's worked in the past and then we want to have a bit of a discussion and get everyone that's here get your conversation and thoughts on what you think might be the future of carrying this topic forward particularly in the framework of a national academy round table so and I'm missing people okay so so our objectives for the meeting we're going to so we'll review some of the status around the round table I think I've said most of this stuff right and if we can get to some initial idea for a topic and themes for round table that would be great and particularly if your institution there we have a couple people that have been involved in these conversations over time if your institution has already made a decision and that may be at a more preliminary level it would be very helpful to know that at this point in time so what is a round table and we talked a little bit about so round table is one of the national academies processes for convening individuals so it's a it's a neutral credible forum focused on a specific topic that meets on an ongoing basis and can look at a range of issues out on the table out in the area where so the round table will meet as a round table periodically but one of the typically things that we'll do is then periodically in past round tables I think they were maybe one to two a year would hold workshops which are a one to two day event where you take a very focused look on a on a specific area that the round table is focusing so for instance in the unconventional round table they they did a workshop on flow back and produced waters and this is a preceding volume from this so it's not a national academy's consensus study but it's a compilation of the information that was presented at the round table and so and so you create a record that can be useful in caring forward with those issues here was one on onshore unconventional hydrocarbon development and this was on and that was induced seismicity and this was on legacy environmental issues so those were created and then at the end of the round table they decided hey we created this body of information on the unconventional round table we created this body of information it would really be useful to move it out more so they did some focused regional workshops to share the information that had been collected over the three years of the round table so that's just one example of the round table but it is the round table itself that will drive the agenda on some of you said well how do we decide you know what do we decide to get it started well it's actually what the decision at this point is to join it to be part of the round table and it is the round table that decides the agenda and the focus areas for workshops and how to move that forward okay so let's see it's cross institutional you know one of the things that the round table offers is that they will initially decide where they're going to focus but after a workshop or two they can decide to change where they're going because from what they've learned in those workshops and I think there were a couple issues that came up in the unconventional round table that weren't foreseen in the initial meetings and for instance the regional focused workshops at the end were never envisioned at the beginning but they said gee we've created this body of information we want to get it out there so it allows the very ability to adapt and address the issues and especially if you don't know exactly where that direction should be at the at the beginning we do want to have a a cross section of participation on the round table I think I mentioned earlier we need to have some government agencies that are kind of the anchor sponsors so that there's that that's part of the way the National Academy of Sciences works but there is room for for industry members for NGOs for professional societies all of those and in fact we would hope we would have that kind of participation to have that broad cross section of interests on a round table and and again the workshop is a very public kind of session the round table would also meet as an round table and executive session to kind of flesh out the direction and where it was going so anything else on the round table I need to cover Elizabeth okay and so let me turn it over to Elizabeth to talk about some of the possibilities for this one thanks a lot Jim I'll move really quickly through these slides to try to get us all on thinking about the subsurface data issue is a round table appropriate is does the Academy's role in that space could that be helpful and in what way so to try to get you all thinking about that there's a little short write up I think you might have seen it in the participant packet that you might have picked up on a table and that's just an outline of what a particular round table on subsurface data could be it's an idea generator way to get started but obviously moving forward if we were able to get the appropriate support for the round table we'd work through exactly what that scope would be but as Jim pointed out it's designed to be a flexible mechanism and so you don't have to have everything mapped out and in terms of what the round table would accomplish over the time period that would be active and I just say when Jim mentioned cross institutional you had heard one of my colleagues and Michelle Schwalbe who's sitting down there at the end of the table she oversees the board on mathematical sciences and analytics and they also have a committee on applied and theoretical statistics and we've been working with them as we've developed this topic because it's obviously a topic that needs to bring together the day of scientists and the subject matter folks to have those conversations and so we are having those conversations inside the academies and that's part of the vision for this going forward and so I'll ask Michelle to jump in at any point if she has something to add so why subsurface data I think this morning and this afternoon brought forward those some of those points for us a subsurface data are critical for identifying developing starting natural resources monitoring and mitigating natural environmental hazards and for infrastructure development we don't know enough about the subsurface in specific or even general terms and we have a pretty vast data set to be able to use and being able to harness that those data to those uses would be amazing we're starting to do that now but we can always do more we've already talked about the collection ownership curation analysis and of these data widely dispersed in public and private hands and the activities are all as we've heard just from a small cross-section of who's doing work in this area amazing work going on not everyone is always able to talk to one another and exchange ideas and information that leads to the third point there's not a lot of data sharing or anonymization of the data so that it can be accessed common data formats and scales are problematic meeting common data collection curation analysis standards and so forth so those are all problems in this space where a group such as around table that brings together experts and sectors across these various fields could make some inroads and there are great opportunities that exist to take better advantage of some of these data analytics and data science approaches if we can move forward with both the discussions and the actions that we have surrounding some subsurface data issues so previous work Jim has already mentioned that we are just finishing up the round table on conventional hydrocarbon development around from late 2015 through will be run through the end of the autumn October we had 13 sponsors for that round table for national workshop proceedings volumes some of which Jim had mentioned we've developed videos we did the regional events that we we also have at the academy in this area of subsurface data we have consensus study and workshop oversight that we conducted not that didn't necessarily have to do with that particular round table on unconventionals and some of the products from those activities are below and in the slide one of the reports on the on the left is from work that some of Michelle's group did we have also the academy extensive convening experience on complex issues with broad stakeholder participation and I think the stakeholder participation piece is something that's really advantageous with the round table form because you get that cross section of use even just as this day today was a small vignette into the possibilities with that sort of an approach the tasks that a round table might undertake this round table in particular and again this is just a sampling list the eventual sponsors of the activity would be able to help define what the specific tasks could be and we would seek help from others who would be part of the community that the round table with whom the round table would engage they would want to gather, examine and share information data and approaches to the barriers and opportunities in this work with subsurface data identifying and helping to advance the activities, discussions, communication and exchange of ideas that would have broad value to the sponsors and the key stakeholders informed decision making and strategic thinking that's a big point for both the sponsors and the community that would be involved with the round table and developing trust relationships. Something I can share with regard to our unconventional hydrocarbon round table as I said there were 13 sponsors but there were 25 members so the sponsor members are obvious because they are sponsors of the round table but we also brought an additional volunteer experts to compliment the expertise on that round table and I think some of the folks in this room Wendy and Dave for example and Jan who's here might be able to share some of their experiences with that cross-sectoral group as part of the round table itself but the round table through its activities then developed its own of a community if you will who attended the round table meetings, listened to the webinars and webcasts and took part in other ways with the round table activities and then that group numbered in the many hundreds so over the course of several years there was a community that built up around the activities of that round table to try to advance the issues that the round table was examining. So again, focus series just some examples these would be determined later as we narrow down the exact scope, put up the bumpers if you will around what the round table might be able to address, could look at barriers could look at data cataloging, current practices, incentives and solutions learning from other fields or communities which has come up a little bit here Lance brought forward some of that in his presentation at the end just the commonalities between different fields some of the challenges that the different fields have faced in this space and what solutions they've come forward with and some of the challenges probably don't yet have solutions but that's part of the ability of being able to bring together people with different disciplinary fields. There's a communication outreach education and awareness component that shouldn't be overlooked in this area because you have a lot of people who are very intensely interested in the topic but there is a whole suite of folks outside who don't understand or don't know about the power there may be also the student population others who might be coming up in the field who can learn about or take advantage of the information that around table is able to generate and then obviously there are new partnerships, collaborations and self initiated activities I'd heard just kind of lurking around here today some of the conversations that you all were having and I saw business cars being exchanged and people saying oh I met someone I'm going to talk with them afterwards because we have work in common and that sort of thing that's collateral if you will from these kinds of activities and I think the round table experiences that we've had of the academies have helped to generate that sort of environment in this space that could be very advantageous so in some detail the concept we have right now but again this is all dependent on what the scope for a round table on subservice data might be we've envisioned something around 25 members the combination again of sponsor members and volunteer experts and again just approximately based on some of our previous activities two workshops a year we can do webinars there are other ways to do outreach so we wouldn't be eliminated limited to just workshops or webinars proposing a time frame of about three years from now that was what we did with the past round table it extended actually into four for a variety of reasons many round tables at the academies but don't worry I'm not thinking along these lines right now round tables at the academies have been around for decades because there's been interest in those particular fields there's no reason that something has to be two years or five years or ten years it just depends on the spirit and the interest of the round table sometimes there's a natural sun setting period sometimes there's a reason for them to go on but right now we're envisioning it for about three years to see where it has and determine along the way if there's need for redirection again reason to wrap things up at the end of that initial period of time and the staffing as I mentioned draws upon several academies boards and committees trying to take advantage of the cross-sectoral and disciplinary expertise that we have in the institution so just quickly about audiences and products the audiences that's something too that we have advantage at the academies because we can reach out to a broad suite of people both in government in the rural and state government industry the research community, NGOs the non-technical public and even international communities again as appropriate the products can be varied this is not an exclusive list but workshop proceedings, webcasts videos various sorts of short highlights we can do infographics a whole suite of different products again it really depends on the needs or those products from the activities of the roundtable and I think something again that's harder to put numbers on because it's the relationships and helping to inform and nucleate other activities which you can always tie directly back to the roundtable but you see the link, the chain that helped to create those relationships that triggered some other types of research or analysis one thing the roundtables don't do is they don't issue advice or recommendations so the roundtable itself isn't writing these documents, they're contributing the information to help bring those activities to light but then the work is done by staff staff repertoires and our communication sessions and just to wrap this part up a roundtable again convenes topics routines from a range of sectors it's widely inclusive in seeking broad input in public meetings I think a big hallmark is that it's a really agile mechanism to address a rapidly moving topic like surface data and it can be regionalized and tailored in an ongoing way where the roundtable members really drive the agenda and drive the direction that it takes so with that I'll wrap up and then turn things back to you Jim okay thank you Elizabeth so I thought before we kind of go into a discussion on this topic it might be useful just we have a couple people that were very active in the unconventional roundtable and we just ask each of them to see if they had a couple minutes of comments to make about that and Wendy I'll put you on the spot first because Wendy not only participated she played leadership for all women in conventional roundtables so appreciate your thoughts on some of the opportunities that the roundtables offer well thanks so the roundtable took me by surprise I'll be really honest I think it's exactly as Elizabeth said it has an uncertain initiation but rapidly picks areas of interest to this broad group and pursues these very aggressively so it was quite open ended to begin with wasn't in the end and there's another thing that surprised me and I greatly enjoyed was this was a diverse group of people and I just said we always have to remember our science is in context but this is a really nice context in terms of the people that we interface with from NGOs to other universities to some of the local states and national organizations so I mean it was it was tremendous and I think it allows everybody to see a different perspective and then so we had two or three gatherings here in DC but I thought the third thing that I really liked about this roundtable was we went out and we went out and we took a message that was relevant to different parts of the country so for example with unconventional hydrocarbons we met in Pittsburgh one time and then in Midland, Texas and those are two completely different environments in which to discuss unconventional hydrocarbons and so I assumed that the audiences and I imagine Elizabeth you have gathered information about how the audience has perceived us but as a participant I mean this was enormously stimulating I thought to me and I presume the rest of the group everyone was very engaged in it so I suppose that's my I have nothing bad to say about it but I was very surprised in a pleasant way I didn't know what to expect Thanks Wendy Dave? Yeah, I thought it was a wonderful broader exercise I think one general message I had from the roundtable was the importance of defining some of the key areas to attack and zeroing out on those eventually that provided a really rich landscape to talk about and this is a different problem in front of conventionals it was about problems with that different arena and also the desire to connect with communities connect with policy makers and decision makers I thought that was the town hall from which the better word were really effective and illustrated the power of the roundtable in terms of how to actually take the workshop informatics and move that to some guide along with improving people but when I think about lessons learned from that and it sort of struck me as though in the context of the workshops and in the all times really it kind of distilled down from maybe 90 to 90 to sort of swatter analysis we kind of looked at strengths and weaknesses of these particular areas opportunities were identified with respect to panel people that talked in areas that needed to be explored more and then we looked at stress in terms of what is this that really provides barriers and difficulties in terms of realizing how to communicate this important topic area so I think this particular roundtable that might emerge from this particular meeting will be so much different right it's not about the public so much but yet the outcomes from machine learning in terms of how they're applied actually do affect people in terms of particularly how this activity is applied to industry outcomes and how it affects education and learning so it's a different kind of community but I think some of the same things that we did with that conventional roundtable and when it can correct me really provided a kind of a really nice definition of what we might try to achieve with this one in terms of lessons learned and I thought was really well well-organized as usual by the National Academy it's a wonderful experience and and a lot to come from it and I think that the key take on final take on that is the ability through the National Academy in the way they structure this to connect people and catalyze conversation that was the key so that's what I envision will happen with this one catalyzing a much more directive conversation that really has meaning and teeth that people come away with in a positive way Thanks Dave I'm going to I'm going to put Mr. Mayer on the spot. Yon Mayer joined us for this session. Yon is with resources for the future and Yon was very active in that roundtable so Yon appreciate any thoughts. I think everybody knows that resources for the future is essentially a nonpartisan economic think tank we're not a science organization I was at the three seminars we had here in DC participated I think from almost the first meeting of the roundtable group but did not go out to the three events one of which was in Denver in addition to Pittsburgh and Midland I support everything that Wendy and Dave said I think it was an extraordinary experience for me as an individual to listen to and learn from the vast variety of skills and experiences that were in the room government, non-government, industry and I think what that group was particularly focused on and I think helped to do is to try to put out into the public domain information about unconventional development as to which there is an enormous ignorance and so I think this helped it's not the only thing that's needed but it's a piece of what's needed to bring our country along to understand one of the great opportunities we have but it's got to be done responsibly Thanks, Sean. I think we're going to transition into the conversation a little bit but I'm going to I have, is there anybody else that wanted to, I don't want to miss anybody that wanted to speak about the the past roundtable experiences then we'll transition a little bit Okay, Dorothea you had something you wanted to add on that? Someone just said that I heard about the past roundtable through our meetings over the last I did on the committee six years I did attend part of one of the meetings here in DC and I did attend online one of the workshops early on this is not my field per se on dimensional hydrocarbons but I certainly picked up on the excitement and listening to everyone today it occurs in me it's much like drawing more and more entities into the willing vortex and then speeding up that willing vortex as much as possible and that's what the roundtable feels to me to be able to do, is to draw more into that, people into that very much like that vortex in many ways to think about it Good analogy, good analogy let's kind of, and I see Lance smiling back there he's pretty excited I'm excited about that slide I see a lot of use for that slide so okay what I'd like to do and this is kind of a transition here but DOE was a key sponsor of the unconventional roundtable along with some other agencies and I think they've indicated that they'll hopefully have some interest in this new this new endeavor and so I wanted to offer, I think Grant was willing to make a couple comments Yinka's here, I don't know if you wish to or I'll even propel upon because see I'm part of that former DOE fraternity and so is Doug Hollitt so Doug was actually at DOE for some of that time so we'll start with Grant if anybody else wants to jump in and either from their DOE experience or after we'd welcome them so I guess what I can add to this discussion is that DOE is definitely, this is right in where in fossil energy we're looking very closely at developing machine learning capabilities for a lot of these different problems related to carbon storage related to improving recovery from oil and gas reservoirs particularly in commissionals but also enhanced oil recovery and things like that a big challenge is exactly what you guys are proposing to talk about is and it's been one that's been around for a long time is how do we get the data that is available for these and how do we get it in formats that other people can use and I think this is something that we've seen you know even hearing people in the industry talk about the challenges with data that exists out there and I mean I've heard people talk about well my company bought that company which bought that company and so I have the same data from the same field but in four or five different formats depending on when it was taken and I can't figure out how to get each of those formats to talk to each other and you know some of those things are being worked at you know now but there are there are still a lot of challenges in you know standardization encouraging you know sharing maybe anonymization Kelly talked about you know anonymizing the data finding ways that those can be shared and so if we're going to develop if we're going to put you know one of the things the DOE has invested a lot of energy in is high performance computing the development of simulation capabilities for a lot of these subsurface systems and if we want to be able to apply which is what we're trying to do these machine learning approaches to these systems we need to figure out how to get access to valuable data that we can use our capabilities on and with and so I think that's I guess that's what I would say I think for fossil energy certainly there would be a lot I can see there being a lot of interest in exactly what you're talking about finding ways of data standardization data sharing and bringing and catalyzing this discussion because it has to be a discussion between industry between the researchers between you know government talking about what can be done in the near term and maybe in the long term Thanks Grant I think we're going to open it up I just wanted to and I know from a constant couple of people I think there's a couple of people that have to that are on a tight schedule so I want to make sure they have an opportunity I mean I know you had potentially had to leave early did you want to add anything you had some thoughts of the break maybe you want to perhaps share okay okay I do have a thought not so much along what I shared with you at the break it's more something that I've been thinking of as I interact with universities around where I work and the expression of a need to for data sets if you'd like to train and educate their students around with machine learning on data and often I hear whether there are data sets from the oil and gas industry that could be provided to universities and indeed they are I'm familiar with the data sets that have been released from the North Sea by certain countries like the Norwegian Petroleum Directorate and recently more recently the UK oil and gas authority I think those are would say wonderful data sets we're focusing on them we're focusing on the classics I know that there may be some data sets in the US that are also public I think Wyoming is one of them and so they are not so easy to use and to work with and they do require some domain knowledge should like on how to go about understanding the context and perhaps even curate them and they may provide the basis these publicly available oil and gas data sets could provide the ground actually for data that can be shared with that can be downloaded by university groups interested in educating and exercising machine learning on subsurface data that I digress here a bit that actually presents a great deal of challenges relatively to what the students are practicing on now they're practicing machine learning on I mean when you do a degree and let's see a master's degree in data science or so you often go through that's what we've done through interns that we hire you go through several projects if you'd like where you practice machine learning but often on data that is available in public domain with universities that refer to transactional transactional data and also to social media and everybody does that to death almost and they are not they don't get confronted with the richness and the challenges of data sets such as ours like that come coming from senses with all the imperfections and all the challenges so going back to the flow that I had is perhaps this round table could discuss and perhaps people could help me and help articulate a collaboration between universities on one hand and oil and gas operators perhaps the national labs would also join in around that effort for some sort of a win-win situation where each one brings in should like that capability that time in kind around that I said that are publicly available and work out if you'd like solutions to particular problems that you decide on so that's one proposal I should like. Great what I want to do is I want to remember we do have some folks online so if I assume Eric they have a way to tell us if they want to weigh in and so we want to just for those of you that are listening online we want to encourage you to be part of this dialogue as well and I'm going to you're going to find a little bit the method to my magic if during one of the breaks I talk to you you are in danger of being called on so Thomas we had a really great conversation this morning about who was missing at the table and so so could I prevail upon you to maybe share some thoughts on that okay yeah I actually do have a few problems the first is that it has not been mentioned today but there is a major data formatting and duration initiative in the industry called the open subsurface data universe there is close to a hundred different companies have signed up including I think pretty much all of the major operators most of the major surface companies including slumber say I'm not aware I mean maybe I've just missed it but I'm not aware of any national laboratories or governmental bodies participating in this consortium but if the labs are interested in where common data formatting and common common curation standards are going in the industry they might look it up it's under sponsorship of open group as an open source initiative an interesting feature of this is that among the most active participants in this consortium are cloud service providers the cloud service providers are in the process of setting up major oil and gas verticals they believe that management of oil and gas subsurface data is a major business opportunity and they are actively promoting the ability of their cloud architectures not only to facilitate the management duration of data but also to offer native machine learning capabilities that can be applied directly to data that is stored in their clouds I have a couple of other observations the first is there is a lot of talk about data and subsurface data and one question of scope I would ask is where is the scope of data to be shared or discussed restricted to United States onshore and economic zone waters of the United States or is there an ambition to go broader than that if there is an ambition to go broader than that one has to be aware of the fact that every country has a different attitude towards the degree to which their sovereignty includes sovereignty over subsurface data in their territory and that is a matter of great sensitivity in a number of nations including nations in this hemisphere the second observation I would make is I have heard from what I would probably characterize as the research community today a number of statements of the form of well if we could only get industry to share their data we could do so many interesting things that is the proposition that has zero value for my shareholders when stated that way a couple of years ago I actually led a team that looked at under what circumstances would be willing to share different data types and we actually looked across the corporation at upstream downstream and petrochemicals data types and the answer was it depends the degree of proprietary sensitivity is very different depending on the specific data type you are looking at the degree to which revealing it can reveal proprietary methods and technologies the degree to which revealing it can compromise exploration or development business plan so I found that a very interesting exercise I think it would be interesting to have conversations about where does sharing more data potentially bring value to commercial enterprises because I think there are some cases like that but you know for us such a conversation has to be framed around the value that can be brought to us by sharing it most of the data we acquire is very expensive to acquire and we are by and large we are not interested in giving it away unless we see that as a monetizing act thank you let's open up Doug I just wanted to maybe in a slightly different way to echo what Thomas is saying you know I think publicly traded companies they have liabilities and exposure in a number of different areas so I think there is probably a need to even be digging even deeper on if data is made available is there a liability on the environmental side on public disclosure I think there are some very complex issues here that actually go outside of subsurface and really go to corporate governance that would have to be considered and that's almost as totally sort of a sidebar piece of this but I also agree with sort of the notion of making sure that this is framed into a complex set of stakeholders why should you care why is this important and both to the subsurface community but we do a pretty good job talking to ourselves why is this important to a much broader population that really should care about the geosciences that should really care about the subsurface and what we're doing in a big data or machine learning type type data treatment articulating that I think is really important and I think it's something we struggle with a lot again I think we explain it really well to ourselves but I don't think we reach out the other part that maybe is connected to that is ensuring that this effort is interlaced in certain ways with some of the big initiatives that are sitting out there and I think I was talking to Wendy there's a commerce department study that came out I think the day before it came out yesterday on critical materials and critical minerals and establishing a framework for a cross governmental approach to tackling that problem when I read through it really quickly it's sort of an assumption of well we know where all this stuff is we just have to go get it and so explaining how this interlaces with something that's sort of a very rapidly oncoming issue and this is just one of I think of a lot of these goes in that direction of why should you care why should funding agencies care why should the public care and making sure that's constantly right out in front I think is really really important and I think the academy does a really good job of that but I think in this case because we're talking about something that's you know pun intended a little bit out of sight out of mind is really really important and then maybe the last one and I think it ties to how do we use this data and to me sort of an ongoing problem is making sure that we do a really good job of explaining I don't know how to put it statistics you know folks tend to want sort of a single answer when we talk about the subsurface and you know what that prize is what you get out of it if you ask somebody what's you know if they're looking for a resource number we tend to give them a distribution and they only you know the general population here's one number in this case I think it becomes increasingly important to be able to explain you know what what is sort of the statistical treatment of the data in the subsurface how do we do it what does it mean how do you want people to understand that you know and this may be a lost cause I think it's always hard to explain statistics but I think again it's really important as you go into this next generation of utilizing and really exercising really big data Thanks Doug and feel free if there's open seats at the table that you'd like to move up please feel free to do that and gentlemen we don't mean to leave the geothermal program at DOEL so we would love to hear from you guys as well and that builds perfectly off of my question which is you know the subsurface we have mining here we have carbon here we have geothermal here and then to me the tool of one of the reasons why Virginia Tech is interested is machine learning is really a hot button for us right we have this new innovation campus that's going to come here up to the DC area and we're going to bring all these people here to do these cool things machine learning data science etc so for me I'm just trying to understand a kind of broad versus specific and just some opinions of the group of kind of where do we think this goes because for me those are different problems but then they're solved somewhat the same and so for me it's just unclear where we have a where we have a big elephant and one bite at a time and I'm not sure where that goes but I'd like to hear some discussion around that okay and Elizabeth I don't know if you want you know maybe because I think we haven't talked about your other hat where you're on the water board and some of where some of the issues such as water and environmental issues and some of those other aspects that could be folded into this and you might be able to address that I can just say a few words it's a great question and I oversee two different boards with the academies one is there a science board and this committee sits under that so there are sciences and resources and so forth and the other board are science and technology boards so I'm always thinking about both of them and water is very important the subsurface fluids and rocks I don't separate them so they're part of that system overall and I think conceptually looking at the broad suite of they've brought up you know why should one care our understanding of the subsurface is not what it could be just if we abstract from specific problems in the oil and gas industry that they want to solve or in the mining industry that they want to solve we don't understand a lot about the subsurface that could probably help us generally with a lot of different sorts of problems and that affects the community that's the extractive industries community it's the water community it's also the waste disposal community and it's the geotech community because they're trying to build stuff and stick stuff in the ground and we need to know about rock properties how a rock is going to behave what happens when we do this to the ground do we see changes that we have to monitor over what timescales we monitor and so forth so I think in the broad g-wiz sort of way we would like to be able to think about the roundtables helping us look at a suite of interesting problems in the subsurface that could have applications across broader scales but that we would be able to use certain types of more focused sessions if you will that address a particular problem that has interest from oil and gas or carbon the carbon area or has interest from the water community that could have broader applications and again I think as we I saw this morning I don't want to speak for everyone in the room but there are learnings to be had from looking at someone else's problem and sometimes it's better to look at someone else's problem isn't it it's a little better about your own and so there is some learning to be had in that regard and so if we can get interest from some of those other sectors in the roundtable I would like to do that but because a roundtable really does drive its own agenda if we don't have strong representatives from some of those other sectors in the roundtable it would just be a question of would we be able to go in a direction on some of those other paths so I would say yes hopefully because I think there's a possibility to learn a lot there I wouldn't want to leave those things out but it will depend a bit also on the membership we're able to bring does that help yeah okay perfect join the table there you said there's no advice or lobbying that comes out of these I would wonder if the roundtables could express needs for example one can imagine formatting or data sharing standardizations tools as one that we've already just started to discuss but also needs in terms of infrastructure for the broader community to make steps forward there's a difference between advocating and making recommendations versus pointing out where gaps are and where's the boundary on those in terms of what the role of these roundtables are that's a really good question and we wrestle with that on a staff all the time with all of our projects some of what even advice giving this advice can quickly move into the advocacy framework so there's a lot of gray but I think in terms of what a roundtable can do through its various activities and the products from those activities if the activities are structured well gaps needs come out of those activities so we always try to structure activities so that they come to that stage so that at the end of the day when people are saying well this is what I heard well this is what I heard and you get that set of common points and so those do come out of the activities again because of the way we structure them if I could add and what you could envision I'm not saying this would happen you could envision that there is a specific focused area that we say gee this needs a national academy's consensus study on and in a consensus study which you're going to then launch into something that's another significant task but that in that process you can come up with a consensus recommendation to support it seems natural to me that identifying infrastructural needs seems like a really obvious result of some of these and so but part of what as we've kind of as a committee has struggled with this for a couple years is that we're not at a point that we can define that to suggest or propose that this would be the way to do it and so that's kind of the evolution was well the round table that allows this community to find those how that further work should be done So Tom Crawford with the USGS Minnows program and Doug's comment about the recently released critical Minnows strategy report kind of an offshoot of related action to that is this new program that USGS has launched just this year called MRI which is earth mapping resources initiative and it's a kind of three-legged stool of a program involving airborne LiDAR geologic mapping and geophysical airborne surveys but all of this is the intent is to better inform the nation about its critical minerals potential and you know we're talking about looking underneath the surface of the earth and doing that more better more intelligently more efficiently and so what you're talking about with this round table feeds very directly into to that and we're faced with this issue of it's coming up this program is going to be generating some pretty large amounts of data and that we're going to be wanting to serve up in the right format and make available to the public at large and so I'm quite confident that USGS would be enthusiastic participant in those roundtables I don't see how we couldn't be Thanks Tom and that just trends and we want to continue the discussion but it is helpful for the committee and for the academy staff if your organization is really interested in being including in this discussion and I'll remind people as we get to the wrap up page but as we continue the discussion it's helpful to hear that so that we know so Elizabeth's you know keeping notes and we'll have a list of folks to kind of follow up with so that's very helpful so others that wanted to contribute and I'll I'm not opposed to calling on committee members either Carmen you look like you're ready to go I had a question and this came up and you touched on it a little bit when you were talking about the water board and I was just thinking back to some of the work that we do focusing more on environmental protection so when we were talking about and that was a question I guess is how much of this is actually gone to that point or looking at it so for instance in some of the coal mine areas we would look at techniques for identifying strata high connectivity producing so could we then use that information is a way to change how you would do some of the mining plan or some of the reclamation practices so I was listening to some of this I was thinking well what is being done or being looked at to use some of this or thoughts about this kind of information for more ecological protection or as you're doing things like that Doug just another set of thoughts so I was sort of maybe not surprised but you know when you mentioned earlier that the unconventional study took three almost four years which actually or not the study but the process took that much time I'm just thinking of what's going to happen over three or four years within this broad topic it's going to move and change a lot and I'm not I guess I'm just asking the question do you make sure that what you're doing on day one is still relevant on at the end of the fourth year as big data approaches on the subsurface are really a revolution in its own way I think sort of on the same scale as shales have been in the oil and basically in the oil and gas and production sector now that grant that took longer in four years but the rate of change was extremely extremely fast sort of hard to sort of hang on to it so I think the same thing is going to happen here and I think that needs to be baked into the process in some fashion just to add you know I think it was interesting even with the unconventional round table that we saw that evolution they kind of planned out when they met they really only planned out the first year the first couple they had a number of ideas on where to go they settled on a couple and they planned those first two workshops and then they kind of as they after the first workshop they reworked their new plan and I think I think it's somewhat baked into the round table process but that's probably an important point to highlight particularly in when you think about the membership of the round table and making sure that you have that the diversity of members to kind of build that evolution that's a good point I would echo that comment because when I look at that unconventional round table when we started water reuse was sort of kind of taking some new dimensions but by the end of the round table we were hearing water reuse to impact economics and also environment had really matured actually over just that three and a half or four year period so it's an excellent example of that fast pace that Doug's talking about I just want to make a quick anecdotal comment that I think this particular round table is could be that either the glue or the carbon-carbon bond that links what some of us used to call secure earth. Those of you that were part of secure earth back you know maybe 15 years ago or less where we tried to imagine connecting all the different subsurface arenas in sort of a holistic way to kind of move forward how to model them just kind of emanate out of the national labs out of Bobovards and out of Berkeley and so here now we have an interesting development with machine learning that actually can bring all of those areas water, contaminant transport, oil and gas, geothermal all into focus in terms of how we move forward to really understand and then when I think about the unconventional round table and its impact all of those areas impact our economics impact people. So there is a connectivity that I hadn't really thought much about really when I started listening to these conversations today but now it really comes full circle for me that this machine learning could be that glue that really kind of connects all of those in a way that's meaningful that really moves science and technology forward so I kind of really endorse where this could go actually just again anecdotal. I noticed in the document at the back of our little booklet and also in Elizabeth's little presentation the words machine learning are absent from that so are you envisioning or is this room envisioning the round table that is talking about subsurface data or what we do with the subsurface data? It's a good point and that kind of shows the evolution of when this started a couple years ago it was very focused on what are the data sets and how do we get access to data sets and I'm going to it actually ties into something that conversation that Samin and I had at the break that he didn't mention was the importance of keeping machine learning and the subsurface data issue linked together and that it is important to look at both of those issues and I think that was one of the opportunities that as we started working on this meeting as a committee we saw that opportunity. Samin do you want to add any more on that? I don't know if there's more to add but it was just I wanted to give you credit because I think it's something one of the things that came out of for me that today was the value of that linkage of those issues and it doesn't mean for instance the round table might have a workshop focused on subsurface data and they might have a separate one but you still have to keep these areas both together in the process. So following along the lines of what Thomas mentioned about the open subsurface data universe something else to be aware of is the theme the SEG advanced modeling group has a project set up that will kick off in about a month and a half called the SEG AI cooperative project where their three main focuses are defining global formats and standards for data exchange in cooperative projects research communication defining important AI challenges in oil and gas geophysics and creating and distributing suitable test benchmarks for public use to assess the results of AI applications on the data and to establish and manage cloud collaboration networks for the advancement of AI and applied geophysics so this seems very related to the discussion so it's something I think you should be aware of. And I think the intent is to be kind of a catalyzing force not in any way come you know there's no I mean this is actually an opportunity to advance things not to do original work so it won't so I think all those kind of areas will be hopefully this would be mutually beneficial with those kind of initiatives. This plan. Two things to note just I don't know if they would end up being relevant or of interest but another area with strong proprietary data sharing challenges but a different sort is the healthcare electronic medical record world so it's largely driven by CERN or an EPIC the two big companies but there's been there's spawn to hold computer science research crosstalk kind of network the FHIR fire thing is one I don't think it's a direct parallel but it's another world where there are legitimate financial reasons for not sharing and there's interest in sharing in some way so that's one that may or may not be of interest for discussion and another area in talking about anonymized data was mentioned a couple of times but there's also the area of differential privacy which is more of a stricter restriction on there's computer science very theoretical research about that but like the US census is trying to put this in place for the distributing data for 2020 census and it's been very controversial and John about the chief scientist at the census but he's been very open about talking about it about trying to make it practical for their world and making it practical in this world will be a whole another thing but his perspective on that and how to communicate it might be of interest too so differential privacy for data sharing and then the electronic medical record comparison with the two things that came to mind thanks David, can I you individually represent a state geological survey USGS has weighed in but from a some of the very, I mean not that USGS is a long history of storing subsurface data but the obviously the state geological surveys have been doing that for a long time too and might have some thoughts on value and engaging state data and the one thing that the primary thing that comes to mind is that every state is different and we've got 50 states and in some states there are vast quantities of oil and gas data stored in the state geological survey and other states that's embedded in a regulatory agency some states have made great strides in digitizing all of their subsurface data and other states that's still in file cabinets and boxes that are perhaps dusty and eaten by cockroaches and rats seriously, I mean every state has a different is in a different place in terms of its regulatory structure and its commitment to digitizing and modernizing its data system so that I see also as a major hurdle to overcome this is public information but it's not public if you can't get it so yeah the variability and accessibility I see is going to be an issue probably both an opportunity for data access and an opportunity for machine learning to try to quality manage that old data I love your sunny outlook yeah I would echo that statement from a DOE perspective representing geothermal we did have a follow on effort building out a national geothermal data system where we were working with many states maybe not every state but many states to populate that data and we definitely ran into the same challenges but we would be supportive of continued efforts to collect that data, standardize that data working within the state guidelines and then the other thing I wanted to kind of loop back to we were talking about ecological impacts and human impacts of subsurface activities and something that geothermal is that we are often working in the western USA where water is scarce so that's something we're definitely interested in is not consuming potable water in particular and then the other item that's been a bit transient has been induced seismicity as it relates to geothermal that's come up more than once as we look at well stimulation there have been international issues with some stimulations in Basel recent issue in Pohang Korea something that we always want to keep our eye on so from a technical standpoint that induced seismicity in many cases that's a diagnostic tool that we use to understand where our fluid is flowing but we need to manage that from kind of a public standpoint and reassure the population that we are not trying to cause any earthquakes that are going to be sensible or cause a problem with the community so I think those ecological, biological and human impacts are also very important thank you I want to make sure everybody has an opportunity to engage but I don't want to just keep us here if we've covered the ground so we can so let me just give people if there's some additional comments that folks want to make if not we'll kind of move a little bit into next step but Yinka would like to make a comment well thank you just one quick addition to what has been said here particularly in regards to machine learning you know I mean you realize that the stakeholders we are dealing with are very diverse and the biggest portion of that is the public there's a different conception out there about machine learning in terms of taking jobs from people is it rankable to think about this or address this or you know down the road that's something that may come up in your public engagement and on that note I think the unconventional rankable was easy to sell because it's the resources that you're talking about and there were some controversies surrounding that and the one main purpose of the rankable was to alleviate that concern that the public has about developing the unconventional resources but in this particular case it would be harder to sell so I'm not sure if they had in the machine learning would then lose the ability to do just that I'm still struggling for the back of my mind about that just one quick point that comment earlier about the theme project on the oil and gas side made me think about there's nothing equivalent on the mining side or I think in any of the geosciences outside oil and gas and the idea of having some benchmark synthetic models with a sufficient level of detail with synthetic data sets derived from them that can serve as benchmarks for any of the researchers or people inside companies doing research with these machine learning algorithms would be fantastic but if this kind of form can do anything to facilitate that in any way I think that would be tremendous because then you actually have some metrics which you can measure these algorithms Thank you I had the same thought and I was going to also contribute with that too and share with you something that I attended this is a thought that I attended a few days ago and it has to do with benchmark benchmarking data if you'd like to look at the performance of deep learning algorithms to recognize objects so there is this data set called ImageNet that is used widely and where teams from academia in particular compete to see whether the deep network can achieve a better performance at humans and indeed it's been said that one of them achieves about 96% of performance compared to what a human can do around 93% or so there is this professor at MIT Boris Katz who has been busy over the past few months creating a new data set of the same objects that are included in the ImageNet except that he asked that this is crowdsourcing he asked for photos to be taken somewhat non-conventionally and he then took a bunch of these deep learning algorithms to see how well they perform and they failed miserably about 46% below that level of performance and this is just a highlight the fact that ImageNet is a particular way of taking images biasing if you'd like the multi-dimensional space and the deep learning algorithms learn if you'd like that bias and perform very well but when presented with a data that is outside of that if you'd like realm of parametric space they failed miserably so just to re-emphasize again I think in terms of topics and themes the round table may address in the future is to think of in addition to synthetic data potentially real data that could be used as benchmark for the performance of deep network algorithms that hopefully can also address the question of bias interpretability I think so thank you I'm sorry I'm guilty of not using my mic any other themes or issues that people would like to put forward okay well I think this has been a really great conversation about how we might get started so I'll cover a couple next steps and then I'm going to I'll so we get some of those topics and themes that's important I want to get so if you have other thoughts or information you can feel free to submit those to Elizabeth and we'll fold those in because we'll have and one of the things that that this committee will be doing we're meeting in executive session tomorrow and we're going to try to take all this information and one of the things is we need to get our page in a half or so it's both subsurface data and machine learning and how do we capture some of the thoughts and ideas and that will then be more of a turned into a little more of I think a scope paper for a round table we'll go through a process we already have some folks that have expressed interest but I would ask if you are interested please make sure that that the that Elizabeth or some of the other staff knows of your interest feel free to mention it to me we'll put you on a list of folks that are interested to make sure that we're that you're getting follow ups on this and like one of the things as I mentioned we need to have some agency sponsors one agency that is not here that we will follow up with is the NSF has in the past been a very added some very thoughtful input when we've had these conversations and has very much expressed interest in being part of that discussion we'll we'll be following up with you know it's you look at this and what's become real clear you know we'll work we'll figure out how to obviously our we have sponsors with with the office of possible energy and and the geothermal office in the in the renewable energy group but there may even be others at DOE because as as was mentioned this is such a cross-cutting issue but we'll work with those two groups with DOE we'll work with USGS but my guess is there could be some I could very much see some interest in some of the other entities that they the department of the interior and so so we we may want to get some guidance from from from USGS or others about some engagement there but but that's that's part of it there will we we so those that have been involved in some of the roundtables before those are clearly obvious ones for the the committee and staff to reach out to but we don't want to you know as we talked it's going to be important to make sure we have have the right representation so we would look for some ideas from folks on who and Thomas maybe I could prevail upon you for a contact with some of the cloud service companies and and because we've done some attempt to reach out to those folks and it we didn't get that lined up before this meeting and it you know it's it's like a lot many most organizations that's figuring out the right place to to reach in and the right referral and so that would be much appreciated because I think that's important but then equally important as as as we talked about so there's a number of academic institutions around the table but but I think one of the areas and I think yawn left but but yawn has been a conduit into the NGO community and I think even though this tends to be you know a little not you know it may be a little different than the unconventional round table I think there could be a very interesting engagement on some of the whether it's through some of the foundations that may have participated in the unconventional round table that are very very supportive of sustainability like the Mitchell Foundation or or some of the others that just see this opportunity that this offers to for science to help inform good decision making would potentially be supportive and add that voice that is that broader public policy voice so I think those are all issues that we would do as we as we try to work on that and the last point is to the staff the committee and working with the staff to develop a real perspective so it's a bit of a chicken and egg there because we'll probably start reaching out but the prospectus will be a bit of a work in progress as we as we talk to people and get feedback so that's kind of the path forward I think just to give you I don't think you mentioned this one time today we had over 40 people participating online we had about 50 people in the room today that is while we're so that I think is a really communicate there is significant interest in the topic we generally don't quite that large of an online patient or meeting and this has been a very good attendance today in person so appreciate that Elizabeth any other comments on the steps what I'll share is primarily following on from some comments that Jim had what we did in the past roundtable in order to get the scope of the task in line we did a similar process we had a one pager that we started with and that was not the one pager that we ended with we had a conversation like this where we gathered together a lot of stakeholders with interest in the topic and this is always where it gets awkward because thankfully Jim handles most of this part but there's obviously a need because we have to have financial support to start the roundtable to have the roundtable but there's a whole suite of others who are involved in the roundtable process both that helped us get the last roundtable up and running but then they also pointed us in the direction of if they didn't become volunteer members of the roundtable themselves they helped us get to some of those other volunteers they helped us identify other people so there's a community aspect to developing what that task should be so we'd really appreciate those of you who have a continued interest in helping develop this into something that could be viable could get some traction and then eventually get some glide speed up we'd really appreciate your continued engagement it doesn't mean you're making a financial commitment that's what I want to say it means that you're interested in the topic and trying to help us to frame this in a good way because all the questions you raised are very good we don't have answers to all those questions yet obviously because it's a very big space and you don't want to make it too big so that it's unwieldy you have to give it enough room so that there's flexibility but you have to give it enough direction so that you can actually launch off and try to accomplish some things in the first year and the second year and the third year depending on how long the roundtable would go so we really would appreciate those of you who are interested to stay engaged and help us continue to think about this so we'll do another round of adjustments those of you who'd like to take part I'll be reaching out to you after the meeting and I won't be a test I promise your inboxes look like mine I'm sure so if you are interested to continue with the work we'd really appreciate that feedback and again at whatever level of engagement in terms of contributing to the frame you're capable of or that you'd like to do because at the end of the day then it ends up being more of a community document because it's had this input from this range of experts because I don't have all of the perspectives no one in here has all of the perspectives and so when we it's like machine learning isn't it we get the different perspectives and we start to narrow it down to a common solution from the different lenses that you each share so again we'd appreciate that very much because we'll end up with a stronger product at the end that we can try to match with different needs from the eventual sponsors of the activity so this conversation was really fabulous for us, for me to take forward and I know we'll be able to make some significant and valuable changes to the one pager that we have but we still have a little ways to go there okay so I think we're we very much so I want to just conclude by thanking everybody for your engagement you're with us all day and I think we it's usually not typical that we come to 4.30 and we still have just as many people in the room that we had when we started the day so that is a grand indication of the interest in the topic so let's so thank you for your attendance and thank you for all the speakers the moderators and we will look forward to engaging with you and moving this forward and just very excited that I mentioned that we're kind of look at ourselves as an incubator and so we're kind of pretty happy when we can get the little chicks out of the incubator and have them start flying so hopefully we're moving in that direction thank you everyone and stay calm