 I start with an image. Let's see if the pointer works. This is an image of a tumor. It's a red-green-blue image, something that any image analysis person deals with. The red-green and blue colors from this image or a traditional histopathology image could tell you something about the morphology of cells, which would tell you something about the way to treat those cells. And a great deal of effort goes into analyzing tumors based on morphology and red-green-blue images. But instead of this image only having three colors, it has about a million colors behind it, and I'm only showing three slices. So although the spatial resolution of this image may be considered poor compared to an iPhone or something, the chemical depth of information you get from a mass spectrometry image is many, many orders of magnitude greater than what most photonic techniques do. So we call this hyperspectral imaging, and we'll talk a bit about it today. Let's see if we got lucky. No pointer. No clicky. I'm having technical difficulties with my slide. I think it's the clicker. That's okay. I'll just not use it. It froze the computer. But my slides are online, so you can just read them, and I'll go get a coffee. Okay, let's see. Let's see if we can bring it back. So Stefan asked to start with a slide about kind of who I am. So physically I work at a place called the Joint Genome Institute in Walnut Creek. It's about 12 miles east of here. It's quite nice. There's parking. It's free. But I've been working at LBL for about eight years, and my main project is Open MSI, but you can check out some of our other production level stuff on our BioRack organization, and then some really bad ideas on my personal GitHub page. I'll tell you a little bit about what's happening in my spare time. So most of the work I do is data analysis, and I work with a large number of people at Berkeley Lab, and I work with some of the biggest programs in biosciences like Lab, JGI, JBA, and Enigma. So I've showed a ton of people here, but the two projects I'm going to talk about, one of them is Metabolite Atlas. It was built by a friend of Stefan's named Steven Sylvester, but before that it was actually built by this guy, Terrence Sun, because he was a 16-year-old working at NERSC, and he's really smart, and now he works for Google, believe it or not. The Open MSI team is built mostly by myself and Oliver Rubel, and most of these people are data generators that we work closely with, and we teach them all how to program. The buddy we work with uses these iPython notebooks to analyze their own data with minimal help from me. So spatial gradients are important. I don't have to tell this to an imaging audience, but we do imaging to understand spatial gradients and find the information in the spatial gradients. In my work, I use it for high throughput screening where we array out samples at extremely high density. So each sample can be about the size of a human hair printed on a chip. This entire image of about 1,000 samples is about a centimeter by a centimeter. So we can make very large density arrays of samples. We use a pathophysiology, as I was talking about earlier. More recently, we're looking at microbial interactions. This is really interesting. You've heard about the National Microbiome Initiative launched by the White House. There's a lot of interest in understanding how molecules are sent from one species to another to tell them, hey, I'm here, or hey, I'm going to destroy you. Feed me, or I'll kill you. Drug metabolism is also a very common modality for mass spectrometry. People designing drugs want to find out what that drug becomes in a living organism. So mass spectrometry imaging raster scans a laser across a sample at each location. You're generating a plume of ions. The ions are recorded by a mass analyzer generating a spectrum. So at every position, you have something kind of like this, which makes a really big cube that looks something kind of like this, the most very sparse data. So you have m over z, or mass to charge ratio, by x and y. So these tall peaks are molecules. So we can slice this data in m over z and make an image of a molecule. Or we can look along it at a particular location and see a spectrum at that location. So these files can be just ginormous, is the proper term for it. This is the printed sample that I was telling you about earlier. This is an experiment where three of us went around and touched different chemical surfaces and then touched the mass spectrometry surface. And this file is literally a terabyte for one file. So one day's work, one terabyte. Yeah, maybe that doesn't sound so big now with all the sequencing efforts and things. Everybody's generating big data, but the data can be pretty big, definitely too big for someone to analyze it themselves on their personal computer. So in summary, you're basically raster scanning. So at hundreds, thousands of locations, millions of locations, you're requiring spectra. New detection strategies can make this go a lot faster. Right now it takes about a day to take a pretty big image. But hopefully that'll be changing in the near term with basically cameras that can image spectra. You can imagine how that might work. That'd be pretty cool. So we call it hyperspectral, not multispectral. And we would say multimodal, because in fact you can take different types of spectra at each location. So many locations have more than one spectrum. A little bit complex there. Also multimodal because we combine other images in the analysis workflow. So this is some simple example. A brain of a mouse. These are showing two different ions. In this case, these are two different lipids, compounds that might come from your diet. And you can see the spatial distribution is very distinct. These types of cells have a different lipid composition than these types of cells. We can also see small molecules like amino acids. We can see drug molecules. And we can even see proteins and lipids, larger molecules. People usually prepare their samples in different ways to see these different types of molecules and not see them all in one shot. The way we do this is we raster scan some sort of a laser beam or an ion gun across a sample. Typically this blows tiny little holes in the sample. Everywhere it blows up, you're doing a desorption. Unfortunately most of the things you're desorbing are neutral and they're invisible to you and they just make your instrument dirty and you have to clean it. Some small fraction of them are ionized during the desorption and ionization process, which allows you to detect them on a mass spectrometer. The most common way that we do this at Berkeley is called NIMS or Nanostructure Initiated Mass Spectrometry. There's a small, very porous nanoscale surface that when you shine a laser beam on it absorbs all the heat. It gets extremely hot. In the pores is a vacuum compatible liquid. Imagine something kind of like honey. It's high viscosity because it's going into ultra high vacuum, medium vacuum. It gets hot, it explodes. This is what blows the molecules off the surface and into the gas phase as ions. So it's multimodal so we can actually do, record a spectrum of these desorbed ions. We can also do some pretty amazing things with the instrument once the ions are in the instrument. We can use electric fields to select specific ions, hold them in a machine for some period of time, and blast them apart further. And this is called MSMS or MSN. So here you're actually trying to get a characteristic fingerprint of a molecule. You select a peak, blow it to smithereens, and see what small pieces are emerging from its decomposition. These instruments have exceptional mass accuracy and resolution. You may think of the mass of glucose 182 in mass spectrometry. You don't think about it that way. You need to add four decimal points to every mass you're recording. So the mass of an electron is 0005. That's about our uncertainty for most of the measurements we make. The accuracy is extremely high. And in more modern instruments, all two peaks that have almost the same mass is better and better. New instruments can also separate molecules on a millisecond time scale by having them flow through a carrier gas. And this is called ion-mobility separation. So now at each location, you're not just recording a spectrum. You're recording a multi-dimensional histogram in ion-mobility drift time and the mass. Now the files, they were big. Now they're ridiculously big. And the data is big data. What's this volume, veracity with all these Vs for big data? This is the main problem most people face. You have all these different things that you need to put together into one analysis workflow. This is where people have really struggled. I would spend all my time interviewing scientists I was working with, building workflows for them to do a very simple task. So these could be things like PCA, making some slices of their data, visualizing certain ions, calibrating the data, registering the data to another data source. And we do a bunch of funny things with isotopes that I won't go into and this track throughput spotted sample screening. So I'm going to present to you a dilemma that this ultimately is that we're facing right now. So what if Dr. Frankenstein had a most advanced mass spectrometer? What would he do? He would have to use a supercomputer is what he would do. So if you only recorded the mass spectra of the human brain, you'd be in some hundreds of megabytes to low terabytes. Depending on the resolution, how many slices of the human brain would be your mass spectrometry on, how many pixels in the human brain you did your slicing on. So if you only did that, but that would depend on the instrument you're using. So these more modern instruments, they're going to be bigger files because they have higher mass resolution. It would depend on the spatial resolution, the type of ion beam you're using. So MS1 only, you're looking at low terabytes, human brain. There's not a lot of information there. So you're probably going to take MSMS images also. You need to fragment those molecules to know the exact thing you're looking at. So you see an immediate blow up of the data with those MSMS images. If you add the ion mobility and ion mobility with MSMS, you're in the petabyte regime for one file. So obviously no one is taking data like this because of lots of reasons, but I would like to say that they could. I would say they could do this. And they could do this with the tool I built with Berkeley Lab called OpenMSI. So you would need a supercomputer, obviously, to do this. You can't do it on your laptop. So right now with OpenMSI, people acquire this multimodal data, transfer the data to the supercomputing center we use. Used to be in Oakland. Now it's right up there in that nice new building. And process the data automatically at NERSC. And then the user can analyze their data, explore it interactively in a web browser. You can check it out there in the bottom link. You just click around on the images and you can see spectra, slice the spectra and you can make images. But more importantly, they're now using those web services and the NERSC environment with IPython notebooks to write analyses of the data and reuse those analyses to do things more advanced and just exploring it. So how did we build this? I'll go into just a little bit of the detail. We have had a file uploader. This is kind of a working group. I'll talk about all the things. This was really a nightmare and I don't think it should have been. But what do I know? I just came into this as kind of a hacker type. But getting data to go from a group in some random lab to a supercomputing center is harder than you think. You'd think you would just get some off the shelf widget, plug it into your website and you're good to go. That was not the case. Partly because the files were so large. We were using a service called Globus to transfer the files and it was very difficult to communicate that to the users. Once they upload the file though, it gets picked up immediately by our system and they can submit the request to NERSC through this website saying, hey, convert whatever file format I uploaded into an open MSI file. Which I'll tell you a little bit about soon. So that converts. And then because we were using a shared resource, sometimes people would wait for seconds. Sometimes they would wait for days for their file to start running. So we had to build a little job manager to kind of let them know it's going to be a while or it's already started or your job has failed. And then once the file is completed, you see all the files that you've ever made, all the metadata about their conversion. You can share it with other people by using this little web interface. Click on manage and you can let other people analyze your data from anywhere in the world. They have to have a NERSC account. And the most important thing you can click around and see the spectra and explore the data. Before we built Open MSI, the competing commercial software would take hours to register one image. And that's because it had to do these serial read operations. So the data was stored spectrum, spectrum, spectrum, spectrum, spectrum, spectrum, spectrum. But a person wants to read image. So they have to go read a little tiny bit, skip, read, skip, read, skip. So it would take hours. And so we simply did some clever things in HDF5 with data duplication and chunking. They made it go really fast. And so they can also analyze their data with these iPython notebooks. So this is kind of the whole workflow that we're, you know, making people do. Success story, you know, people at Washington State didn't know anything about computers. They owned computers, but they didn't, like, use computers for anything other than checking their email. And they now analyze all their data effortlessly through Open MSI. And they take some of the best data you've ever seen so that the world would have never known about their experiments without this. Because it wouldn't be online and it also just, they wouldn't have ever gotten their papers out. They couldn't have analyzed the files they were taking. Their files are quite large, about 50 gigs of each, something like that. It's actually a botanical medicine for treating, like, genital warts of all things. They're trying to express it in a plant for purification. There you go. This is a root ball of a plant where they can have it make whatever this molecule is. I think Potofilo toxin. Yeah, so we had to have a good computing environment and the people in this room were largely responsible for it. Danny was in some of the early kickoff meetings just bringing people together. I'm sure if BIDS existed then that would have also been part of the roadmap, too. Like, just bringing the right people together in the room, getting going, making a new file, making an API, and then doing something useful with it. So this is something like what the ecosystem looks like now at NERSC. So we have, you know, this slide changes every year because they buy a new computer and get rid of an old one. But data gets uploaded here daily to this large multi-pedabyte file system. It's highly interconnected through magic to the supercomputers, so you can immediately see your data from any NERSC system, whether it's a web service or whether it's a supercomputer, it sees the data on project. To me, that's a really good idea. You know, the file format, you can see this slide online, and it's in the paper, too, but basically the file format was also kind of the magic thing that by doing this duplication and chunking in HDF-5, it let us really read the data really quickly so you can read images or spectra without much perceivable delay. And that's kind of what our main paper was about, was just showing how fast this can be done. And of course now everybody's doing it that fast. So, you know, the API is very simple. We put a fair amount of thought into keeping things kind of minimum on the API. I think pretty good design where we have a rich parameter space for using the API, but a really small number of commands. There's these five commands. They're the only commands in the API. So it's pretty easy to document and teach people how to use the API. This is the web API, not the file API. There's an example of it. Yeah, so we could get these large, you know, for us this is a big image, but it's not as big as it looks by hundreds of pixels. But it has 130,000 mass bins, so it is kind of big in that dimension. And we're looking at millisecond timing to go from NERSC to a laptop and get these images and spectra. The smaller images, it's faster. And then if you pull the slides, you can hit these shortened URLs. These actually go to some iPython notebooks that you should be able to run yourself in Python 2.7. This week was my first week in Python 3. I'm getting used to it. Print is really just the only difference in how strings are. But anyway, I'm sure there's more to it. But anyway, you can check out these notebooks. These should work for you. You should be able to pull data from OpenMSI and do some image processing on them. I mean, simple stuff, it's like, you know, thresholding the image, changing the color and scales on the images. But the point of the manuscript that these notebooks accompanied was to say, hey, we can use notebooks to share this with people so that was really the point of the paper. It wasn't really how advanced the image processing was. Pretty routine stuff. K-means, PCA, that kind of stuff. Okay, so I'm going to shift gears just a little bit. Much more common than mass spectrometry imaging is mass spectrometry coupled with liquid chromatography. So everything about the mass spectra is still the same. We require mass spectra in large abundance. But there's no imaging. Instead, what would be spatial dimension is now time. So you're recording spectra as a function of time. You record a lot. In time, the molecules are being separated by a combination of solvents and a stationary phase. So you have the solvent composition that's flowing over a column and it's pulling molecules off by their physiochemical properties. So this is widely used. There's probably 100 people within a couple miles of right here doing this. For mass spectrometry imaging, there may be one at the ALS, maybe two. There are not very many. So liquid chromatography is very common. And guess what? It looks kind of like images too. You can make it look like an image at least. So here you have time is going vertical. These vertical bars are time. So you can see here is like a molecule comes out in time and then it goes away. So as it comes off the column, you record a signal and then it goes away in time. And this axis going across the page is the mass. So we can record these things with unbelievable throughput. And for my other project, which is called Metabolite Atlas, we upload 1,000 files a month from just Berkeley Lab to Metabolite Atlas. It's really a much different scale in terms of the complexity in a number of samples. So you can see from here that if you wanted to say, you know, what's this molecule? You know, good luck. How are you going to do that? You might be able to record its mass. You'll be able to record its time. You'll be able to record maybe a fragmentation spectrum for it. But the main thing, all those peaks are is degenerate garbage. And your main process is cleaning up the garbage. So all these things, like this, this and this are the same molecule. But they're different forms of it. Maybe it's gaining a sodium ion. Maybe it's breaking a bond. Maybe it's forming an ion cluster. So we call those things adducts. Going from these adducts to actual molecules is a large reductionist process. So we go from hundreds of thousands of features to hundreds of authentic identifications. And this process is an unsolved problem that I think could really transform all areas of technology if it were solved from medical, environmental, cosmetics. You name it. So we've written a lot of papers about this and this has been kind of a career goal of mine that this year has really catalyzed into something somewhat meaningful. So we have a nice database. And we have a Python API to analyze the database and to analyze a bunch of HDF5 files at NERSC. And we can do things like store a reference fragmentation spectrum for a molecule. So I can say this is the fragmentation spectrum of glucose. We can say this is the retention time of glucose. Beyond that, we can say this is the retention time in this file for glucose. Beyond that, we can say this is the retention time in this file that was prepared with this method for glucose from this sample. So we call these sample and method-specific metabolite atlases. So now that we store all this information about all these features, we hope that to have a hackathon here at the BIDS Institute, we hope to build a learning process so that when new files are uploaded, the user immediately gets all the benefit of these assertions that have been stored by the previous scientists, the good ones at least. And it doesn't require this careful curation. So how do we identify molecules? There's really three ways. The first choice is a third-party spectral library or an in-house spectral library. Downside of that, you just can't buy most molecules. Of the known molecules, you can buy, you know, thousands of them. And there's many, many, many more. So we would like to go first principles. We would calculate the way a molecule would fragment. This really can't be done for electrospray ionization right now. The physics are too complicated or the computers aren't powerful enough or somewhere in between. I don't actually know why we can't calculate it for electrospray. You can pretty well for electron impact ionization. So instead, what do we do is we use hybrid methods depending on large graphs. So the basic idea is you take a molecule, you fragment it, you fragment the fragments, the fragments of the fragments, and you build this gigantic graph that's the complete enumeration of all possible fragmentation paths a molecule might take. So for small molecules like this, that's fairly computationally feasible to do on a regular computer. It takes a few seconds. You can generate a few thousand paths towards, from parent all the way down to tiny bits. In these fragmentation paths, you can then say, okay, did I detect a fragment? Did I detect a fragment of a fragment? Did I detect a fragment of a fragment of a fragment? And you can use statistical tools to score that very high or very poorly. So this is the real problem. This is easy to do that for. Molecules like this, that's very hard. This molecule on a node of Edison would take a couple days to build that tree, just doing that complete graph enumeration. So we're trying to do that for all of our molecules so that we can have some automated scoring process. And we like to scale this up a lot. I think that this could be scaled up to a much easier problem if we did things a little differently. So the other thing we do is we need to be able to visualize this data. So people want to visualize their data in the context of knowledge. I know about one molecule, but I didn't detect that molecule. I detected something else, but I need to visualize what I don't understand in the context of what I do understand. So consider a Thanksgiving study where you ate one food medium and it induces sleep, and another medium had no effect. So let's say you identified serotonin and tryptamine. Well, in these networks where you can group metabolites by their similarity of that substructure I was telling you about, you would immediately see things you might recognize like tryptophan, melatonin, different derivatives of serotonin, all in this same family of this graph because they have the same underlying chemical structure. So this is an important tool and when you start pasting all of biochemistry on tools like this, you generate very large trees, which you can't quite see here because it's so minuscule, tiny. But the point is you can now start to paste data onto these graphs and help to understand the context of one result in the context of something you understand. So I'm going to end with this. I have no idea how long I was supposed to talk for. But anyway, we also are really interested in bringing people on. We've had great luck hiring consultants. We pay money. That's kind of a nice thing. So if you'd like open source software, you might like to get paid for making open source software. Send me an email. We'll try it. But mainly if you're interested in the science because there's not a lot of oversight in this. It's not like a regular job usually for the people we've had good luck with. So some independent interest is important too. So send me an email. Thanks for your time. It's a pleasure to be here.