 I'm going to be talking about the intersection of sort of my two main affiliations now, which are the Open Connectome Project and the Institute for Data Intensive Engineering and Science. And that is data intensive computing, which is the subfield of high performance computing that's focused on problems that are IO bound, that are limited by your ability to examine and process data. So let's start with what the Open Connectome Project is. So it spans data collectors, data scientists, machine learners, computational vision, and statistical data processing. The goal is to create a computational forum for data analysis of massive data sets. For us, this mostly means 3D electron microscopy mouse brain image data to date, although we're not at all married to that data. We have mouse monkey about to have some human brain data. We have clarity data. We have array tomography as well. But what we want to do is take these massive data sets, so right now we're at 100 terabytes of data and link them with high performance computing in a way that you can do computational vision at scale to extract brain structure. We also want to provide the ability to examine this brain structure through spatial queries that ask for things like clustering, distances, volumes of structure. And the ultimate goal is to create a fully annotated graph that you can do statistical analysis on. Now so my research agenda, I thought it was my research agenda. It's not. Barack Obama made it clear that I'm doing his research agenda in the last two years. And this has happened in March of a year ago. He made an announcement about big data and the way that the Office of Science and Technology Policy works in the US is kind of interesting. So they're actually in office with no authority. They just give a mission and say a number and then the world scrambles to make it happen. And so the president said, do you big data? I said, ooh, I've been doing big data for 10 years, that's great. And then the president this year said, do brain. And this is either a large initiative or a small initiative, depending on which side of the pond you're from. But it is the US initiative. And so I'm being driven or pulled asunder, not by my own will, but by the president. And so brain data is big data. I think we know this, some numbers that really reinforced it. We can talk about the sort of fundamental size of the brain, 100 trillion synapses, 10 to the 15th neurons, sorry, other way around. But I also like to characterize what the fundamental data size of the task of imaging the human brain at, let's say, four nanometers by four nanometers, which is a resolution where you can see all the fine scale structure arguably that you need to. And that's about an exabyte of data. And so that's a lot of data to manage. And that's the size of the data we want to process. Right now we have several projects that are imaging data at about a terabyte a day. And this data is really rich and high-dimensional. It has, it's 3D, but we also add time series and multiple channels. And there's many imaging modalities and it's multi-scale. So this is just a daunting data problem. This is a movie that I think gives a sense of what this scale looks like. And it goes from the scale of a human brain, which is Bobby's brain, which is in that head there, to zooming in to four nanometers. Now we store about 11 zoom resolutions for this data, but this is all about 10 or 12 orders of magnitude. And so it gives a sense of really just how finely detailed it is and how much data we're going to need to accumulate to get from this level of detail to hopefully not Bobby's brain. Bobby is a postdoc in Jeff Lickman's lab at Harvard and is one of our key data providers. Okay, now I'm going to change gears and I want to talk about big data and data intensive computing. I mentioned that data intensive computing is this data analytics high performance computing field. I equate that with big data. And that is not necessarily a reasonable equation, that there are many other things that are big data, machine learning, scalable statistics, sensor fusion, semantic integration, and I need to be cautious not to use big data to mean just the problem I care about, which is processing vast amount of information. We have a history of big data at Johns Hopkins. I very fortunately walked into one of the great data science projects that was going on, which was started at Microsoft Research and Collaboration with Johns Hopkins by Jim Gray, who was a former intern award winner. And we built this machine in 2006 called Gray Wolf, which was really the model for a data intensive computer. It stored a petabyte of data at the time and it could look at 80 gigabytes of data a second, which was a really staggering rate in 2006 and still is. And this won the Storage Challenge Architectural Award at Supercomputing. And every great computer needs a great application. And so this computer was co-developed with the Sloan Digital Sky Survey, which is a catalog of the output of the Sloan telescope. It's a map of the sky in the visual frequency. This was done by Alex Soleil and this has become the most important telescope in the world in the sense that you can always get time on it. There are never clouds. It's never day. We have 6,000, sorry, we have 15,000 registered users, which is notable because there are only 6,000 professional astronomers in the world. And so this Sloan Digital Sky Survey is really a archetype of what a good data intensive application is. It's taking a massive science data set, imaging data. It's processing it to enrich it, pulling out things like clusters and frequency filters and doing image processing on it and then providing it to the public through a web service where they can do deep analysis and discover lots of new things. It's been the basis for projects such as Galaxy Zoo that's created 20,000 Galaxy classifications. It's a crowdsourcing project. So this is where we started with data intensive computing and at the Institute for Data Intensive Engineering and Science we've branched out and the model is to take a scientific domain and the capability to do deep data processing, combine them in a way that meaningfully transforms that domain. And we've done this to, this chart is just a sort of family tree of the open connectome project. We've worked on oncology, half of these databases are related to astrophysics, environmental science and the father of or the parent of the open connectome project is the magnetohydrodynamic database, which wouldn't necessarily seem to have a lot in common. This is a database that captures a high resolution output of a magnetohydrodynamic simulation. It has much in common in that the data is both high dimensional, spatial, high resolution and needs to be studied in depth. And I think it's a good example of what the capabilities you'd like to provide for connectomics and I point out that studying data after it's been generated, whether it's been imaged or simulated has great value and we found this in magnetohydrodynamics where we were able to go back and look in a dataset that was collected for a completely different reason and identify and verify stochastic turbulence in the data that just was there. It was an 80 year old hypothesis, it just got resolved, we published it recently and I think that all I did for this result I can barely understand the paper. All I did for this result was build a data system that allowed people to study the data. And now we have a next generation machine and this is the machine that hosts our connectomics data, this is the data scope. So it is a 10 petabyte machine, it does 500 gigabytes a second of IO and this is for a machine that only has 90 compute nodes. So it really is a data intensive machine by that definition. So for comparison, Titan, which was the world's number one supercomputer until really recently, does about 720 gigabytes a second of IO and it's about 3,000 times the compute power of this machine. So the point of this machine is to be able to access data and bring it to processors where you can run streaming algorithms on it. And we have a 40 gigabit per second upload, so this is the workhorse for all of our datasets. We collect datasets from people that are willing to contribute them to Open Connectome Project. We ingest them and make them available and you can run analyses on this machine. All right, so now back into the Open Connectome world. I wrote the following language recently, which essentially is my goal about the Open Connectome Project, which is I'd like this compute infrastructure to be a place where I can put my statisticians and my experimental biologists together. And so that's my goal, which is I want to write software that people use to characterize brain structure. And when people use this software and do interesting things, I've succeeded. Now that is my goal. That is not necessarily my collaborators' goals. As I mentioned, this goes from people that do electron microscopy to people that are just statisticians. I do want to point out a couple of my key collaborators. One is Jacob Vogelstein to maintain this very US-centric theme that I'm presenting. His goal in life is to mitigate neuroscience surprise. So this is to create biologically inspired computing architectures that keeps the US defense infrastructure ahead of the world when it comes to tasks like computer vision, image processing, anomaly detection. And then my other key founder in this is Josh Vogelstein, who's a statistician who refused to take this question seriously and wouldn't give me a statement of purpose. So this eye chart is the founding figure. And I include it to confuse, which I think it will accomplish, but also to give a sense of the depth of the pipeline that we, what we put the data through at Open Connectome. And so this chart is organized into sort of three layers. One is workflow at the top, and the second is sort of our data products. And then the bottom is how users, public users, interact with it. And so we'll actually stream data from the internet into our computer where we'll do alignment and color correction. And then having done that task, we'll both pile the data for a sort of Google map style interface, as well as build volume databases, which are our key data products. There are these databases that, rather than representing the data as image slices, represent the data as a bunch of small compact cuboids. And it's effective for reconstructing sub volumes, sub dimensional regions, extracting regions of interest. And that's actually really the goal. Once we've done this volume indexing, we can run vision algorithms against it that will do things like detect synapses, do automatic segmentation. And then the products of that vision will be co-registered with the original data. And then you can do statistical analysis and hopefully do graph extraction. And we have a complete pipeline working. The graphs it extracts, you don't love yet, but what they are, is they are big graphs that were extracted from large data sets. So we provide a handful of services. These are focused just on the data services. And to sort of understand the Open Connectome data infrastructure, I like to break it into sort of three tasks. One is to cut out data. This is to subset a region of image data or annotation data, which are labels on images. And to look at a specific range of data or a specific subset of the dimensions. To annotate data, after you have cut out data, you might want to apply labels to that data that describe what it is. So we'll look at all the synapses in an image and then label those regions as synapses. And that is labels in both to a spatial database as well as creates objects that represent the metadata in an object database. And then you can query that object database, which is to ask questions about the objects themselves. And that's where you can do things like come up with spatial queries and look at volumes and distributions. Okay, so with these services, we can then build data products. And sort of the, I'm gonna present a couple data products. The first one being a Google Map style interface. So this actual cat made is the viewer we're using here was developed by Alberto Cardona and his team at Janelia Farm. We use cat made as a front end for our database so we can dynamically load data from any of our database into cat made. And this displays some of Jeff Lickman's EM data overlaid with annotations. Those annotations are themselves a database and those annotations are linked to objects. And so this slide shows what object representations provide. Once you've represented all your data as objects, you can then query the objects individually. In this case, I'm showing you a full field and also a query, this is a sample restful query that you don't need to be able to read. It's too small for you to read, which is probably intentional. And that just extracts the dendrites. And so you can do a task like this, which is you can look at the data in a field. And then you can say, I just wanna look at the subset of the data in the field that are dendrites. And then you can say, how long are they? How big are they? What's their volume? How far away are these dendrites from the synapses that connect to them? With the same registered data and cutouts, we can plug that into 3D visualization frameworks. These are just kind of fun to look at. This is one of the ways that we found that's useful to visualize the data is to take a 2D image plane and then essentially pass that over and build up the annotations and raycast them as we do it. And then this other video shows that these annotations are actually objects and that you can manipulate them as objects. That you can click them and exchange your metadata. This screen will flash out. And then we can look at the objects themselves and you can highlight individual objects. Now these are interfaces to the data products. But the data products really drive the capability. This is our MATLAB API. And I bring this up because we have a companion poster that will be presented tomorrow that talks about how you can use MATLAB to interface with all these services and extract data from the Open Connectome project into MATLAB and essentially write your own scripts. And the goal here is we provide both Python and MATLAB bindings to all of our data products so that you can program in the environment that you'd like and use massive data and distributed data sort of seamlessly. So there's a whole poster about this, I'll not dive into that. Okay, so those are our services and some of the data products. With that I'll give a couple examples about how you might use these data products. So this movie shows a subset of 20, actually 19.5 million Synapse Detections that we took from a four-ter voxel or that's a four-terapixel volume. This was done on data collected by Dobby Bach in Clay Reed's lab. The way that this was done is by taking this entire image region, breaking it up into cubes, and then processing those cubes in parallel. So we ran this on 192 cores, it took slightly less than three days. This was an exercise in scalability. So architecturally what this looks like is that we have this data cluster that's running on data scope on our large machine at Johns Hopkins. And then we connected that up to a small compute cluster at the Johns Hopkins University Applied Physics Laboratory. Which despite having the same name as a prefix, Johns Hopkins University, and only being 30 miles apart is an internet distance apart. Which means that it has internet latency and internet bandwidth. And we're able to demonstrate that you can do deep data processing by doing cutouts across the internet, doing processing locally on your HPC cluster. Finding those synapses, uploading them to a database, creating a new derived data product, and marking those as annotations. And so this is sort of a visualization of the way that that pipeline runs. It shows the data overlaid with a whole bunch of synapse detections, and those synapse detections are the small colored dots. Another example of what we can do with the Open Connection Project is scalable image processing. We can take an entire image stack and we can run algorithms that process the entire stem image stack and do so globally. This example is going to be doing color correction, which is a approximate iterative global Poisson solver that was implemented by one of my colleagues. And the goal of this color correction, the left hand panel is the image as collected in the Z-stack. And you can see that there are anomalies in resolution and not resolution in exposure from slice to slice. And then the right picture shows that these should be cleaned up at the end of solving this Poisson process. What we do is we smooth all the low frequencies, but that produced a relatively low quality image. We then take the high frequency data and add it back in to both correct for the color and preserve edges for edge detection and computer vision algorithm. And this ran against four terabytes of data, takes about six hours to complete. So this video shows the outcome of the color correction process, which is the right hand panel is the original data, which is flashing. We've corrected that color in the middle panel, but that also mutes a lot of the high frequency, which is bad for algorithms that want to do computer vision. And then we add back in the high frequencies in the right panel to show the same process at a little bit higher resolution. Shows both how it washes out that the color correction process washes out the high frequencies, and that we do reintroduce them. And so this is one algorithm for the task. The reason it's notable is not necessarily that it's the perfect to write algorithm, but that it is a large data processing task that we can conduct on the server. So a couple of comments now on the design of the Open Connectome Project. The Open Connectome Project data architecture is what I would call a scale out no SQL architecture. So if you're into cloud, data intensive, data processing, those buzzwords all make sense. If you're not, no problem, scale out means that we're taking lots of nodes, putting data on lots of nodes. And then doing queries across lots of the nodes. So we can incrementally add storage. Right now, we're at seven data nodes. We're bringing another four online. Scale out often refers to hundreds of nodes in the cloud literature. I find that it's more effective for us to build high-capacity storage nodes and have tens of them. Our typical node is about 20 terabytes. And then node SQL refers to that it is a relational-like data store. We use the relational properties of data storage to get this cutout performance where we can both do dimensional subsets and spatial subsets of the data. We also take our data and we split it across two types of nodes that we have what I'll call standard storage nodes that host image data. And those are implemented with this storage for capacity reasons. We also have SSD nodes that store annotations. And the workload is very different. If we think of the Synapse finding example, when you do the cutouts, you're extracting terabytes of data. The data is spatially co-located. And so you want good sequential performance on high throughput and massive storage for writing all the synapses back. We need to do 20 million small random writes to 20 million objects. And so we do that by building a special class of node that supports really high throughput rates. This particular node, we got a million IOPS. That is one million IOPS per second out of commodity hardware. We're going to report on that in supercomputing this year. And so the data itself, most of the data we've seen has been 3D EM data and annotations. I also want to note that we do support time series data. And this particular picture is about flexing our multi-channel muscles. This is some rate tomography data from Steven Smith. And it shows that each different color is a different channel of the same data set. So we typically have six dimensional data where we have time series channels, projects, which is you may have multiple annotations of the same data set, as well as the imaging dimensions. We take that data, we get it as a whole bunch of images. Some of our biggest images are 200,000 by 200,000 pixels per slice. That's not a very useful image representation for us. We reorganize it into cuboids, which are just small, compact chunks of data in three, four, or five dimensions. We typically image x, y, z and time, but not channel or project for a variety of reasons. And then we build a resolution hierarchy of those cuboids. And so we have all this data in a compact, dimensional representation at multiple resolutions. We then take those cuboids and we order them according to a space filling curve. And this just says that when we lay the data out on all these disk drives, we lay it out in such a fashion that data are clustered by their spatial location according to a space filling curve. This has the property that it minimizes the number of IO requests. As an example, this image, which our service can generate on the fly in a couple hundred milliseconds, in the original image stack would take one pixel from one scan line from each image. And so the image that we can process, produce in 100 milliseconds, extracting it from the raw data would involve scanning over the entire image stack and take hours. And that's really the value proposition of dimensional imaging. So this shows the actual curve we use, which is called the Z curve, flattened into two dimensions. And the point of this is to show that the curve, the colors correspond to the location on disk. And so that this curve localizes compact regions of data onto small regions of disk. And this has a property of speeding up access to sub volumes by a factor of two at minimum to thousands, which was the example I just gave, where you're extracting a single pixel in every scan line. The next NoSQL principle is sharding, achieving scale out. So we now have a big data set. We want to get this data set onto multiple nodes. We need a principle for doing this. And the principle is sharding. This is also a term from the cloud community. And sharding typically takes an index and spreads the index across multiple nodes, and it's self-organizing. So that if you add more nodes, you can further refine the index. We do sharding right now manually in the application because we have the Z curve. That Z curve is the index we need to distribute. And so this figure demonstrates how we'll take a Z curve and divide a Z curve by ranges and place it on four nodes. So this process has been effective for us at both scaling storage and scaling throughput. We've gone from having data sets spread across one node to data sets spread across eight nodes. But it's sort of the most primitive example of the scale out storage principle, which is to be able to add nodes and not have to change your software or your configuration. And in the value proposition of scalable storage that is incremental and dynamic, and you can deploy it on the cloud. And that's actually one of our next tasks is to create a cloud deployable open connectome project so that we can host this data on the cloud or that you can essentially build your own open connectome data storage stack without using our hardware if you want. In adopting scale storage will also benefit us and that we'll be able to push some function into the storage system such as indexing in these cutouts. Right now we're evaluating two different technologies. One is Apache HBase. The other one is SideDB. These are both really great products that I think will be transformative to data management for connectomics and spatial data period. The state of the world right now is that neither one of them is complete for our feature set. And so I put them up there as ones to watch that we'd like to adopt for our design. Right now in Apache HBase and other NoSQL stores like that, such as Cassandra, you have the ability to do spatial locality and implement order partition so we can do the type of Z-curve indexing we want. But it doesn't have any computational capabilities. Whereas SideDB, which is the database that's being co-developed with the LSST telescope, has some really nice array storage and slicing and assembly capabilities where it can do this dimensional slicing and data assembly dynamically, but doesn't support ordered partitions. So I think both of these technologies are really quite promising for adoption. One of the hard tasks in dealing with connectomics data is extracting objects. So after we've done a vision, run a vision algorithm, and identified a bunch of objects, we need to then go find them. So this is essentially you want to query the spatial extent of a structure, which you need to do volume queries, you need to do distance queries. And it turns out that most traditional spatial indexing techniques, which come from geographical information systems and like applications, perform pretty poorly. And our trees is the variance or the classic method. And they have the property that regardless of which archery variant you want to use, when the bounding boxes of your objects intersect, the archery performance will degrade. So this is really a problem for neural structures because you'll have really long, sparse, skinny data. So for example, the largest dendrite in the data set that we store for Bobby Kasuri and Jeff Likman has about, see if I'll get this right, about 10 million pixels in a 100 billion pixel region. So that's less than 1% selectivity. So we need an indexing technique that can dive into this data, find these objects that have large spatial extent efficiently, and our trees don't work. So we're exploring some sparse indexing techniques that are also built on this Z-curve technology in which we look at enumerations of the Z-curve and represent that as an index. It says this object exists only in these cuboids. And then we can just go read that subset of cuboids. And we've got performance numbers now, and I throw them in to give a sense. I don't love these. They're not optimized. But what they do demonstrate is that on a single node, we can do 150 megabytes a second of data for cutouts through memory and over 100 megabytes a second when we have to go to disk. So this means that when we can pose these nodes, we can right now reach a gigabyte a data a second, which is a it's not optimized for the hardware. It doesn't compare well with this 500 gigabytes a second that the machine can do, but it is a high throughput data processing task. Now, the scalability is performing as a function of what you're doing. If we're getting the data from disk drives, it's IO limited, and that's going to scale with the number of disk drives we have. And so that scales on a per node basis. If the data is in memory and you're cutting out data that you've already looked at before and it's in cache, then the overhead is the reassembly of the data from its cuboid representation into its reorganization based on your query request. And that's memory bound and scales with the number of cores we have. All right. So I've been at this for about two years in the Open Connectome project. I'll start at the bottom. I love this project. It's great. This is the best scientific data I've seen. It's my fourth discipline. I'm done with astrophysics. I'm still a little bit into computational fluid dynamics. But what makes Connectomex data so wonderful is it's got really deep and rich pipelines. We're talking from instruments to visions to analysis. And that just the data processing task requires that the data systems connect all these different communities in a meaningful way. And that makes it a really cool infrastructure to work on. And the data themselves are very complex and not complex in one way. They're big, they're diverse, they're sometimes sparse, there's matrices, there's graphs. And I'm just talking about one type of data, not talking about multimodal data and anthologies and atlases. And so there's a whole level of complexity that we're not even touching on in the Open Connectome project. And the neuroscience community gets big data. That one of the things that came out of the brain initiative in the US is that the money and the interest is not in doing data analysis right now, but in developing the tools to do data analysis. And that's actually the challenge is we need informatics tools before we solve the problems. We're not ready to just go solve the problems. And so a really wonderful topic area for me as a big data scientist. And I'll finish with my team. Most of this talk has just represented people or tasks that are at this data infrastructure and data management. But again, it does span data collection. We work with the Deseruf Lab at Stanford and with Stephen Smith and with Jeff Lickman and with Janelia Farm. We owe a lot to the people that develop software tools, such as Stefan Solstheild and Alberto Codona, and to people that do statistical analysis, such as Carrie Prieb and Joshua Bugglestein, and many others. So it's a really wonderful community. Thank you. Thank you. Thank you for providing Formula One car for neuroscientists. OK. Now it's time for Christian and comments. Any Christian or comments? Thank you. So spectacular and amazing and fantastic talk I've blown away. So one of the things about neuroscience and especially about neural circuitry and neuroanatomy is that the features that a lot of people talk about at a high level are ensembles of all of the stuff that you're putting together at the low level. And so listening to you present your talk, I'm struck by how informative your work could be to us and to everybody else in the community who hasn't kind of dived down into the depths of the individual synapses yet in terms of if you had access to all of the data down to the four nanometer scale level and you could represent everything else, could you help us develop and think about the kinds of representations, for example, of neural circuitry based upon how these things combine? So if you have, so for example, right, you're talking about an individual projection between two brain structures and there's a statistical pattern that maps brain structure A onto brain structure B. Your system and the data that you have within your system kind of provides a really straightforward vehicle to kind of start thinking about how one would build quantitative functions to do that. So could you comment? Could you offer some insight? What do you think? So that's a wicked question. So I'm going to answer it a little bit backwards, which is about what we build and deploy and how we go about it. I would say that our development strategy is immersive in that we work very closely with people that collect and analyze data and that the world is better, in my opinion, when those people are different people and we're emerging or we're converging to that point. I will build a function for anyone that will use it and that actually is a pretty extreme gateway that having data is not interesting for entry into Open Connectome. If you have data and you want to analyze it in a meaningful way and use scalable services in a way that you can't analyze it without a data-intensive computer, then I care. And so a lot of the tough issues, representation, metadata, searchability, querying, I feel like my project is very primitive at when I compare myself to what I've seen in the poster session and what I know of my collaborators here. But there's a lot of people that are working on multimodal data and representations that are thinking about it at a higher level and doing a better job. The focus for us has really been in supporting vision because in this particular world that is electron microscopy, connectomics, we have a throughput problem in terms of representing segmentations and detecting objects. And the algorithms that are for it are not very good or accurate. So I'm asking questions that are how can I build the best system to support computer vision? Now that is a very narrow view of the world and my view is getting bigger as I come to understand that multimodal and multi-scale and physiology all need to be part of this data environment. But I'm gonna have to work in a bigger data community than I am right now to solve those problems. Thank you. Hi, of course I would never question your sincerity in your project. But as I see all your petabytes of data and your storage facilities, a kind of question, what is the ownership of this? Okay, so I appreciate that you've never questioned me and as does my mom. I'm a man of marginal integrity. Okay, so the machine itself has been sponsored by the Sloan Foundation and the National Science Foundation. It was built as a data scope and so this is an analogy for a microscope. It is a machine that is to do deep analysis of massive data sets. And so connectomics is one of the named applications, one of six named applications for this machine. So right now I'm in this luxurious place where if you have data in a problem, I can send you a whole bunch of disks or we can stream it over the internet and I can take the data. That's not a sustainable model. The question of how do you sustainably fund large scale data management is I think one of the thorniest questions in big data science. I participated a lot in looking at data sustainability proposals or data sustainability as part of proposals and centers and the answers are never satisfying. The only thing we have going for us in data management when it comes to sustainability is that the data that we're processing today is going to occupy just a small corner of the next machine we built. So let's take the Sloan Digital Sky Survey as an example. This project is now almost 20 years old. We just got a sustainability grant to keep it alive. So that's first of all not the greatest thing in the world. You have to get a sustainability grant to not drop the data on the floor. But the first data release of the Sloan Digital Sky Survey was 600 megabytes. The second data release was 1.3 terabytes. And as long as we're living in this exponential world, we can sort of skunk works our old data sets into the corner of our next machine. Don't love that answer either, but it's an answer. And the NSA are not tracking our every move into your database. Into my database? I have no idea. Have you ever considered using crowdsourcing or was there a need for that to do the imaging segmentation? So crowdsourcing is I think incredibly important to addressing this segmentation issue. And this is something that has been expressed as a thesis in iWire, Sebastian Sung's website is all, is about that. And I think that people that do automated segmentation algorithms and use them right now are in a place where they feel like crowdsourcing is the way to get this data that is high enough quality to be used. Our project is currently doing a crowdsourcing synapse classification task with I think the number is 400 high school students. One of the big questions about crowdsourcing is how do you capture the hearts and minds of people to log on to your site and do the task? By routing it to high school students, we have a captive audience. They require a little bit higher touch. If we take astrophysics as an example, Galaxy Zoo was a crowdsourcing platform built on top of the Sloan Digital Sky Survey that was incredibly successful. But astrophysicists are some pretty dedicated, strange people that don't value their time. So I can't speak to neuroscientists, but they may be pretty good at organizing their time. I certainly hope not. Okay, you got the rest? Just out of curiosity, I mean, you've been probably looking at a couple of data sets over and over and over again. Is there any point where you're saying, look, reanalyzing this chunk of data doesn't make any more sense. We need fresh data to get something new out, which in a way is there, in your mind, a point where it's like, look, put that data to rest. We need new stuff and let's not reanalyze because we're not getting anything new out of that. So no, there certainly is a need for new data. And the notion that you can make a single canonical data set that will answer all the science questions is nonsense. I am amazed at how quickly data becomes irrelevant in this field, just in the connectomics world where you have preparations and stains and polymerization techniques and some mesoscale, light-based techniques are gonna compete. But so all of that is true and there may be a case for discarding data. I feel like for the open connectome project, the criterion we need to justify its existence is that individual data sets that capture interesting features are too large to analyze on one or multi-core machines in a workstation environment or a lab. Okay, thank you, Rector. Thank you.