 Thank you. Let me start by introducing myself. I am, in addition to being the director of the San Diego Supercomputer Center, I have a bunch of other hats on. They're all somehow fooled into this talk. I am the executive director of the Open Science Grid since 2015, and you will hear that part of, I'm a copie-eye of the Pacific Research Platform with Larry Smaar that you've heard a few years ago about, and I am the PI of a follow-up infrastructure that kind of brings both OSG and the PRP together into a national infrastructure that I'll be focusing on. And on top of that, I am actually a professor of physics that uses all of this infrastructure for my own research. So I'm a user, I'm a provider, I'm a service provider. I sort of touch this infrastructure that I'll be talking about from a variety of different angles. And of course, I'm a geeky scientist as a physicist, and so I will not be able to hide that completely in this talk. And so, voila, let's get right into it. My talk is about the path towards an open global sub-infrastructure and abling digital research. And I want you to start out by putting three reasons on the table for having an open cyber infrastructure, and then I will later define what exactly I mean by this and then show some use cases and talk about how this all comes into place. And at the end, I will make an appeal to all of you to join this ecosystem and use it as foundation for building even more interesting things. And so my three reasons are democratization of access, openness for an open society and big science as a team sport. And let me go into what I mean by this. In terms of democratizing access, I really want to quote the bottoms-up initiative called the Mines We Need. You can see in the corner here their website, and I've literally verbatim taken from this website because it's the most eloquent description of why democratizing access makes sense and is necessary. And so the objective is to connect every community college, every minority serving institution and every college and university including all urban, rural and tribal institutions to a world-class and secure research and education infrastructure. I call that a cyber infrastructure. And with particular attention to institutions that have chronically underserved. Now, why is the chronically underserved so important? It's very simple. The next generation, geni or geniuses or whatever the plural of geniuses, may come from anywhere. They're just as likely to come from the thousands of non-R1s or colleges as they are to come from the R1s. And as a result, if you think about it from a very high level, what we want to achieve is we want to achieve efficiency of human capital. And that's ultimately an economic prerogative, it's an intellectual prerogative, it's a human prerogative. We will do best if we manage to make opportunities available to as many people as we have. And if you want to do this, then you have to democratize access because at this point in time we have a massive gap, a massive opportunity gap and there's a number of different recent reports on this. The NSF did the missing million reports that some of you may have read. There's this initiative which awards it as I have on my slide. But the bottom line is at this point, we're leaving people behind. And it is in our interest to catch up and not leave people behind. And that is not even talking about all the other reasons of equity and inclusion and that sort of thing. There's a strict economic and intellectual reason in addition to all of the equity and inclusion issues. So this is in essence the vision that I want to present and everything else in this talk is just detail. So my long-term vision is to create an open national cyber infrastructure that allows the Federation of CI and all 4,000 accredited degree-grounding high education institutions, nonprofit research institutions and national laboratories. In other words, my mental model is anybody who engages in open science should have the ability to federate their resources, compute, storage, et cetera, et cetera, all of cyber infrastructure into a national federation, call it federation, call it ecosystem, call it what you want. And we want to provide the underpinnings of this. And I look at this as the necessary fourth leg in addition to open science, open data, and open source that we're all very familiar with. There is, I am talking about open infrastructure as additional objective. And what I mean by infrastructure, literally anything that seems to have hardware-y kind of characteristics. So that means there's open compute, there's open storage, content delivery networks, there is devices, instruments, IOT, all kinds of things. I envision a future where the future of wireless will intersects with open infrastructure where you have devices that are, that can be hooked into this infrastructure in order to create data that then gets compute, stored, et cetera, et cetera, within that infrastructure. And what I'm going to talk about in this talk is some of the foundational principles that I think this has to have. And then I talk about the state of the art where we're at and who's using it and where we're going next. Now, I wanted to add one more that I normally don't talk about, which is normally I just leave it here, openness for an open society sounds corny and a sort of a nice way of having a one-liner. But recent events sort of made me realize how important openness actually is if there was, in a funny way, speaking for myself, before the Russian war against Ukraine, I sort of took openness for granted. And now it seems pretty obvious that I can't take openness for granted anymore and that all the things that I just talked about, that ultimately make us more efficient are also the tools that will make us win the economic war against forces that are authoritarian. And I encourage you to think about this and I won't go into it any further. I took this from Wikipedia and then how I connected is that knowledge is never completed but always ongoing. And ultimately I will argue that the creation of knowledge is increasingly requires multi-disciplinary and multi-institutional teams. And this sort of science as a team sport is something that will come up on several of my transparencies. And let me just give you a start-up kit, so to speak, on big science as a team sport. On this slide I have put one, two, three, four science collaborations of varying sizes in the... Let's see, can I use a pointer? Yes. This here is a picture of the Ice Cube counting room on the South Pole. It's an instrument that is deployed inside the South Pole ice. It's a cubic kilometer of ice instrumented in order to detect the highest energy neutrinos from extragalactic. And so this is the largest collaboration on this slide. It's about maybe five, six hundred collaborators and you see sort of the global distribution of this on this map. The next one is Xenon. I'll talk more about this in a second. There is Veritas down here and next to it SPT-3G, which is also on the South Pole neighboring to Ice Cube. All of these instruments at scales from 10 million to almost a billion in terms of just instrumentation costs. Having common is that they are large international collaborations from a couple dozen institutions to hundreds of institutions. And they are big science as a team sport, both interdisciplinary or multidisciplinary and multi-institutional. That's what it takes to get this kind of science done. It's the science that I do myself. I work at the Large Hadron Collider in Geneva or Switzerland. My collaboration is a collaboration of 208 institutions across 48 countries with a few thousand researchers involved. So what do I mean by open infrastructure? Let's talk a little bit about this and talk about it from the perspective of principle. The first principle is the power of sharing. When I think of open infrastructure, I want to create something that any participant institution is able to share dynamically any fraction of its resources with any other. Given that I just made an argument that science is a team sport and it is multi-institutional in addition to multidisciplinary. This kind of principle allows collaborating researchers to pool resources and therefore benefit to the greatest extent possible from in-kind contributions both nationally and internationally. And for every one of the collaborations that you saw earlier, that is an essential part of how they assemble the resources needed to get the science done. It's an in-kind contribution from the institutions that collaborate in addition to funded project money that they pay resources for the common good of the organization. In addition, in order to accomplish the democratization of access, we want institutions to be able to share for the common good of all. And in particular, to democratize access, funders like the International Science Foundation, but also others may stipulate nationwide sharing in order for you as part of the solicitation. In fact, if you are familiar with the CCSTAR solicitation, set of solicitations from the NSF, that's actually written into the solicitation. 20% of the money of the resources that you buy with the money that you get, you have to give into the kind of organizations that I'll be talking about. Meaning, into the open infrastructure, you have to join the open infrastructure and you have to give it away to the common good of all nationally. And how do you make this happen? In order to be able to create something like this, federation is a foundational principle. What do I mean by this? Federation to me means distributed control. It means that resource owners determine policy of use for what they own and resource users or consumers determine policy of what they're willing to use. And in particular, the first line should be immediately obvious. We live in a capitalist society. That's a given. If we don't guarantee that first principle, you're never going to be able to federate resources because the owners ultimately control the resources and must be able to take them back anytime at any moment. Otherwise, this whole thing doesn't work. So the federal system then matches consumers to owners respecting both sets of policies. And that is the core foundation that then allows you to actually grow into the kind of built structures and grow into the kind of systems that I'm aspiring to that would include thousands of institutions. Now, the hard part in the end is behind this question or the answer to this question, how can it possibly scale to support order of a thousand institutions? And I want to talk a little bit about what the core challenge, what the structural challenge is that we have at hand here that we have to solve. The structural challenge is a mismatch between what is required for growth in resources and growth in consumers. So resource owners want minimal threshold for participation because if it's too high a threshold to participate, it's not worth their while to actually get connected. Think of it as first comes the plumbing. Once you have the plan and the plumbing costs effort, once you have the plumbing, then you can actually develop a long term growth in how you use it. Nobody's going to join the plumbing if it's too onerous, too effort intensive to actually join. And so the bar has to be very low in order to make that work. At the same time, resource consumers want minimal threshold of participation as well and they want a rich set of services. And the rich set of services typically means a lot of effort by somebody and somebody other than the actual researchers that use the infrastructure. And that is fundamentally a tension and that's the biggest problem in creating what I'm talking about. The technologies are easy in comparison. It's all the social stuff and the structuring and the creation of scalability by matching these disparate needs for growth. That's really hard. And now let me talk a little bit about state of the art, both by showing you how the concepts that I've just talked about I implemented in practice and how big these systems have gotten at this point. And I'm going to take two examples that I'm intimately involved in, the Pacific Research Platform and the Open Science Grid. And I'll put them first in perspective with each other because they sit complementarity in the stack. So let me talk about that first. In my mind, there are three ways to build open infrastructure and the three ways distinguish themselves, where in the stack you're connecting and integrating the institutions. You can connect the institutions very low in the stack so that an institution that joins doesn't even have to operate the operating system. That's what we do in PRP and I'll talk more about that in detail. Or you can connect very high in the stack at the cluster batch system and storage system layer. And that's what OSG does. And the two distinguish themselves by if you get low in the stack, you give away all control. Once you give away your operating system management, you've given away everything. If you have a cybersecurity or other control stands for your institutions that would not allow that, then you need effort to operate the batch system, operate the storage system. And then you can join this as a federation very high in the stack via OSG. And by having these projects work with each other and figure out the inter-operation, we can actually create something that gives the institutions options. And different institutions will join at different layers in the stack at different points in their engagement. They might join first at the lowest level and say, I'm going to set aside an enclave here and I'm going to join you and let you do everything just to get my feet wet. And then later I'm going to build up the other things and join at the higher level because I want to get this control. I want to have the cybersecurity control and the policy control that this allows me to have. And there are invariably going to be some institutions will never reach all, which is my goal. Because some institutions will have a mismatch between the effort they have and the desire they have. The desire from control costs money. It's that simple. And some institutions just won't have the money, but they have the desire to control. And when these two things mismatch, you have an economic impossible thing that you can't solve. And therefore those institutions won't be able to join. So let me now talk more about the complementarity and the implementation and again point out in analogy what we're really doing. We're implementing bring your own resource philosophy. The open infrastructure is really all about bring your own resource. And the way my colleague Tom Defante coined it, we're doing somewhere between soup kitchen and potluck supercomputing. Potluck is you have a party and everybody brings a dish. That's when you bring your own resources to the party and you get to eat off everybody's table and we all share. And that is the ideal that we're aspiring to. However, at the same time, in order to facilitate growth, we're open also to the soup kitchen model. Basically, you can come to us and eat for free. And so you can bring your scientists, you can bring your researchers, we give them accounts, we give them training, etc, etc, etc. And make them be successful with their science on this open infrastructure. And so OSG is focused on the campus cluster integration. The passive research platform is focused on individual node integration instead of clusters. Literally individual pieces of hardware are integrated in PRP's Kubernetes infrastructure. And voila, off you go, you have an integration point. And in the following, I'll introduce these two models in a little bit more detail. The Pacific Research Platform, it pioneered integration at the Kubernetes and IPMI layer. For those of you not as geeky as I am, IPMI is a mechanism to do remote installation of operating systems. And Kubernetes is a mechanism to do remote installation of containers in order to operate services of various kinds. And so what PRP allows is you join either at the IPMI layer and we run your operating system, or you join at the Kubernetes layer and we run services on top of your OS in the pods that you provide us by joining our Kubernetes cluster. And the way that in order to be successful PRP realized, if you are starting way low, you have to actually start really low and provide people even with a shopping list for what to buy. So we basically every year we put together and validate hardware installations. You can think of it as appliances. We validate appliances, put on the web the exact part numbers as exemplars of the kind of things that you can buy and then in order to achieve a certain functionality as you integrate into our system. And the global hardware integration at this point is across 30 plus institutions. I think last time I looked at it was 35. And out of those are 11 minority serving institutions, 8 in California, given what you'll see the map is like and 6 institution EPSCO states. And I put some more information on who uses it. This infrastructure has its origins in the explosion of machine learning. So the original funded project included gaming GPUs and deploying lots of cheap GPUs in order to support machine learners, mostly computer scientists but also engineering and other fields like that. OSG has its origin in big signs in a way. So the two organizations I've been talking about come actually from very different user communities and with a very different way of doing this. This gives you a map of where it started. PRP originally was a California project first and foremost. It was expanded along the Pacific Coast and sort of inland and it has since grown into something that is basically global. We have hardware and PRP in Australia, Korea, Europe, all over the map. Now switching to the OSG model, institutions operate their own computer storage cluster and OSG provides software and services to allow integration of clusters. The way that the sausages are made is that we basically build an overlay batch system on top of batch systems. So instead of submitting jobs to batch systems, we submit batch system demons to batch systems all over the world. There's about 200 batch systems that we submit to globally. And those demons then call home to a resource pool and the science is actually queued in our queuing system and then as resources become available we found them out across the globe. They operate there. We create a common runtime environment in part via containers, in part via pre-installed software and voila, science looks the same, runs the same all over the globe and you've created and federated infrastructure and then we can make pools for different communities. We have one pool for all of open science that we operate with the spirit of eat your own dog food given that we provide software and services we must operate software and services because we must be eating our own dog food in order to show to others how it's done and then we provide, people can use us at various layers and we operate entire pools for some organizations and for other organizations we operate only pieces of services and they operate their pool so you can think of OSG as a tool chest of things all of these tools we operate ourselves for our instance but all of those tools you can use for your instance and the tool can and federate with each other. That's the map of OSG's deployment. As you can see it's a heavily US focused organization but it has green dots all over the map worldwide and a green dot is an organization and there's 149 listed here and I'll talk on the next slide what I mean by organization because one of the strange things in OSG is even just the schema of sorts if you wish even how we name things is actually a challenge how do you conceptualize what something is given the flexibility that our tools provide what we call things is actually non-trivial in some funny way so there are 64 US institutions contributed to compute power last year but we count institutions, organizations and clusters all separately and here's an example for myself UCS-D is an institution it has two green dots on this map one for the San Diego Super Community Center and one for my own group's physics cluster I run a cluster of about 10,000 cores five petabyte of disk space in my group for my science and that is its own green dot on the map now in addition in that last year actually six clusters contributed from UCS-D I think five of them from... actually it's not even true three of them from SCSC three of them from my group that we built at various times and these clusters itself were... two of them were entirely inside the commercial cloud so you basically deployed an entire cluster in the commercial cloud interfaced it with OSG and voila, you have federation and the cluster in the cloud is owned by a research group and then another one is PRP itself is actually a cluster inside OSG and it itself is across 30 plus institutions so you have sort of a scaffolding of... a re-entrant conceptually you can actually build clusters that then the cluster joins but the cluster itself is built at a much lower level and therefore you can federate federations in a way and we actually do this and then the bottom line here is in terms of democratization of access to cyber infrastructure 26 of the 64 years' institutions either a minority-serving institution are in an EPSCO state or are a non-R1 and now to give you an example what a non-R1 is that is actively participating is the American History of... the American Museum of Natural History in New York City is actually a participant it federates its own resources and its researchers do research on OSG so the museum itself is a research organization it does research research arm of the museum is actually collaborating with us both providing resources and providing consumers then the next thing I want to talk about is the OSG data federation and there's one thing that is fundamentally different between data and compute and that is reference of locality data has to be integrated where it is whereas in our compute integration we actually run the API to integrate the cluster away from the cluster and then do via SSH login from remote into the cluster to submit the batch system so whereas here we actually need to have service orchestration service deployment that is geographically distributed and we need to operate ourselves so OSG uses PRP and the PRP concept in order to have global deployment of services global deployment of data origins global deployment of data caches and we've built in essence something that is like a content delivery network so you can think of it as YouTube but it's not YouTube in the sense that you have to upload your data it's YouTube where you can join your data and join your hardware into the system and therefore you bring your resources the resources have the data on them and voila the data becomes available and we are responsible for operating all of the caches to make data access transparent and I'll talk more about this in a moment and right now we have 10 data origins one of them is an open science data origin for the entire community anybody can come and bring their data and then the other nine are basically community origins they are dedicated to certain communities that have joined the data into our global namespace in order to benefit from us let me at this point mention two science use cases to make this I've talked a lot about concept and philosophy and implementation now I want to talk a little bit about who uses this and what do they do with it I picked one big science and one medium science example and I'm going to give you later examples from individual scientists meaning all the way down to individual undergrads who can actually use this infrastructure and do use this infrastructure so let's start from the top there is a global quest these days on understanding the most violent events in the universe via the measurement of gravitational waves the physics Nobel Prize a few years ago went to LIGO for this work we worked with LIGO from before the Nobel Prize in fact the very discovery that gave the Nobel Prize was confirmed on our computers and they ran it twice at least twice they ran it once on their own infrastructure and once on our infrastructure in two independent teams I'm sure they actually understand what they're doing now what's special here is that the very science requires multiple instruments in multiple locations around the world because not only do you want to detect the wave as it flies by the earth you want to pinpoint where it came from and in order to pinpoint where it came from you have to triangulate you measure in three different locations and then you point back to where it came from however the fact that gravitational waves are polarized they actually have an XY polarization means that you need actually at least four instruments in order to be guaranteed to be able to pinpoint because if the polarization lines are badly with the instrument you don't see it and so there is a fundamental physics reason for an infrastructure that is global that can be joined by instruments that are funded by different countries and are completely independent LIGO is US funded by the NSF Virgo is European funded the two have nothing in common on the hardware end of things they collaborate but they actually build the instrument completely independently they even use different technologies and then Kagura in Japan is even less common whereas LIGO and Virgo actually have conceptual similarity Kagura uses even a different detection concept on some level and so all of these different instruments then bring their data into our namespace and then from different locations in the world Virgo brings it in from Europe LIGO brings it in from the US the actual disk space is US resident versus Europe resident but the namespace visible and access visible to jobs all over the world is actually identical so they can run on the same data together in order to do the science via our infrastructure and in fact if you look at some of these weird caches in weird locations outliers are basically put there because there are LIGO collaborators in Australia there are Kagura collaborators in Korea there are Virgo collaborators in Europe and the caches were deliberately placed near the institutions such that the data of LIGO gets cached in and is always available the latest one that is relevant and I'll have more on that later then I'll give you a mid-scale science use case here we're talking billion-dollar investments LIGO, the confrontation program is multi-billion-dollar investments now we're talking a xenon collaboration the xenon experiment is probably in the neighborhood of 10, 20, maybe 30 million dollars it is an order of magnitude to a way in size from what LIGO is and yet they have the same challenges it's an instrument inside the Grand Sasso inside a mountain and it has all of the overburden of the mountain in order for only dark matter to penetrate all the way through dark matter is weakly interacting and they're searching for weak coils of nuclei from dark matter think of we fly Earth flies through a dark matter halo in the galaxy and as we fly through that dark matter hits their detector and they want to detect this and this has never been done before so they're basically building something to discover something that is known to exist and they're in a race for discovery and have about a 20 institution collaboration to build this and they've been at this for 20 some years because they've built generations of instruments into the same laboratory one after the other one in order of magnitude larger than the next basically and the two topside papers are in the neighborhood of close to 2,000 citations and that's really big impact science and so now what's their problem their problem is that they need the in kind contributions from the collaborators across the world so they have the instrument in Italy the tape archive in Sweden disc storage in seven locations across the Netherlands, Italy, Israel, France and the US and they have compute resources in Europe and the US mostly plus an allocation on an NSF funded supercomputer and they go super computer center and so we integrate their globally distributed in kind computing storage plus the allocated parts of the NSF and therefore they can use the system as if it was one batch system one data distribution network and we provide the services that make this happen now let me talk a little bit about how can you join your data and one is obvious you can just bring it onto our storage that's sort of an obvious one the one that is less obvious is how do you federate a file system at your institution with us and the mental model that we have is given that people like to hide their file systems behind firewalls we have a mental model that you put a dual home Kubernetes server on the firewall you export the file system only the pieces that you want to export into the federation onto that file server that file server we then install all of the software that does the magic we operate the software that's the magic you just operate the file system done and this model the kind of functionality that you can get at the very limited way if you're paranoid about giving space accessible to the outside world you can just export it read only and then you just export your data and nothing else alternatively you can choose this part of my file system gets exported read only therefore everybody should be able people who have access credentials should have access this part should be writable by certain privileged people and they can then use that as their home base to bring the data that they produce on the distributed infrastructure back home and reuse it for future computation there's therefore an integration of locality at home and the globally distributed infrastructure in order to share as much as possible this is sort of a little bit of a geeky detail we have a strict separation of namespace and physical server space that has the advantage that you can actually out of band move things around retire file servers even move across the country a data set and the users will never even notice because the global namespace doesn't change and can be served into from multiple locations with multiple replica of the same data that gives enormous power to build interesting structures and replicate across continents for example and not only rely on the caching this is my slide to talk a little bit about the top users of the data federation what you see here is a ordered by project the data read the working set which is the unique data that was accessed within the year and then the ratio between data read divided by working set which is the reread multiplier and I've ordered all users within the last year by the size of the data read top being the largest and you can see then if you go through this list there are the obvious ones like LIGO that this infrastructure was partially built for and they are non-obvious ones which is Minerva Nova DUN are large collaborations that are either already in with the instruments online and DUN is the next billion dollar experiment out of experimental program out of the DOE high energy physics that is just being built and so as they go through their construction phase through the R&D phase including the infrastructure inside their R&D efforts and then in addition you see that LIGO makes the data public after 18 months for the global community so there is a part of the name space that is private for LIGO that they use themselves and there's a part of the name space that is public for all of open science and data effectively moves from the private into the public once every 18 months after it was finished taking and you can see that both the private and the public is heavily used and then you have individual groups like the computer architecture lab of Tafts University that's essentially a single faculty and his group who does computer architecture research or down here let's see if I can there is the Stuart Observatory at the University of Arizona which is basically an astronomy shop and they do data analytics on the infrastructure or then there is a Red Top is a individual PI that is developing an idea for a next generation particle physics experiment and he develops that idea with his students and postdocs and he creates a collaboration of his own on our infrastructure and then you have Malkrist is quantum chemical machine learning and biomedical informatics and microbial genome sequence data in essence we have pretty much every large scientific domain represented in one way or the other there are economists on here there are political scientists on here we have the social sciences there is a lot of life sciences because a lot of genomics in one way or the other ends up being they have a lot of data so that ends up being a lot of what's happening interestingly enough a lot of evolutionary biology which I didn't really know existed as a field before we had people talking to us about this and so all kinds of things happen a lot of them as individual groups individual PIs sometimes individual grad students we have a summer school in person every year and that accepts preferentially one person per institution or one person per group per year and so we sort of try to spread the goodies in a way and the graduates from that summer school invariably are then afterwards researchers using that the infrastructure so we have a lot of single grad students that bring the knowledge that they learn out of that summer school into their groups and then are the trainer of the group and therefore grow the scale so what's next in 2022 let me stop my alarm before it beeps at me what's next in 2022 where we got funding from the NSF through this app systematically across the continental USA so we got a project funded it's actually called Prototype National Research Platform and it does what you see here on this map this is basically the first page from the proposal the first figure on the first page of the proposal and the key for this context is that we are building hardware in three places where we deploy compute and storage basically GPUs and storage hardware at the west coast, the middle of the country and the east coast and then when I wrote the proposal I went and asked myself what's the reasonable distance where these caches work well based on our experience and I said 500 miles 500 miles is the max that I'm comfortable with cash access for all kinds of reasons like latency access and so forth and then I asked myself let's draw circles on the continental US and how many caches would I need to get coverage I moved them around and overlaid them on the internet to network backbone and picked internet to locations that gives me coverage more or less of the continental United States and that's what gave you this picture and so we're collaborating with internet too to place caches at their pops in locations in order to make national coverage more or less work it's not perfect you can see that there's little gap between San Diego and Houston 500 miles, they're a little more than 1000 miles apart sort of epasolized more or less in the gap if you wish there's also a gap that you don't see Miami is 500 miles away so we have actually some other projects that fill in some of these holes one project actually places a cache in Miami to get the lower part of Florida covered and so forth so we're collaborating across multiple projects to basically tessellate in we have a cache in Guam that we're deploying we have one in Hawaii that we're deploying so we're sort of growing this out of the strength of multiple projects which one project give the bulk of the spread in a way and what's the federation model we're also introducing something that goes beyond what OSG did in the past and we've noticed that people are actually the hardware costs are not the limiting factor for most institutions the limiting factor for most institutions is operation support, system in support all of the stuff above the hardware so and moreover if you think about it operations cost scales something like logarithmically with hardware costs meaning while volume in hardware deployed gets linearly in dollars two systems are twice as expensive than one system so to the extent that we can create a single team that automates the crap out of this deployment and is really expert as operating remote hardware we can actually grow the infrastructure cheaply by offering people to join cheaply by only having to pay the hardware and we provide all the other support and it costs us not linearly in terms of system in support that's the basic model behind the proposal in order to spread this out and so but if we're doing this then people don't just want to buy compute they also want to buy storage and they may not want to buy storage and run their file system because then they need system in then they need data admins so therefore we're offering a concept where you can buy your storage and join the regional system and therefore we are effectively offering a distributed storage infrastructure based around SEF within a regional context and fundamentally here we're trading off usability file systems are beautiful to use everybody knows how to use a file system and performance for a performance standpoint you want all of the storage in the same data center but from a usability standpoint you want these regional storages to be one system to get large file systems and then use that for the user community so we're allowing ourselves to deploy in multiple locations in a region whoever owns the storage and owns the hardware and puts it into their science DMZ and voila that is the birth of this concept and we're going to see how well this works we're in the process of deploying this right now in regional storage we have regional storage in Salah California we have regional storage in Northern California there's regional storage in the east coast the midwest and so we're in the process of covering regions with this similar to the caches the other thing that's new here is when I ask myself as a domain scientist transparency is nice as long as it works well however often I want to actually be in control where my data is located and ideally I want to be in control and say this part of my file system my path everything below this part of my path should be distributed in the east coast and the west coast and that part of my path should just be in the middle of the country because I'm a resource in the middle country to do this stuff and resource at east and west to do that stuff and that kind of user level granularity decision making is introduced with this system for the first time and we're going to see how well it works and will work with communities that are individual scientists who want that kind of functionality and the other thing that is important we have full interoperability across the legacy PRP the new NRP and the OSG and that is really important and I'll get back to this multiple times but before I go to my concluding slide I want to impress upon you in particular because this audience is really the audience that I would like to see next to work with us all of what I've described is just the beginning because when science, interesting science gets done you'd a hell of a lot more than just a storage layer in a funny way I look at myself as a plumber the networking guys are the real plumbers I'm the service plumber I plumb services to make structures that are national and global now above my plumbing there's a whole lot of other stuff that people need example shout out to my colleague from SDSC Melissa Creighton who later on today talks about FAIR all of the FAIR stuff metadata, discovery all of these kinds of services need to exist on top of the foundations that I've talked about in this talk so what I'm really looking for that we are right now missing a large variety of high level data services that we will never get involved with the people involved in the projects that I've talked about are much lower level people they're basically plumbers high end plumbers, very well paid plumbers and they create infrastructure foundations but they're not the kind of people that you need in order to build all these higher level data services I'm envisioning that libraries have a role to play there, curation has a role to play there and curation in the sense that it's end to end all the way from the data the containers that encapsulate the run time environment the output that is ultimately the knowledge created and all of this packaged in a digital book that can live in our infrastructure and can be used for training the next generation of scientists by learning how to reproduce a published result and then expanding on that published result with the tools that were used to create that published result there is a space here for the libraries to use what I've described and built on in interesting and novel ways and ultimately take the next step in the democratization of access because there is all of the training necessary to really make access useful all of the applications necessary as templates that people learn from and all of that is outside the scope of the projects that I'm involved in it's other people it's people like you who would be the ones that would be working on those kinds of things in my mind and again the very openness of the CI that we create makes it possible to have an ecosystem where other people can write solicitations to funding agencies all funding agencies not just the NSF and build on what we're creating what the NSF supports what the bring your own resources philosophy grows beyond what the NSF supports and therefore build something much more intricate much more impressive than what we've done and that's where I want to leave you at it takes a cyber infrastructure ecosystem to be able to build and operate the open infrastructure that will ultimately achieve our joint vision of democratizing access for all and we're we're encouraging all of you to join us and if you want to get in touch with me that's my email address thank you this is really interesting work and very necessary I'm Geneva Henry from George Washington University what struck me throughout and I thought you were going to go there at the end there's a piece of cyber infrastructure and really if we think of this really at the scholarship production infrastructure and that's publishing so one of the big challenges we've had to trying to get publishing open is the ownership of the knowledge so you went pretty much there about the knowledge that we get out of the research that's done in this infrastructure but I think it's going to be really critical to recognize that you know the publications have to be part of this larger infrastructure which then you know drives to questions of ownership and governance and it's one of those things where we have not been able to crack the nut of the ownership and the publishers who do have the resources are still in control even though some of them are getting to more open access but it's you know we're spending in libraries millions and millions of dollars you know to make the knowledge that we produce through these kinds of infrastructures and that we peer review available to people and we're buying it back so until it seems like until we can get to a cyber infrastructure that includes you know the whole enchilada the entire research process we're going to have a challenge and it seems also that the universities the R1s, the AAUs need to at the highest leadership levels step up to the ownership of that full production of knowledge which includes the publications and take that over so that there is the stability and some level of governance so I'm curious to think about how to hear about how you thought about you know publication and where that fits in with all of this. Yeah and the short answer is that that's where publication fits in meaning for good reason I work as a plumber intellectual property and everything around it and the need that those things that get funded by our collective richness become collectively available and open is a kind of worms that sits on top of the infrastructure we build and it's a kind of worms that must be addressed but it's a kind of worms that I've declared above my scope in a way so and I totally agree with you that in a when you think of it of there being an intellectual stack in addition to a technical stack the intellectual stack includes the publications at the top and those have to be open as well but that's I sort of bleed ignorance and allow myself to focus on the lower levels of the intellectual stack if you wish and I'm cheering you on as you address the upper levels of the intellectual stack and fight the good cause of open signs and open publications does that make some kind of sense Hi Kurt Heligus from Princeton University as I look at your accounting of how things work and I haven't looked as much at storage but if you look at computational infrastructure the hardware cost is only about a third of the lifetime cost of a system if you keep it three to four years if you're keeping it longer then it actually gets worse and then that remaining cost is dominated for us not by sysadmin cost but by facilities cost power, cooling the amortization of the data center and such and so that means that people who are contributing now are contributing a lot more than just the percentage of hardware correct where does that incentivize people to contribute their hardware into the if they're not getting back nearly as much as they're giving in it's really blowing up the contribution people are making so correct I try to slide this under the radar screen and I fully acknowledge that depending on where you are in the world the power and cooling costs anywhere between the third what you mentioned in Europe, in my field where we have actually each country buys its own hardware and operates its own hardware the Europeans typically say 50-50 50% of the lifetime cost of the hardware lies in actually the hardware the other 50% in just operating them and in a way you can think of this that is the institutional contribution and what you get out of joining is access to everybody's spare cycle think of it what we found is that there is an enormous willingness to buy hardware have by buying so first access preferred access access when you need it but give everybody else access to the hardware when I'm not needing it and when I'm needing hardware I have extra access to everybody else's so those hungry enough and clever enough to batch their stuff such that it can run at nights will benefit from the hardware of those who are rich enough to afford to have interactive access during the day and there is a in a way implicit benefit calculation that each and every institution makes for themselves of is it worth it and the is it worth it calculation has parts to that are just dollars and cents and parts to where it is worth it for me because I'm supporting Science X of my researcher Jane Doe to be part of this collaboration that she wants to be contributing to in order to have scientific standing when you have a collaboration as large as LIGO or the LHC or even smaller ones like Xenon there is acknowledged need that we collectively need to contribute in kind resources there's many ways of contributing in kind resources contribute effort to collaborations on the instrument that requires you to send people to Italy that's costly instead you could contribute computing resources to the collaboration that already buys you in intellectual in a competitive advantage because when you're competing with in addition to collaborating you can advance your scientific mission by having those resources using them and sharing them with your collaborators in my own field I get involved in computing to a large part because of establishing myself a competitive edge scientifically I own hardware because it allows my group to do more competitively research on the LHC data analysis than I would be without so that it's a complicated calculation does that make sense and it has lots and lots of facets and the power bill and the energy bill is a very interesting one I should probably not be saying this but from a PI perspective power is free at the university it doesn't cost anything space is free cooling is free that's the way the university's role that gives very interesting incentive structures let me repeat your question in case it wasn't picked up a lot of the arguments work best for the collaborations that's true however from the soup kitchen up individual PIs can transition from the soup kitchen to the soup kitchen and start out at the soup kitchen where they are 100% just benefitted from other people's money in terms of hardware investments and they can then write solicitations with their IT groups on campus in programs like the CCSTAR program and therefore migrate from soup kitchen entities to potluck participants and so what I'm trying to create is a hell of a lot of gray everywhere so that there is an enormous flexibility of becoming soup kitchen entity collaboration member and bring your institution with you sometimes sometimes just bring yourself sometimes bring the hardware under your desk well not quite it has to have open access so they say from a networking perspective there are some restrictions but we see on PRP we see an awful lot of the hardware being contributed by individual researchers because we have a double whammy in advantage NVIDIA doesn't sell gaming GPUs to data centers they claim they're not quality enough we use gaming CPUs all over the place everybody buys them for doing machine learning they put them under their desk so individuals who buy we basically give them options to buy 8 gaming GPUs in a single system for cheap dirt cheap in comparison to what you would buy in the cloud there is a factor of about 8 that means as long as you put in a duty cycle on your own hardware that you own that is one eighth of the total you come out ahead almost everybody who does machine learning does more than one eighth of the total of the hardware but nobody does 100% so we have an enormous amount of hardware that's just sitting there idle because the machine learners aren't using it at night because they're bought all of these gaming GPUs that they're primarily using interactively because that's how they flow all of that GPUs are harvested at night by ice cube so there is a strange ecosystem of mutual benefit that is a play here where the economics is often interesting and non-trivial does that make sense any other questions thank you