 Good morning, everybody. My name is Avast Vasiliadis from the University of Chicago. I'm going to be talking to you about the GLOBUS project. A couple of few things today. Just by show of hands, who's familiar with GLOBUS? Some of this might be a little repetitive, I apologize. I'm just going to run through a high level view of what the GLOBUS service is, how it looks from a researcher's perspective, and in that I'm mostly going to do a brief demonstration and a live demonstration. Talk about some common use cases about how it's being used, both in sort of an interactive mode as well as in some automated research data management flows. I'm going to spend a little bit more time on how it's being used in the data publication context, because obviously that's more relevant to you folks. I'm going to start off with some of the things we're doing on the sustainability side, because that's something that's been sort of caught to our mission from day one. And day one was about ten years ago, and this was sort of the picture on campus of what people's data looks like in their labs. I think one of these was actually on our campus, I forget which one. I show this slide because I'm particularly intrigued by this numbering scheme, because you can't see it in the back. Each of these drives is numbered. So there's obviously an index somewhere that is being used to track that data and one wonders what happens when the grad student who has that index Excel file leaves the institution. So obviously the question is how do we, in this sort of environment, how do we move and share and describe data and make it reproducible and discoverable, and really how do we facilitate data stewardship throughout the research life cycle. So we came at this from Globus and actually I'm just sort of a quick snapshot here of where Globus has come from. So as a technology it's been around for the better part of two decades now. The early days were fraught with peril for those that might have tried using Globus back then was a very painful exercise. So about eight or nine years ago we launched a service to really make Globus accessible to anybody, really. No sort of deep tech experience required. Just some milestones in particular. Again I'll talk to sustainability at the end, but we are approaching our 100th subscriber. At least I'm hopeful that will show up sometime next month. And we are well on our way to becoming fully self-sustaining. We're well over 50% at this point, but we still do have a lot of grant funding. So the way I like to summarize what Globus does is really acting as a bridge between data and people, both within the organization and between institutions. And at its core what we try and do is present sort of a unified view of data irrespective of what type of storage system it's on and where that storage system lives, be it on campus, be it in a national facility somewhere, be it in the public cloud and so on. And then we do try and make it as easy as possible for researchers anywhere to share with the collaborators and also to make their data sets more widely available to the community. Again either in the form of public repositories or in their own storage, be it again on premise or in the cloud. So the core functions of Globus, so we started out basically as a file transfer tool. And the idea here is that a researcher can come along and say, move some data from a storage system somewhere. And that can be, as I say here, might even be an instrument in many cases. That is the case. And move it to some other machine. And that request goes into Globus and at that point the service takes over and the researcher can go off, shut their laptop down and the data just move between those two systems with Globus monitoring things making sure that it can recover from any transient errors and so on. So it's really a fire and forget kind of mentality when you're moving even small data sets but typically the larger data sets where it does become critical to make sure that things are progressing as they should. When you want to share data, what the researcher does is select the data they want to share and they set permissions for who can access that data via Globus. So this is not changing anything on the underlying storage system. So Globus has this philosophy of really not touching the storage. The storage is yours, you administer it and configure it and set policies on it. And Globus just sort of sits as an overlay on top of that. So we have these permissions that are sort of sitting above your storage system if you will and then another researcher at another institution can come along to Globus and access the data the same way. And then when you get more to sort of a more formal publication situation what you can do is attach metadata to this collection or this data set and optionally put it through some kind of integration step and then we have a search facility whereby others can come along and discover and reuse that data, pull it down to their systems by the same means. And all of this for the most part is available just via web browser. We do have other interfaces I'll talk about briefly later on. I say it's accessible on any storage. That's not strictly true. We don't support everything but we are working hard to add accessible connectors to support essentially any type of storage system. And very importantly for many people they can use the service using with their existing identity so they don't have to create yet another user name and password just to access the service. So with that I'm just going to switch out of slides and do what my mother always said I shouldn't do which is do live demos. So let's hope this works. Okay so here's the home page. So I'm going to go log in and here is where you'll see, can everybody read that at the back? Should I make it a little bigger? Okay. So this is our federated identity system. We call it Globus Auth. So we have about 560 or maybe close to 600 now different identity systems that we trust. These come to us through different federations like Incom and Edugame. If you don't see your institution listed here you can always use Globus ID. That's our username and password so you can go and create a Globus username and just access the service that way. But for most people hopefully they have their institution listed here. So when I click continue this will actually redirect me out to my identity system at UChicago. The thing for us is we don't want any credentials flowing through Globus if we can do that. So I will log in here with my UChicago ID and we have two factor authentication enabled. So I will get a code on my phone here and approve that. And then I'll be redirected back to Globus and I'm logged in and looking at what we call the file manager. And so this is our main screen. Maybe I should make that a little bigger. We can read that in the back. So we have this concept of end points and collection. So an end point is a system that has the Globus software installed on it and can be accessed via the service. So I'm just going to access one of the storage systems that I have data on which is our high performance cluster on campus called the Midway system. And you'll see when I click on that it logs me right in. And that's because wherever possible Globus will try and do single sign on based on the credentials that I've given it. So in this case I logged in with my UChicago ID. This system recognizes it so all is good. I'm going to actually do a quick transfer between two other systems. So a common use case for us is people accessing large public data sets like the NCAR Research Data Archive that stores I think upwards of 30 maybe 50 petabytes of climate research data. So actually let me click on a bookmark that I have there. So here is this is a data set that's about a terabyte in size. So let's say I wanted to move this to one of the other systems I have access to which is at Argonne and that's called the Petrol system. So I'm going to go there and then transferring is pretty straight forward. I select the files I want to transfer and down here I click the start button. And that's it. So you'll see up here it says a transfer request was submitted successfully. So Globus now has this job to do and it will go off and do it and if it runs into any issues it will try and work through them. If it needs some intervention from me it will send me an email. So for instance if this is a very long running transfer and we've had transfers that run upwards of months so my credentials may expire in case Globus would say hey, let's go and log in to this system so the transfer can continue. And it will pause and resume things automatically so at this point there's nothing I need to do further just go about my business. So that's in the simplest way what transfer is about. The more interesting thing and I'm going to go back here to my midway system is I think more interesting at least is the ability to share. So let's say I wanted to share this folder with all of you here. Let's say I had a group of you set up in Globus so I can select that and click share and it prompts me to create a share. Let me just call it demo share because I think I have another one that I created by the same name. And now I can grant permissions to anybody that I want to. I can only grant permissions that I have so in that folder I do have read write access so I could give you access to read and write files there. I can select an individual user, a group of users or I can make it more open and say anybody who logs into Globus has access to it. So that's essentially a public share if you will except that you do have to log into Globus to get access to that. So I do have a demo user for this purpose so I'm going to search for that. And if you don't know the individual's user name or ID or anything, you can just put in their email address and Globus will send them an email and they can go into Globus and it will automatically create accounts and all that so they don't have to worry about it. So I can add permissions here and this user now has access, they have read access down here to that folder. So if I go back to the file manager and access that share, you'll see that's the set of files they're looking at on this system, this was the directory I shared so they're kind of locked into there, you can see they can sort of navigate up from there. So it's a nice clean way of giving someone access to your data even though they may not be part of your institution or they may not have an account on that storage system wherever the data resides. Actually speaking of identity, so we do have as I said a federated sort of identity and access management system here. So I can link a number of identities, I don't know if you can read those in the back. So here is my UChicago one, likewise I have an Argon identity, I have my University of Michigan ID from back in the day. So depending on what people know me by, they could share with me with any of those identities and I could just access it irrespective of how I'm logged into Globus. So that's just a quick walk-through. Let me go back to some slides here. So we talked about that. So how do all these systems become available or accessible via Globus? So it's using a software called Globus Connect. There's two sort of major versions of it, one is a personal version so that you can run on a laptop, on a single user system. It doesn't require any special permissions to install and it handles all the networking, cruft, firewalls, etc. So pretty much anybody can just plug and play that and they have an endpoint on their laptop. More interestingly, there's a connect server. This is the thing you would put on a storage system and that would give access to anybody that has a local account on that system. They could then come in via Globus and access that system with whatever permissions they have. And as I mentioned earlier, we do support a lot of different systems. Any sort of POSIX compliant system is supported out of the box. We do have a number of connectors on the left that we've released over the last three or four years. Next up for us is a connected box. That's been one that's been asked for by many people for a while now so we're trying to get that wrapped up soon. And then there's a few others in the pipeline and we keep adding to this. We keep getting requests. We've had requests for one drive and various others. So we do try and keep up with what our users are asking us for. You just saw me walk through some part of the web interface. There is also a command line interface which is more interesting for those that want to script things and automate things further. And then ultimately, behind all of that is a set of REST APIs so if you want to access Globus programmatically, you can just talk to those directly. We've got lots and lots of sample code out there on how people are doing that. So that's our command line interface. It's actually a little more full feature than the web UI because there's a lot more options on some of the commands that are exposed through this interface. It's an open source code that you can actually go look at. And so one of the ways people are using this is to integrate Globus into their workflows. And further they're actually using the Globus platform to build science gateways, data portals, various other applications that are part of their research workflows that allows them to manage data a little more in a more streamlined fashion. So here's a very high level view of some of the Globus services. So what you saw me demonstrate just now was Globus Auth, which is the identity and access management piece that underlies everything. I showed you transfer. I did not show you search to identify. So the transfer is sharing with the two key ones. And then there's additional services built if you will on top of those. But I do want to spend a second just talking about Globus Auth because it really is a critical piece of the puzzle here for us. So it's essentially based on OAuth 2, on the OAuth 2 standard on an open ID connect for those that are familiar with those terms. So what we've tried to do other than allow people to access Globus with existing identities, we've tried to create a set of services that you can put, that you can leverage in your own applications. So if you're building a data portal, you don't have to go and build your own user name and password managers and accounts database and so on. You can just hook into the Globus Auth service was literally a handful of lines of code using standard libraries and then anyone that has an identity from one of those systems or one of those identity providers that I showed you earlier could access your application. So assuming you've given them access right. So there's a nice clean way to make your apps sort of accessible to a broader community. And you can also use it to secure your own APIs if you're building. We have some folks that are a little more sophisticated using Globus Auth to secure their own APIs so other services can call those as part of the same flow. So as we've gone through the last few years it's become clear that it's not sufficient to have these sort of point and click interfaces. They're good for sort of ad hoc work but when you get to sort of scale and large research projects that need to do the same tasks over and over in an automated fashion. So we've built additional facilities to enable that. So a very common one is doing things like scheduled backups or replicating data on a regular basis so you've got scripts that will run and automatically kick off transfers that put your data elsewhere. Data distribution is a very common use case so for instance from a lot of campuses have next-gen sequencing centers and other things like that where they'll pull data off and then they'll put it onto some storage system and using the sharing mechanisms they'll make the data available to their users. And in many cases as I said they're building sort of custom data portals and some of the capabilities of the global so built into those portals both for data movement and for sharing. One of the biggest examples I mentioned I showed you that NCAR Research Data Archive that I was transferring from so if you go and browse that archive when you select the data set you'll see one of the options typically not for all the data sets but for most of them there's an option to use the global service to transfer that. So this NCAR was one of our very early users of the global platform and they continue to build capabilities into their repositories. And what's becoming probably our biggest use case of late is pulling data off of instruments right so be they light source high resolution light sources like the advanced photon source of the ALS out of Berkeley high-race microscopes and so on we've got lots and lots of instruments generating lots of data and what's particularly critical in these use cases is the research only has access to the instrument for a short period of time and then they have to get the data and get out right so they have to pull the data off of that instrument in a reliable way because in many cases it's very hard or in some cases impossible to replicate that experiment right it might be sort of a one-off sample or what have you so they want to pull that data off and put it again into some other system perhaps for analysis or for sharing with the collaborators so how do you do that securely or reliably so we've got a number of different use cases where instruments are sort of enabled for global access and there are automated mechanisms for pulling the data off those instruments as it becomes available a good example of this is that the advanced photon source out at Argonne National Lab we have a researcher Bobby Casturi who is studying the brain he's essentially he's undertaken to map the brain to build out the connector as they call it so a massive undertaking so he does he takes images they talk about it as using a deli slicer so they'll take mouse brains or octopus brains slice them very very finely and push them through one of the light sources and gather all the data and then there's a flow built behind this that's built using the Globus platform where after the imaging the data are pulled off to an acquisition server they go to another server where they're pre-processed and then they're sent to a system sort of on the other side of the Argonne campus at the Argonne Leadership Computing Facility where someone looks at that so they do some initial reconstruction to see if this image is sort of of adequate quality perhaps they might do some adjustments to the machine to the instrument and then once everything is good they'll do a full reconstruction on that same system at Argonne and then move the reconstructed data to yet another system called Petrol you saw me move some data there earlier and there they will attach persistent identifiers to it and essentially publish it for the rest of the group and much of the group lives at the University of Chicago so about 35 miles away from there and between them as Bobby says science happens right so there's some magic there obviously but as far as the infrastructure goes all of that is enabled by Globus and this is actually the kind of thing we're starting to see more and more of as a requirement out there. There are also many use cases where Globus is helping people with their data management plans so you're able to pull again from diverse systems maybe you pull raw data from your instruments you pull process results from some other HPC cluster or something and then perhaps you've got some additional data on your own machine on your own laptop documentation what have you code and you can sort of put all that together in a data set as we call it in the data set has a number of policies that define who can publish data sets into these collections and whether or not there's a curation required, what types of metadata are required and so on and once that's published will mint an identifier, it can be a DOI, it can be an ARCA handle what have you and obviously that's then available to others so the current, what we call sort of our version 1 of the data publication service this has been around for probably going on 4 years now so this was an interesting exercise for us because when we first built this service we really didn't understand the needs of the community quite frankly we assumed that for most people a nice turnkey sort of application would do the trick so you bring your own storage that's where your repository lives, you decide where that is and then we provide some predefined schemers and we mint these identifiers so you can pull these data sets together and publish them in your repository and we're all done. The reality was obviously that that's not the common use case, there is that use case out there so we do have upwards of a couple thousand users and a few hundred data sets published but it hasn't had the adoption that we had hoped so about 2 years ago we started looking at really what are the other types of use cases in sort of research data publication so citable data where your metadata tends to be more standard do you require strong persistent identifiers like DOIs then perhaps in community data where the schema is agreed on by the community some of these data sets have sort of more fine grained metadata because they are discipline specific and then within the institution again there are multiple domains and lots of different storage systems so trying to handle all of that from a publication standpoint was quite a challenge for the initial system and all the while we still want to support this active research data so it's not just about creating these immutable snapshots and then we're done it's how do you manage data as it evolves as the schemers and things change and so on so what we did is starting about 18 months or there abouts ago we took that publication service and broke it up into a set of micro services that allow you to essentially build your own flows if you have an existing repository and you want to integrate some of these capabilities into it so you can add data to your institutional repository you cannot do that there are a number of services we have two of them so other than the often transfer which already is there we do have a search service this is actually built on Amazon's elastic search and the nice thing about this is it's sort of schema agnostic so as our product head says just give us your unwashed metadata and we'll just do the best we can so we'll index essentially whatever you give us it just has to be in a certain readable form and it's basically a JSON type document so we can index that very importantly we overlay the same access control mechanisms that we have in the rest of the global service on search what that means is you can determine or you can decide who can actually search into this data who can see this index or these indices so it's not sort of a it can be an open index an open search but in some cases perhaps there's some PHI data so you say for this data set for this index you only want people that have certain permissions to get to it and what global search is it creates a set of facets so you can filter by those and you can sort and organize your data that way and it provides a rich query language so you can write your own search queries and build those again into your own data portals and so on and then the other service that we've tried to this is the identifier service so again this allows you to different types of IDs and so when you create an identifier very importantly it's created within your own namespace so we're not in the business of maintaining DOI namespaces or anything like that and it has an identifier has all these different attributes right it's versioned you can control visibility every identifier points to some kind of landing page with a link to the data and so again you can build this into your own data flows data portals what have you and the service we're working on right now and we should have a first release of this probably in the next month or so is the Globus Automate service so this is taking all these individual services and allowing you to compose them and use them in a flow so you can define some automated set of steps that are triggered either manually or through some type of event perhaps you know data showing up on capture device on an instrument will kick off a flow a transfer flow and then data showing up out of some analysis tool will kick off some kind of publication flow where that data gets it's metadata it's indexed and it gets an identifier and so on so all that can now be soon will be automatable and you can tap into that and build your own flows just some other applications here and then I will wrap this up because we have a few minutes for questions so we've also done some integration with Jupiter hub and this has become you know many of you are aware Jupiter is becoming sort of the defective tool for interactive data science and so we've built integration into the Jupiter hub such that again using your existing identity you can log into a Jupiter hub instance that you've set up on campus or your researchers rather or your students can they can spin up Jupiter hub and within those notebooks they can they then have access to all these global services with the same fine-grained security model that I've been describing so this is a case this example here is from our materials data facility at Argonne where they built these flows whereby they query this big data set of materials data they move it to petrol do some analysis using this parcel library which is a parallel python scripting library and then they push it out sort of they publish it and make it available to the materials research community other examples so this is from the Welcome Sanger trust in the UK they've integrated globus they've actually had this in place for quite some time where you can send them your genomics data and they'll run the imputation tools on it it's used extensively in sort of national cyber infrastructure compute Canada and exceed the two big providers of supercomputing resources in the US and Canada I just want to talk briefly about sustainability and then I'll open up for questions so we've got over the past nine or so years since we launched the service we've had some reasonably good adoption all of this and thank you to our sponsors that have made it possible so we've had grants from those agencies as well as some private foundations we continue to get support from the university and from Argonne National Lab but more importantly we've really as I said at the beginning we've tried to build sustainability into our core sort of value proposition so what we want to do is make globus self-sustaining and we decided to do that by having sort of a freemium model right like you would see in the industry so parts of the service in particular the transfer component is free for anyone to use but most of these other features that I've talked about do require paid subscription so we do have as I said approaching on 100 subscribers and thanks to them we've actually we're starting to get to the point where we can support everything without relying on grant revenue because grant funding is obviously not geared towards supporting operations right we are in a sense in the business of providing a production grade service to the community even though we're sort of with this group within the university so we're kind of this odd duck in some sense but we're doing it from the perspective where you know we want it to be available for a long time right so there are different levels of subscription standard high assurance gives you all those additional protections for PHI and so on and there's add-ons and that's I'll I'll leave it there sorry I haven't lived too much time for questions but I'm happy to take any questions that you have we'll show you anything else okay thank you very much