 Welcome to another edition of RCE. This is Brock Palin. You can follow me on Twitter at Brock Palin all one word You can find all that information on the RCE website at www RCE-cast.com Also have here again Jeff Squires who is again across the country for me. So we are doing all this remotely I'm again in my little recording booth size of like a Porta John or something and So we're going to be back to you. Yeah, we're back to kind of passing things back and forth So if things don't sound as smooth as the last show, it's because we're kind of separate again So Jeff again for helping out. Yeah, it's always easier when we have that once a year sitting together recording a show It's a lot easier when you have all those visual cues of human communication and stuff like that and not the limitations of Skype But we're back to our regularly scheduled program Well, I guess I should say it's not quite regular because December is always a down month We're all recovering from supercomputing heck. I've even been really slow on my blog I think I've only gotten one or two entries out so far, but I gotta get another one in before Christmas We're recording right before Christmas here. So This will probably be the last episode of the year right Brock Yes, this will definitely be the last one of this year, but we will be back with more stuff We've got things scheduled coming up after the new year. Yeah We'll resume a much more regular schedule back in January of 2011 All right, so Brock what do we got on tap for today? Ah, so our guest today is Paul Miller who is in Hamburg, Germany So it's a big time zone difference here He is one of the authors of Decache and I've never actually used decache. It was something I ran into at SC So we'll let Paul introduce himself and then we can go from there So Paul welcome to the show. Why don't you say a little bit about yourself? Hi, so my name is Paul Miller. I've been working on various sort of grid related projects for rather large number of years now before being at Daisy I was working at the University of Glasgow and Then about three years ago. I moved over to Hamburg and I've been working on decache ever since All right, well, let me kick it off here with give us a description of what Decache is so I went and looked at your website and I read a little bit of the the material there And I was maybe you can explain right off. How is this different from say a parallel file system? So with decache you have this idea of combining lots of storage together in kind of a raid zero style It differs from a regular sort of cluster file system by having more emphasis on the management aspect so you can Steer where the data is placed in the cluster that you have It also has Tape back-end support so you can you can write files to tape if you need to And it's also supports lots of different protocols So whereas a cluster file system will typically be mounted locally and that's it This decache actually supports lots of ways of people accessing the data. I see so it's more of a data blob where a blob is accessible in and in a variety of standardized protocols that many people can be aware of but you're not limited to just One files, you know open close read write kinds of protocols. Well, it's actually mostly open read and maybe delete But the we don't actually support Modifying files so you are operating on files, but once you've written the file. That's it. It becomes immutable. I See what what's the the rationale for that? So this is if I make sure I understand you that this is kind of a right once philosophy So it goes in and is kind of locked in like Roche motel Exactly. Yes. So the the rationale that where it comes from is particle physics So with particle physics experiments, you have a large amount of data being written out and This this is data that is Never going to be changed because It's the what you've measured From the from the equipment and then later on people want to do it analysis runs or Understand the data better and then do analysis runs and these again create files that you know, you don't want to change so because We there's there's no strong motivation for there's no requirement for people to add the ability to modify files We just didn't implement that feature and it makes our lives a lot easier in in providing de-cache So you mentioned de-cache was motivated by particle physics. Was there a specific project that funded? the actual de-cache on production So yes, there there was a particular project that was a joint project between Daisy Hamburg and the Fermilab facility near Chicago and The the the two projects were trying where we're Producing software or want to produce software that was taking what what the physicists were already doing in the way of they were managing their data and Try and sort of provide that as a service The the physicists could then just use So this is around about the year 2000 2001 we had the prototype deployed here at Daisy and then early 2002 it was in production At Daisy and Fermilab So this is Can you actually write the data through de-cache just open and write but then you can't modify? Basically, yes the What what we've found is well what what the team found was that the Protocols that were available such as NFS Were not fast enough to provide the The necessary throughput Particularly they had single points that were bottlenecks So an NFS server has and it isn't is at a particular point and that becomes a bottleneck. So because of these There wasn't anything sort of really appropriate the the Team were developing replacement protocols to to allow a faster throughput so the So at that point you couldn't mount de-cache and do normal POSIX. I open close write But you have a library that you can use which has all the functionality That you would have apart from modify so you can write data in and then when you close the file it becomes immutable But you could but you could you link applications against this library and then use it as you would a normal fast system So you mentioned that I think you said earlier it makes our lives a lot easier given that you can't modify the data what what kind of optimizations and and and Code simplicities do you get out of that model as compared to a you know read modify write kind of model? so pretty much the the Simplicities you you might imagine so that we don't have global locks or a locking Structure that we need to support We don't have to support synchronicity for us that the files never changed. So we don't need to worry about Removing stale versions of a file Think those things along those lines really Now can you get a better bandwidth or is it there is it more optimizations on say the management side before you can? You know Decide to start streaming the file you can always just start to decide to start sending the data directly because you never need to check Whether it's stale or anything like that or there are other optimizations in terms of say bandwidth so with with bandwidth The the main optimization we have is in decache is that we can talk directly to where the the machines that the data is stored This is something that is not actually an optimization for as a result of Making files immutable and Luster for example also supports this But it's something that we needed to use custom protocols for for the for the Immutable files being a benefit This is this is I think mostly just a management issue and the fact that we can we can afford to cache files quite aggressively So decache itself. It's not a file system. It sits between the client and your real back-end file systems it's kind of both the Storage where files are stored that you have to use if I have a file system there You can't you can't just access the block level device itself So typically somebody would take a partition that they want to use decache with and format it with XFS or whatever and then mount it as you would a normal file system and then run decache on top of that But decache itself supports protocols that allow you to mount The decache as a global namespace into your client So you can actually now access decache as you would any other file system Okay, so you can mount it, but decache itself doesn't hold the data. The data is always written to Some set of back-end NFS servers or Luster servers or tape Does it care what it's writing the data to? Not so much. I mean, it's I mean it could be XFS. It could be ext3. It could be ZFS We have people that are using GPFS Luster As the back-end storage. So it is really flexible in that respect So then is decache a client side only kind of Service or do you have a server side as well and you can move the data, you know remotely to other, you know The client side on a different node So decache is a server based Solution, it's it's the client is we do provide a client because of sort of historic problems, but We also provide that the server component provides NFS for one so you can mount decache without with the client being the Linux kernel I See Or the Solaris kernel or whatever Now in in the transports that you use to talk from your client to your server and whatnot. What what networks do you support? I'm assuming you support ethernet But do you support things like in finnaband or iWarp or other kinds of transports currently? No, so we typical deployment is in What what we call to a high throughput computing clusters where? The analysis work or the production work that's being done by physicists is highly parallelizable It's the sort of embarrassingly parallelizable workload so the high the low latency requirements that you would have with a more tightly coupled problem don't don't apply and The a site will typically not deploy The sort of in finnaband or the sort of higher The lower latency networks So you mentioned that decache shows up as a NFS v4 one isn't we had a peter honeyman in the Group at city on this show a while back and v4 one sounded like it was still heavily into works How is that affecting your guys's? Work of supporting that. Well, yes, it is and the since about 2006 our Technical lead on the project t-cran Has been involved with the NFS for one development process And I think decaches perhaps the first storage element to support for one PNFS And particularly it's PNFS. That's that's of interest because this allows Us to redirect the client to where the data is stored to get the throughput we need But the the big problem for us with for NFS for one is That the the clients aren't available yet They're still going through the process of of being added to the Linux kernels Lara's kernel So although we are kind of ready to go with for one We're still we're having to backport the next kernels patches into Linux kernels and then packaging that up and then trying to provide this for Retta enterprise Linux or the the Scientific version of that which is scientific Linux To just to allow people to try out NFS for one So we're kind of waiting on the clients to catch up here early on when we've asked what decache was you mentioned that decache acted like a Raid zero to get performance across all these underlying storage pieces, but you also mentioned a tape back in does Decache do any type of HSM work or does it rely on? Other pieces of software for that So yeah decache doesn't provide a Stager a some component for writing to tape it assumes that a site will have something like that already And the interface to this is actually quite nice and simple so you can interface to lots of different stages so At Daisy where you were using a stage you called OSM, but people have used decache with TSM HPSS and Also sites have written their own stages to provide this kind of functionality you can you can even interface to Things that are not tape so one of the things you can do is interface to cloud storage and Push data into a cloud and then pull it out again as you need it So these port floppy drives too I Don't think so not the moment So let me switch a little bit here. What does the I'm a software guy so my questions tend or the software side What what does the client acts? Test look like is there an API in languages or is this a command line thing or how do you actually get to the data? So we support a number of so it's defined at the protocol level and for us also at an API level so There is several protocols that we support that are standard so the clients just talk the normal over-the-wire Protocols so for example web-dav HTTP for just reading or browsing web-dav for also reading and writing FTP if you want to Use that as a as a wan transport We actually support some of the grid extensions to the FTP protocol that allow you to send check sums of the data as well So those are standard protocols and the clients are your standard web browser or your standard web-dav client NFS is another example as a standard one. We also support some sort of specialist High throughput protocols such as decap and extra D Which are have been developed in the particle physics community to for really high throughput Work and these there are support libraries which you can access and there's an API that looks very much like a POSIX API That you can just link an application against and then use it So then what I infer that your API is in is in C We yes the main the decap and rex root do APIs are in C. Yes So to get this a high throughput you so de-cache as a server side product Can you run like multiple de-cache servers and the clients kind of stripe data across them? We currently don't support striping The decision early on was to store complete files on on a pool So when somebody writes to de-cache there's a decision made where that file should be stored And then there's a client is redirected to that particular storage device where it files that stores the file To include floppies of course because those are vitally So you mentioned the protocol in the API is there any provision for Streaming or is that even useful to your user community say that you have you know like Extraordinarily large files and maybe you want to analyze only a part of it And so you know the client doesn't necessarily want to download the entire file They only want to download a chunk of it or maybe they want to you know Just see a part of it at a time to have a rolling kind of analysis. Do you support streaming kinds of operations? So we support Streaming and and also the first one which we handle in a slightly different way So the the for streaming we have sort of protocols that Will give you the the data as as you come It's not really controlled in the same way that other than the client sort of reads the data as it wants to in that sense, but Yes, we provide Streaming but the the first point was interesting because you're absolutely right that clients don't want to read the entire file Typically, they want to read to cherry pick the bits that they want So typically reading the first bit of the file which has some kind of inventory and then Based on that inventory make a decision which parts of the file they want to read and What we support is a form of vector read Over a file so whereas normally a vector read in Normal POSIX idea is is a vector read where it's a scatter gather operation here. It's actually Offset length list for the file and the server will then consolidate all the bits that you've asked for and send them back So you don't have to waste any bandwidth Transporting data that you're not interested in So I have a question about the internals of de-cache if I write a file through de-cache And it writes it to some underlying NFS server or a ZFS file system If I then went and looked at that file system if I mounted that ZFS file system Would I be able to find the file? I wrote would it be recognizable or can I only access the data then through de-cache? so the the file is is certainly There and recognizable if you if you know what to look for Whenever a file is created in de-cache. It's given a unique number an ID and the Pool node the node that stores the data will store the the files data With a file name based on the ID So if you go to the right directory and have a look you'll see the the file there with with the ID as the file name Okay, and then I have a one last technical question about if someone wanted to use de-cache kind of in a Data replication space because does de-cache allow me to apply any policies like? write data, but it actually writes two copies because it knows that this underlying ZFS server is in The states and this underlying NFS server is in Europe, and I want it to be written to both places so the data are never lost De-cache sort of supports that it's not quite as Clear as that, but it certainly would allow you to set that up de-cache has a lot of support for data placement and Data replication so you can certainly say that you you can certainly set up a de-cache so that you have two copies of everything and One of the features of de-cache is you can tag pools with meta data So you can say that this this met this pool is in one country this pulls another country and You the replication would choose the to to put the two replicas in different Countries in that sense So does de-cache have any Knowledge or understanding of what type of media or file system it's writing to in the back end so that it understands Say difference in latency I guess writing and reading so that you know if I do a request for something that happens to live on tape Would de-cache say? Yeah, I'll get that to you why don't you come back in five minutes, and then I'll have the data or something like that So certainly between disk and tape this is One of de-cache's main strengths is is managing the flow of data on to tape and back from tape So this very much certainly with with disks because nowadays we're getting SSDs and magnetic media and this is starting to become more feasible to to use both in storage system setups then This is something where we're sort of looking into what we can do to support this, but certainly for tape. Yes So let me let me change direction yet again here. What's Who uses de-cache so you mentioned, you know particle physics and things like that, but what's your what's your user community look like? anybody beyond particle physics so certainly the largest user base is In the part in the particle physics community the WLCG is the worldwide LAC computational grid, which is the resources and software for providing the compute and storage requirements for the experiments to do their analysis work and there are something like 11 tier ones and Eight of those are using de-cache to store their data Some of the some of the largest tier two sites are also using de-cache But there are other so this this is particle physics Community, but there are other communities. So the sites that are using de-cache provide storage for other in scientific experiments, so at Zara facility in Amsterdam They're providing storage for the lofa experiment, which is a very impressive radio telescope project at Daisy where we're looking at increasing our support for Photon science These are people using very very bright very very fast X-ray laser That they're actually in the process of building here at the moment. It's kind of like having a stroboscope going off very very fast and They these people will be generating somewhere between one to two up to ten petabytes a year data, which needs storing So how many developers do you have? You know who actually create and maintain de-cache and work on new features and fixed bugs and things like that at the moment We have let me see Nine people working on de-cache. They're not all Sort of strongly developer focus, but a group of nine people and I would say of that Eight yeah, but the majority are developer people and are you all at one organization? Are you spread across multiple organizations? We're sped spread in different different organizations in different countries. So where there are people working in de-cache at here at Daisy and Fermilab in Chicago and Also in energy F which is in Copenhagen So as a that's a pretty large distributed team. How do you how do you manage that as a project? You know getting all the different requirements and goals from the different organizations I mean you're doing similar things and I assume you have similar goals, but you know everybody's got slightly different Agendas and things like that. How do you manage that as a project? so we It is an awkward a challenge to manage that when you when you have a distributed group of people we have Weekly meetings to to bring people up to up to scratch and where where everyone's at We make good use of Jabba as an email just as communication technologies and we we meet up Once or twice a year to get everyone under the same roof for about a week So we can hammer out differences and figure out which way technology won't want the project to go and really just Figure out where we want to be So these meetings when you guys get together and stuff How do you actually decide what the next feature that's going to be added to the products going to be? So far it's been the person who's hosting the The workshop that actually that makes the decisions on the agenda. So The previous one was it at energy F. So the energy F developers were Sort of deciding which which parts were were we're going to be discussed and prior to that It was hosted at Fermilab So and again there but it's it's kind of it's that puts too much emphasis on it being one person It's still a group decision about where we're going So there's a lot of discussion beforehand about what we were going to what what the plans are for a workshop and what we're what we're going to achieve and and What's discussed? So what are some of the planned upcoming additions to de-cache so the things things that we've been working on recently and sort of carrying on is working on scaling down de-cache so Although de-cache is is very very flexible and you can deploy it in in numerous ways The first time first time you install de-cache can be a bit of a daunting process. So we would like to have It make it trivially easy for people to install a one-node de-cache instance and have it up and running very very quickly so this involves changing our configuration system and and Management's and this kind of stuff We're also looking at scaling up because people have always wanted to store more data So we're looking at single points of failures and how we can try and remove the bottlenecks that there are in the system at the moment I'm also trying to reduce the management load the Common operations that are somebody running a de-cache instance would have to Perform to keep the system running well So how big is the learning curve? You know say I'm An administrator at a site and one of my scientists come to me and says hey, I need de-cache Because I'm doing some stuff that de-cache is really good at well available to me How easy is it to set up and manage and you know if I've I really only know how to spell de-cache at this point? How do I move forward? So currently we provide insulation guides and Sort of material to get you started in installing de-cache, but it usually takes people a While to get really comfortable with with running de-cache This is Something where we're hoping to improve on Of a time. What is it that makes it complex or difficult to run or manage or things like that? Is it just the nature of the problem that you have so many different? Stores and they can all be combined under one roof so to speak or or something else inherent It basically comes down to the flexibility of de-cache the fact that de-cache can be used in so many different ways and with different deployments different Whether you have Dedicated region right pools or whether you have combined pools whether you have a tape system What kind of tape system it is? All of these others flexibility is can also be a negative thing because you then need to set this up So part of the scaling down Project is to try and simplify the first install so that it comes with useful meaningful default values and Good instructions and you you basically install an rpm and it just works so de-cache was built for the particle physics community which My sister was in a community for a number of years at Fermilab And they do some strange stuff, but what is some of the unexpected uses you've seen of de-cache? And so some of the interesting examples you've seen so the One is this there's several interesting uses of de-cache One of one of the things before I came here in the early days of de-cache the Service was under you know the software was rapidly being developed and They needed to the developers needed to test De-cache was would work in production environments So they had but at the same time you wanted to apply and provide a reliable service So they had to de-cache instances one in front of the other the second one pretending to be taped for the first one and then all of the normal use of Operations would be handled by the first de-cache, which you was a stable well well Well-working de-cache instance and the crazy new stuff was being tested In the the de-cache instance behind the first one. So that's kind of a Crazy use a de-cache and another one is the ndgf facility Which is based in Copenhagen, but actually spans five different countries. So this has got storage Part of de-cache storage in Norway Sweden Denmark Finland and Slovenia So they different sites six of them with hsm systems and this to the end user looks like Normal file system you can just mount it. So let me ask you a question I'd like to ask all software developers just for because I'm always interested in this what source code Repository do you guys use for development and and why? so currently we're using Subversion as a repository. It's Not our main choice now because we would like to move to Mercurial and we we have a number of Binary blobs in the in there at the moment, which are external JAR files that we depend on and We're in the process of removing those and once we've got rid of those we'll be switching over to Mercurial So you mentioned that organization that's spread across and they have six different tape systems underneath them is That the largest single collection of storage managed by de-cache instance, or is there something bigger? What's what's the biggest? de-cache install out there So yo, yeah, that's it's it's big in terms of the geographic area that it sort of encompasses But it's not actually the biggest in terms of the amount of data it stores It stores three petabytes roughly three petabytes of data Which is big, but it's not the biggest the biggest is at Fermilab again So this is the CMS instance, which has some eight petabytes disk and about 12 petabytes tape All of the disk is being used at the moment and they're flushing data in from tape as it's needed by the experiment So this is constantly changing. Okay. Well, great. We we appreciate your time How would someone go get involved in de-cache or download the software or or find some of those installation guides and things like that? The best bet to go is to have a look at our web website, which is www.dcache.org You find the installation guides there in the download page we also have a mailing list a user forum which provides sort of informal place where people can chat about Any problems they're having or and you'll ask other people for advice, you know ask people that have been around for longer How to do a particular thing? And we also provide a support as well. So if you if you're having a problem, you can always email us support at de-cache.org Okay, Paul. Thank you very much for your time This show will be up soon and thank you again