 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online in our entire back catalogue at RCE-Cast.com. You can also find links to all the twitters and the blogs and everything there. We can find other ways to get a hold of us. Also, please feel free to send in nominations for topics to have on the show. Also, we have Jeff Squires here as always helping us out. Jeff, thanks again for your time. Hey Brock, it's another lovely spring day and another cool topic to talk about today. What do we got? Today, this is actually something I ran into looking, I think it was the NERSC website. I was curious, okay, what the heck is this thing? And it is called PsyDB. And so I will admit that the whole reason I dragged our guests on here today is because I want to figure out what it is. So our guest today is Paul Brown. Paul, why don't you take a moment to introduce yourself? Hi, this is my name is Paul Brown. I'm the architect of PsyDB. The firm I work for is the commercial wing, which is called Paradigm4. You can find PsyDB at www.PsyDB.org and Paradigm4 at www.paradigm4. That's paradigm1word4onenumber.com. Paradigm4 is the sponsor for PsyDB and I work for them. My background, I've worked for a heap of database companies that started with the letter I. So I worked for Illustra, I worked for Ingress, I worked for Informix and I spent 10 years at IBM. And then when PsyDB started up, I joined PsyDB as the architect and general dog's body. So what is PsyDB? That's a good place to start. So PsyDB, it stands for Science Database and you remember the name Paradigm4 for the company. So the motivation for PsyDB was the work of the late Jim Gray, improving data management tools and technologies and platforms for scientists. Jim spent a lot of time at Johns Hopkins working on Sloan and PsyDB was kicked off out of an XLDB actualized databases research group. And my boss, my boss, Mike Stonebreaker and his colleague, Dave DeWitt, spent a lot of time walking around and asking scientists what they wanted in a database platform. And it sort of broadened out a little bit from just databases. And we built PsyDB to meet those requirements. Now, when you hear the term database, we've all kind of been conditioned to think relational database, SQL, and now there's these young upstarts, the no SQL crowds and things like that. Is PsyDB fit in one of these two categories? So the other thing about that word database is it's a database management system. It's much older than relational. And I suspect it'll be much newer than the NoSQL guys. It'll keep on going. If you look back, we had hierarchical and network databases, we had relational databases, we had object-oriented databases, we had XML databases, and now we've got no SQL databases. So PsyDB, it sort of continues that grand tradition of coming up with a new name. But the signature difference between PsyDB and what's gone before is that we've chosen arrays and matrices as our building block for our storage manager. And as the internal unit of data processing in our query engine. So think about PsyDB. If you pick up a SQL database engine and you think relations, don't think relations with PsyDB think arrays and matrices. So internally, how do you guys do stuff though? Is it closer related to, would you say it's more closely related to a traditional SQL database, a columnar database, or like a Vertica or something like that, or a NoSQL object store? So think about almost all of the above. So the data model at the risk of doing a bit of a lecture here. So it's multi-dimensional arrays is the organizational principle. So every time you organize data, you store it in an array. And the array, each cell in the array, each logical position can have multiple attributes. So the first thing we do is like a column store, we just divide the database up into, you can think about it as one array per attribute. So it's a vertically petitioned data store. But the second thing we do is that we basically asked the boys and girls who'd be doing array processing for 20-odd years. We said, well, what do you do with arrays? And they said, well, we chop them up into rectilinear petitions. We store the petitions separately. We might spread them out over a file system or divide them up into a collection of HDF5 files. And then we do parallel processing by organizing our algorithms to deal with the access to each of these sub-arrays, each of these rectilinear chunks we call them. And then the sort of rotation or mutation of data around the nodes that make up your massively parallel system. So I guess the 10-set tour is that, yeah, we're a column store insofar as we divide stuff up into one column per attribute of the array's cells. But then internally, we organize ourselves in the same way that people have done big square databases before. Just we do transactions as well. So there's a database piece in there. When you update the array, we make acidic guarantees. And we have a query language. It looks a bit like, if you remember, there was an old language in the late 60s, early 70s, called APL. So when you manipulate the objects, you do things like array multiplies or filters or, you know, array vector products, that kind of thing. So what you just said there gave me a lot of questions. So the first question though is you mentioned HDF5 and the fact that you're trying to store data a lot of the same ways that scientists may store, you know, data partitioned across a big, you know, parallel job or some sort of data parallel job. Can you actually compute directly against data inside DB using like the MPIO functions or something like that? It's sort of the other way around. We pulled the MPI framework inside side DB. So what you do is, you know, supposing you want to do something like a big truncated SVD is a good example because we did that. So one of the things that we did in our design and implementation process is we wandered down to the North Carolina and we talked to Jack Dungara and we sort of said, OK, how do the boys with belts and suspenders do this for a living? And we learned from him how you do block cyclic or block partitioned algorithms for linear algebra. We learned how you do, you know, block cyclic rotation of data through the nodes. We looked at his scalar pack implementation the way it uses MPI. And we grabbed as much of that technology as we could and we've reused it inside side DB. So if you've got, you know, a very large matrix that we do sort of 60,000 by 50,000 is the size that we sort of run into on a regular basis. And you want to compute the top two coefficients of a truncated SVD. That's the sort of operation you just directly ask. You say literally GESVD of Array A. And we return your three arrays to your TVU and you can figure out data from there. And we make it scale as best you can. They're all cubic algorithms in this space. But we make it scale by using as much of the hardware as it run on as we can. And I guess that the thing to sort of step back a bit is that isn't just that GESVD operation, general purpose singularity composition. It sits alongside a very large list of other... You might think of them as sort of more orthodox data operations. So, you know, filters and slices and betweens and shape changes and dimension reduction algorithms and things like that. So the idea is to have this composable language, a functional language, that combines high level linear algebra operations with lots and lots of kind of low level useful things. You know, filters and aggregations and group buys and regrids, windows, that sort of stuff. So this is interesting. So did I hear you correctly say that you actually use MPI inside the database? And so therefore your database infrastructure could actually span tens, hundreds or even thousands of nodes depending on how many clients or types of jobs you're serving simultaneously. Is that a goal of what you're going for? Yep, exactly. So at the moment, our biggest sites have sort of 200 to 400 physical nodes. You can do an awful lot of damage with 400, you know, modern Intel 16 core boxes with, you know, 63 to 2 gig of memory on each of them. That's an awful lot of compute power. And we really just... Usually because we just haven't had the capital, we haven't been able to push much beyond that. But there's no particular reason in principle, given the architectural design, that we couldn't. So again, sort of our backgrounds, I worked on, you know, high-end DB2 and I've worked on massively parallel and formix engines in the past. And it wasn't uncommon. It was a bit rare. We only had sort of a handful of them. But, you know, we can run a big relational database engine, a big SQL engine on a thousand nodes. You know, it's not ideal. And really the principal difficulty you have is that for any sufficiently long running job on a thousand nodes, you're going to have node value. And your relational engines didn't deal or don't deal with the single instance or single physical node value very well in that kind of configuration. So we borrowed a bunch of ideas from people like Hadoop and to sort of harden the system. So if a single instance goes down, we have to halt the running processes. But we don't cease access. You can continue to ask queries of the system. We pick up data and we partition it and we replicate it on multiple physical nodes. So the loss of any one physical node doesn't mean a loss of service. You can continue to ask queries. But yeah, the goal was to sort of pick a bunch of technology that we knew worked from these massively parallel relational engines that scale quite well to figure and then to sort of integrate that with the linear algebra and the array algebra that we knew we could get from ScalaPak. This is interesting because MPI is actually traditionally, or at least MPI implementations are pretty terrible at default tolerance. And you're saying you've kind of worked around that. You have to forgive me. I'm an MPI implementer. So these things kind of perk up my ears here. I'm assuming you've kind of worked around that by doing smaller individual jobs rather than one monolithic system that is all MPI all the time. Is that somewhere close to reality? Yeah, that's exactly right. We only dip our toe in the MPI world when we know we have to. So, you know, for the very, very big SVD operations and matrix multiplies, dense matrix multiplies, MPI is the only game in town. It's the way to make that work. We are vulnerable if the MPI system that we call out to, if it has an issue or a failure, it just reports back to us. Yes, I failed. And we sort of have no choice but to kind of reinitialize the whole job again. From the point of view of the person writing the queries, though, that's sort of hidden for them a bit. So they put the query in, the query, you know, it chugs along and it might take twice as long to complete as it would do ideally. But the other thing is that in the CIDB core, the piece that we do the work with, so there's a sort of an important point that doesn't, I think, come across as well in some of the science literature. We found when we began using CIDB in anger that in addition to those big dense operations that MPI and Scale of Pack are really tuned for, an awful lot of science data falls into the category of being sort of very, very sparse. And the nature of the algorithms that you want to run in a very sparse matrix multiply or a very sparse SVD, they're quite different. So we've managed to get away with implementing sparse algorithms in the same framework. So we still use the MPP framework, but the sparse algorithms themselves, we don't use MPI for those. We use our own internal secret source. And because the bottleneck in those algorithms, it doesn't tend to be the rotation. It tends to be just sort of the local block-to-block operations, given the way that sparsity can be factored out. You can sort of, you can condense the sparse chunks to a small number of larger units, and you can operate on them locally. So we've been able to get a lot of the sparse algorithms working inside CIDB without the need for recourse to MPI. When we throw the computation over the wall to MPI, if MPI fails on us, we report a happy error message and just sort of try to repeat the job. That's the best that we can do. The medium-to-long-term goal, though, is just to sort of, you know, we don't have the same kind of momentum in the MPI community as a lot of other players do. So we have sort of, we have basic ideas for, and again, this is back to good old-fashioned database land, of we sort of, we know how to keep multiple nodes up and operating and computationally integrated. And again, this is stuff that we learned with using elaborate use of lamp clocks and heartbeat mechanisms with periodic serialization to disk and checkpointing. We know how to do this stuff from SQL engines. And so our thought was that once we get enough momentum, you know, maybe we can help out the MPI community a bit by bringing some ideas from that world, from that highly reliable relational world to sort of harden up the MPI infrastructure in the way that we use it, and then, you know, maybe help by putting some of that stuff back with the community. It depends. I don't know how, given the other things that we do. So, you know, SQL, SIDB is a, we're a transactional engine. You know, we succeed or fail. We have isolated access levels. So, you know, two users have two independent perspectives of what's going on. We support concurrent readers and writers, so you can be reading a data set at the same time somebody else is writing to it. We have a lot of these, that level of support. So, it's not quite clear whether MPI, in the general sense, really needs what we're trying to do with it. But from our point of view, being able to put in a lamp o'clock and keep alive heartbeats on the MPI nodes in a way that would allow us to sort of reroute or restart queries from checkpoints, that would be a very useful thing for us to be able to do. So, you said a lot of things there, and, you know, I'm a cluster admin, I used to have jobs that run for, you know, up to hundreds of hours at a time. And it sounds like you want people to do everything inside of it, but at the same time, it sounds like you're still kind of keeping a little bit of an interactive system. How do you really see PsyDB mixing in with the regular, you could say, research, research, resource portfolio? How do you think it should be used? When should we use traditional large-scale batch systems with scratch file systems and import data in, and when should we run directly on PsyDB? We've got a crack team of sales and marketing people trying to answer that question right now. We really don't know. It's very early days. We've only been in the market aggressively for, and I say aggressively, we, you know, we don't do a lot of this kind of marketing. Our emphasis tends to be on, and this is a bit of marketing speak now, but we tend to be very lead generation focused, so we'll attend user groups that have an interest in, for example, a tool like R, right? The open source R product. And you mentioned about the difference between interactivity and batch processing. Most of the users that we have tend to be sort of taking a tool like R or a tool like MATLAB, which at some level are fundamentally interactive, right? They're not, you don't run a, you know, you don't run a weak job inside R, unless you're doing something unnatural. And so our focus historically has been on these folk who are doing, you know, I've got, they've got, you know, a 15 to 25 terabyte data set that's coming in while it's possible that we might want to do an all pairs covariance calculation on that kind of data set. That would take an awful long time to do, but typically what they want to end up doing is they want to be able to interactively look at the data, examine it, and do sort of localized experiments. It's still big. There's still hundreds of gigabytes in size, but this stuff doesn't tend to run for the full, you know, a full week, 10 days duration. It tends to be the sort of thing where you can get some sort of bootstrap. You can get some sort of sampling. You do something quite sophisticated with it. You're not, you're not, we're not just doing SQL group, you know, roll ups and group buys, but you're doing something sophisticated enough that you need a tool with the linear algebra built in or with the linear algebra framework. And then perhaps you want to run a big job, but for the most part, our batch stuff, it doesn't tend to be the multi-day run times. They tend to be, you know, overnight weekend jobs. The kind of workload which if you come from a world of doing the business state of analytics, you know, folk who do, you know, friends and family for Sprint or, you know, some of those, you know, load calculations and yield calculations on airline reservation systems, those jobs are big. They run, you know, 20, 12, 24 hours to 48 hours, but typically the actionable information has to be returned every couple of days. So at the moment, I guess on that spectrum, think about the SIDB framework as being, you know, a tool that users of R who are familiar with R or MATLAB or running little Python scripts, that they sort of run into scalability limitations quite quickly, right? You can't do that much with R. But suddenly they sort of say, look, I'm suddenly getting an order of magnitude, more data, I need to go big. Well, we sit in as a platform that can run behind those interactive tools and happily do, you know, 48-hour jobs for them as well as supporting the multi-user and very interactive query workload as well. Does that answer the question? Because I'm not quite sure if that touched on your point. No, I think that did because, you know, I was worried about, like, kind of concurrency. It's, I don't know, I mean, the more you talk about this stuff, the more I actually think about the different ways people are using Hadoop, using certain add-ons like Hive, Pig, which, you know, have longer startup times, but then they use something like Impala or HBase for quick turnaround time for more interactive type use. And it sounds like you're trying to almost solve everything. But it really sounds like it could be a useful thing for heavier need post-data analytics, like post-processing. I mean, maybe analytics is a bad word to use, but that's the first thing that kind of comes to mind. I have a bunch of data produced maybe someplace else and now I kind of want to slice and dice it, but it's bigger than the usual drag to my desktop post-processing job. Yeah, that's pretty much right. And just to sort of, you know, it's actually amusing to us. So we've watched the evolution of Hadoop over the last little while. And, you know, from starting out with a very heavy emphasis on map reduce, it's been amusing over the last little while. You know, we went out and we sort of said, query language is very useful. You know, they're massively productive and they're useful. And map reduce, you know, it's sort of because it doesn't do peer-to-peer very well and because it doesn't stream things particularly well, you have to use the file system. It's not going to really work in this interactive mode, right? As you mentioned, you're starting up a job as expensive. And now having, you know, having sort of mentioned that or having mumbled about that obliquely for the last few years, we suddenly noticed that all of the Hadoop vendors, your cloud areas and your cloudants and your map hours of the world, they're all out there building what look like massively parallel SQL frameworks. This is the impala. You know, it doesn't use map reduce. It uses HDFS. But essentially it sets a top HDFS and implements, you know, what if you sort of look at it long enough and sort of stare at it hard enough. It looks like a pretty classic SQL top-end MPP SQL framework. And that brings me, I think, to the key distinction. We're not really, we're going to be kind of okay at that stuff. Yeah, we're not going to do any of the stuff that map reduce was traditionally excelled at. So if what you're doing is taking, you know, Weblog information or, you know, some sort of streaming information from a lot of sensors and you've got to do some kind of on-the-fly conversion for that, it's heavily, it's unstructured, perhaps it uses a JSON file, a JSON model or an XML model or something like that. And it's heavily string manipulation focused. You know, that's not really, we're not going to be very especially good at that. And then if you look at the way that the SQL guys are working, you know, they're going to try to implement something pretty close to full SQL 92. They're going to sort of, they're going to add joins and subqueries and unions and intersections and divisions and all that good SQL framework. We're not really that focused on that either. We're not going to be a general-purpose query tool. It's really once you start hitting anything that involves anything we're expressing the problem in a matrix algebra is the way to go. That's where we really, that's that's our goal, that's our target. So again, big multipliers, all pairs correlation calculations, SVDs, K-means clustering, anything that involves an underlying algorithm which is just fundamentally not embarrassingly paralysable. So a good example is GLM. Basically everybody can do a pretty good job at GLM because GLM is embarrassingly paralysable but all pairs Pearson is not. That's a cubic problem that you've really got to use ScalaPak to do anything to get Teddy scale on. So we're much more focused on the linear algebra piece than we are on the straightforward sequel piece. That said, we'll be okay at sequel but I wouldn't try to use us for doing weblog analytics or something that's heavily textual. All right, so you've covered a tremendous amount of ground in there. It sounds like you are really differentiating what SIDB is for compared to a lot of the usual paradigms that come to mind when people think about databases and big data that. So if I could read between the lines here and having read from your website and various other sources it sounds like you really are targeting the science market and you are optimizing for that case and therefore it's kind of a new niche. It's database as applied to scientific computing and things like that and that allows you to get all kinds of speed ups and optimization. Is that an accurate characterization? The way of crack marketing department describe it now that they refer to this as a computational database with exactly that idea that it's all about science and the curious thing is I know you chaps are from you know, supercomputing background but there's an awful lot of science is now bleeding over into industry. Okay, so how do users actually interact with this? What is a traditional SQL or have you expanded it to kind of work with a data parallel type system or can you basically write your own user defined functions like what language is and ways can you interact with CIDB? I know you chaps want short answers so I'll try to keep it short. The first thing is that CIDB itself is extensible. It's a microkernel architecture so if you know how to program in CC++ you can, much like you can with Postgres for example, add user defined types and functions and aggregates and even linear algebra operators. So an operator like multiply, you can add that. At the second level up we have this very definite perspective which is the declarative query languages or query languages in general. Not so much for the science community I think because there's a higher IQ crowd there but often what you end up with with folks in industry is they prefer the high level query languages largely because they need more productivity they need return on their investment and their investment is not just the hardware it's the time of the people writing programs. So we've provided two called them query languages. One is a functional language we call AFL it's heavily influenced by APL the original array processing language or a programming language as it was the acronym and we have a language called AQL which is an array query language which looks, it looks or I shouldn't say it looks, it smells like SQL but as a point of technical divergence the underlying algebra is different. Settheoretic versus matrix algebra looks very different. At the top end if you want to program in R we have two packages, SIDBR and SIDBpython where you can basically run your R package on the client side and talk to SIDB over port 80 the same thing works with Python and we also have a JDBC driver on the client side so you can program in Java if you were born in a different millennium. Real quick I kind of mentioned earlier have you thought about making it possible to talk directly to SIDB from an MPI cluster? Yeah so the codes there we just don't like to talk about it very much and look the reason is that from a marketing point of view we're trying to kind of hide a lot of those details if you work for the Caterpillar for example and what you've got is you've got 10,000 big pieces of industrial equipment each of which has 10,000 sensors, each of which is generating a couple of 1632 bytes per second from each of those sensors you run that for a year you've got a huge amount of data and making sense out of that is a classic application of single processing at the very high end the executives who run those companies they don't really care much about MPI they just want their all pairs correlations to work they want to be able to try to use principal components analysis to cluster their machinery to different kinds of classes so we don't talk about it much because it doesn't enter into the conversation at the level of the folk that we talk to and so we haven't documented it but the entire products open source and if you wanted to grab it and download it you know we do have this back end that will enable you to reach out to somebody else's MPI framework and surface a fairly thin operator inside the crew language to drive the whole thing. Now changing direction a little bit here we've talked a lot about these large queries and clusters and things like that what kind of hardware do you typically run on what what servers what networks things like that so this can be a quick answer we start out we we're currently available on the Ubuntu and CentOS red hat so you know any of the LTS releases of those platforms our next release is going out on the news Ubuntu 14.4 so that's the the basic configuration we don't run the windows yet we don't run on Mac mumble the rest of the hardware configuration runs the gamut we test routinely on Amazon AWS clusters so we'll fire up a cluster of you know four of their middle-sized engines will run 64 side to be instances on that and off we go but there's an important factor we've noticed with our customer base so they run the gamut all the way from people like NIH for example who have an awful lot of data in their repository but at any one point in time only a relatively small set of it is active in the query base so their users are focused on a particular area or they're doing lots of sampling and so forth and they're not really emphasizing the sophisticated math yet so in that kind of configuration it tends to be big disk small CPU small memory on the other hand some of the NERSC clusters that you guys bumped into there's not a lot of data there it's you know it's maybe 10 terabytes but they're doing awfully big things with it so that tends to be very CPU and very memory intensive so I just it's unfortunate how long is a piece of string question it really depends on the workload but we run the gamut from clouds to you know a small number of nodes with lots of memory and infrastructure to a large number of nodes with lots of local disk we're really quite flexible in that way we designed it that way so a technical thing you mentioned that you replicate data does that mean we don't have to worry about the underlying disk being reliable kind of like Hadoop like you just you throw raw disks at it and PsyDB figures it out that's the that's the design goal that's where we are now we're not quite there yet we need to be able to figure out how to do what's called replacement and provisioning elasticity so currently if a node goes down and dies completely yeah I believe the admin line is the magic smoke got out of the box if the magic smoke gets out of one of the boxes and you've got to put a new box in the process for reinstalling and putting data back on that box it's very handy at the moment it's very manual you can continue to do to use the cluster we need queries but we at the moment stop writes we say look we can't guarantee this thing so we're just going to stop you doing writes until the node is back the other longer term goal is to be able to do provisioning elasticity which is you know I'm about to I'm about to do a job which is I know it's a cubic problem and I know it's going to be 10 terabytes of data I want to spin up you know 100 nodes which are just compute only nodes and I want to run the job on the compute only nodes and then when we shut them down and put them to bed we're not there yet everything is in place to do that we've just got other priorities to do with quality and with simple things like load performance but yeah you're right we in the nutshell we borrowed a bunch of ideas from the way that the HDFS file system works to do the replication and even older ideas so if you go back to the Andrew file system and some of the block redundant file systems that were built in the 80s 80s through the mid 90s we borrowed a bunch of ideas from there as well so I'm looking on your website and it looks like version 13.x is available these days what what kind of functionality available is in that and actually what does version 13.x mean that that implies a very old product so it's actually the latest version is 14.3 is the one we just put out we coupled onto the way Ubuntu does its naming so it's basically year.month is the release identification and we did that because if you look around the world is it 0.9, 0.99 0.999 and we just felt that it's you know it's a bit I wouldn't say dishonest but it doesn't really reflect when a thing came out you know you sort of we want to be able is 0.999 and the latest is a difficult question to answer whereas if you look at it having the year and the month number gives you at least an idea how old a particular platform or a particular product is so yes the 14.3 so the way it works out is that we started the first real release was back in at the end of 2012 so I'd say 12.10 was our first the release heard around the world the process since then has been an incremental addition of functionality as users have requested it we're very responsive to our modesty sized user base at this stage the latest 14.3 release we actually added query language macros so you're able to use let bindings in the AFL expressions and that's an example of a thing which superficially it looks easy but when you've got a terabyte of data under management the getting that right ends up being very hard from a performance point of view so that's an example of a piece of functionality to be added to help users write queries the macros that it just takes a bit of time to get it in and I guess the final thing is that you know we're very very conscious being you know we're database people a lot of us so we're back from the Belt and Suspenders community we test the heck out of everything and that's a very important value we there's a lot of stuff in the product that we don't talk about very much because although the codes checked in we're just not happy with quality quite yet so you'll begin to see a few of these things over the next few releases so let's talk about kind of like the way the community is set up so what's notice there's a community edition you mentioned the products open source and then there's an enterprise edition what's the difference between those so we've we're following there's actually a funny story about CIDB in the background I joined CIDB back in about 2011 2010 and our original goal was to make this thing you know a legitimate science platform to get the whole thing spun up with NSF funding and to try to you know follow the the the orthodox route to making this work the trouble is that Mike Stonebreaker was involved and if you know anything about Mike Stonebreaker you're aware that that he has a reputation a justified reputation as being a very good database entrepreneur that's to say he sort of he starts companies that attempt to sort of take ideas from the lab and put them out in the marketplace so we found that that you know because I think in part because of Mike Stonebreaker's involvement the funding agencies looked a bit askew at us and sort of said you know you need to try this with the venture capital community so we we sort of we took the we took the blue pill and we talked to the venture capital crowd and the venture capital crowd are very keen to get return on their investment so the result is that we've got this kind of we follow the same model of sleepy cat the red hat community and so forth with this dual licensed model there's a an open source freely available and free to use platform with no restrictions on it at all you can grab it you can use it scales forever all the queries work and so forth and that's the side to be the side to be platform that you can get from the from the side to be what dot all website then of course we're building the tools that make up the enterprise addition so it's not so much an issue with science that are processing but if what you're doing is one of those systems I mentioned that's you know that's taking input from a very large number of sensors and is trying to figure out relationships or trying to detect errors in the sensor network you know that are machines about to break down you know that becomes a fairly critical piece of infrastructure so having software that keeps it up and running and let's users and administrators know when there are problems that comes at a premium all that stuff is in the enterprise addition they mentioned open source multiple times in there what specific license are you wonder very good so the the side to be open source is available under the the afero gpl a gpl library we had it on a gpl 3 for a while but we found that there are some folks who got a bit nervous about that so it's under the the afero gpl license the closed source or the enterprise stuff I'm fairly adamant about even if I'm not you know our licensing on the enterprise side is the usual thing but I am it's the usual conflict with lawyers and venture capital folk but I'm rather keen there are pieces of the enterprise platform that we've licensed from people like Intel and Intel won't let us put the source code for that out which is another reason that we don't we don't release under gpl licenses or under any open source licenses on that side but as much as possible we put all of our source code in the you know users who buy the enterprise license get as much source code as we can as we can legally give them I just think that's a that's a central value to how software should be developed I'm a little disappointed that we have to include the Intel libraries this is the Intel mkl libraries the linear algebra libraries that they ship on their own hardware and you know that's their prerogative and we wanted to bring that stuff to our customers so we we sort of we did that deal with the devil okay so I'm a little curious what you know you talked about different ways of scaling a side db cluster but in your mind what is the largest side db cluster the largest the moment I shouldn't yeah the largest moment is the the one at NERSC which last time I recall talking to you she was a was a 400 node instance they'd spun up there for some of the datasets that they have there's a larger one under construction at the NIH and that the trouble with that number is it keeps changing a bit they sort of say they're going to go to 500 and then they cut back and then they get bigger again but that's going to be a 400 terabyte dataset of biomedical information so it's it's changing every day at the moment the biggest one we have that's in what you'd call production condition is the NERSC platform but there's a bigger one coming online sometime in the next two or three months here's a random question because I'm a developer and whenever we talk to other developers I just love to ask this question to hear the variety of answers what version control system do you use and why so we use we use subversion and we use subversion purely for inertia reasons we started out the development there and it's adequate to what we needed for I think that the thing about any version control system so I've used inside IBM I've used a variety of the IBM proprietary ones that they bought from a clear case I've used ones like Piccolo which became the Perforce I've used SVN subversion I've used GitHub there is this is there is no perfect system they all have their advantages and disadvantages and their drawbacks and their you know their disastrous corner cases and really it's a case of how far can you push the platform you have with the team you're working with we're only 5 or 6 people at this stage we're okay with subversion as soon as that changes we'll pick something else but the next thing we pick will be to meet the specific problems that we're encountering there's a terrible tendency in software development to behave like a bunch of 7 year olds playing soccer you know look there's a shiny shiny over there let's all go over there with the plan is I've this is not my first rodeo I'm just planning to be a bit more conservative now let's just see which ones wash out what the various pros and cons are and then pick our way ahead on our tool chain choices only as far as our headlights can see that's not you know to sort of chase the latest thing because it's the latest thing all right and then I wonder if you could give us a little preview of what are some of the things you guys are working on now what's coming in future 14 point X and 15 point X versions very good so the this is actually sort of very very exciting we've got we have very ambitious plans the most immediate things that we're working on in the short run on the language side we've got some very aggressive ideas about pushing the boundaries of what you can do with with a functional language as a data language both from the point of view of what the language's power is and also from the point of view of the implementation so there are a number of things because you know the idea of a no-side affecting operator language is that you can concurrently in a parallel compute subcomponents of the plan I could you know give you a 20 minute conversation about this first sequel processing where you know sequel processing tends to be dominated by pipeline operators you're sort of pumping data through a plan with relatively few things that block linear algebra operations often tend to block there's a lot more operators in the plan tree where before you can emit the first byte out you have to take the last byte in so this is a both an opportunity and it's it's also a problem for query planning in general so the language work we're doing is both to extend the power of the language to do to do things more more elegantly and you know with just more more power in the language but also to figure out how we can build a runtime engine underneath this thing that takes advantage of all the nice things you get out of a functional language in the context of a database engine which is that's quite novel no one's ever done that before the second thing is that you know we're constantly working on the the guts of the executing framework so when what you're doing is you know is is an awful lot of double precision multiple operations vectorizing your engine organizing it so that you're using taking advantage of modern chip technologies modern chip ideas like the single instruction multiple dispatch instructions inside the cause of these things that ends up being an interesting engineering challenge from the point of view of a data management platform because you've got to figure out how to bundle up blocks of data and how to minimize movement between your L1 L2 caches and your RAM that's an interesting challenge and we're just sort of trying to make the guts of the thing go like a bat out of hell the third thing is that this I mentioned a couple of times the elasticity model we're very excited about the plans on that front we want to be able to provide not simply the ability to sort of stop the system add nodes bring the system back up again but this very dynamic provisioning model this elastic provisioning model where you can say I'm about to run a big job it's going to take those 100 machines spin up 100 machines and run the job on the 100 machines that's not going to be here quite yet that's a little way out but we're laying the groundwork even now in the engine to get to that point so users and administrators will be able to sort of say there's a pool of 100 machines over there I want to run this particular operation on that pool of 100 machines and then quest them once we're done so that's the sort of stuff that you can see looking ahead query language improvements performance improvements to make the thing go faster from the guts out and this very ambitious goal we have of provisioning systems in a way that can sort of ease the burden on administrators and end users to figure out physically where computations happening okay Paul, thank you very much for your time where can people find information about both the community version and the enterprise version of PsyDB so your best bet is the website and the forum we maintain so for PsyDB information www.cydbsidb.org slash forum f-o-i-u-m that's where our community kind of hang out and argue among themselves about the wisdom of our decisions www.paradigm4 that's p-a-r-a d-i-g-m-4.com that's the commercial side and we've got websites we've got email lists set up support.paradigm4.com info.paradigm4.com those are the email lists that I pay the most attention to okay Paul, thank you very much for your time no worries, thanks really good conversation