 Thank you for coming. Welcome to my talk. I've got these, one of these really weird slots here where it's just before lunch and I know everyone's hungry and wants to go out and get some of that pizza in the pizza van, but I'm going to actually bore you for an hour first and then you can get pizza. All right, so my name is Manik Siratani. I'm going to be talking about data as a service. I'm also going to talk about Infinispan, which is a new project of mine. Hopefully it'll be very, very exciting. It won't be as boring as I said. Right, so the gist of the talk really is all about, it's about cloud. This is the cloud track as you know. Everybody likes cloud. Everyone thinks it's a good thing. I'm sure you do too. That's why you're here. That's why you're here to the cloud track in the first place. And cloud is good. There are lots of good things about it. I mean, yes, a lot of people think it's just a buzzword and it's a marketing thing. It's true to some degree, but there are some very good things about it. So cloud is good, but at the same time, everyone also likes data and data storage. Data storage is pretty much, you know, you can't really build a serious application without a form of data storage, right? Now, okay, so we like cloud. We like data storage, but unfortunately cloud and data storage don't like each other. And that's really what this talk is about. Why cloud does not like data storage? Why data storage does not like cloud? And hopefully what you can do about it, right? Before I get into the talk, I know everyone likes Twitter and we've got some awesome Wi-Fi here. I must say this is like the best conference Wi-Fi I've ever seen. It really works and works well. So I'm sure you're going to be tweeting about this. If you are, please use these hashtags. The Infinispan community, they're very, very excited about what we do. They follow us quite closely on Twitter, and I'm sure they'd appreciate Amazon EC2, Rackspace. This is all very, very cool stuff, right? Now, I'm not going to go into too much detail as to why cloud is good and why cloud is cool and why we all like it. I'm sure you all know that. All the good stuff about, you know, pay as you go and easy access to lots of resources and stuff like that. But in general, you've now got a virtualized operating system. That's really the starting point of cloud. That's the starting point where you can say, I'm going to use virtualization, whether it's an EC2 kit or whether it's on VMware or whatever. And now this stuff is elastic. I can fire out more instances as I need them. I can scale them back down when I don't need them anymore and things like that. All very cool, right? But not enough. I mean, you still need to maintain all this other stuff on top of your virtual operating system. You still have the same pain points of maintaining your database on top of a virtual operating system. Your middleware, your front-ends, and so on and so forth. So, I mean, in the evolution of cloud, the next step really is platform as a service. This is where people start to take things a few steps further in addition to just virtualizing your operating system. We also start virtualizing your front-end and your middleware. Now, this is all very, very good stuff, right? Now, before people start saying, how come that's not virtualized up? I'm going to talk about that. Most platform vendors, most PaaS vendors tend to just virtualize the first two things. And there's a reason why it's actually quite easy or relatively easy to do. Of course, there are some vendors that actually virtualize the entire stack, and that's the good thing. That's the holy grail. So, if you look at things like Google App Engine, the entire stack is virtualized all the way down to your data tier. But Google App Engine's actually a little bit of an anomaly. They're a bit of an exception. Not everybody does that. But there's a reason as to why they don't do that. It's actually quite hard to do. Where's your data actually stored, right? Now, I mentioned cloud doesn't like data. Data doesn't like cloud, right? Virtualizing data is not quite as simple as cloudifying or elasticating or virtualizing any other layer. Virtualizing a patch is easy. I mean, you can shovel your state somewhere else. You have stateless instances. You're middleware. You can kind of do the same thing as well. It's not so easy with databases. So, let's take a step back and think why. The main reason is clouds are ephemeral. Or at least you need to assume that clouds are ephemeral. What does that mean? What I mean by that is clouds are stateless. Or they've been designed with the assumption that they will be stateless. I'm going to talk about EC2 specifically, but this actually does apply to many systems. So, with EC2, you've got a machine, a virtual machine that you boot up based on an image, based on a Linux image or whatever that you've got. And you're going to get a fresh copy of that machine, right? Now, if that machine were to go down, any state that you might have stored on it is lost. It might come back again. You might bring it back up again. That's cool. But you actually lost anything that you've stored on it. Now, there are lots of things you can do to get around that. I mean, Amazon gives you a few other services like EBS, Elastic Block Storage. Anyone heard of EBS? People know about it? Okay, good. So, you can do that. You can store stuff in EBS. It's a semi-persistent thing. They give you something called S3 again that's also kind of persistent. But they're not quite as trivial to work with. It's not just saying, here's a machine. Here's an operating system, which I can just store stuff on it. And if it goes away, I can bring it back up. I haven't lost anything. That's not the case. So, that's how clouds tend to achieve elasticity. They do so by making this assumption that everything's going to be stateless. If EC2 decides that they're going to reprovision one of your nodes because a physical machine died, I'm going to give you another one very quickly. You may not even notice, but what you will notice is you've lost all your data on it. They're just going to boot up another copy of your image and so on and so forth. It means that you now need to store all of your states somewhere else. So, why is it so hard to virtualize your data tier? So, I've been talking about state a lot. That really is the big deal. Now, front-ends and middleware also have state, well, transient state, HTTP sessions perhaps, your middleware might store some stuff as well. Again, HTTP sessions and things like that. But typically, most platform as a service vendors get around the problem of state in your middleware or state in your front-end by pushing all that state into the data store. I mean, yes, that's a very, very valid solution. It means that your middleware in your front-end is now stateless and that's awesome. It means that if you have more requests, if your site gets slash dotted or whatever, that's fine. You can deal with that. Just bring up a few more Apache instances. That's fine. Bring up a few more middleware instances. You're running an EC2. That's cool. You can do that. You can bring them up very quickly. Boot them up off an image and things like that and add them to your load balance and you're good to go. But that's putting a hell of a lot more stress on your data storage tier and that's what you need to think about at this stage. That's the hard bit. So, I keep saying that virtualizing data is hard, but there are some public services that do exist. They do exist that give you virtualized data, right? Now, what are these services? I mentioned Google earlier. Google's one of them. Amazon gives you a couple of things as well. I've got RDS up there, but RDS actually is not a really scalable, true, virtualized data tier. Who's out of RDS? Anyone know what it is? Okay, a few hands. So, for those who don't know, the usual pattern of running a database on EC2, I'm going to use MySQL as an example. That's all the Amazon docs use anyway. And the usual pattern is that you will boot up MySQL instance on an EC2 image. You will attach an EBS volume onto it. An EBS volume is supposedly a persistent volume in Amazon. So, you attach that volume onto this virtual image and then you point all your data files to that volume. So, if the virtual machine goes down, you can bring it up again, point it back to that EBS volume, and you still got your data. It's not that simple. It's actually much harder than you think. A couple of reasons. One is you can't scale this outwards. An EBS volume can only be mounted on a single virtual machine at any given time. So, you can't actually have multiple virtual machines. For example, all running MySQL, all pointing to the same data store, that doesn't work. Whether it's for high availability or even scalability, you still can't do that. Also, EBS is not guaranteed. It's not guaranteed to actually be there. Even EBS can disappear, in which case you will lose your stuff. And again, the usual pattern that Amazon talk about is you're going to take snapshots from EBS and store it in S3. Now, S3 is the super secure. Yes, this data is always going to be around sort of storage service that Amazon gives you. Now, the problem with S3, even though it's all really cool and really safe and all of that stuff, very durable, the problem is it's slow. I mean, as you'd expect, there's always a trade-off. S3 works on either the API that you use to connect to S3 is either a REST API or a web service, both of which are notoriously slow for various reasons, mainly HTTP connections and things like that. And then on top of that, you build a layer. That can get slow. So you can use it for snapshots. That's cool. But snapshots, again, I actually hate snapshots. Snapshots are terrible. A complete false sense of security. It makes you think your data is safe, but you've still got these windows where you actually might lose stuff. Now, the drawback with snapshots actually makes you go back home and say, I don't need to think anymore about any issues that might arise if I were to lose my environment at the wrong time. No, you've got to think about that. That's still pretty nasty. So, okay, that's RDS. Amazon also gives you something else called SimpleDB. Now, SimpleDB is actually a scalable data service. I mentioned SimpleDB quite briefly here. Who's heard of a paper that Amazon once put out called, I forget the name now. Anyway, it's a very interesting paper. Dynamo. Yes, Amazon Dynamo. Who's heard of Dynamo? Who's read it? It's a long and pretty complex paper, but it's very interesting. They actually talk about a truly scalable data grid system. SimpleDB is actually their internal implementation of Dynamo. Now, the only problem with SimpleDB, again, is API. Actually, the name as well, they call it SimpleDB. That's probably the worst name they can come up with because anyone's used it here. You can probably confirm with me that, A, it's not simple, and, B, it's not a database. Calling it SimpleDB is probably the worst thing you can call it. It's essentially a web service. It's an XML-based web service where you store key-value pairs. I mean, it's horrible to use. Who wants to write a web service just to store key-value pairs? But that's the only service they actually guarantee to be properly elastic and truly durable in their environment. There are others as well. FathomDB is very similar to RDS, except it's not run by Amazon. It's run by somebody else. Cloudant, it's another service company they're working on, an elastic version of CouchDB, Mongo HQ, do the same thing with MongoDB, things like that. It's basically, they manage the environment, they host it, they manage the replication, the snapshots, blah, blah, blah, and they just give you a service login where you can sit and store stuff. But these are all public services. This is all stuff where you buy a login and you use it. But that's not good for everyone. And there are many reasons why. Well, what about private clouds? Private clouds are important as well. When you talk about cloud computing, not everything lives in the public space. Very often you want to use your own hardware. Why? Well, because you have your own hardware, for example. You already bought it, maybe. You don't want to throw it away and then go and, you know, use EC2 and pay for stuff again. Or maybe because of security concerns. Maybe you're concerned about storing your data on a public service where Uncle Sam might, you know, walk in with a letter and seize all the hardware and your data with it. What about data locality? You might actually want stuff local to where your users are processing things. You might not want to fetch data from across the Atlantic or things like that. Right? So there are lots of reasons why you want to build your own data center, not your own data service on your own hardware. So how would you go about doing that? Let's start by looking at the characteristics of a data service. What are the characteristics that make a scalable data service? Firstly, you need elasticity, you need scalability, high availability and full tolerance. These are all things that you expect out of a cloud service. One of the reasons why people use cloud for all the other stuff, for, you know, your front-end, your middleware and things like that is elasticity and high availability. You want to be able to scale out on demand when you're slash dotted and things like that. So your data service needs to be able to scale out as well. Right? You also need high availability and full tolerance. This is all running on commodity hardware. That's the whole point of cloud. You're going to be running on cheap commodity hardware and cheap commodity hardware will break. It will break. You know, power supplies will blow up and things like that. What are you going to do? You need to have a highly available service so that you don't actually lose any more quality of service from the users. Now, there are other optional features as well that you might want out of a data service. Things like transactional capabilities, XA, or MapReduce and things like that. But those are all optional extras, I would say. They're not the core of a data service. So how do you normally store data traditionally in your traditional 3TS-style app that I shared? Now, I've actually got SQL and NoSQL on the same slides. I'm actually not categorizing them together. What I'm trying to talk about here is essentially non-distributed databases. So non-distributed SQL and non-distributed NoSQL databases. And this kind of stuff you can't really use for a data service. Now, okay, let's start with databases. Traditionally, you'd use something like MySQL or Postgres or something like that. You don't really have all the stuff that I spoke about. You don't have elasticity. You don't really have high availability. Yes, you can cobble it together. You can use NDB as an engine and things like that. But that's not really something that's truly reliable or tried and tested. The other option is stuff like Oracle. Oracle's RAC and things like that, which are all extremely expensive. And it doesn't actually run on commodity hardware either. You need special hardware. You need a SAN, all this other stuff. So it doesn't really work for cloud. And in the case of NoSQL databases, the ones that are not distributed, again, you've got the same problem. If it's not distributed, you're not really going to be elastic or highly available. So in general, stuff like this won't really work, to build your data service. What can you use then? What's the next step? So I'm going to talk about data grids now. I'm going to change gears a little bit. So essentially, when I talk about data grids, I'm going to cover both in-memory and disk-based data grids. They're two fairly different beasts in terms of how they actually store things, in terms of how they optimize for latency and things like that. But in terms of distribution, or in terms of the way they deal with fault tolerance and elasticity, they're very, very similar. They follow similar principles and similar concepts. So essentially distributed databases, just by the very name, they are inherently distributed. They're distributed by design. They're distributed by the very nature of what it is. And that means you get a lot of things for free. Since the system is distributed, you're going to get high availability for free. There's always going to be another node somewhere, and I'm going to provide that you have adequate numbers of nodes altogether to take over from a node that were to die, that were to fail. And you're going to get elasticity, too. You're going to be able to add more nodes to the system, and it will scale outwards, and it will scale back down again as you shut nodes down if you don't need them anymore and things like that. So you get all this stuff for free, and this is why I think distributed data grids are actually perfect for a cloud data service. It's a perfect building block. You get all the characteristics that you want from a cloud data service for free out of a data grid. And what about API? So how do your apps actually talk to your data service? A lot of no-SQL databases use things like key values and stuff like that, or document-orientated stores, which are very good for special purpose or very application-specific, domain-specific needs, but it's a little bit too low level in general. So anyone here at Java developer has heard of JPA? Who uses JPA here? JPA, essentially, for the non-Java folks, is your standard high-level object-orientated way of storing data, of storing objects in Java. It's an ORM. It basically maps high-level objects into relational tables and things like that. But leaving the ORM part of it aside, essentially, this is the API that you traditionally use. In Ruby, it's kind of similar. You use ActiveRecord and things like that. You use high-level APIs. You don't actually fiddle around with SQL statements, right? You shouldn't anyway, not on a high-level app. Essentially, what you should be doing is these are the kind of APIs you need. You shouldn't really be talking key value. So when you're trying to build out a data service, you should think about stuff like this. What sort of APIs are you going to offer your middleware tier? So shifting gears a little bit. I'm going to introduce Infinispan. How's everyone doing so far? Have I put anyone to sleep yet? All right, good. So what is Infinispan? I started Infinispan a couple of years back. It's an open-source data grid. It's in memory primarily. We also do spool off to disk, but it's primarily in memory. It's written in Java and Scala as well. Even though it's not just for the JVM, even though we write it in Java, I'll explain why. And there were a couple of primary usage modes. So there are two ways of interacting with Infinispan. One is an embedded mode. The embedded mode is if your app happens to run in a JVM. Now, I'm careful to say JVM here, not just Java. I mean, you could be writing a Groovy app or a JRuby app or whatever. That works just fine. You can launch an Infinispan instance within your JVM. And if you launch a few copies of that, of your app and different servers, they'll have to auto-discover each other. They'll form a cluster. They'll start sharing state, et cetera, et cetera. So that's embedded mode. And then there's client server mode and connect to them remotely over a socket. Over a number of protocols that we support. So I'm going to give you a brief tour of some of the high-level, more notable features of Infinispan. And then I'm going to talk about how you can actually use Infinispan to build a data service. So start with embedded mode. Infinispan is primarily a peer-to-peer system. We use peer-to-peer protocols for auto-discovery, for sharing state, for group membership, things like that. Who's heard of a project called JGroups? JGroups is an open-source peer-to-peer library. And that's essentially what Infinispan uses. We were built on top of JGroups. We use JGroups for discovery. We use JGroups to share data and things like that. So essentially, this is your typical embedded architecture. This is how you would use Infinispan in embedded mode. As I said, you start up a bunch of JVMs. Your app sits in the JVM. Your app will launch an Infinispan instance. Infinispan instances discover each other, start sharing data. And that's pretty much it. It's very simple. You automatically get all the features of Infinispan in your app, stuff like high availability. You can now load balance requests across your app because data is shared everywhere, like session state and things like that. A lot of frameworks, a lot of web frameworks, for example, tend to use Infinispan to share their transient state. A lot of web containers, like a JBoss app server, for example, actually uses Infinispan as well to distribute HTTP session state or EJB session state, things like that. And this is how a JBoss, like I said, achieves clustering. What does the API look like? So Infinispan's got a number of APIs. The primary API is a very simple map-like API. So again, if you use Java, the core Infinispan API actually extends Java U to a map. So if you know how to use a map in Java, you know how to use Infinispan. It's very, very simple. It's more than that. Of course, we've got a few other higher-level APIs and more rich APIs, including asynchronous APIs, non-blocking APIs, and we're also building an upcoming JPA layer as well. So like I mentioned, I mentioned JPA earlier, so you actually have this JPA style of interacting with Infinispan to persist your data, as though it was MySQL or as though it was some other database. There are also other high-level APIs being discussed in the community, including ActiveRecord. Who's heard of a project called Talkbox? Any JRuby people here? Any Ruby people here? So Talkbox essentially is a Ruby on Rails implementation on JRuby running on top of JBoss. It's actually extremely popular in the cloud community. So who uses Engine Yard? Anyone use Engine Yard here? Anyone heard of Engine Yard? So Engine Yard's a hosted Ruby on Rails cloud environment where you can host your Ruby on Rails applications. It actually uses JRuby underneath. It actually uses Talkbox underneath. The benefit of using Talkbox as opposed to Ruby on Rails on its own is that you get all the benefits of a Java EE app server underneath Ruby on Rails, which particularly includes clustering, high availability, and so on and so forth. What else do you have in Infinispan? So again, like I said, a bunch of high-level features. Infinispan distributes data across its nodes using a consistent hash-based distribution algorithm. The reason why we do this is because consistent hashing gives us a very fast and deterministic way of locating data in a cluster without needing to maintain lots of metadata and without having to broadcast calls around the cluster to try and locate stuff, right? Everything happens in a single machine. You can locate stuff very quickly in deterministic fashion. And as a result, the system is self-healing. You don't have a single point of failure, so on and so forth. Infinispan is also highly concurrent. We use MVCC-based locking on a per node basis, which means that you've got concurrent reads and writes and things like that, again, very, very performant, especially in massively multi-core environments. Persistence. I mentioned that Infinispan also does persist to disk. It's not just in memory, although primarily it stores stuff in memory. It also does write through to disk as well. Now, when I say disk, we've got an interface called the CacheStore interface. We've got a few implementations that we ship with. Some of the implementations of the CacheStore is just a file system-based driver. Some of them write to BarkleyDB. You can actually write back to a database as well using a JDBC driver. We also have plug-able drivers for Amazon S3, for example, to actually persist onto something like S3. And, of course, you can write your own as well. Eviction and expiry. So if our primary data store is in memory, you're going to have a problem, and if you're doing stuff in memory at some point, your JVM is going to crash with an out-of-memory exception. You are going to run out of memory. So you do need some form of eviction and expiry to be able to say, okay, I need to take stuff out of memory and put it onto disk, onto my CacheStore. Essentially paging that happens in most operating systems. So we've got a couple of rather interesting algorithms, very interesting adaptive algorithms. Some of this stuff has literally only just come out of university research like two years ago. We're one of the first implementations. If you want, I can talk about it in more detail later on. But yeah, go and check it out. Very, very interesting stuff. XA Transactions. Yes, we support XA Transactions. And this is actually very interesting because a lot of no-SQL databases try and not support XA Transactions or they don't support XA Transactions as a trade-off for performance to be able to scale. Now, we actually think we can scale while still supporting XA Transactions because guess what? XA Transactions use cases need transactions. They tend to be transactional. So a lot of people want it. We do do XA Transactions. If anyone's ever heard of a group called Arjuna, this was a research group out of the University of Newcastle which then spun off into its own company and they used to do XA Transactional engines probably well before Java, 20 years ago, and therefore one of the most mature transactional implementation around. We work very closely with them. They actually help us implement XA Transactions within Infinispan and so on and so forth. And in addition to that, we're also researching new ways of providing consistency or providing strong transactional-like consistency and making it perform even better. One particular idea we're working on at the moment is Atomic Broadcasts. If anyone's familiar with Atomic Broadcasts, there's lots of very interesting stuff around that. We're working with a couple of universities, one in Portugal and one in Italy and we're closely with them on it. So that's one to watch out for. That'll be hitting a release soon at some point. What else do we do? Yep, we do do MapReduce as well. It's currently in a pre-release state. It's in an alpha. It's alpha quality. But do download it, try it out. We've got a very interesting API in my opinion. A lot of people talk about how complex MapReduce is in the Java world. I mean, I know it's not very complex in the non-Java world because you've got things like closures and dynamically typed languages where you can actually make MapReduce quite nice. But in a strongly typed and rather clunky environment like Java, MapReduce can be very complex. If you look at Hadoop's MapReduce implementation, has anyone tried to implement a MapReduce task in Hadoop? No? A few there? I mean, it is complex, it's non-trivial and we're hoping that our API is actually very, very simple. We've taken lots of cues from dynamically typed languages and things like that. So, yeah. And yes, we also do support querying. We do support indexing and querying stuff that you store in Infinispan. We use Lucene as a querying engine and a querying API as well. So that's embedded mode so far. Now, client server mode kind of builds in top of embedded mode. It kind of, in terms of architecture, this is kind of where it looks like. If you want to start up a bunch of Infinispan instances in client server mode, each one starts up in its own JVM, that's what that thing is, and each one opens up a socket and starts listening on it. Now, in this setup, your application does not share the same JVM as Infinispan, it sits outside. In fact, it may not sit in a JVM at all. Your app might not be a Java-based application at all. As long as it can talk one of these protocols over the wire to the Infinispan cluster, you can use Infinispan. Now, we support a number of protocols. Firstly, we support REST. We support REST because it's very popular in cloud environments, it's very easy to manage, it's a very useful protocol to support, plus it's very easy to build and very easy to implement. So we support REST, we support Memcache D. Memcache D is a, for those of you who don't know it, it's a single VM, a single server, a daemon process, and it's very, very popular, it's ubiquitous, most Linux distributions ship with it. What's also very interesting about Memcache D is there is a client library for Memcache D for pretty much any language or any platform on the planet. And this means that since we support the Memcache D protocol, it means that pretty much any platform or any language can use Infinispan as well. Right? And then there's Hotrod as well, so you're probably thinking, what is Hotrod? Hotrod's essentially a wire protocol that we started building for Infinispan specifically. It's an extension of Memcache D, it kind of adds a few extra things on top of Memcache D, but let me start about what the big deal is, or why did we do it in the first place? So Memcache D is a very simple protocol, it's a very simple protocol for storing and retrieving data, but we found a couple of shortcomings in Memcache D. Firstly, it's text-based, it's not binary-based. I know there's a binary variant of the Memcache D protocol, but that's not really well supported. There's a couple of other variants out there and things like that. But specifically, the reasons we found the Memcache D protocol falling short is because it's a one-way protocol. It's a one-way protocol where clients talk to servers. Clients talk to servers and get results, that's it. There's no way for a server to talk to a client. Now why would a server want to talk to a client? There are a couple of reasons. If you want to maintain a dynamic list of which servers are available, that's pretty useful. Another interesting reason is built-in high availability and failover. You can actually build this into clients as well if clients knew where your servers were and how this was to change. And the last part is, like I said, in Finispan uses a consistent hash-based algorithm to store data. If the clients could be made aware of what algorithms enforce and your server topology at that time, it can actually direct to request to the actual node which has the data rather than having to have a node hop somewhere else to find your data for you. So you can optimize a lot of things if you had a two-way protocol. So that really is what Hotrod is. It's essentially a two-way protocol where you can do smart routing, where clients can be built in a very intelligent fashion. So here's a quick comparison of the different endpoints, the different protocols that in Finispan supports in client server mode. So REST and Memcache, they're both text protocols, as I said, Hotrod's binary. REST, in terms of client libraries, you don't need a client library for REST. You just use an HTTP client, Memcache D, there are lots and lots of client libraries out there. The big drawback with Hotrod is that currently we only have a Java client. It's essentially a reference implementation client that we have built. I know that someone is building a Python client. I have also heard of a group building a .NET client. We are hoping to see more of these come out in the community. They're all clustered at the same time because your in Finispan nodes are clustered. Unlike Memcache D itself, where the backend is not clustered, if you lose your Memcache D node you've lost your cache. But this isn't a cache, it's more than a cache. In terms of smart routing, I just explained Hotrod supports smart routing, Memcache D and REST don't. But you still can build in failover and load balancing across any of these protocols. You just need to use different techniques. So in terms of REST, if you were to use the REST endpoint, it's quite easy to build HA and load balancing. You just use any HTTP load balancer like ModJK or ModCluster for Apache or a hardware load balancer which can be used as well. Memcache D is a little bit limiting there. So most Memcache D libraries, client libraries allow you to provide multiple endpoints or IP addresses of multiple Memcache D servers and it will try and load balance and failover across them. The predefined server list is static. So if some of these servers were to go away and some new ones were to be introduced you've got to either restart your client or reconfigure your client or something, right? So that's not always possible. Whereas with Hotrod, that's all dynamic because Hotrod's got a dynamic view of what's happening on your server side. So let's kind of sum things up. What is infinite span, right? Is it a data grid? It's called the characteristics of a data grid. It's in memory, it's peer-to-peer, it's latency because primarily stored in memory and it's a key value store, right? So what is it? Or is it a NoSQL database? Because you do persistence as well and we do MapReduce as well. Or is it something else entirely? Because you also have querying support and transactional support. Stuff that even NoSQL databases don't do or many of them don't do. So what is it? In reality, it's all of these things. What infinite span is, it's a highly scalable data store that can be used as a database replacement, that can be used as a NoSQL database and it will still support high-level things like querying and transactions. You don't have to compromise your programming model or what you do on your app tier to fit in a scalable database. So why is infinite span sexy? I actually don't know who that is. Just a caveat there. I'll give you six reasons for that. Firstly, it scales horizontally, scales outwards and back in again. Very important stuff. It's elastic in both directions. Fast and low latency access to data. Primarily stuff is stored in memory. It makes it very, very fast to access things. It addresses a very large heap. So if you've got a bunch of JVMs on a bunch of nodes and each one's got two gigs of heap, it gives you the aggregate view of the entire thing. And Java, that's pretty cool to actually have a single data structure that looks like it can store a couple of gigs of stuff. Even though it's striped across multiple nodes. It's cloud friendly. It's cloud friendly. It runs in EC2. It runs in Rackspace and things like that. It'll run on a private cloud. It deals with ephemeral nodes because, like I said, it is distributed by design and that helps you deal with things like that. And it can be consumed by any platform. It's not just for Java. It's not just for the JVM. And of course, most importantly, it's free and it doesn't suck. So let's revisit this little diagram over there in the corner where we had the various parts of your three-tier architecture turning cloudy. And let's try and make that data bit cloudy as well. How would you do that? One solution is to actually replace that single node with a bunch of infinite span nodes, perhaps sitting in EC2 or whatever virtualized hardware that you have. Like I said, because they're distributed, you're going to get all the elasticity on that tier. Your middleware tier would talk to it using one of the three client-server endpoints. You could use all three as well. You don't have to stick to just one. You can actually have each of those nodes listening on all three of those protocols as well, which is an interesting setup if you have heterogeneous middleware doing different things written on different platforms and stuff. And there you have it. So that's how you end up achieving elasticity, high availability, and scalability in your data tier. And this is actually true of pretty much any data grid that would support the features that I spoke about. It's not just infinite span. So how would you actually start an infinite span server? Well, step one is actually not there. It's to download the distribution. I'm assuming you've done that. Download the distribution. You've got your startup script. You provide the protocol that you want to use. There's a bunch of other tuning options as well. I mean stuff to tune the sockets you're using and things like that. Once you pass in the infinite span configuration, that XML file actually configures and tunes the memory node that the server endpoint uses. You get to tune things like what sort of JTA characteristics you want, what sort of transactional characteristics you want, what sort of locking characteristics you want, and things like that. The rest endpoint is slightly different. The infinite span distribution comes with a war file, which is a Java web application. You deploy the infinite span war file in your favorite servlet container and that exposes an HTTP endpoint for rest. If you don't have a servlet container, I mean there are lots of open source ones. There are some very good ones I can recommend. And essentially you use your web container to tune things like security or the number of threads in your socket pooling and stuff like that. So where we headed? What's the future for infinite span? My motorcycle as well. Just another caveat. So our first release we released it in February, 2010, February last year. It was version four. For those of you who are wondering why is this first release called version four? I was kind of bored of everyone releasing it. The first version is version one, right? Let's do something different. There's an FAQ on the website if you're interested. Go and have a look. Version four code named after the beer we were drinking when you were actually coding it. It comes with all these features. The map like API, ASync API which I didn't talk about. That's a non-blocking API. That's pretty cool. Consistent hash based distribution right through and right behind the disk. All the storage stuff, eviction and exploration. Some management tooling with JMX to actually monitor the nodes, see what's going on. Things like that. A rest API and all of that. 4.1 By this stage we had progress to a different beer. This was released in August 2010. Other very cool things like deadlock detection and stuff for distributed transactions. All the client service stuff came out in this release. Lucene directory implementations and stuff. If you use Lucene for indexing stuff you can actually distribute your indexes using Infinispan. Stuff like that. What are we doing right now? Very hard on 5.0. 5.0 is going to have some other cool stuff including the JPA like API. The distributed code execution all the map reduce stuff. I mentioned that's already out there in alpha quality at the moment. The goal is to actually try and have this as a final release by the summer. And of course 5.1 and beyond what are we doing over there is dynamic provisioning. This stuff is all very cool. Some of the stuff is pie in the sky. We're just playing around with a few ideas here. Dynamic provisioning is quite interesting where you can actually plug in some rules like SLAs. I want to make sure if I hit 80% capacity fire up some new nodes automatically and start distributing stuff in sharing state or if I drop below 30% capacity shut down a few nodes. I don't need so many nodes and stuff like that. That will be all very interesting to watch. Complex event processing as well. This is stuff that we want to do to add event notifications and things like that. It could be quite cool to watch. To sum things up essentially I was talking about clouds are mainstream. They're here blah blah blah. We like them. We like data. They don't like each other. Rather how elastic data is a hard problem to solve, how it's hard to achieve how data grids can help as well. I talked about Infinispan as an open source data grid a viable solution as a data service both for public and private cloud as well. And with that I believe I've got a few minutes for questions. There's a bunch of URLs up there if you want more information including the project page, the project blog fork the project and github Contribute code. We like that. It's all good fun. We're an IRC. That's our IRC channel on FreeNode. With that any questions? You mentioned that because it's running from year to year you don't have any single board with failure. Later you mentioned that you're doing HH sections in two-phase moving to relax on a single transaction manager and thus survive the whole transaction. What happened with the transaction manager? So the question is we say we're peer-to-peer based and we don't have a single point of failure but at the same time we participate in XA transactions and XA transactions means that if your transaction manager fails does that mean that you've lost your transaction? Is that your question? Essentially when we deal with external systems like a transaction manager that's not internal to Infinispan. That's an external thing. If your transaction broker dies that's a problem of the transaction broker but we do have a solution for that as well. I mentioned Arjuna earlier. Arjuna has kind of been rebranded and renamed. It's now JBoss Transactions the JBoss Transaction Manager which in itself is distributed and highly available as well. As a recommendation if you want true high availability we'd recommend the Transaction Manager because that in itself has got distribution capabilities and they don't have a single point of failure either. So that's how we achieve that. Does Infinispan use data of multiple nodes for example during transactions? I'm sorry could you repeat that? Does Infinispan use the data locking for multiple nodes? Yes it does. So the question was does Infinispan do data locking for multiple nodes? Yes well we have a very optimistic scheme around that where we actually only acquire remote locks during the prepare and commit phases not through the entire life cycle of the transaction although that's again something you can configure you can use eager locking as well. Eager locking will of course mean you've got less overall concurrency in the overall system but it gives you more secure. You have fewer transactions potentially rolling back. The transaction would fail. If you can't acquire all the locks you need your transaction will fail. So the question is what's the overhead of using Lucene over Infinispan? Are you referring to using Lucene to index stuff in Infinispan? Well there is an A overhead yes. What's the overhead? Is it a number? It's something you need to try out. It depends on how complex your objects are the effort involved in indexing those objects and of course most importantly how you tune and set up Lucene as well. The question is so we're using J groups how does that scale essentially right? Specifically with regards to multicast no we don't actually use the multicast part of J groups. J groups is actually very very tunable and configurable and you can actually tune it to the nth degree and we only use a certain subset of J groups for a few very specific things specifically primarily for discovery and things like that not so much for the actual communications for the actual communications we tend to use point to point TCP's. Is Infinispan suitable for to store geeks or other types? The question is is Infinispan suitable to store gigs of files? Are you referring to a single file of that kind of size or in total? Yeah I mean I've seen people use it to store terabytes of data as well. Yeah so no one says yes. Anything else? How do you cope with security features? The question is how do we cope with security? Security is on the roadmap and unfortunately we don't do very much with security at the moment. We do a little bit since like I said we do use J groups underneath to actually talk on the on the wire. J groups has got a number of security features and we just benefit from that like the J groups can be configured to use certificates for authentication to join a cluster J groups can also be configured to encrypt any traffic that's sent on the network but in terms of security on the user level features so actually accessing stuff we don't support anything yet but it is on the roadmap. Okay so with that if there are no more questions thank you all for listening and I won't hold you back from your lunch.