 Just a little disclaimer, CouchDB in itself has nothing to do with Ruby, and I'm not even a Ruby guy, a Ph.P. in my school, and so I might have the odd job presentation here, but I'm glad you all showed up on this question. Yeah, I can do that, so there's neat things you can do with Ruby and CouchDB, but there's not inherently, the topic of Ruby is not inherently embedded in CouchDB or somehow, so I hope you like the talk anyway. My name is Jan Leenard, I'm from Germany, which you probably can tell from my accent. I'm a web developer, like I said, doing PHP MySQL, and I tend to keep an eye on emerging technology. That's how I came about CouchDB, because I thought that would be, well, if I had something like CouchDB that would really, really make my life easier as a web developer. So that's how I got started with the project, and now I'm a contributor to the project and help to get the project along. That's me. The basic premise of my talk is CouchDB is an easy database, and it's easy on two accounts. The first one is easy to understand, there's not much magic going on. The concepts behind it are not very, not that strange, or you can easily approach CouchDB. The second one is the way to use CouchDB, the actual programming with CouchDB is relatively simple, so that new programs are not that experienced, can easily get into the world of CouchDB. And the third one is if you're into, like, look over an application lifetime, the demands for that database usually change. In the beginning, it should be easy to program against it and use it, and when, well, we have a mature application, you usually want your database to be easily maintainable, it should be scalable and all these things. And CouchDB makes that easy as well, so CouchDB is from, like, top to bottom an easy database, sort of. And first off, CouchDB is not a relational database, so when people say database, they usually mean relational database, CouchDB is not a relational database. What CouchDB does is it takes, no, how do I go with that? What you usually do when you write an application is you have some sort of data model that you need to, where you go and take your, the data objects you have in your application and you split them up into different tables and design joins and queries and all this, and there's a lot of things going on before you actually start writing an application that is just the data storage layer. And in the end, you just want to store data and get it back somehow and there needs to be, like, if you want to, like, future proof and all that, you need to be really careful to do things right from the beginning and there's, like, a whole lot of work to do and CouchDB tries to make that easier. CouchDB has this or introduces the concept of a document and a document is, like, things we have in the real world, a business card, a bill or a receipt we get and with these documents, they have, like, they are structured, like, we have all of, we all have, most of us have business cards and there's a telephone field and an email field, so there's the same information on it but there might be, oh, I don't have effects, you might have effects, I don't have an office address on my business card and you might have and it's still both things are business cards, but they are differently structured and the same is with receipts you get from restaurants or from where you buy things, for each shop you go to, it will look the same or similar, but for each, for the different shops would have differently structured receipts or bills or whatever and if you would want to build, like, a financial overview application that, like, if you can analyze your spending or whatever, you will need to come up with a schema that accommodates all these different structures of all these documents there and this is kind of hard, so CouchDB takes these concepts of semi-structured data and lets you store just as it occurs in the real world or in your application and store that into CouchDB without the need to create a schema up front, there's no predefined structure, you just have a data object and store it into CouchDB, how does it technically work? We use, or CouchDB uses JSON, it's a JavaScript object notation, it obviously comes from the JavaScript language but it's since evolved to be, like, available everywhere and what it basically does is you have an object in your programming language with a native object with, I don't know, all the native types, like a number of strings arrays or another object inside of there and you can serialize that into a string representation of that object and you do that and store this string representation, this JSON string into CouchDB. Excuse me. Then you can get this string representation back and de-serialize it into, again, back into a native object of your programming language, so a Ruby object, put it, convert it to JSON, store it in CouchDB and if you want to read that object again, you read the JSON out, convert it back to Ruby object or if you do, like, I want to integrate different systems, you can de-serialize it into a Python object or PHP or Java or any object in any language that supports JSON and the nice thing about JSON, it maps naturally to all these native types, again, numbers and strings, dictionaries and arrays, lists that we, that all programming language share at some point. CouchDB used to use XML and we found that XML is a bit too verbose, it does more things that we would need for CouchDB so we switched to JSON which is like a better fit for what CouchDB does. This is, I hope you can see all that, this is an example JSON document and you just, it's basically just a list of keys and values and values can be, again, either strings, numbers, there's an array there and any booleans and this is, this is very simple to understand what that is, it's easy to read and write and easy to parse computationally, so this is a very easy and simple thing. You notice two, like, private properties, ID and rev, when you store something in CouchDB you get back an ID, that's the, like, identifier for that object and if you give CouchDB that ID and ask for the data, it will give you the data you put in previously. So each document in a CouchDB database will have a unique ID that you can identify the data with. The revision thing is, when you change things in a document, like, you want to change the age, you don't tell CouchDB to increase the age by one or something like that, but you create in your application a complete new document with the change data and then give that to CouchDB and CouchDB saves this new document as the latest version, the newest revision, the newest version of that document and maintains the old one. So the revision ID there would be, would change for different versions of a document and you would be able to make a change and decide there was a bad change and want to undo it. You can take, okay, take the, or ask for the previous revision or any revision and save that again as the latest revision. So there's a mini version control system embedded in that, in CouchDB there. That's what I already said about JSON. The other thing is, what I mentioned earlier, with the application needing to spread out all the things into tables and reassemble all that, again, usually happens in some sort of object relational mapper or database abstraction or something and with the step of we take objects, serialize them and store them, there's this messy thing that usually doesn't scale and has other problems that, well, it makes our gray hair gray, sorry. We just don't need that. We just do the serialization step and we are done with the saving data. That is, well, like, a great simplification of things. That is so far documents. One more thing, CouchDB documents can, like, email can have attachments. You can put in binary data, images, PDFs associated with the document. So you can organize things nicely there. Okay. That's document. The next topic. The way to talk to CouchDB is over HTTP and we fully embrace this REST principle. That means everything in CouchDB, each document is a resource. We have the basic CRUD operations, create, read, update, delete, map until the HTTP methods put, delete, post and the other one get. And this is a very, very simple model. Everybody who knows HTTP, who works on the web, directly understands that you don't need to learn a new database driver or something and every language or framework that supports HTTP already can talk to CouchDB. Let's know why you don't need anything else there. Since everything that is kind of matters these days already does it, does understand HTTP, well, you can use CouchDB from everywhere, from your browser, from the command line, using Curve, anything. Again, very simple, very easy to use. Everybody knows that. So there's no very steep learning curve going on. There's a lot of tool support. We have like the scaling HTTP or analyzing, having proxies, caching, all that is solved for HTTP and you can just use that with CouchDB, which is a great advantage. Okay. I'll leave you with that. Now, we have a dump object store. We can just throw things in, get a key and ask for things that we saved under that key and have the nice revision thing. But that's really a dump object store or data store. And the power of a database is usually that you can do calculations on the set of your data that you have in there, that you can manipulate them, sorting and all these. And that is what views are for. With views, you can define subsets of your data, get other things that have like certain attributes. You can do collation. If you need a certain order of things in your application, you can have views do that. And you can do aggregation that is make calculations based on the data that is in the documents, make counts, averages, everything you need there. And the way views are defined in CouchDB is you create a special design document. And in that document, you put two functions, a map function and a reduced function. That's where that map reduced comes in. And I'll have an example of map reduced in a second. What do you do with these? You as the programmer provide the map function and the view function. And CouchDB takes these and when you query a view, that's just the basic terminology here. When you query a view, CouchDB executes the functions you provided, you don't actually execute the functions. And then CouchDB runs all the data through the function and returns the calculated result. And here's a simple example of how that works. We have a database and a set of documents in there. And the first step is the map and the map thing. And what map does is it creates a list of keys and values. And in this case, we have messages, emails, and we tagged them. And we want to create a nice tag cloud. And for that, we need to know how often a single tag occurred in this, in our data. So what the map thing does, it just runs through all the data, all the documents we have, and just spits out all the tags we have. If a document has multiple tags, it puts multiple entries into the list. And as value, it puts out one, which is the count of the single occurrence of this tag, which sounds weird at first. It's a bit strange, but that's how map reduce works and becomes clear in a second. So what we have is a list of all tags that occur in our database and all the counts for these single tags. Now, the reduced step, the left list is what we just saw, reduced step on the reduced function gets, for each key, we put out all the values and can then operate on the values that we have in the map list on the left side. So what we just do for the two tags that occur twice, we just sum up the values, which ends up being the count, that's why we have that there, ends up being the count of that tag in our database. And this is a very, very basic example of how to do calculation. You can do really fancy things there to query everything you actually need. Since this is a programmer-centric conference, there's code, and this is the map function. There's two and a half lines of code. What happens when you query the view is it goes through a database and throws each document in the database into the map function. And now we have, it receives a doc argument, which is our document, and this document has an array attribute that is called tags, and we just iterate over that one and admit that's our internal function to create this list you just saw with the name of the tags and the count of one. And that produces the list on the left. Now, the reduce function gets each key and an array of all the values, just like I just explained, and it just loops over the values and sums them up, returns the sum for that key, and we get the right list here. Why do we do that? MapReduce is a concept that is parallelizable. That means the each step in between, and even the mapping and the reducing, can happen simultaneously on different machines. So if you have 100 million documents to count the tags and then have 10 machines running on that calculation, that will be faster than on a single machine. With MapReduce, we can distribute subsets of the keys and do parts of the calculation on each machine and then combine the result and return it. That is a very powerful mechanism, and that's why Google is so rich. So we thought that's a good idea to implement. Two more things about views. You're probably cringed when I said when you query the view, CouchDB goes through all the documents in a database. It does that once to create an index, and each subsequent request then reads the intermediate result. This request gets cached. And if you change things in a CouchDB database, views don't get changed directly on the document change. You don't pay a view index creation penalty, but the CouchDB index or the view index gets rebuilt when you query the view again. So that might be happening a lot of reads and writes and then the views just don't care. And when a view is queried the next time, only the changes that happened in between the last time the view was queried and now get integrated into the index and returned. This makes a view index creation, the view looks up very fast and very efficient. And the internal structure of CouchDB is totally optimized to the list quickly recreating or updating the indexes incrementally. So that's a very far separation here. The other thing, so yeah, views are incremental, built on demand, and the reduced part is optional. So if you just want a list of all the tags, even the duplicates, you'll just leave out the reduced function and get the list that the map function returns. If you want that, that's possible as well. And again, going back to the HTTP interface, each view has a URI in CouchDB and you can again do the crowd operations over HTTP, like with the get post, but again there. So they're handled in the same way as documents are handled. So that's again very easy to use from a programming perspective. The next thing I'm going to talk about is replication. The basic idea here is it's a good idea to have your data with you. We have internet access everywhere except when we don't and it doesn't work or it's slow or the data you have is just too much data to send over the wire again or you want to deliberately be offline to not be like Twitter disturbed all the time. So you would probably want to have the data with you. And the one scenario there is you work for a big corporation, they have a huge internal customer database and all the associated data and application and then you have a copy of that application on your notebook and you'll be able to replicate the company's database onto your notebook or laptop. Go to the client that might be Apple and doesn't allow any like GSM or Wi-Fi or internet access there and you're still able to use the application as if you would use the live database because you have a real copy of the database to work on. And what application then allows you when you get back home is to synchronize the changes you made with that master database on your company and in fact the company might have a lot of representatives running around and they all can integrate the changes back to the master database and all the changes that happen there then can go back to you so that you have the latest data. This is like the one scenario there. The other ones are in a web application if you have a really cool database and all the nice views and map reduce and all this. If your hardware goes away for whatever reason then your application is down and that just can't be so you usually have multiple database servers to be safe against hardware failure. You will probably have more database servers if your application gets bigger to spread load, read and write load and replication there helps you to get all the instances of your database up to the same level of data so that you can do load balancing, failover things, obviously back up as possible too with that. The way replication works is not like with the first scenario. There's no constant streaming of changes going on. It's more like an R sync operation where there are two sets of data and the replication creates a diff and then just moves over the other, just applies the diff to the other database so this has the same data than the other one and this also, well this just not works and it's not only works with two nodes but with any number of nodes. With the MySQL replication you usually have a master server that all the write requests to a set of master servers and then another battery of read servers or slaves where all the reads go from and with the CachedDB you don't need to set up such a fragile system. You can have any number of master servers and have them replicate back and forth and have them eventually come up with all the same data and this eventual consistency as the topic that Werner Fogels, the CEO of Amazon wrote a great paper about it. I really encourage you to look into that one. That's the secret behind why Amazon is so massively scalable that they don't force the upfront consistency that a relational database requires. So CachedDB there, well this paper was published after CachedDB was released so there's no relation there but this is a very powerful concept that is used in practice and very successfully. The one problem with distributed replication is if two or more instances of your database get a change for the same document and then want to replicate, let's call the conflict and obviously that's a problem situation and what CachedDB does is that each node upon replication sees that there are so and so many changes that are in a conflict and it chooses one of these conflicting versions to be the winner and each node in this replication environment comes up with the same winner and they do that on their own and there's a deterministic algorithm going on that for each node that each node can independently come up with the same winner that everybody else does without any group communication between the servers so there's no, if there's a network shortage or something or the network is slow in between there's no, we don't wait for everybody to agree on some winner but we just pick the same one and now what happens with the versioning system that comes in handy there that I explained earlier the winning version gets saved as the latest revision of the document and all the losing revisions get saved as previous revisions and CachedDB then sets a flag at that document and says okay here's a conflict and it needs to be dealt with and usually the application can decide what to do then, like there's probably time stamping going on and you can see there's stale data here and just throw that away or we see the winner that was chosen is none of the data we want to end up with and we promote one of the losing revisions to be the actual winner and resolve the conflict and have that replicated through our cluster, yes, question? There is a document and a network in one document, one field is changed, another document in another field is changed, would it match both together and with one document? The question is if the replication is on the document level or on the field level so if the change would be in different fields in two documents would they be merged or would there still be a conflict? At the moment there would still be a conflict, we might go to have like feel or row level conflict detection but that's not in at the moment. So where was I? With the conflicts. All right, so most of the conflicts, most conflicts can result automatically by the application, the user usually doesn't have to see any of that but in case the application can decide, your application can show the user, okay, we have these document of this data here, which one is the one you want to keep and then resolve the conflict from there. That is a pretty neat way of dealing with this replication problem. The idea is that your cluster of databases has a consistent view of all your data so that the application can ask two servers and get two different answers. There is consistency, so eventual consistency again that CachedDB has, okay, that's replication. The next one is build for the future. The databases we use today were designed and built 20, 30 years ago and the computing model back then was there's a scientist as a set of operations to run on calculations to run, he goes to this huge badass machine, sits in front types in his things, gets a result back and the machine is built in the way and the software and the questions or the calculations are built in the way that the answers can come back as fast as possible so that the next scientist can come in and work on his problems because both the scientists and these machines are very expensive. This is fundamentally different from how we use computers today. We have lots of machines, lots of different CPUs, lots of cores per CPU, and we have 10,000 users hitting a single server at the same time. That's massive concurrency going on and this is fundamentally different from how the databases we use today for these systems were designed and built and I find it really, I don't know, it's a great accomplishment. I'm really, well, it's really astounding that these systems actually do work and can handle this load for that they were not designed for. And well, it's impressive, I find. And what CacheDB does, it acknowledges that we have this new hardware going on, that we use cheap hardware instead of huge machines, that we have multiple CPUs, that we have a lot of disk space and that we have this more of this concurrent usage model instead of the single and serialized usage model there. Programming is all about trade-offs. CacheDB trades disk space for data consistency, speed, ease of use and all that. CacheDB is still reasonable with disk space, but that's the trade-off we take because this goes so cheap these days that we can waste a bit of that to have all the other cool things. So what CacheDB does is, well, like I said, the relational databases are designed to execute the single query as fast as possible, as fast as hardware allows. And that goes a lot of thinking into designing your data model and the queries to get this as fast as possible result. What CacheDB does is it actually has reasonable, pretty good performance for the single query, but the design idea that it is optimized for is the concurrency. All of your 10,000 users eventually, in a reasonable time, get a result and not just 100 per second, but a really massively massive parallel load can be handled with that database instead of just a couple of hundreds on a big machine. How does CacheDB do that? CacheDB is written in Erlang. Erlang is a very cool programming platform. The language is a bit weird, but it's still nice. And Erlang was written by Ericsson. Ericsson is a TECO company, and they have pretty unique, it was developed in the late 80s and early 90s, and they had pretty unique for the software world requirements for their software. That is, sorry, they need to be able, needed to be able to handle a lot of concurrent telephone calls on a single piece of hardware. And in case there wasn't a problem with one of those telephone calls, there might be a bug in the software, unless, I don't know, a crash going on. This single problem cannot by no means affect any of the other calls that are going on. You cannot just drop the connection there. People will complain. I don't know. That just wouldn't work. It's not a model that is allowed there. And the other thing is, you shouldn't need to take the machine offline to create or to make a software update. That means with Erlang you can have a machine running and have a while it runs update the system, fix a bug. And this sounds a lot like amateur open heart surgery. And it's really scary when you first hear it. But Erlang puts in a few mechanisms to ensure that nobody dies. So this is actually a very cool thing. And this is a bit of a number bragging, but it's really impressive. Ericsson sells this, AXD301 telcos, which they guarantee nine-ninths of availability, calculated per year. That is one-thirtieth of a second per year guaranteed availability. Frankly, I don't believe that. I have no idea how they pull that off, but they sell it to you, so they can guarantee it. It's really impressive. And they do that with all the cool features of Erlang. How's that? Programming-wise, Erlang has this concept of very, very lightweight processes. And these processes are so... They're a magnitude lighter or smaller and easier than POSIX threads, for example. So on this five-year-old trustee laptop, Erlang is able to spawn 10,000 to 15,000 processes per second and dispose of them again. So they're very, very, very small. And being a functional programming language, Erlang doesn't maintain a global state for your application. That means that there's one chain of processes can change some shared data that causes another one to crash. That's why thread programming is so hard. That just is not in there. It's not allowed. So it's very failure-resistant. The way it works, because we have these cheap processes, each module that is sort of like a class in Erlang when you call a function. And there you actually send a message to another process. So all the function calls you do, or method calls you do in your application, are actually run in different processes inside of Erlang. And you end up with this long call chain of processes. And now, if there's a problem in the current running instance of such a process, what you do usually in Erlang is just terminate this small process and report to the calling process. And it can then decide what to do. And it may decide to crash itself. And you have this crash chain going on up to the original calling code. This greatly simplifies code, because you don't handle errors. You just crash. You save like two-thirds of nasty error handling code. You just crash your current train of processes without affecting any of the other things going on. So there might be a single problem somewhere crashing. And you still can serve the other 9999 users in parallel that you're doing there. Erlang allows you to do that. I could go on to praise Erlang for like hours. Well, I'll stop here and go to the other point. The storage model of CouchDB, that is how bits and bytes actually get written to disk is very, very clever. First, it is asset-compliant. That means basically your data is safe and secure and all that with your database. Once it says it has written the data. So that's basically one of these basic things a database should do as asset-compliant data. So it won't go into details here. The other thing that makes it so smart is the MVCC, or multi-version concurrency control. That allows, again, for the concurrency going on. When you have a data store and there's a read process and the write process going on at the same time, you don't want the read process to end up with half of the data that came in before it was there, before the write and half of the data that was after the write. So that would be garbage. You don't want that, obviously. And the basic way of dealing with that is saying, okay, when there's a write, we block everybody else who wants to read or write until the write is done and then we let readers in again. Or if anybody is reading, nobody can write as long as they are at some point that the readers need to be blocked and all that. And the MVCC storage, and there comes the revisioning system in again. When you read a document in CouchDB, you only ever get to see this, the version that was current at the time you read it. When in the meantime, a new version is appended to the CouchDB databases added there, the reader just doesn't care about it. It only gets to see the data that was current, again, when it started reading. And also allows any number of parallel readers onto a database. That means as much as your hardware permits and everyone gets only the data that was current at the time they were starting to read. The only thing we need to be aware of is that document writes need to be serialized. We can only have one write happen at the same at one time. And yeah, the writes get placed into a queue and worked off. And the possible potential problem is here that your hardware is not fast enough to deal with all the writes coming in. But that's the point where you need to partition the data out to multiple servers anyway. That's the same thing with any database. If you're like the underlying hardware is just not powerful enough, you just need more data there. Now, the way these, that's a question. You know, message queue currency and the handle things, that thing you just brought up. What would happen then if you did get full? What sort of mechanisms are in place in CouchDB to allow you to pull back like a head in a live situation? What would happen in case the write queue of CouchDB would get full and the hard... Yeah, yeah. I'm kind of trying to think of that. I think what eventually, at some point, RAM would fill with all the write requests that were sold. And the response time for this write request would just grow. And at the point when there's no more RAM to take in more read requests, a new write request would just denied earlier. Sorry? A lot of badness. Yeah, a lot of badness. But the way CouchDB is designed is that it gracefully, well, if resources fill up, it gracefully declines accepting more data and these things. And then waits until resources are available again without just crashing the whole database server. So when your application would detect that writes wouldn't get handled in a snap, I would see, OK, and then the application would say, OK, we have a huge server load. We've got to wait for a bit here and show the user probably waiting on this temporary site to say, wait until all these things can be flushed to disk again. So you'd be able to monitor that. And then eventually you would need more hardware to deal with your load. Another question? Security do have a note, whatever you're calling it, on the fly. Would that be something that... Have people done that? They monitor the activity and then based on the load, pop up new notes? The question is, have people done actual monitoring of a CouchDB database and then on demand, popping up new notes, handling the write load or read load? Not that I'm aware of. This is kind of a management of a CouchDB thing. I haven't seen anybody doing that, but that's obviously a way to go. Yes, so. One more thing about the data store. The way the things get committed to disks ensure that after each write of a document, the entire database is in a consistent state. And if during a write your hardware happens to fail or power goes off and anything, your database is still in a consistent state. So when you restart CouchDB, when your hardware is back again, you don't need lengthy integrity checks running. That could take hours on a reasonably sized database. CouchDB is up and running in no time again. Data is always consistent on disk, which is pretty neat, if you've ever dealt with a MySQL problem. I like to talk about the storage engine a lot like I like to talk about euling, but I keep it short in favor of time here. We've got a few bonus features. In the end, people, actual people, will use your application. We people use natural language to communicate. People will put natural language into your application and they will want to search for that data using natural language search. Natural language or full text search is not a matter of comparing lists of characters. There's linguistics involved and it's a very broad field of research and practical things. And instead of doing that ourselves, we just leverage existing technology in Lucene as a framework to create full text indexes. What CouchDB does is just implements a full text mechanism that reads CouchDB documents and stores them into a full text index and then has a mechanism to query this full text index using the Lucene native query syntax. This is set up in a way that if you don't happen to like Lucene or don't happen to like our version of Lucene, which is a Java one, you can plug in any search technology you want, but it swings or there's a couple of other ones on the Mac that are spotlight. You would be able to integrate that. So the actual full text search engine you use is your thing to choose. One thing I forgot to tell about the views. I said we use JavaScript functions because they are an actual fit for the JSON data we have. You're actually able to change the query language there so you could write your query from the map reduce functions in Ruby or Python or any other language and have CouchDB work with that without changing a lot of things if you don't happen to like JavaScript for that structure, for that situation, for whatever, for your need. The other thing is JSON, which is a research project from within IBM that's going to be open sourced. It's basically XPath for JSON where you can query JSON structures and do nasty things with that. That's not actually in CouchDB at the moment, but this is something we're really keen on getting in. Okay, a little bit of history. Damon Katz is the guy who came up with all the cool ideas and most of the implementation. He used to work for Iris, which is a consultancy that created lots of nodes for Lotus, which then got bought by IBM and it's like, $10 billion per year enterprise for IBM, very successful. And Damon worked on the core database engine of Lotus nodes, which is in concept with the documents and the replication, very similar to what CouchDB has. So there's some leverage of ideas there. Well, Damon at some point said, fuck this, we need something better here. And he quit his job, took his wife and his new born daughter, went to place near their families, it was cheaper to live, and lived off his savings for two full years to create CouchDB and release it as an open source project. I wouldn't have done that, but it's very impressive. Yeah, Damon's an amazing guy. He now works again for IBM. They hired him to work full-time on CouchDB. And this is set up in a way that there are actually no strings attached. We were recently accepted by the Apache Software Foundation to become an Apache project, so we will be Apache CouchDB. And all the code Damon writes for IBM is obviously copyrighted by IBM, but they donate a copyright royalty-free and worldwide and all that to the Apache Software Foundation. So in case IBM loses interest in CouchDB and uses Damon for something else or fires him, all the things he did will still be in the open source project, CouchDB, so they didn't bind the software to them, they just hired the developer to work with it. And we release on the Apache 2.0 license now. And again, there are no strings attached. That's a good one. Anyway, I obviously didn't cover everything. I just scratched the surface, and there's a lot of things going on, a lot of things to talk about CouchDB. And I'll be giving two CouchDB tutorials, like 3.0 tutorials with programming examples and all that in May and in June or July. Forget the XTech conference in Dublin, Ireland, and the Erlang Exchange conference in London. If you happen to be around and be interested or want to come up there, you can read about that on my blog when these things happen, so you could show up there. If you've got any other things you would know, the slides will be online at some point. You just follow these links. Damien's blog, my blog, the official CouchDB things. And the last link is, I gave a presentation at MailTrust two weeks ago, and they taped my presentation. They can see me giving the same talk only over two hours with a bit more technical detail, if you like to. You usually do a demo, but time's up now, so I signed up for a lightning talk. And if you want me to, I'll give a short, like, shh. I need to prepare. I couldn't do it now. So I can give a short demo of a CouchDB distributed application in the lightning talks. Yes? Is there any concept for the transaction? The question is, does CouchDB have a concept of a transaction? Yes, it has. That is, the trend that the first of it is, a document writer update happens or doesn't happen. So there's this single entity transaction, and you can do multi-part post requests to send batches of data that succeed or don't succeed. What it doesn't do is, two-phase commit or ensure that a set of changes happen on multiple nodes. You would need to implement that in your application. Also, with the multi-part post request, you can do this. You can't put in a delete in there, so you would only have write operations. There are no delete operations in that transaction thing going on. These are the questions. Yes? I was wondering if you could talk about production status and maybe a little bit about it after the next few months. Yeah, I have questions about production status. CouchDB, at the moment, we label it as alpha software. And we mean with alpha that we don't have all the features in that we want to have for 1.0, but there are systems running CouchDB at the moment that don't have any problems. So the features that are in are pretty stable. We just don't have anything in. We wanted to call it a better or final release. And we want to have, now that Damian is back, working on it full-time, I think during summer we go into better and then have a release sometime after that depending on how good that goes. Question? Sir, is it dependent, or is it a whole new thing that you're doing for data? And our release, the amount, how do you do it? OK, question is, change to documents, get appended inside the database? Yes? So instead of changing actual data on this, we just append things to the database file? Is it a whole new record, or is it just the diffs? It is a whole new record. We don't store the diffs. At some point, I think we will have the ability that you send in diffs and calculate the new revisions from that, but we don't have that at the moment. The other question is, if we do actual deletes with this model, we don't. We just mark as delete. So your database grows all the time, even if you delete data. And there's a process called compaction that runs periodically that gets rid of all the old revisions and deleted data eventually to gain hard disk space back. That is where it couched to be waste data in favor of speed and consistency. Sorry, question? The view implementation, how is it tied to the JSON structure? Is it a one-to-one mapping that says, I have to have these fields over the other map of this view? Or is there some kind of typing that is set up by some screens that include? OK, questions, how the JSON structure is tied to what you have in the views? When you create the view at the map and reduce functions, you actually can define a JSON object, again, with the fields you want to have eventually in the view. So you say you want to view with all the names and nicknames of all the persons. You would just emit a new JSON object that has the structure you want with all the data in it. And then when you query the view, you get this predefined object back with the data in it. So if I have to find a view or something, is there any way for me to close up? How would, is there a way for me to close up more? Why have to define a view in order to see data? Do you have to define a view in order to see data? Not if you, obviously, not if you request single documents, but if you want to do, want to have ranges of documents, you create a view. If you're familiar with relational databases, think of a view as an index on a column. Have as many views as you want for all the data you need to get out. OK, that was another question. Specifically for Ruby, I'm kind of looking at you guys and pretty scared for a bunch of music. The only part I can't quite figure out is finding new registraries, mapping, and using functions with the data. OK, the question is how to actually create a view, how do you put in the method and use functions into CouchDB? We have this thing we call design documents, that are just documents that have a special URL. And then CouchDB recognizes to be a view definition document. And you put the literal code, the JavaScript code, for those functions into this document, and they have to have a certain structure for that. And you can find examples on the CouchDB wiki.com. And then CouchDB sees that as a view and reads this document and applies these functions from there. The question over there, yes? OK, the whole thing about RDP and that versus just storing data, is it reasonable to see CouchDB as an object database that just happens to use JSON? The question is here, CouchDB is just an object database that happens to use JSON. Must admit, I'm not too familiar with what is called object or object-oriented databases. We obviously don't have all the full-fledged features, OO features, we just store objects into CouchDB. So I don't know what we call a Damian called CouchDB a document database. So that's a bit different there. Like, what's the... Why do you want to use a document database instead of an RDP-MS? I should say, CouchDB is not meant as like a silver bullet thing that solves all your problems. There are situations where a relational database is just the perfect fit because the model of your data just naturally fits this relational data. But it happens to be that most data we use in the real world is not relational and there's an artificial thing going on to transform that, to fit on that relational model. And you can then probably decide if that model fits your data. And the most of the data we use, like the business card receipts example, that's all the data we use in the real world. And there's no... Like, my business card is no pointer to the actual data that gets updated when I change my address. It's printed on the thing and can get outdated and people might end up with old revisions of my business card that are no longer valid and somehow we can deal with that as people. So this concept is just applied to the database store. Okay. More questions. I have a question on replication. Does each database have to have the full data? So I can give you an example of a business that you can have a client application that has all the data. What if you have data in your database that shouldn't be accessible to all users? A question about replication of all the nodes must have all the data. At the moment, yes, but you will be able to define replication on a view level. So you would say, I would be able to say, I replicate only the data that is in this view or the documents that match this view so that you can send only a subset of the data to a replicating database. Also security, I didn't talk about that, will be implemented in that way that you have a mechanism to say this set of documents is not allowed to be read or read by this sort of users. As again, that's not in at the moment, but that's how it's going to work. Got a question? What was his question? I think it was, must each database have the rightful data? Or must it be, if you have more to... It's going to be when, at what point, you know, that just... Yeah, if you have a company internal application and then you have a client application, and the client application obviously shouldn't have all the data from the company application but just a subset of that. So if you have a database with confidential information and only want to have a subset of that for the client, would it be able to just replicate a subset? I think that was your question. Sorry, could you touch on security? No, the question is if I can touch on security, I would love to, but there is no built-in security and couch to be at the moment. We will have a full-fledged security and authentication system where documents have users and you can validate documents on-right, so you have certain ways to enforce some sort of structure and documents that don't match that. I'll get rejected for rights. I will read and write permissions and all that. Well, that's at the moment not in. So you would, if you would run KHDB, it would be behind some sort of proxy layer thing that is not actually exposed to a user. Yes? Are the views intended to be used across the homogeneous dataset or in the view of a different scheme? The question is if couch to be used, can be used for, should be used, can or should not be used? What is, is there a perception that every object is the same? Okay, if this is how I do it, well, for a document, do all documents need to be the same to be in the same view? No, as is, you can. For example, you have an address book application and you decide to have a name field and all the other fields. And at some point, you want to split it into first name and last name. So you have an old document set that has just a name field and your application would read a view that has a name field and now the new database of the new document have a first name and last name field. In the view, what you could do is for this object that you store in the view, you could concatenate the first and the last name and return that as the name so that your application can still use it. So views are one way to aggregate the semi-structured data. Yes. Are there plans to include something similar to SQL triggers in SQL? Oh, I don't know if there are plans to add something like triggers in SQL databases. We have that. There is a mechanism in catch to be that you can register a standalone demon that you would need to write that gets an update notification each time a database gets changed and that's actually the way a full text indexing works already. So you could get notification each time there's a change in your database and then act on that notification that already works. Good. Another question. Any plans for integrating it with the browser like Google Gears? Or is it a different model for the client? The question is, is there any plans to integrate it with a browser like Google Gears? We are not the ones who talk to that. That would be the browser vendors. Or that would be a perfect usage model for CouchDB. Integration with client. I will show that in the demo since it's an HTTP API and you can put HTML documents inside CouchDB document attachments that have a unique URI. You can query this attachment inside a browser, have a web page rendered there, then have AJAX calls talk to CouchDB so you can have a full-fledged application surfed and running from CouchDB without any middleware. So there's another usage model there which obviously you don't wanna do without the security stuff going on. But for internal applications, this is already possible. And CouchDB comes with an administration program that works that way. That's just a jQuery, AJAX website thing where you can administer your CouchDB. Another question. I have a follow-up question. What is not business cards you just store but it's different types of data? Do you tell us the data types? Do you determine the data types of the URI? Question is if I happen to have multiple data types not only business cards. There's no way to force you to do that. Each document can have a different structure and there's no way to tell upfront unless you query the object, query for the document. And we are still developing best practices working with CouchDB. One thing would be have a type field that would tell your application which type of document you put in there and if it's a business card, have all these in a view that lists all the people and if it happens to be an inventory or just all computers in the company, that would be a different type. So you could do that or you could query on the structure of a document. So if you know, I don't know if there's a field person or a key person in this document, then you know it is a business card entry and if there's field inventory regardless of it's data you cannot. So you could do queries on the structure of your documents to determine which type it is. But it doesn't enforce it. Well, obviously you would want to have some sort of mechanism there. But we will have validation functions. They were just functions you provide like with the map reduce functions and that will be applied on write. And so if you would, and it could check if you write this document and it doesn't, the validation function doesn't return true, though it can check for certain structural, for certain things to be there. If it doesn't allow the actual write to happen, the write gets redirected to the user so you can enforce some schema there if you want to, but you don't have to. Also that's not in at the moment, but we will have that. Yes, another question. I'm afraid it won't play down. This is a time frame for one point only. Oh, like I said, we hope they have a better going on in summer and then depending on how things go have a one point only sometime after that. Okay, if you've got any other questions, one of the ask later they pop up later just grab me in the hallway or anywhere, send me an email, I don't know, just talk to me, I'm here. Thanks.