 I'm Daniel Austin. I'm chief architect at PayPal, and I'm also a mad scientist, as some of you discovered on Tuesday. Our talk today, I'm going to convince you all that the World Wide Web is nothing more than a very large, no SQL-distributed data system. Notice that I did not say database. That's sort of tainted vocabulary in this context. So it's not really a database, but a data system. And we'll talk about that distinction here in a little bit. I actually have a clicker, so that's going to help a little bit. The big idea, I think I just told you this one, right? The Web is the largest no SQL-distributed data system. Actually, it's the second in order, in chronological order. And we'll talk a little bit about that. When I was writing the slides, I debated whether I wanted to write more here. And then I thought, a big idea only needs a few words. So I left it where it was. The point of making this talk, and with every talk that I give, I sort of try to create a mind map. Map my own thinking on this. And because the mind map for this talk was so complicated, I thought I would share it with you. This was actually created in iMindMap 6. And you can see that it keeps on going. I kind of cut it off a little bit here. These are all the topics that we're going to talk about today. No, they're not one slide each. But I wanted to give you some idea of sort of the complexity of the thinking that goes on here. And I don't want to leave you with the impression that I've solved every problem involved in this space. This is preliminary thinking. There'll be more interesting ideas, more conjectures about it as we go forward. So you're looking at a work in progress. But I wanted to give you an idea of how complex the thinking around these things really gets. Just to deal with a little bit of the history of this, the www was not the first large-scale, no-SQL distributed data system. That privilege actually belongs to the DNS system about 1983. This was our first very large no-SQL data system. If you think about it, it's distributed. Basically we're using a bunch of flat files. And most people don't really think of it as a no-SQL system. The term no-SQL hadn't been invented at that point in time. And we weren't really thinking in that sort of term. But this was our first big data system. And the web was closely following in 1989. I put in a little quote here. This was from Tim Berners-Lee's original proposal for the World Wide Web with Robert Kyle Owl. He did this while he was at CERN. I was there roughly at the same time. The interesting part here was that I had underlined the idea that he had put databases in there. Now this was from an era when we still put a dash in the word databases. I don't think anybody would do that any longer. But even when Tim was originally conceiving the web, I think that he understood this, at least at an intuitive level, that this is really this very large distributed data system. So there's a certain history behind this. I also noted that, and I just found out two days ago, somebody stood here at this conference and I'll ask you all now. I made a similar point three years ago and talked about the web as a very large distributed database. When I looked over his slides, it was almost entirely about RDFA. And so his slides are very, very different than mine. His concept was very, very different than mine. These are two separate talks. But because there is prior art, I wanted to make sure that I brought that up and credit this person. His name is Robert Tenerife. And I'm not sure where he works because LinkedIn wouldn't tell me this much. So the web sits on three legs. I think every one of us understands this pretty well. HTTP provides transport. This is way, way up in the stack. HTML provides presentation around what we might think of as a result set, which is the result of your query. The URIs provide addressing. And most of this talk today is going to be about shortcomings of the URI system for addressing all of this data. It's fairly easy, I think, to see it as a distributed data system. It's much more difficult when you start thinking of URIs as queries. And we'll talk about how REST queries form a certain sort of syntax, what the ups and downs of that are. We'll talk about the ideas behind the query part of the URL and how that changes the ideas of addressing. There's a number of things around URIs that we're going to talk about in the talk today. But I want you to understand these three legs that the web stands on. In my career, I've taken a stab at trying to improve every single one of these over various times. And I've been a member of the HTML group at W3C now for 17 years. And I actually have a proposal out there with the ITF for reforming HTTP. And that was sort of folded into the HTTP VIS work that's being done now. And so now I'm tackling URIs. And that's part of the talk today is I'm going to try and see if I can't inspire some of you to start thinking about how we might improve the URI addressing syntax and make a little bit or do a little bit more thinking about queries rather than addresses. So a little bit of history and structure to this. Let's get into the main points. There's two words that I want you to take away from this slide, transitive and intransitive. So there's all kinds of talk about hyperlinks that goes on out here. I want to simplify this and not get involved too much in the excellent terminology in some of the previous work and talk about how links are actually actuated on the web. There's really two kinds, transitive and intransitive queries. Transitive queries are usually for inactive content. You can think about these as the links that the browser clicks on for you. They belong to things like images, CSS, JavaScript, things that are loaded behind the scenes as subqueries to the initial query, which went to our base page somewhere, presumably, right? And the browser clicks on those for you. No user ever traverses that link for the PayPal logo or the Google logo or whatever. Those are transitive links. In transitive links are the ones that you get to click on. And these usually lead to a top level query, and we'll talk about the query structure in a minute, that then makes some subqueries. And those subqueries actually are mostly transitive hyperlinks. So as we go through this, I want you to keep this distinction in mind between transitive and intransitive hyperlinks. We click on some, the browser clicks on others on our behalf. It's a little more complicated than that if we want to write out graph theory equations and so forth describing these hyperlinks. But for purposes, it's enough, I think, to talk about the links that are actuated by the browser as part of the presentation and the links that the user actually actuates through some will. So now that we have this vocabulary, let's have a look at how this actually plays out and how the queries actually go. Now, this looks like a sequence diagram. It does not match the UML3 specification for sequence diagrams. Nobody call me out on my UML here. I'm trying to give you some idea of how the query pattern works rather than adhere to the UML methodology. So the user clicks on something. And this is one of the first things I want to call out to you is that our queries on wwwdb, and I'm going to call the web wwwdb throughout this talk because I want you to think about it as a database rather than as a presentation system, which is how most people think about it. When the user clicks on this, the actual verbs in your query are part of HTTP, but the actual addresses for the queries are part of the URI system. Now, when we do a query in some SQL language, we don't do that. The verbs and the addresses are all in the same statement. The web splits these two functions between the actual verbs, get, put, et cetera, and the actual address of whatever resources are going to be addressed or manipulated. So the user clicks on something. This results in a get for something, a base page, index.html. That subsequently brings that data back. We get the base HTML back to the user agent. Notice that that line does not go back to the user because the user has nothing to do with it at this point. The browser then goes about its business, makes a link list of all the objects in the page, those transitive hyperlinks, and then goes about downloading them, usually from the nearest CDN node. And you'll see I put little database icons here at every step because every single one of these leads to a query into some database or another. Even if it's a caching database, it's still a database of some sort. Now, I don't want you to think it's some RDBMS and everything comes out of Oracle. We know better than that. But I did want to remind you that this is a data store. So we grab zero to N objects from the CDN and those get returned, rendered here in the user agent. And then the user actually causes some event. Maybe they click on something to get a price. Maybe for a movie. Could be zero to N of those as well. Then a query is made. And guess where that query comes? From some database somewhere, right? And it comes back to the user and then we might have an app on the page, right? And this is actually a transitive hyperlink. The user did not click on something that says, show me this ad for movies or something, right? The browser did that for you. And I sort of illustrated that here. You could have zero to N ads, hopefully N minus one, rather than the full amount. And then we get back these ancillary results. The basic idea here is that the query pattern goes like this. You make a request for a web page. You get the base page. You can think of this as control information, right? The control information has subqueries embedded in it in the forms of image links and CSS links and this sort of thing. Those subqueries are transitive queries. You don't have to click on them. They're embedded in that base page. The user's out to the CDN goes out to, you know, maybe calls an API here. We'll talk about the role of APIs as subqueries later in the talk. May actually go out to some third party. And this is sort of what I want to illustrate with this last part of the query is that not all of these queries go back to somebody that owns that content, but you can actually bring all this third party content in. And we've been doing this for years. Everybody knows what a mashup is. But when you think about it in terms of the query pattern as queries and subqueries, it starts making sense as a distributed data system, rather than as sort of a presentation system. So this is our query pattern on the web. And the statistics around it are pretty horrid. And I'll get to that later. But something like 95% of all bytes that are transferred on the web come out of the transitive hyperlinks, come out of the CDN. And only about 5% of the traffic is in transitive content that's actually dynamic and has something to do with what you were really making a query for. 5%. It's the 5% that doesn't get cached, of course. Right? So anybody that's followed this whole area for a long time will recognize the title of this slide. It's sort of a riff on something that Tim Berners-Lee said a few years ago. He printed out a famous article, the W3C, sort of trying to explain what do HTTP URIs identify? And it came from an era when URIs were still associated with HTTP. So you can kind of see that it's been a little while. And so I wanted to return to that topic and sort of show all of Tim's original ideas on what URIs identify were completely wrong. So it's going there. Let's talk about it. So in the original paper it said, you know, URI identifies a resource. It's like this is no longer true. We long ago broke the web in this case, right? A resource no longer identifies a single object. It may not even identify a single instance of an object simply because that object may have been customized for you in particular. Right? That base data about, you know, the news or whatever it is may have been decorated in some way that's based on context data. They, you know, Yahoo News, right? They know about you. They know what you like to read. And so your request, even though the query, if I took that query and I examined it, would look exactly like my query, it might return it entirely different results. So you can see that we're not really dealing with, you know, select where here. Right? It's a different sort of creature. And that's why I'm talking about it as a data system rather than a database. Because database is the one thing that we do depend on them for is to return the same answer to the same query repeatedly. Right? And this is not the case on the web. The same query can return different results for different people depending on context, time, location, number of different things. Once again I wanted to return to that split in our syntax. The verbs are actually handled in HTTP, the addressing, and the actual query string part are handled in the URI system. These two systems are very different, have different RFCs. We use them sometimes without the other one. They're not necessarily coupled. Right? We can use HTTP without URIs. We can use URIs without HTTP. Right? This is what the whole URI system and naming schemes are for. So those two things are very different. And that split leaves us with a state management problem. And we'll talk about that in a little while. But one thing that I want to just inject here is that normally we talk about, you know, stateful databases. And one of the first questions I got when I was reviewing this talk with one of my colleagues was, so you're saying the web is a big distributed database, I'll buy that, but why no SQL? And so I had to think about that a little bit, because it sort of seemed obvious to me. You know, there's not a lot of SQL in there, right? But I had to think about why is it a no SQL database? And the real reason for that is because SQL manages state. Right? SQL is a layer five protocol. And state is implicit. You establish a session and your actions are implicit on what happened previously in that session. Right? We don't have that sort of thing on the web. Right? All this happens at layer seven in some stateless way. So we need a no SQL solution because SQL simply won't work. Right? There's a number of other reasons why we had to use no SQL to do this. But that's the main one, is that we're not maintaining state. The URI actually encapsulates a resource as the object identified by a query. So that sounds a little bit like buzzword, sounds a little bit like corp speak. I don't want to mislead you on it, but the idea behind that concept there is that the URI identifies a resource in a particular state at a particular time based on the person making that query. It's entirely context driven. It's not a straightforward address in the sense of saying, okay, the document is located here at this IP address and we're looking for all the characters between you know, line item 35 and line item 40. Something like that. What we're really doing is encapsulating that entire object including its state. This is the whole idea behind rest. Right? Is it we talk about the state and when we move the state or change it, we distribute that in some document form. The transitive and intensive hyperlinks always go to different locations. This is one of the things that makes people crazy. Right? All the transitive links go to the CDN. Right? All the actual content that's active, that's dynamic, that actually interacts with people and results in an intensive query is in some different location. Right? This has a huge effect on the performance of the web page. Anybody who knows me knows that a lot of my work is done in web performance and this is one of the main problems is that the actual information that the user wants is often in a completely different location than most of the queries that were actually made. So if it's a big data system, now we're all familiar with Oracle. Right? They made a big announcement here. Right? They're a big presence in the spot and we know that Oracle has two kinds of caching. Right? So they have a query cache and a data cache. Right? And you can really make your work go a lot faster and work a lot better if you have, you know, you take advantage of that query cache and that data cache. And for web, web, web, or for wwwdb, our CDNs basically act as a caching mechanism, but it doesn't really work entirely the way that we would like it to do. Right? So we have these local caching systems, you know, it's frequency based. Most of those queries go to the CDN, the transitive queries. Ninety-five percent of all the bytes come out of the CDN. Right? But only five percent of the content is what's really interacting with the user. If you stripped the result set, if you just made a query, thought it was Oracle, wanted a result set back from the web instead of a web page, which is a decorated presentation object. Right? But think about getting a Google result set instead of a Google page of blue links. Right? I mean, that's a pretty simple example that result set would have, you know, a row with the URL and might maybe a row of description and so forth. You could imagine sitting there on your monochrome monitor with your fixed width font and looking at that as a result set. Right? There's really no difference between those two things. About ninety percent of all the queries that's different from the bytes, right, because the size of the response to the query may be very different in fact to the CDN, it's, you know, static raster images. Right? So most of the bytes come back that way but also most of the queries are made over to Akamai. And I talked to somebody at Akamai about this and was talking to him about the query pattern and the number of queries that go to different places and they think that it's unbalanced as well. And I was pretty happy to hear that they had thought about it and realized that having five percent of the information come directly from the web provider wherever they host, right, where 95% comes from them and is basically optional, at least to some extent, right, is very unbalanced. So APIs. We all have APIs. I work for Paypal. We have tons of APIs. Some of them are easier to use than others. I think that's basically true for everybody's APIs. It kind of depends on the age of the API and who's maintaining it and things like that, right, but we all have these. And if you work for a company that doesn't have an API, you know, I'm kind of surprised because it seems like, you know, they're growing out of the woodwork. Everybody has one. We want to think about APIs as secondary queries embedded in the web page. They're usually dynamic. URI's function is a selection mechanism. In our previous slide I was showing a query for a movie and that query might return some listed movies that are actually showing in your area, right? If you type in, you know, some movie name, Pacific Rim or something into Google, what you'll get back is, you know, list of all the movies that are playing in your area at some time. And it's a selection mechanism. Almost always APIs are user-activated and intransitive. There's a couple cases where this is not the case. If you think about embedding a Google map into your web page, there's a query that's made to actually display that map on your web page that doesn't involve any user interaction. That's a transitive link, right? You make a call out, Google sends you a map and then the user can interact with that. Now we're actually interacting with those APIs through the user events. So that's one case where those are really transitive API queries. But in most cases, all of these folks, and I put this here just to give you an idea of the number of people who actually have these public APIs out here, also the other thing is to notice that my beloved employer is here. This is the only slide that actually has PayPal outside the copyright. But I felt obliged to put their logo somewhere. I wanted to make sure that it was displayed prominently. The point that I want to make here is that APIs need to be treated as sub-queries. How does the syntax of those queries? So let's say you have a REST API. So Netflix, my esteemed former colleague, Adrian Cockroft, I think gave the keynote to this conference a couple days ago. Netflix claims to have a REST API. What do those queries look like? Does REST impose a specific query syntax on wwwdb? Are there queries that we can't make in a REST syntax? Does it work really well? Let's talk about REST as a query syntax mechanism. So we're all familiar with our little triangle here about how REST works, right? We have a certain number of verbs. These are constrained HTTP, get, put, update, etc. We have some nouns which are actually the things being resourced. This is something, you know, employee 12345 at company X. And then we have some representation for that. It's going to come back in XML. It's going to come back in JSON and so forth. This is basically the description of REST that we give to people these days. Whether that was what Roy intended, I don't know, but that's what we explain to people. And in fact that particular picture comes to us courtesy of IBM's training course in REST APIs. The important points for our thinking about wwwdb is that what we're really doing is providing context queries, context-based queries for specific resources. And that object is in a specific state. And we see this a lot of PayPal, actually, where if you do a transaction with PayPal, you're asked to come in and remember this is a single transaction. You're asked to come in and log in. Now we've had a change of state. Now we transfer that change of state to the next service, which actually shows you, oh, my purchase is going to be five dollars, and I'm going to use, you know, your AmEx Platinum card, should we actually have one, to pay for it? And so now we've gone through that state change, then the user clicks on something says, yes, go ahead, you know, charge my card, take my money, pay for this object, and that's a third state change. And as we go through those specific changes, we record that in the status of some document somewhere. When we talk about this, REST really demands no SQL. There's no way that you can do this in a stateful way, and in fact if you talk to any of the REST heads, they're all very much, we don't manage state. It's about state transfer. If you're going to authenticate, you need to authenticate on each and every request. There's been big arguments, I know, with the Amazon guys about authenticating into their CDN and CloudFront that we mentioned earlier, and people have been very unhappy to find out that they need to send the authentication string with their every request, because it's RESTful. It doesn't maintain state, right? Demands no SQL, right? Due to state constraints. We haven't really thought about it that way, right? I see this gentleman sort of reacting here, and he's like, I'm thought about it that way, but it sounds like it might be right. We use query strings in our queries, right? Our URIs. Now remember how the URIs set up, right? We've got some protocol information, we've got some domain information, then we've got a path, and then we've got a query string, right? Now unless we treat the actual path as in some sense a query string in itself, if we look here, right? This is in itself a REST query, right? You're querying, you know, employees of example.com, you want just the one, right? One, two, three, four, five. So that's a query. Now you could easily refashion that with a query string, right? You could call it, you know, employees, put a question mark here, and put, you know, employee ID equals one, two, three, four, five, right? So all you're doing is taking your query out of the path part of the URI and putting it into the query string, and there's been a lot of discussion about, you know, is that still REST for if you use a query string rather than a URI? The answer is yes. Those two things are semantically exactly the same, but for our purposes what we really use the query strings for is range searches, right? So instead of going employee one, two, three, four, five, if I want to get all the employees whose numbers start with one, two, three, then I'm much better off using a query string for that range search than trying to make multiple REST calls based on a URI syntax like that. Does everybody kind of see that? I'd have to make multiple queries using the URIs where I can do a range search rather easy with the query string. This is a point that was actually brought up by the people who built the AWS, and when talking about this there's been several talks by them where they brought this point out. A lot of times REST leaves us in the position of making a lot more queries than we actually want to do. Better path for you when you're querying wwwdb would be better to use a query string if you want to do a range search. They're semantically the same, nobody can come up and beat up on you, oh that's not RESTful, right? It's still RESTful, it's exactly the same way, but you don't have to make multiple requests, right? Because the address of that data never changed. And this sort of in some sense exposes some of the thinking around REST that might not be all that helpful, which is that this is no longer an address but a query. If we want to divide our URIs into address part and query part, use a query string. But the confusion between address and query kind of makes it a little difficult sometimes to design a good REST API in a good way of querying the wwwdb. We'll talk a little bit more about REST and I really want to talk about how REST and the cap theorem sort of play together and whether the cap theorem applies to all this, but before that we have to talk about our friends in Mountain View who are busily indexing the web for us, right? Every good data system needs an index, right? No SQL or otherwise. The Google guys provide this for us, Bing, Yahoo, this is why they call them a reverse index search, there's no real mystery here. A good question though, is Google a query cache or data cache? Remember Oracle analogy, Oracle has a query cache and a data cache, right? So is Google a query cache or a data cache? All right, who said that? No. So think about it this scenario, I'm too lazy to type in www.netflix.com so I just type it into the browser bar. The browser bar sends me to Google and Google has a nice link there to www.netflix.com, right? In that case I've pumped in a query and Google is acting as a query cache, right? In other cases it might act as a data cache, right? And so whoever said yes over there was exactly right, sometimes it's one, sometimes it's the other and sometimes it's actually both, right? Because sometimes you'll see if you type in, oh say backpack into Google you'll get a whole bunch of images of backpacks, right? So now it's delivered some data, that query has actually resulted in not a list of URLs but some actual data about backpacks. So Google is both a query cache and a data cache, provides secondary query routing, that's what happened when I typed in Netflix and it gave me back the fully qualified URL and then we had these alternative query indices. When I showed this slide to somebody they said oh but what about Wolfram Alpha? Doesn't that do queries? Doesn't that return information? But that's not Google, that's not really an index on to the wwwdb is it? So I did think about that a little bit. What role does Wolfram Alpha really play? I mean I love Wolfram Alpha, it's like I'm always pumping some question into it and in my thinking about it that was what turned out to be key is Wolfram Alpha returns information. I never get a list of links back from Wolfram Alpha right? So Wolfram Alpha I made a query and I got data back. It was a straightforward data query, they don't even decorate it very much, I might get a graph of a plot of something if I ask that kind of question but it's basically a very simple set of data queries and we've got some others of those index one that provides some information about countries around the world, Twitter in some sense if you think about it and these guys are at Twitter anybody hear from Twitter? I'm not going to say anything bad about you don't worry. So the Twitter guys are well aware that they're actually back there running a reference database because people go back into Twitter and search for things all the time. Somebody said this yesterday and somebody said that yesterday, we've got the police and the government's out there actually tweeting you know things as they happen. They too are providing an alternative query index on to the web. Their query index index is people and data, right? Anybody here not have a Twitter handle? Okay one person. All right we're signing you up. But seriously the the Twitter is actually an index on wwwdb but it's on to the people rather than the data so it's another different kind of query, right? So query for data with Wolfram Alpha index Monday also is a data query system. Twitter is actually an index on to a list of people like you and me and allows us to communicate with one another and then has some database back there where they store not all but most of it. Does the cap theorem apply? So everybody is tired after three days of the new SQL conference of hearing about the cap theorem. Anybody not absolutely tired of hearing about the cap theorem? So I think it's come up in every talk that I've been to here at the conference so I didn't want to feel left out so I have a couple slides on the cap theorem as well. It actually plays a role here and it's really important. Does the cap theorem apply to wwwdb? Can we talk about partition, consistency, availability? Yes but only partially because and we'll get to that in a little bit. So we can see I think that partition and availability in our 404 is pretty straightforward. Example partition of the database. Oh it's unavailable. We've got a problem. DDoS attacks. These are sort of a vicious kind of query. Something is unavailable to me because so many other people are querying it right now maybe in a spurious or harmful way that I can't get at it. So those are pretty simple examples of the cap theorem applying. Obviously wwwdb relaxes the consistency constraint. Nobody even tries to be consistent. Right? The way the data that you got from the New York Times yesterday remember it's the same URL www.nyt.com produces entirely different results today and nobody even really intends for it to be consistent. If you go to paypal.com from Brazil and you go to paypal.com from the US you get two different pages. Right? Now somewhere under there buried between beneath all the presentation one hopes that the essential data characteristics of those pages are very similar. There should be some place to log in. There should be you know some data like that but otherwise those queries are not the same. We don't even try to be consistent. We accept inconsistent queries, broken links. Where do we get out of it? Well it works pretty well. At least well enough that a lot of people have gotten rich and we can all go to big conferences like this and talk about it. Right? But it doesn't work nearly as well as it could. Anybody want to get rid of 404s? And it proposes a way to get rid of 404s at the end of this talk. And we would never have any broken links. Who thinks that's cool? But I think that would be amazing. No more 404s, no more broken links. Anybody in site ops here? Okay it's like site ops people will love this. Right? We make that trade off for real-time availability, being able to update things, but we can do a lot better. And a lot of the rest of the talk is about some of the drawbacks of the way that we do things now and how to fix them. So remember all three twos, or three-raged stool, right? This is how we fix the URI leg. First I wanted to point out just some drawbacks of the way the cap gets applied to the WDB. The first one is that all the data is not cached everywhere, right? So remember a query pattern again, you go to the base page, you get some control information, your browser processes that you get some data back from the CDN, right? So that base page is often not distributed. It's served directly from the servers of the company that provided it. PayPal.com's base page comes out of servers that we own and control. They're located in some specific locations and no place else where you don't distribute that page via Akamai or anything else. But the page that comes back to the user that sits in their browser staring at them is sort of a mixed query. Some of that data is distributed all over the world by Akamai or whoever it is that we're using for CDN provider. And some of it comes directly from a single source. So we have sort of a mixed distribution model here. Some of the data is distributed, some of it's not. And for the things that are not distributed, the cap term clearly doesn't apply, right? If you only have, and I'm not suggesting anybody does this, but if you only have one point of presence, right, it's not useful really to talk about partition because there's nothing else from which to be parted, right? And there's no point in talking about consistency because there's nothing else to be consistent with, right? So the cap term only applies partially when you have a partially partial caching model. So I wanted to bring that point out. Consistency only applies to part of the queries. And if we think back to which parts of the query are distributed, remember I brought that up earlier and I said there's a really important point that only 95% of the data comes from the CDN and the rest comes from the actual base page. This non-cached part is the intransitive hyperlink part, right? That's the part that we would actually like to be distributed. It's the whole thing is sort of upside down. The data that we don't care about. It's optional. Primarily for presentation purposes, right? It's most widely cached and distributed. Akamai does a much better job than you do of delivering data to the user, right? I don't care who you are. They do a lot better job. That's kind of what they're in business to do, right? If they didn't do a better job, you wouldn't be using them, right? It's pretty simple, a priori logic. But they basically distribute most of that data but they don't distribute the part that we would like to have distributed. That's one of the drawbacks. How can we improve things? Well, first off, who thinks their browser is a great data client? Anybody hear from the browser manufacturers? I'm going to offend them. Okay, no browser manufacturers. Browsers suck as data clients, okay? Every single browser sucks at this, right? And I'm not talking about, you know, I want to go back to seeing a result set with fixed width fonts on some bright green monitor. That's not the point. They do not do a great job with the data clients. And part of the reason for that is that they're concerned with presentation. We've got HTML5, brand new shiny spec. I worked on this disclaimer, right? I love HTML5 but I don't pretend it makes the web work any better. And I don't pretend that it makes it any faster. It makes some things simpler. Web storage makes it really easy to store stuff on the local disk. That's nice. It's not going to get rid of cookies. I know I hear people say, oh, it's going to get rid of cookies. We don't need to manage state on the web anymore because now we have web storage. And I just sort of giggle, right? It's not going to happen. As much as we did test cookies, we're not going to get rid of them any time soon because the semantics that have arisen around them don't apply to web storage. There's nothing in web storage that says this cookie has to expire, deal with cross domain rules, or any of the semantics that have sort of accreted onto the cookie concept. Web sockets. Web sockets are just another way of creating the WebDB. It does not fundamentally change the underlying situation, right? How do we make these things better? We make the caching and the distribution better. If you only have and I'll go through the rest of the slides, but let me sort of finish talking about where I was, right? Why data clients are so bad, how they can do a better job, how we can make the caching and distribution, thank you, how we can make the caching and the distribution of our data a little more effective. So let me ask, and without asking who you work for, does anybody actually have their stuff all in one data center? See, I got my hand up too. You don't know where I work. Right? So people do this and this defeats the whole idea of being able to cache this stuff, right? Your users do not get a great deal, right? No matter where that data center is, point to someplace else and those people get a bad deal, right? We suffer from this too, we're working on it, a lot of other people are doing so as well. And we could do a lot better job at distributing that content by distributing the intransitive links, those queries, and to do that we basically have to distribute the machinery that answers and responds to those queries. That means putting your machines from your company in some data center closer to the user, multiple locations closer to the user. We can't really rely on the CDNs to do this for us, which is kind of what we do now, right? It's like we kind of throw the static stuff over to Akamai into the, you know, onto your origin server somewhere and the dynamic content, which is what we'd really like to have cache, never makes it. How can we distribute that dynamic content? The way is to have multiple data centers in different locations, right? It's all very straightforward. Reforming the URI system is really key to this. If we can't make intelligent and efficient queries on to wwwdb, we're not going to be able to actually do very well. And the way to do this is twofold. First, I just want to say this outright, this is going to be a little bit of a shocker for people and whatnot, right? The web maintains state. Let's just build state into it rather than trying to pretend that we're stateless and then have to go and add all these things, cookies and all the other noble mechanisms we have for maintaining state. If we could do that, we could have stateful queries and we could manage our URI queries much better simply by managing state. If we added state to our queries onto wwwdb, we would actually be able to maintain partial data sets on the user's machine. We could maintain partial data sets at the CDN nodes and ask for those selectively based on the context and state of the user's queries. There's a number of things that we could do by adding state. Another thing that we can do, and this is sort of being done, is simply to add more storage at the level of the user client. Web storage, the reason that I brought web storage up earlier in my talk, was because that's exactly what we're doing. All right, so I think I've covered most of this part. I wanted to cover RDF really quickly because it always comes up in this context. People ask me about RDF, about the semantic web. Isn't it going to solve all of our problems? No, it's not. What the semantic web really does is allow us to address things by what they are, right, what they mean, as opposed to some generic address, right? So and that's very helpful. It allows us to improve queries. It also slows things down, right? And nobody ever accused it of being really fast. Let me get back up here, and of course now I have to go back through all the slides. Sparkle is actually a language for querying RDF, and it reduces some of the query limitations that we had, but it's also not very fast. And nobody ever really intended for it to be very fast or to be the real query language for wwwdb. And my claim is that cloud computing is going to do more to improve the overall distribution of content on the web than any amount of semanticizing, romanticizing maybe, with RDF and the semantic web. I mean, I love W3C, I love TM, I've known him for more than 20 years, but this is an area where we strongly disagree. As I said earlier, browsers kind of suck as data clients. They're really based around presentation. Most of HTML itself is based around presentation designed for browsing, not really thinking about querying. And of course, they're bedeviled with just an endless amount of legacy issues. If you talk to the Microsoft guys, they'll tell you 85% of the code, and IE is there to account for people's bad HTML, right? Not to format the pages, but just to deal with the errors in the HTML. Rest does not mean fast. Okay, so here I'm committing sacrilege again. Everybody is restful, right? It's a big thing in the news these days, right? But rest is not fast. Semantic means of accessing this does not improve performance. Once again, if you use a domain model, you can limit the number of queries and, but you may make some unnecessary requests. Query string semantics allows for joins, arbitrary comparisons, recognizing that some queries require state and just deal with it. Go with the fact that it's state. Distributing the intransitive queries more widely would also help. Reforming hypertext enlarge the number of link types. Why do we only have one kind of link? This is what I was talking about fixing the 404s earlier. The trick to fixing 404s is to instrument bi-directional links, right? A link actually has to have something at each end, right? But our links are now only tagged at one end, right? If they were tagged at both end, the browser could simply query them out if the other end didn't exist. No more 404, right? This has all been proposed. It's part of the excellent proposal at W3C, which has been a recommendation now for nearly 10 years. Nobody's implemented it in the browser and I'm really frustrated because hypertext is not matured in the slightest since 1990, not in a bit. And I want to see some evolution in the hypertext space. Distinguish those trans-divided intransitive links, right? Add that bi-directional linking, enhance the semantics of the query string. Currently the IFC basically says you can put whatever you want in the query string, including pictures of Bugs Bunny. Right? I would like to see much better semantics around the query strings in URLs, right? Leaving it open like that leaves an open box that anybody could put anything in. Most people put something really bad in there, right? Much better to constrain the semantics to being something sensible, not leave queries to be open-ended in the way that they are now by the spec. Anybody ever try to use hypertext on your phone? I mean seriously, right? You know that part where you're trying to scroll and you clicked on the link and you went off over here and then you had to go back and you had to come back. Hypertext didn't work on your mobile device, right? Embedding those links inside the content just doesn't work very well. And if you think about it a lot of the work that we do on the web is about hiding hypertext, right? We put all these nice buttons over here and we've got the menus across the top and all that stuff is there just to hide the hypertext from you, right? This brings into question whether hypertext is really a good idea. Has it gone past its sell-by? Can we fix it? IPv6 and query routing, so the IPv6 space is large enough for us to enable any number of query schemes, right? The IPv6 space is so large that we could get every page on the internet an IP address and never miss any IP addresses, IPv6 addresses. Would you be able to find things through some DNS-like mechanism? I don't know, I mean there's a proposal out there, there's an RFC out there for that right now. I did not write that, but it seems like an interesting idea. You could conceivably partition the IPv6 namespace in such a way that it addressed particular kinds of things, right? All the queries about people are in this range, all the queries about products are in that range. I'm throwing out some speculative ideas. These aren't necessarily things anybody's working on. Scaling, every system has a scaling limit. Nobody really knows what the web scaling limit is right now, but we know it's out there unless we've somehow managed to fix the laws of physics. We know that the web has a scaling limit, and we may find it by finding the first resource that runs short. I mean that's typically how we do these things in machines. It's the memory or the CPU or something as the resource that we had the least of. It may be the thing that we had the least of on dub-dub-dub is good URIs. And that may be the place where the scaling needs to start. The semantic mapping, the RDF and so forth, this is very complex and it requires a lot more work in the construction of that content than most people are ever going to be willing to do, honestly. So explicit state management would make this a lot more efficient. Front of Thoughts. I think is anybody not convinced that the web is a big distributed data system at this point after an hour of listening to me talk? Anybody not convinced? I'll be standing right out here. We'll talk about it. The URIs addresses a result set of a new SQL query. We have these two kinds of hyperlinks. We can add power and simplicity to our queries by reforming the URI syntax, making some change. And there's a lot of evolution around HTTP and HTML at this point. Why is there no evolution in the URI space? That's where we really need to go. So that's it. Thanks everybody for sitting through the extra little bit of time.