 So, hi, my name is Greg Wendall and I'm the CTO at Blacko, which is a new search engine. You might have heard of us. How many of you use search engines? This is sort of a funny thing to ask because, of course, everyone uses search engines. And since you're here to hear about NoSQL, there's, of course, a big combination here. So how hard is it to build a new search engine? Oh, it can't be hard. So Google did it, right? So Google was a NoSQL pioneer because they had to do it to get the job done. When they first sat down to index the internet, they're like, wait, this dataset doesn't fit onto one computer. It has to fit onto a cluster and we're cheap. And so they went out and figured out how to do it and out of that work came some research papers which inspired the Hadoop guys. And so really the entire NoSQL community is built on top of Google building their search engine. So how hard could it be today? Well, the internet is a lot bigger than it was back then and so it's a very different prospect to sit down and say, okay, we're going to have a startup and we're going to go out and we're going to build a search engine that will produce good enough results to survive. So what are the requirements? So once you launch to the public, you need to have a big enough index to give interesting answers. If you're being a general purpose search engine, you want people to come and ask you all kinds of different questions. And so five years ago when we started, we said, okay, two billion web pages is sort of the size of crawl and index that we need to have on launch day. And that's more than one petabyte of storage. And there's two clusters involved. So one cluster does crawling and indexing and it's about 300 fairly beefy servers. And then the other cluster serves answers. So when people type a query into the engine, they expect an answer back. They expect it quickly. And so you have two choices. You can either have your index sitting on solid state disk, which five years ago didn't really exist yet, but we knew it was coming. And or you can buy a lot of servers with a lot of RAM and put your index in RAM. And there's a big trade-off here. So if you do it from flash disk, you have less capacity for a total number of queries, because SSD is a lot slower than RAM. But the good news is it's a lot cheaper. And so 300 machines with SSDs on them is the size of a serving cluster that we imagine we had to have for a 50 terabyte index at launch. And if I wanted to instead serve answers from RAM, I would have ten times the capacity. I'd have too much capacity and it would cost ten times as much. So it was a no-brainer in the beginning to attempt to bet on this flash disk future. And the nice thing about this business is, so the hard part of building a search engine is not buying the clusters, crawling and indexing. The hard part is building a good product with good relevance. As far as the actual serving goes, this cluster that I outlined with the 600 machines, it costs about $5 million to run for a couple years. But if it was running at capacity and serving out as many queries as it could to people who live in the US, that would bring in $20 million a year of ad revenue. So while search engines are a capital intensive business, they easily pay for themselves in terms of the hardware you have to buy. And the hard part is having humans to efficiently implement all the algorithms that you need. So you'll see people writing articles where they say, yeah, it's very difficult to break into search. You have to have 100,000 servers. You have to do this, you have to do that. You need billions of dollars in order to launch a new search engine. And we don't think that's true. We're attempting to prove that point. So five years ago, we had to decide if we wanted to build or buy. So back then, it was obvious that a lot of the NoSQL projects were not only producing real value to people, but they were on a sustainable trajectory. But we have unusual requirements. So we wanted to do everything in the same database, crawling, indexing, and serving. And serving is a near real time operation, right? You want your 50th percentile time to be under 400 milliseconds so people don't get annoyed and go off and use Google instead. You have to have a lot of uptime. And so here you are serving out of your NoSQL database. The typical NoSQL deployment five years ago was a batch operation. Almost nobody ran them sort of in real time. And so with a batch system, it doesn't matter if you have downtime, really. And so people didn't work very hard on it. But we wanted four nines or more, hopefully. We knew we had to use flash disks efficiently for cost reasons. And so sort of the bottom line was if we took one of the existing NoSQL systems and tried to adapt it to our use, we'd end up adding a lot of code because we're trying to do a fairly extreme thing to it. And the code we might want to add would probably be strange to the rest of the community. If most of the community using a NoSQL database is batch oriented, then being the first people who want to use it in a near real time fashion can mean that you have sort of a conflict. You're adding features that other people don't see much value from. It can be challenging in an open source project to deal with that. And so after thinking about all these different things, I came up with a couple of rules which I modestly title Lindahl's laws. So it outlines the two mistakes you can make when you're in this position. So the first mistake you can make is it's a mistake to write your own database. So with all this NoSQL activity, people have been working on all different parts of the design space available for databases. And even if they haven't done exactly what you want, there's something that's pretty close. And so you should just take that source code and you should go out and do it. So that's a mistake you could make. The other mistake is the flip side of that. So if you're trying to adapt that open source database into an extreme environment that's different from how everybody else wants to use it, that's also a mistake. So therefore, you should write your own. And so that means that both ways of doing it are a mistake and there's no right answer. And so we chose to write our own. The second decision a company like us has to figure out is to cloud or not to cloud. So a lot of people these days do not run anything on their own servers really. Their developers have laptops and they have servers at Amazon and they're totally happy with that. And it can be very cost effective. One thing Amazon does really nicely is if the amount of compute power I need waxes and wanes, then I can add more servers. I can delete servers. I can scale it up and down and life is good. I pay the minimum amount of money. And that doesn't really work for a search engine. So there's sort of four reasons the cloud doesn't do it for us. The first is that when you are running servers in the cloud, there's only certain ratios of CPU to memory size to local disk that are available. So for example, it's hard to get to go out and rent a server that has 20 terabytes of local disk. People don't rent servers that look like that. And as it turns out, for a crawling and indexing cluster, it's really important to have a lot of spendles available. And you also want the capacity to hold the whole crawl. Second problem is even to this day, it's hard to rent something in the cloud that has solid state disk. So what, three weeks ago Amazon for the first time started renting servers with solid state disks. But there's only one server type. That must be the one that we kind of like the best, but nonetheless it's only one server type. So it also restricts the available ratios of CPU to memory to disk even further. And it's very expensive because probably most of the people who are running on these SSD machines are wearing them out pretty rapidly. Because Amazon doesn't charge you for how much you write to the SSD. So people are gonna be lazy, SSD is gonna wear out, Amazon has to buy new ones. So they charge a fair amount of money for this one server that has SSDs. The third weird thing about search engines is search engines actually don't ramp up and down quickly in capacity. So there's a difference from day time to night time about two x in the number of queries that you serve. However, the minimum size of the serving cluster is 300 machines. Because the index has to fit in the SSD. So it doesn't matter if you have one user or you have 10,000 users or if you have 100,000 users, the minimum size is still 300. And therefore you can't ramp down. Also if you wanna ramp up and you've got a machine that has a lot of local disk, it takes a very long time to copy 20 terabytes of data onto a brand new server. So you're really sort of stuck, you can't go up and down, you can't do the flexibility thing that helps make cloud services so cost effective. And then the fourth point is that there's an economy of scale to the cloud. So everybody knows that Amazon owns hundreds of thousands of servers and clearly they have an enormous economy of scale. But if you look at their pricing, that goes to their bottom line. They make good money on the cloud services as they should. Whereas, if I go out and I buy 100 servers, 500 servers, 1,000 servers, eventually my economy of scale is gonna be better than Amazon's. And so we knew on launch day we were gonna have 600 servers. We actually had 768, but we're gonna have a lot of servers. And we think the break even in between Amazon's efficiency versus our efficiency is somewhere around 100 servers. And so we were gonna be way beyond that point at launch, which brings me to the final comment that, and this is proven by sad experience with some companies. If you plan on outgrowing the cloud and eventually having your own data center, it's better to do that earlier rather than later. So there's one big example, and I won't name a name, but of a company which had been using a managed data center where somebody else came in and replaced all the hardware for them. And so they went out and they got to be big and they built their own data center and attempted to move in to their own brand new data center and they failed to move in. They ended up getting more managed servers and that's because they tried to make a leap that went too far from getting a lot of services from other people to doing all of it with an in-house organization. They didn't leave enough time to spend that up. And so we started on day one not being in the cloud and actually we've never been in the cloud with any of our servers. Today we have 1500 servers and if we get to the size where we're a viable company that's cash flow positive, we'll probably be on order of 15,000 servers. And that's certainly a number that you really don't want to rent from a cloud provider because you can do it significantly cheaper in-house. We needed a good SSD layer and it was very interesting actually the speaker before me talking about Cassandra was talking about some recent work they had done with an SSD layer which sounds very promising. Here are our requirements. So a good SSD layer should have the exact same semantics as the database that it's caching. That means that your poor programmers don't have to write additional code in order to use this cache. In order to be cost effective, well, so we don't use RAID for our disks. Like most people have petabytes of storage. We keep three copies of each piece of data and that's more cost effective and operationally efficient for us. If we have three copies on disk, we don't want to have three copies on the SSD. We prefer to have one copy on our SSD and that allows us to sort of triple the size. It's very important for cost effectiveness. SSDs have a limited number of writes that you can make to them. So in order to minimize the amount of writing we do when we update the data which is on the SSD, we actually hold the update in memory. We also write it to disk. And then every once in a while, which for us is once a day. In order to prevent that memory table from getting too big, we load the data off of disk into the SSD again and throw away the memory table and sort of start over from scratch. And so because our data on the SSD changes slowly enough, we can do this once a day. You can do this hourly if you needed to. So you get three benefits out of this layer. It's easy to program because it has the same semantics. It's cost effective for the middle two points. And then finally, it has very low write rates. And so if you look at SSDs that we've had in production for two years now, less than 5% of their total write lifetime has been used up. So these disks will last an infinite amount of time in our environment. Basically, they'll be obsolete long before they wear out. I've been harping a little bit on programmer efficiency. As I said, that's the hard part in doing a search engine is sitting down and writing algorithms that can produce an interesting product that has relevant results for people. So it's very important to make sure that your system allows your programmers to be effective and fast. So it helps if your data types match the language data types and database data types have a nice sort of one-to-one mapping. So you're not converting very often. The cache layer should have database semantics. We already talked about that. And the third unusual point I'd like to make is that in a lot of environments, you can do things in two fashions. You can either run a MapRedd's job across all of your data, or you can process the data as it comes in in a streaming fashion. And normally in many systems, the code that you write for these two cases is different. And we would like this code to be the same for programmer productivity reasons, right? Our crawler and index are started as a batch process. But like every other search engine, we went from that to a faster sector batch process, and then now we have a streaming indexing. And we knew we were going to walk down that path, and we wanted to not have to rewrite all of our code when we did that. And this is an unusual thing in a database system. So one feature we have that is unusual. Some other NoSQL systems touch on this a little bit, but all of our data is stored as these things we call combinators. And a combinator is a remote atomic operation on a database cell. Let's imagine I wanted to count all the hits on my web server, so every time a user comes by and does a search, I add plus one to a cell and a table. So this in a normal database is a hotspot, I can actually slow you down. But it doesn't have to be hard, right? All these plus ones are sort of being fired, but I can accumulate them. So let's imagine that every single web server as a user does a search, while it adds plus one, but it'll accumulate them for 60 seconds before it does that right. So if I get one user a second, that's a plus 60 when it actually goes out. So that's one 60th of the work to do that update. So when I'm adding, instead of having the poor programmer write code that waits before it does that update, we use this combinator system which basically goes in and automatically holds things and combines them together as much as it can before it goes to the three places that it's gonna be written on the disk. So that one's easy to understand. Imagine though if I wanna set a value, right? So if I set a value and somebody else tries to set the same cell, who's gonna win the race? This is complicated, also if I wanna do set unless, right? I wanna set a value, but I wanna set a value only if it was not defined before I did it. That's a similarly racy thing, it's an inverse of set, but it's also a race condition. So we have a combinator called top in, which is an ordered list of n things. And so n is used to keep it of a finite size. Frequently I don't wanna know every single instance of something. So for example, yahoo.com, right? There's, we know of 7.8 million webpages that link to yahoo.com. A lot of them are porn websites, inter-exit. If you hit the exit button, occasionally they send you to Disney, but usually they send you to yahoo.com. I don't know why. So we have a place in our database where we wanna quickly be able to look up the most important incoming links to a web page. And we have that as a top in, where n is 2,500. And so a top in stores a key, which in this case is the URL of the website that's pointing in, a rank, which is the rank of that URL that's used to order the list and figure out who to drop when it fills, and then some right along data. So one nice thing you can do with the top in is if you use time for the rank, and it remembers the n most recent operations. And so we use all kinds of tricks with combinators. And I'm afraid this is probably a poor introduction to the concept because it's so short. There's some blog postings that I mentioned at the end of this talk that discuss combinators and their use in search engines a lot more extensively. Another important system for us to be near real time is the repair demon. So one unusual thing about our no school system is that we feel that the poor sys admin staff should not receive extortionate emails all the time. So if you imagine a raid system, if there's a failure in a raid system, it's a big problem, right? So okay, I had an extra disk in the raid set, so it's gonna do a rebuild. But if I have an additional failure in the raid set, the raid set is gonna be degraded until a human comes by and sticks a new disk in. And if I have two failures after the first one, then I'm gonna lose my data. So that means that the poor sys admin has to run to the data center and swap disks out. And that's no way to run an enterprise like this. If I have 300 nodes and 3000 disks, that's a lot of disks and there's failures happen routinely and that should be a routine operation. The database should rot gracefully as disks and servers fail without any human intervention needing. By the way, this particular situation is why people who have large node squill databases which I consider to be more than a petabyte of actual storage use a three copy system instead of raid sets. Secondly, there should be an administrative way to stop using disks and servers. So for example, we have actually upgraded the memory on our production system while it was in production, which involves stopping some nodes, turning them off, opening them up, changing the memory out, and putting them back into service. And with the right administrative tools, you can actually do this. And then involves talking to the repair demon while the cluster is in production. So let's use a couple of these things together. So we do system monitoring, non-geos is a system I love to hate and we've written pretty much all custom things. We do the usual stuff, monitoring at the operating system level, load and swap usage, database specific monitoring. Like most no school databases, if there's only one copy of a piece of data instead of three, we're gonna stop writing to it. We're gonna have write back pressure in order to avoid changing things that will cause additional work later. But that write lag shouldn't go on for too long, so we have monitoring on that. We have application specific measures such as searches per second by users and pages crawled by the crawler and so forth. But the interesting one is this last one here. So we monitor the system for errors. So when a disk goes bad, what happens? Well, it could be that your no school code is gonna make a call to write onto the disk, or read from the disk, and it'll get an error from the operating system. And that's catastrophic. You've failed, you can't recover this node from it. You basically have to fail that disk immediately, and then worked on repairing it. That happens about half the time. The other half of the time the operating system notices before the application does. So you watch syslog, the smart system sometimes comes up with the error. Sometimes the error occurs on read ahead, and it doesn't occur when the sector's read again, but it's still a significant error. And so we find about half the time we discover an error at the system level before it's noticed by our database. And in that case, we actually automatically drain that disk of the data it contains. And typically that can occur before the application, the database sees that failure. And that means about half of our disk failures we avoid seeing it all. And since the limit to scaling in those SQL cluster to large size is dictated by the number of disk devices, this allows us to have a cluster which is twice as large. Very handy, unusual in this day and age, and I hope more clusters start doing this. So experiences with the system, we've been running this in production for more than two years with actual users on it. We've had downtime, all of our downtime has been caused by network configuration, bugs in our network firmware, or bugs in our database code. And none of it due to having too many disk failures. We have had a few instances where a failure of two disks in a row that were unfortunately placed in the cluster has nearly caused us to lose data. This is that of 4,000 disks in a cluster. And this is the event that limits your scalability, right? Too many failures in a row could cause you to totally fail. And that's why it's a limitation on scaling. So more information about these topics. Sorry, I couldn't go into depth on a large number of things. We have a blog for the company at blog.bleko.com. But probably the more interesting thing for you folks, there's a three part blog series on highscalability.com. And if you go to your favorite search engine, which is hopefully Bleko, and do a search for Bleko on the highscalability site, then you'll see three postings pop up at the top that cover a lot of these topics in more detail. And if you have any further questions, please feel free to contact me. There's my email address, and I'm also on Twitter is glandall. So, thank you. And I have five minutes for questions. So any questions? So question was, is our NoSQL database homegrown? And the answer is yes, it's actually completely homegrown. So we started five years ago, and it's interesting how we made a lot of implementation choices back then. And without actually studying the architecture of the existing open source NoSQL databases. And it was interesting that we made a lot of similar design decisions that other people did. It's always nice when you sort of successfully reinvented the wheel the right way instead of making it oval or square. However, because of our unusual environment where we're gonna run near real time and have website uptimes instead of batch processing and batch processing uptimes, we made some different decisions and worked on some other systems in unusual ways. And that was the main point of this talk, was to talk about what we did that was different from the usual. Right, yes, it is, and if we made our decision again today, we wanted to start a search engine today. We probably would end up with a different decision as to whether or not we would adopt one of the existing NoSQL databases. So, indexing is, I mean, I'm not sure what there is to share. Indexing involves a lot of black boxes. It's been well known for a long time that, for example, the words that describe a webpage the best are the anchor text that comes in, right? So if you have, you know, blacko.com's website is the thing you click on to go to blacko. That's, blacko.com, that's describing the website. Of course, that's not very useful, right? But Greg Lindahl's homepage, to my homepage, means that Greg and Lindahl are very important for that page. And so that's the largest signal that's used. There's also, you know, the authority of people who link and this and that. So there's a wide variety of stuff and ranking websites is a very complicated non-local operation. You always end up, you know, if I've got outgoing links, I need to make those incoming links to other websites. And so there's a lot of non-local rights that go on that stress the database heavily and we do that efficiently. That's an example of something you need to do. Other questions? So a little bit unusually, so the question was what language or languages do we use and why? So most of our database is implemented in Perl, which is an unusual choice in this day and age. People who, a lot of people don't like Perl and it's because they've seen horrible Perl that was written a long time ago that's been maintained by a series of people who don't really like Perl. And it's very ugly code. Our founding team at Bleco has been writing Perl together for more than a decade. And so we have a very strong Perl style and use it the right way. There's a lot of nice libraries available, da, da, da. Perl is an interpreted language that allows you to develop new algorithms very quickly. This is something that's very useful in a search engine. When you arrive at the finish line and you love your algorithm, if it's not efficient enough, you can always push it down into C or C++ and we have done this with a lot of our code. So we sort of, we took one tenth of our Perl code and we pushed it down into C and C++ and it got to be ten times larger. So we have about an equal size code base in between the C, C++ and Perl. Also, we interfaced a lot of things ourselves that were C libraries that we liked what they did. We've interfaced them into Perl's system. So we have a very large amount of code written by other people and we've done mainly writing glue and then sitting down and trying to figure out what the right things we have to do in order to have good ranking. Some other teams trying to do the same thing would probably pick something like Python or Ruby. And they would have the same costs and benefits and drawbacks. So Perl has this thing called CPAN, the comprehensive Perl, whatever network. And so we use 600 CPAN modules that do everything from A to Z. So there's a lot of bit twiddling and stuff on very large vectors and just stuff and so it's really hard to say which ones are more notable than others, but it's sort of all, none of them are no sequel things. All of them are sort of underpinnings things. And it's really nice to have a very wide variety of things available in CPAN to choose among when you sit down to solve a problem. Excuse me? Do you think you're gonna be able to solve a problem? So one of our goals was to use one database for everything. And I'm happy to say that we actually successfully met that we actually do use this one database to do everything and it's for a wide variety of different things. Except for we actually do some analytics and for analytics, people want to do ad hoc queries. And so we emit summary data into a MySQL database and use some standard tools there for it. But for example, our user accounts, which is data that you care a lot more about being consistent than for example your web crawl. We actually do store our user data in our NoSQL database, even though it's eventually consistent and not as tightly consistent as MySQL is. So we were very pleased that we were able to do such a variety of things in one database. There are other search engine companies at scale that use several databases. I think it's always ended up being a technical disaster for them. And we knew about this in advance and didn't want to make that mistake. So the question is, how many servers do we have and how do we figure out what the right number, the ratios, the CPU to disk, and so forth. So when we started, we had no idea. And that was a little exciting because the dollar number of money we needed to raise before launch was a big unknown and that's always disconcerting. However, what we figured out over time was we actually wanted a lot of capacity, a fair bit of bandwidth, and a lot of memory, quote unquote. So our ideal server is a fat server that of course is a dual socket with sort of the usual number of cores you can stuff into two sockets. With 10 disk drives, as much memory you can cram into the thing. And then a couple of SSDs. And that sort of ended up being a sweet spot that meant our programmers didn't have to worry too much about preserving memory as much as possible because they were on, you can imagine, okay, I could saw that server in half. I could saw that server in four. And I would have four times as many nodes and they would probably be a bit cheaper than four times the small server cost would be a little bit less than the big server cost and so forth. But it makes programming it harder. So in the end, all those ratios worked out for us. I'm not sure if it's the best solution but it's one that we can afford and which leaves our programmers productive and not worrying too much about memory overhead. Any more questions? I'm standing in between you and lunch right now, so okay, I think that's all. So thank you for listening and thank you for the good questions. We're done.