 My name is Dave, and I am from London, which is why I look very, very tired. Well, that's my excuse anyway, having flown over yesterday. So, I've been using NoSQL for quite a while. I started the Cassandra Meetup Group back in 2010, and I've used Cassandra quite a bit, sort of from then on. I still run the London Meetup Group. And I work for a company called Halo, who aren't launched on the West Coast yet, so you may not have heard of them. So Halo is a taxi app, and it basically is a way that you can get a licensed taxi to come pick you up by pressing one button on your phone. So this gives you a sort of overview of the features of Halo. On the far left where you can see the calves around you. You can press the pick me up here button, and then you'll see who's going to come and get you. You'll know who they are. You'll know what their number plate is. You can press a button to ring them. And at the end of the ride, you can just get out of the cab. The taxi will automatically charge your card for the amount. So you don't have to mess about with kind of paying the guy, and then you'll get an email or a seat. So that's kind of Halo in a nutshell. Some facts about Halo to sort of give some context to the talk. Halo is the world's highest rated taxi app. We've got over 10,000 five-star reviews. And we've got over half a million registered customers now. We've launched in a bunch of cities. We started off in London, and we've spread out to places like Dublin, Madrid, Barcelona and Europe. And then over to the US, we were in New York, Boston, Chicago, and up in Canada, Toronto. So we're kind of spreading out. We haven't made it across to the West Coast yet, but it is on the plans. Oh yeah, I forgot about Tokyo as well. Tokyo and Osaka as well, we're launching them. So Halo is growing. We haven't been around that long. We've been around about 18 months. And in that time, we've already grown in terms of the number of cities. We've grown in terms of the number of passengers, the number of jobs done, the number of drives on our network, the number of, you know, you name it, we're growing basically in a lot of different dimensions. And so this kind of growth presents some challenges. So what's the talk about? Well, I'm going to go back to kind of basics and just say, why did Halo choose to use NoSQL technologies? What was behind that decision? And to give you a flavour of what the thinking was. I'm then going to talk about two of the kind of NoSQL technologies that we're using. We're heavily investing in Cassandra, so I'm going to talk about how we use Cassandra. And we'll talk about kind of the use cases and what our setup looks like. And then I'm going to talk about Cudam Analytics, which is a kind of NoSQL analytics solution that's built on top of Cassandra. And then finally, I'm going to kind of round off with some stuff about some of the challenges of running NoSQL in an organisation. And some of the things that we found are difficult, that we've had to sort of work to overcome. So first things first. Why should you choose NoSQL? This is a quote from Andy Gross, who's one of the lead engineers on React. And he gave this talk a couple of years ago now, probably, in London. And he said that NoSQL DB's trade-off traditional features to better support new and emerging use cases. And I think this is a good kind of way of summing up NoSQL. And I think the key word here is the trade-off, basically. We're making trade-offs. And this kind of resonated with me at the time when I saw this talk. I thought that's a very good way of putting it. So what sort of things do you trade off when you choose NoSQL? You're trading off more widely used and tested software. So there are no SQL solutions. Cassandra seems relatively mature now compared to when I started using it. But it's still quite a young technology compared to something like MySQL that's been around a lot longer. You're probably going to be trading off ad hoc querying. Many of the NoSQL technologies don't provide a more limited set of querying capabilities for the database. And then finally, talent pool with direct experience. You're going to be trading that off because when you come to hire engineers, you're going to struggle, not struggle, but you're probably not going to find someone to use the technology to use before. So these are the sorts of things that you're going to be giving away. So what sort of stuff do you get back? Well, these are kind of three headlines for me. I mean, there are other things as well. But high availability, scalability and operational simplicity are the things that I find attractive in the NoSQL stores. And they're the sorts of things that led Halo to want to adopt NoSQL. And I'll go into that in a bit more detail. So this is how we ended up adopting NoSQL at Halo. When we launched in 2011, we were running on AWS. We're still running on AWS. And we were running in one region and we were in a few availability zones. And we basically had three applications. We had two PHP MySQL web apps and we had some Java services that kind of did some of the heavy lifting. And the whole thing had been built reasonably quickly by a team of three or four engineers. And we had some resilience in our data store because we were using the MySQL multi-master setup. So if one of the MySQL boxes died, we could continue operating. So what drove us to a Doc Sandro? Well, pre-launch, the focus of Halo was all about features. We needed to get the features done. We needed to get the basic app working such that we could launch. After we had launched, we were ready. So we wanted it so that whenever you needed to get a taxi, you could pull out your phone, press a button, taxi is going to come and get you. We didn't want to be saying to people, we are down for schedule maintenance, or I'm sorry, the database is unavailable at this time. We wanted people to be able to rely on Halo to get them a cab at all times of the day. We had plans for international expansion and we wanted a single app. So what we wanted to be able to happen was to get to Heathrow, get on a plane to New York, get off at JFK and you'd be able to pull out your app, press a button and the taxi comes and gets you. We wanted locality of data to go with that. So we wanted it so that when you're in New York, you're going to be running out of a data centre in the US. When you're in London, you'll be running out of a data centre in London. So we wanted this kind of global expansion and we wanted to spread our data around the world. We had expected growth. We wanted a data store that wasn't going to impede us that we could kind of not worry too much about growth. So Cassandra gives us that because it kind of scales linearly. And actually we don't have an enormous transaction volume. But we wanted to pick something that was not going to constrain us in the future. And then finally, prior experience, I had some experience of Cassandra before I joined Halo. And so that familiarity that sort of fit into the decision making process as well. We adopted it largely as a sort of unilateral developer just getting on with things approach. And this is when we were building Halo on a boat in the Thames in London. All of us in one little room. So what we did was we kind of took the bits of the app that we knew we wanted to be global and we broke the functionality down and we sort of moved to more of services sort of architecture. So we took the bits of the app that we knew we wanted to be global and we broke the functionality down to services sort of architecture. And we built the services so that they talked to Cassandra. And then we swapped them in for the old functionality. That's kind of how we migrated on to Cassandra. And we launched our Cassandra backed services in 2012. We had a sort of gradual roll out first of all to do North America and then eventually to do London and Dublin as well. And now all of our systems on the kind of customer side are all using Cassandra. And we're migrating more to it. So Cassandra Halo. For another talk I did about Cassandra I went round and I asked a bunch of people in the company what they thought about Cassandra. This was the kind of developer view that there's a view that it just kind of works. You can use it and you just don't really need to think about it. This is actually quite different from the early days of Cassandra where it would not do that. It seems to be pretty reliable and people find it very easy just to get on with it. The other feeling from the development team is that you have to invest a little bit more upfront with Cassandra. So as a developer, especially someone who's not potentially not used Cassandra before, you have to spend a little bit more time thinking about your data thinking about your data model and actually doing the programming work to get this thing, because once you've done it, the deployment story is really easy. You can deploy your service in Halo and you know that it will just work because the database is on three continents and it will just work. These are two of the use cases of Halo that really power in what we do. The first one is the entity storage. We store all of our customer records in Cassandra. When you use the app we'll look up your customer details from Cassandra and we store it in one column family called Customers and we have a row key which is the big number, which is your unique identifier as a customer and then we have columns and values. So this looks a bit like you can imagine putting this in relational database. Every row looks pretty much the same. They've all got the same columns. Every row will have the same set of columns, just different values to identify like a primary key. So this is kind of the first use case. The main considerations for entity storage in terms of using it correctly is to when you update a row just update the columns that you've changed because the thing you want to avoid with Cassandra is reading a bunch of data, modifying it and then writing it all back. With Cassandra you want to be mutating individual columns so you want to be saying particular thing to this value. Because if you do that you're going to avoid race conditions and you're going to avoid overriding someone else's changes. This is another use case. So this is quite different. This is kind of time series. And this use case is perfect for when you've got measurements or things that are immutable actions that are occurring. So in this example we're storing a record of every time we send communications to customers. This is for customer services or for individuals to find out what's gone on. So in this use case you can see the row key here is a day. It's a date 2013-0601. And the columns, the names of the columns now are UUIDs and they're actually time based UUIDs so they've got a time component within them. And Cassandra will understand this time component and will actually order the columns according to time. So what you end up with is you end up with one row per day and you end up with one column per email sent basically within that day. So this is a bit of a departure. This kind of shows the where Cassandra starts to diverge from a relational store. Because in this instance we haven't got a consistent set of columns in every row. In fact the columns are all very different. The names of the columns are a useful piece of information. They're actually the ID. You wouldn't design a MySQL schema where your column names were an ID. That would be insane. You'd have to do alter table every time you edit someone. But in Cassandra you can. So this is a good example. And to build on this this is another index on the same data. So as well as storing an overall view of what we've sent to everyone is another index which is what we've sent to an individual person. So in this case you can see I've changed the row key. That's the only thing that's changed here, the row key. So now the row key is an email address. So what we're doing here is we're storing everything that we've sent to a particular person under one row. What we're actually doing here is that when we write the data to the database we write it twice. We write it in two places. We're denormalising the data to read requirements. So we've got two ways we want to read back the data. With Cassandra what we're doing is we're actually storing the data in two different ways to satisfy both of those query patterns. So this is kind of a Cassandra pattern basically. You store the data as many times as you need to satisfy your requirements on read. The main consideration for time series storage is to just pick a decent row key. If you had millions and billions and billions of things happening you want to store them under a single day because you'd have too many of them. You might break it down so that you had a 10 minute bucket or something like that. That's the main consideration. In terms of client libraries we're using Java, PHP and Go so we're using a Steinax which is the Netflix library. We're using PHP CASA and we're using Gossi for Go. We're not yet using the CQL3 brand of driver which are quite new for Cassandra. These are the old school thrift ones but we like all of them and they're pretty easy to use. This gives you an idea of where we are in the world with Halo and it kind of gives you an idea as to how far away everything is. The latency between the Asia region and London is about 350 milliseconds so that wouldn't be acceptable to run if we only had one database and it was in London the Tokyo app just basically wouldn't work. It would be too slow. So this is kind of one of the main reasons Cassandra works for us. It's very easy to distribute your database all over the world and this is actually what Cassandra at Halo looks like. The yellow dots are machines on AWS and we're running two different clusters of Cassandra six machines per region and three different regions so we're in AP southeast one, US east one and EU west one. The little grey dotted arrowed lines are the VPN links doing asynchronous replication between the regions. So what happens is when you get out in New York and we need to get your customer record that will be serviced from the US data centre and any changes will be written to the US data centre but then asynchronously replicated off to the other data centres. So this is great because we can actually withstand the loss of a region for our Cassandra backed services. If the entire US region disappeared and we could serve the traffic out of London with extra latency but we could still do it and we have actually done this on occasion. Normally when someone messes up and they kill a bunch of stuff in one region we can just quickly flip the DNS. We're running M1 large machines at the moment with provisioned IOPS EBS. This isn't really a recommended thing to do and we're currently looking at other ways of configuring our cluster. A recommended practice for Cassandra is to run off ephemeral if you're on AWS to run off ephemeral disks. So we're looking at switching to that. We might switch to SSDs and just have less of them. There's a famous blog post from Netflix where they basically said that they switched from having 20 machines running Cassandra with memcash in front of it and they moved on to SSDs and then they could basically run six machines with no memcash. So that's probably where we'll end up going because it makes life a lot easier if you don't have to worry about cache coherency. Especially with a global setup it's quite difficult to do global cache coherency. So if you don't need to bother then it's great. Multidc with Cassandra is just fantastic. It's so easy to get set up and running. We started off in one data center. We built a new data center and they just connect them and they just work. I remember Adrian giving a talk about this working off in a 747 or something and then upgrading it mid-flight and then landing in the, I don't know, he was giving me some talk about how mind-blowing it is that you can take a live running database and then just add a whole other region and it just works. We run op center which is a free data stack thing which I always include just because it's just worth doing because it gives you a pretty picture of your ring and it gives you lots of useful tools for managing the cluster. The next SQL technology I'm going to go through is Cuno Analytics which we're kind of starting to use more and more. So when we switch everything to Cassandra there's one type of query that we lose the ability to do compared to say the data we've got in my SQL and that's kind of this analytical query. If you know any SQL there are kind of queries where you're saying select countless, sum this, the average of this group by something. So you just can't you can't easily do those things with raw Cassandra. So what we did was back in shortly after I joined Hello actually we started to design a system so that when we moved to Cassandra we'd still be able to do this stuff and we started building our own kind of analytical system in front of Cassandra and that was quite hard work and we didn't do a brilliant job about it and then luckily Cuno kind of released this tool so we switched using that. So what a Cuno gives us is it gives us this ability to do these queries back so if you basically define pre-planned query templates and then a Cuno what it will do is it will denormalise on right such that you can conduct queries like that. Our setup looks a bit like this everything that happens in Halo we turn into an event which is just a set of keys and values a kind of map and these are generated both in the app on the servers basically anywhere that anything happens we generate these events and then we farm them all into a queue system called NSQ that Bitly created in Go specifically to do this task they were looking to do a very similar thing which was kind of analytical analytical analytical event stream and it's a fantastic tool is NSQ from NSQ we feed them into a Cuno and a Cuno stores the data in Cassandra and then we can do things like this so for anyone who's got experience of Cassandra you'll know that you wouldn't be able to do this with just raw Cassandra so we can kind of this is an example where we're summing up our accept rate so we can basically take one type of event which is an allocation event where we offer it out to a driver so every job gets offered to each driver and they can either accept or decline or just ignore it and we can kind of count up how many times that happens and we can do things like group by day or group by minute so it gives us quite a lot of power and then we can do things like we can draw graphs so this shows the blue graph kind of shows customer demand you can see how spiky our demand is rush hour in London everyone wants to get a cab and this is another example from our testing system showing the number of drivers on shift so the top graph is quite interesting because the way that Hilo works is that drivers send us their location updates about every 5 seconds so every driver pings as their location so we have this event stream which is like 500 to 2000 odd events per second of drivers sending in their updates and what we do is we just simply stick it all into Okunu and then we can say select the count distinct of the driver ID group by minute and what that basically gives us is how many drivers were active at that time of the day which is pretty cool and you can just draw a little graph of it and below there's like a heat map which is kind of built in as well so the challenge is this is kind of a big one experience team members are not I don't think there's any exception to this rule every team member who's joined Hilo has not used Cassandra before we now have 50 engineers none of them had ever used Cassandra before so there are challenges there the main challenge is I think that it's very easy to shoot yourself in the foot with a MySQL database or any SQL database you can design your schema badly you can end up in a situation where your queries will run okay for the first week of operation when there's kind of 100 rows and then gradually implode but an engineer who's had a reasonable amount of experience will often just naturally avoid those pitfalls because they've got experience using the tools whereas with Cassandra because people haven't got that experience they can sometimes shoot themselves in the foot doing things badly so that's a challenge and one of the things we've tried to do to get around that is is kind of educate so we have internal talks about Cassandra and we explain how it works to people when they join and we kind of try and do things like peer-reviewing data models so we have a we try and put a lot of energy into kind of mitigating this risk and the risk is that people use Cassandra terribly and then it will blow up in their face this is kind of a second challenge this is this is quite an interesting one when I went round for a previous talk and asked everyone in the company what they thought of Cassandra when I went to management the management team and asked them this was kind of their main response they had this fear that we put all of our data into this database that we couldn't get it back out of that was their perception of Cassandra and actually it's not really true there are lots of ways you can get your data out of Cassandra I mean to start with we have no problem getting data out of it for the actual apps because of the way we use Cassandra you de-normalise on write you prepare for the ways you want to read it you read it back it's fine it works great I think what this came down to was it came down to kind of ad hoc requests for information I would almost think of it as like debugging when things go a bit wrong and you're sort of when someone will come and say right we need to get all the customers who registered between these two times who's account has these particular characteristics and to do that when you've got 500,000 rows in Cassandra is not that easy because it's not something we've designed for up front and Cassandra is really wanting you to de-normalise on write so there are ways of solving it the kind of normal way of solving it is you can use her doop and you can plug that directly into Cassandra and you can write a MapReduce job which is obviously insane and painful but you can also plug in things like Hive data stacks have an enterprise product where you can run basically a kind of integrated Hive on top of Cassandra so you can literally just type an SQL like query and press enter and it will go and deal with it all and it's that we don't have that so it's that kind of ease of doing those exploratory queries that we miss and I think it's and I think it's also possibly the fact that engineers and the company will use it as a bit of an excuse so fundamentally we don't often management come and tap you on the shoulder and say can you just get all this data for me and you're like not really I'm busy we work but with Cassandra you can kind of say actually I can't it's genuinely impossible whereas with my SQL you kind of felt a bit silly saying that so I think there's a few different things around this and this is something we're kind of working hard to change the next challenge with we kind of know SQL is that it's easy to I think it's easy to cause yourself a big data problem so this is an example where we stored every single point that every driver ever sent and this is about this is like a small subset we must have three or four billion points now and this is kind of like a few million of them and we just got them all out of the database and then we just drew them on a canvas and lo and behold because drivers generally drive on roads it plots a map of London just from the data that they're sending us so you can kind of see the this is London you can kind of see the 10s you can see the Hyde Park and Richmond Park not Richmond Park the other one the other park yeah that one but it's easy I think with no SQL it's kind of easy to get into this habit of like oh we can just we can store everything it's technically possible we'll have hundreds of nodes and we'll store all this data and we'll throw it all in there without really having an idea as to what you're you're actually trying to do with the data and this is a pretty picture but we don't really need this to run our business so this is another challenge I think of no SQL which is kind of avoiding the that kind of like hoarding mentality of I'm just going to hoard everything storage is cheap but it's not actually that cheap you know once you start adding lots of AWS nodes you give Amazon a large wheelbarrow of money every month so you need to kind of keep that in mind so our lessons learned the sorts of things that we've taken away from our adoption of no SQL I think having an advocate is really important if you know if you're investing if you're taking on a new no SQL database in your organization you need to have someone whose job it is to kind of sell that internally so when people start you can sell it to them and say look this is good you know this is how you use it because if you don't do that then they'll join the company they'll use it badly and then they'll become disaffected with it you definitely need to teach people the fundamentals because people are going to come into your organization with no experience so you need to kind of address that you need to get them up to the level where they can feel comfortable using it so we have regular kind of regular sort of internal seminars on Cassandra and we try to put a lot of effort into kind of learning it the other lesson is don't store stuff for the sake of it you know do it if you need it basically explain trade-offs in choosing no SQL this is quite an interesting one so the other the management perception of Cassandra was they they weren't really sure why we'd chosen it we didn't explain it well enough and we should have done a better job of explaining what we were doing why we were choosing it and the trade-offs involved in that decision and then finally kind of provide solutions I think this is a really important one as a kind of developer we provided a solution to the problems that we directly had which was we wanted to run Halo on three continents to power the app but we'd kind of neglected some other people in the business who were able to use the data in my SQL and once we'd switched we didn't give them a way of doing some of the things they used to be doing so we actually made the overall system have less features they couldn't do something they could do so what we should have done is we should have we should have kind of we should have put more energy basically we should have provided a solution for that we should have figured out how when people needed to get 20 rows out of the database matching some criteria we should have had a way of them doing that easily and we did them so finally and just quickly in conclusion we like Cassandra at Halo because it's got really solid design principles it's got HA characteristics and you can just run it globally really easily it's a very easy thing to deploy and operate it's very operationally simple all the nodes are the same there's no special master nodes they're just all identical so it's very easy from an operational perspective to run this thing the future for Cassandra is that we're going to continue to invest in Cassandra we are migrating more and more of our stuff off from my SQL onto Cassandra mainly due to the multi region and it's just a simplistic operation side and some people experience running Cassandra this is quite hard work actually we've been trying to do that for about three months now this comes back to the talent pool thing maybe we should just hire someone and get them to learn it, I don't know we're going to focus on expanding our reporting facilities and that sort of debug thing and then we're basically just going to grow we're going to launch in the west coast so that you'll be able to actually get taxi via Halo when you're in San Jose and we're going to grow engineering teams to see Asia and London right that's it, thank you any questions? right at the back we're still switching, we started about a year and a half ago so quite a while but it wasn't it wasn't something we were aggressively pursuing at that time I think the aim now is to switch a lot of the rest of the stuff by the end of November so we're kind of pursuing it a bit more aggressively now yeah so sometimes people would write a little python script that kind of iterated over all the rows but it would be very slow and suboptimal basically I think the normal way you do reporting with Cassandra is that you'd have another data centre that is a reporting data centre you put one replica in that data centre and then you run some kind of Hadoopy map reduce thing on top of it so it's just a case of us getting that set up we just haven't invested the time to get that set up I think if we did it it would make everyone's life a lot easier so it's probably something we'll be doing yeah that's the normal way you do it because then you're reporting data your sort of workload for reporting doesn't impact the sort of workload for the transactional app flow yeah yeah so they're type 4 UUIDs and they have a time component so I think the type 1s are pure random type 4s have got a time component in them and it's a bit messed up because they actually reverse the bit so you don't it's not a natural ordering yeah but Cassandra does all that for you so with Cassandra you just say yeah this thing is a time based UID and then it will know that and it will automatically order them in the right way so that's pretty cool they're GUIDs but if you generate two at exactly the same time they'll still be unique because only a part of it is time based yeah it's got other bits it's got more bits in it basically some of them aren't to do with time last question yeah that's the plan the vision is certainly to grow a halo beyond a taxi network and take not necessarily no some of the other decisions in the architecture are based on that kind of SOA approach trying to make reusable chunks of services and stuff but the Cassandra decision was really the multi region thing really was a big selling point if anyone's got any more questions they can grab me after thanks very much everyone