 Hey, everybody. Can everybody hear me okay? Okay, good. Cool. You can move up to the front. You don't have to be afraid. You can band together like a community. My name is PJ. I am here to talk about the rise of distributed database. So a little bit about me. I'm here on behalf of Crate.io, who are the people who are kind of the stewards of CrateDB. And I'll talk more about that in a little bit. But it's a neat, you know, distributed NoSQL SQL hybrid database solution. I work for a company called DevRelate, which is developer relations and community relations as a service. It's pretty cool. If you want to know more about that, you just want a cool button that glows in the dark. Come see me after the talk. I'd be happy to give you one. As DevRelate, what I do is I provide the service to a lot of different companies. These are some of the companies that I work with. And if you have questions about any of these, they all do a lot of different things, which is kind of the main thrust of what I do. So I'm happy to talk about everybody there. Also, shameless plug, I'm throwing together a polyglot conference called Code Days in Buffalo, New York. If you're interested, CodeDays.me, the CFP is there as well. It's open until April 30th. And tickets, early bird tickets are on sale for 129 bucks. So it's pretty much a steal. And you get to come to Buffalo, which I know a lot of people are going to say, why would I want to go to Buffalo? Believe me, Buffalo is a great, great place. So let's talk about what we're actually going to talk about. The focus of this talk is kind of get the introductory idea of distributed databases and what they are, what that means. We'll look at where we came from, both from an application building perspective and a database perspective, what's been going on the last few years, last, you know, five, 10 years, which has been pretty exciting as far as all of these things are concerned. Where does no sequel fit into all of this? And what does that really mean? What in the love of everything is a distributed database? Because it sounds like a weird term, it's pretty new still. Who can benefit from this? You know, what are the use cases? And of course, where do we go from here? Because it wouldn't be a tech talk if we didn't talk about the future. So let's start with where we came from. As all of you know, as developers and technologists, if you will, technology moves at a pretty fast pace. At one point, it was totally reasonable to have something like this in the same facility that you were working in. It was probably in a closet. It was probably like a one meter by one meter room and there was one person who had the key. And it looked just as scary as this, unless you had like a really OCD person who was like, good at organizing cables and color coding things. But this is what it looked like. Right now, most people would look at this, especially young people when I've given this talk, and they say, my God, why would you do this? At the time, it made sense at the time to have your own server like that meant your company was doing really well. Those things were not cheap. To have your own server plus you were connected to an ISP plus maybe having more than one Dell blade in your stack, you were doing amazing. Now, the thought's repellent. But at the time, it was about making pet servers. You gave them names like Lord of the Rings characters or Isaac Asimov robots. And it was like, they were your friends. These are your servers. You never want them to die. They can never go away. You were controlling the flow. That was the main focus is you could control the flow. You weren't just throwing it out to a server that you didn't know where it was and hoping for the best. Or at least as much as your ISP would let you control. Databases have been equally impressive in growth. They started at a place in the beginning to gain data. For those of you who are a little bit older, you might be familiar with this. If you wanted to use the library's database, you went to the card catalog and you understood the Dewey decimal system. I'm willing to bet there's at least a couple of people in here who don't even know what that is. At any rate, even with computers, a lot of times you would have to go to a card catalog that would have a specific set of punch cards to query the data that you were looking for. After that moved on a little bit, you had to find the specific reel of tape that might contain a bit of the data that you were trying to query for. As time moved on, we moved to spinning disks and stacks like that. And it's interesting, but you had to still know what stack contained that data, what disk had what you were looking for. And the interesting thing here is the query. The query hasn't really changed. It's a basic statement. I need to select something that meets these sub criteria and I need that information. We didn't even have things like secondary joins and left outer joins and primary and secondary keys. It's a wonder we were even able to retrieve any data. I'll give you a perfect example of this. Databases move so quickly that the data that was from the original Apollo missions, you know, going to the moon and getting all that information, most of that data is irretrievable at this point. Because we moved so fast from cards to tape to stacks, stacks of disks to modern techniques that we have a ton of information about the moon that we can't access. Maybe we should go back. There's a thought. But it wasn't easy, even with all these stacks. So, of course, our next step was to build an abstraction layer. We needed to create ways that people could query databases and have interaction with the database more and more in each of our, you know, industries. So in order to do that, we started to build tools like visual fox pro. Who remembers visual fox pro? I'm sorry. Access. And this allowed, we all know access. You don't have to raise your hand for that one. But it allowed people who weren't necessarily database administrators to go in and query things and look for them. You know, they were able to go in and say, oh, I need to know if all of my customers in Nashville have made purchases in the last 30 days so I can hit them up when we all meet at DrupalCon. And these are things that you could do with access. It was real simple. You just, you know, first name, last name, email, latest purchase. And you could get yourself a good piece of information with a stylistically okay display of it. Push that to a spreadsheet, hand it to your sales people and say, go to town. It was awesome. Fun fact that I recently learned, I was doing this talk with a guy from Microsoft in San Jose just before another conference at a meetup. And he informed me that apparently access is still one of the most widely used database abstraction tools in the world. For those of you who knew abstract, who knew access back in like the late 90s and early 2000s, you know how horrible a statement that is. Abstraction there is, though, they're interesting because they have positive and negative effects. Like I said, people who weren't DBAs could suddenly access the database. That's great. Now they don't have to go to DBA and say, hey, listen, stop what you're doing. I need this list of customers. They could just do it themselves. On the negative side, this also meant the developer started to move away from the database. And this is definitely a negative because skills started to atrophy in a lot of developers. I'll give you a good example. I was working at an academic company. We did surveys and logins and, you know, application data, stuff like that for colleges, universities, mostly health sciences schools. We moved to Ruby. I like Ruby. Ruby is pretty good. I'm sorry I'm saying that at DrupalCon, but I like it. Before that, we were using Visual Fox Pro because we were using, I was using Visual Fox Pro. I knew how to write database queries because it's basically how Fox Pro works. You write a database query, it dynamically generates ASP, you get a webpage. Cool. These Ruby guys, because of the abstraction Ruby had from directly interacting with database, they could write these nice little statements that would usually get the query. But if something went wrong, they couldn't track it down. And these are developers. These aren't even like, I'm not talking CTOs or the sales team who want to report. These are developers. And they've already been abstracted so far away that they couldn't figure out how to do a database. You know, a simple select user, you know, select users from where, blah, blah, blah. And when I would do that, because things would break, because this is software, they'd be absolutely stunned and amazed. They'd be like, wow, you know SQL. It's like, yeah, because I'm old and it was necessary when I went to school. But yeah, I mean, so abstraction layers aren't always a positive thing. There's a negative connotation there. In the meantime, we started to get out of our gross little smelly closets with cords running all over the place and like, you know, little doohickeys up on this side. And if you touched it, everything would go out. We decided that probably wasn't the safest thing in the world, right? So we got into this idea of co-location facilities. And what a co-location facility is for you kids out there who don't know what it is. It would be an office building or a warehouse somewhere that had stacks and stacks and stacks of servers. And that was great. You could easily say, you know, oh, I need to do up a server. I'll slide one in there. We'll keep rolling things out and we'll grow that way. You could throw more hardware at the problem without completely overloading a closet and potentially starting buildings on fire. That's awesome, except for some small caveats. Sometimes you'd have to wait to have access to your equipment because people had private data that they didn't want to share. So they'd be in their racks. You'd be waiting outside to get into your rack and if there's a problem, obviously that's an issue. The other thing is, if there's an outage for one person in your rack, there's an outage for everybody in the rack, which means something as simple as, oh, did I just trip over that cord? Oh, look what happened. Everybody lost their stuff and it wasn't great. As far as the actual environment, they were pretty much just as hot and sweaty as the closets, just kind of on a larger scale. But we were moving, people weren't naming their servers silly things and they were getting kind of divorced of the notion that the server is our friend. The server is a tool and it does exactly what we needed to do. So, hello. Good, thank you. So, it was a long road. It was a long road to get to modern computing. This is my example of modern computing. This picture is in every talk that I give because I love this guy. You know, he's straight smoking a cigarette right next to his printer. Brilliant, love it. But yeah, all these factors lead up to where we are. In the past five, ten years, things have changed so very much. We've moved from the basement to the boardroom. You know, people in technology aren't just, you know, low-level administrators sitting in basements, living in their parents' houses. Like, we're not going down to the colo and hoping for the best. We're not looking for a tape reel to get our data. We're hoping that some piece of the card catalog is going to give us the information that we need. We're digital. We've moved forward. And when we look back, we can see the value of why we've done these things. A lot of times with technology, people say, well, why was that done? Well, it's pretty easy. You know, co-location facilities, things didn't work. Our customers couldn't get access to things as quickly as we wanted them to. So we continue to move forward to make that faster. The last five or ten years has been amazing as far as growth. Like I said, co-locations, they weren't quite cutting the mustard as applications got bigger and bigger and bigger. Databases got huge. We needed information and we needed to put it somewhere. Someone came up with this great idea of the cloud. It gave us high availability, which if you're not familiar, that means that you pretty much have an application running all the time 24-7. You know, you'll read companies that have SLAs like AWS. You are up 99.999% of the time. That's an amazing statistic. There's no way before the cloud we'd be able to do that. You had stability of a sort in the beginning. It's gotten a lot better. You had a reasonable operating system. That would be pretty easy. I mean there's, back then you were on IIS or maybe some weird distribution of Red Hat. Things like Ubuntu came along and made it easy to just spin up a stack. The cloud changed just about everything. It was Rikety at first, but so were co-location facilities because you were trusting someone else with information that they had no idea about. At least when we're using the cloud, we're working with other technology companies. They kind of have an idea of what you're trying to do. Everything became really interesting. Security became tantamount. Uptime became paramount. The cloud was here and everything from AWS to platform as a service took over what we thought was the way to do things. We could move away from the service. The hardware was less important. Building applications, building better databases, that was what was the key and so many technologies. This is the whole post.com startup boom. People realized they could do things easily with open source software, build it, put it somewhere they didn't have to pay for hardware and start building something that made a difference or money depending on what your goal was. Then in just the last few years, enter Docker. What a game changer. Things got crazy. You containerize your work. The whole concept of microservices started. You could build things, piecemeal, destroy a container when you didn't need it, build another container for new features, and just roll things out so the environment was the exact replica of what you wanted it to be. Little tiny pieces. You didn't have to break a whole application because one line of JavaScript went wrong. And I'm sure we've all had that problem. This seemed to be the next evolutionary step in application development. Containers allowed developers to break things down and deploy daily, sometimes hourly, in certain situations. This is something you couldn't do back in the day. Sometimes, I mean, even with the co-location facilities, I remember a time when we had to actually physically put the code on a disk, on a CD, bring it down to the colo, load it up into the server, hope it worked. Those are fun days. Nowadays, an instant goes down somewhere in the U.S., not a problem. Another can be spun up in just a few minutes in the same region or another, or maybe on a different continent. Doesn't really matter where it is. This is a far cry from going all the way down to the colo or trying to fight your way through that closet of cords, figuring out what's going on, switching out physical hardware. All the while, your users, your clients, your customers are all sitting there wondering why the hell the site isn't up and why aren't they using some other service. Huge step in application development. But it's just a step. You hear a lot these days about Kubernetes and Docker, containerization, and it being the golden hammer to solve all of our problems with application and data development. And it is great. It is amazing, but don't ever assume anything is like the end of the line when it comes to technology. There will be something else. There will be something cooler. Adoption on containers is far from universal at this point. Huge organizations move at glacial speed. I worked at IBM just a couple of years ago. And fun story, when I was there, this is two years ago, they were like, so agile, this seems like a thing. I'm like, have you guys heard of DevOps? They're like, what? It's like a post agile philosophy. They're like, we're just getting used to agile. You can't really bring up these things. I'm like, how long are we going to be working on agile? Five years. And you still haven't figured out what it is yet? Great. Now, time out. This is supposed to be about databases, right? And I've really been focusing on the application side of things. And that's fair. The idea is we needed to know where we came from. If you don't know where you came from, there's no way to know where you're going. And to be honest, the relational database that came about in, you know, the late 70s, early 80s, even the early 90s hasn't really changed that much. Post grads in my sequel have totally had updates, but they basically work the same way that they've always worked. Some features, some speed, some security, but if you were to compare post grads one to post grads nine, you're not going to see a whole hell of a lot of difference. So in 2009, give or take, we start to see a revolution in databases. For the first time, people began shouting about and exploring the use of no-SQL solutions, non-relational databases, and using them in production environments. Now, to be clear, this is not where no-SQL started. It started in the 90s with a guy named Carlo Strozzi, whose name I've never pronounced correctly, and he wanted a solution that was a non- relational solution. So he developed a concept of no-SQL. Also, so that we understand this going forward, what we're saying here when we say no-SQL doesn't mean no structured query language. SQL means structured query language. The assumption is that no means we're not using that anymore. But the no actually stands for not only. So no-SQL is really building on top of what a structured query language already uses. So let's keep that in mind as we move forward. It wasn't an easy transition. There were politics. There was confusion. Everyone immediately assumed they were supposed to throw out their entire database and hope for the best. Questions came up. Was a document stored non-relational? Could a full application handle things like sharded no-SQL? What the hell does sharded mean? Is it possible that a true 100 percent video web application could run 100 percent on something like Redis, which is barely a database? It took a little time for the dust to settle. Developers started to realize the benefits of running something that was much more lightweight than the standard like huge data warehouse or relational database. The old ways didn't die out though. By no means as everyone, I'm sure everyone in this room knows that they're not all on like MongoDB suddenly because this is a cool thing that happened in 2009. But a new concept started to emerge. People started to get ideas of ways things could be used. And at first it was a hybrid. You'd have kind of your no-SQL side. You'd run Redis to do some asynchronous tasks in the background like run emails that need to go out for a nightly newsletter or something like that. Things that would take some of the burden off the application in the main database. But it started to gain traction. No-SQL solutions started to become safe and stable. And slowly they became more popular and more common in the application life cycle. Now, many no-SQL stores to be clear, compromise consistency for stability. They use it for availability, partition tolerance, and speed. Speed is the key here. Barriers to the greater adoption of SQL include the use of low-level query languages. So not your standard SQL querying language. A lack of standardized interfaces. That was a major problem in the beginning. Huge, huge investments in previous relational databases. So if you're working for a company that's been around since, you know, 2002 and they've always used Postgres, guess what? They're not going to jump up and switch to no-SQL. That's just probably not going to happen. And, you know, these are the hurdles to no-SQL. So it's hard to get that kind of idea. With that in mind, we started looking at the next step, which is the distributed database. Or as I like to call it, the database of the future. I hope to watch this video at some point in time in like 10 years and be like, wow, that's the stupidest thing I ever said. It's just fun. I've been telling myself that this is the future. It's always weird. So I have to read this part straight up because this is a definition. A distributed database is a database in which storage devices are not all attached to a common processor. Maybe stored in multiple computers, located in the same physical location, or it may be dispersed over a network of interconnected computers. Unlike parallel systems, which is your standard relational database, in which the processors are tightly coupled and constitute a single database system, a distributed database system consists of loosely coupled sites that share no physical components, or at the very least, they don't have to. That sounds a little bit like containers, right? If you're familiar with containers, that's kind of how they work. Similar to those distributed application services, a distributed database integrates data logically. It makes it seem as if all the data is stored in the same place and you can't tell the difference on the outside. This means we can separate functionality of the database, separate the schema into silos, and it lessens the burden on the application. So the application has better functionality. With a non-centralized database running a NoSQL or hybrid NoSQL RDBMS option, we start to see full flexibility and we're able to see great gains on the application side and safer parts of our database, more secure and better running. In the beginning, it was really about replication. People would kind of do these hybrid situations, distribute the database into pieces. You'd have your primary and secondary and really be about replication. So you could guarantee the data was always there. Which is the same thing we were doing with the application there once we moved the cloud. You had primary, secondary, maybe a few secondaries to make sure you could handle load. And that's only like the first step towards distribution though. You're essentially backing up 100% of your database, 100% of the time, or whenever you're doing it on a regular basis. You should be doing it on a regular basis. Can't emphasize that enough. Since each node is an individualized subsection and is only part of the total data store, there's less need to worry about processing time. And that's awesome. That's where we get our speed from. Things become lightning fast. It's easier to expand a data store, to build it up out of nothing. It's also easier to make it smaller if that's the need of the application. Things get faster. They grow exponentially. But at the same time you start to see savings in cost because you're only using the pieces you need to use. You're not using a huge database all of the time. You're taking up less space. That makes things really a lot easier. I mean, there's some caveats, though. On the negative side, distributed databases, they grow in complexity pretty quickly. It's important to keep a handle on the architecture side of things. So that has to become a focus of either your data team or your application team depending on who is actually in charge of that. Additionally, the more fragmented database becomes, the more concerned for concurrency becomes. And of course, this comes with pushing the envelope and moving towards new technology. It was the same way with cloud. People didn't understand how they could have different parts of an application in different parts of their architecture. It's a learning curve, something we need to focus on. All of this is happening is, you know, I say we usher in the era of big data, but it's really we usher in the era of understanding what the hell big data is. More data is being processed than ever before. We all know that they give, you know, there's so many selfies taken on Instagram every day. There's more photos in the past five years than the previous 100 and all this kind of stuff. That's all good information, but when it comes to big data, we're talking about four specific things. Everyone focuses on the first one, which is volume. Having just volume doesn't mean you have a big data issue. You also have to have variety which means a lot of different types of data. Velocity which means you're bringing in the data at an extreme pace. It's coming in constantly and veracity and this is the most important one. How much of your data is actually valid? To put a point on it, if you are running a platform as a service and this story just comes off the top of my head and it happens to be that people are spinning up instances and those instances are only around for 23 hours and 59 minutes so they could Bitcoin mine really quickly as much as they possibly can in that time and spinning down the instance before they get caught. That is not true data. That is not a picture of what your users actually use your service for. It is really important to your security team. However, that is not what we qualify veracity with. So it's really a matter of knowing with big data. Do you have big data or just lots of data and you think you have a big data problem? That's something for you to decide. It's different for every organization. One distributed solution I'm sure you're shocked I'm going to bring this up is CrateDB which deals with distribution in similar ways to the containerization of applications. The concept is to allow piecemeal changes to be made within the data schema while not at all affecting the overall operation of the database itself or the application that it's interacting with. This preserves necessary functionality and allows the application database to form highly, highly better than if it's just a regular standard application. And this is great. This is all wonderful information but where do you use these things? Who uses distributed databases? One of the big things is IoT and I'm not talking about like you have a drone and it's really cool to fly. I definitely need a distributed database. You don't. You really don't. I'm talking more on the industrial side like large factories that are that have been updated. They're using IoT technology. They're using sensors to see if things are being built properly or robotic arms to make sure things get shifted from one line to another line, whatever that is. These applications have floods of information, gigantic tons of information going on all the time. It's impossible to follow that with a human. So that's one application is industrial IoT. The other one is data science and forensic applications. I don't mean like I watch CSI and they use this really cool database. Talking about applications that go in and say we've found your pen points. We know how to penetrate your system. Here's a great example and some wonderful log management on exactly how we did it. Things that break down why things aren't working. More topical thing, you know, kind of in the in the reality, you know, that we live in is driverless cars. Tons of information. Maybe not enough information or at least not enough information to react quickly enough to not hit a lady with a bike. But still, lots of sensor input, mapping, geodata, camera inputs from all over the car, reaction to fluid levels, you know, can I clean the windows so that my driverless car can, you know, my passenger can see. Power input, power output, all these things. Perfect case for distributed database. You can actually have a database specific to every sensor on the vehicle and only worry about things when they start to go wrong for that particular database. So I'm running out of time so let's talk about the future because we all love that. As time passes, you know, we kind of saw where we came from so where are we going? I honestly don't know but here's some ideas. Serverless, I'm sure we've all heard people talking about serverless. It's a really cool technology where you can basically take a piece of application code that doesn't, you know, affect the overall running of the application. Just needs to run every once in a while. Create a lambda. That lambda goes up to AWS only runs when it has to. You don't even need a server for this. You just run that code. A great example of that is those little Amazon buttons that people have so you could buy more Tide Pods because you're cool. You want to show your friends how much you can eat. You hit the button. It sends you a box of Tide Pods. All that does is send a message to Amazon that runs a lambda says this order has been completed with the information here. Soon as the order process, all of that code goes away. Absolutely amazing. So why not have some sort of serverless architecture for databases? It's a possibility. Quantum databases. This is brilliant. I came up with this idea myself. Don't know how it works. But it's one of the many things people are looking at quantum servers. So why not have quantum applications and quantum databases? I think that's a next logical step from serverless. Who knows how far that can go? I actually saw a talk about a guy who is talking about cloud computing using actual clouds. And I was like, this is ludicrous. There's no way. And he actually went through how there is a magnetic inference between molecules in the cloud that creates an electrical charge, hence lightning. If you can harness that lightning, which, okay, I'm really ready to see this now, why not be able to use that energy to actually hold information? And that was blown away. Okay, I give it to him. Sure, why not? Cloud, cloud, clouds. I'll do it. What about databases are so large they hold the entirety of human knowledge, the encyclopedia glatka. The possibilities are endless. And the reason why they're endless is because of open source. The future is open source. And if you take nothing else away from this talk, this is what I want you to take away. Everything that we do, you can do yourself. That doesn't mean you alone that means probably part of a community. But you take an idea, you build it. And if it's a good idea, everyone's going to help. This is true whether it's a distributed database or a new way to do application code or hell, you know, 15 years ago, if you had told me there's this great way to social code, it uses git and it's called GitHub. I would have been like, get out of here. That's ridiculous. No one's going to use that. But, you know, once upon a time, we also had giant monolithic applications standing alongside giant monolithic databases. But that time has passed. We have microservices. Distributed databases are the microservices of databases. So we have a lot to look forward to and we all have the opportunity to make that. It's really up to us and that's where the open source comes in. Tie it all together and we can make a wonderful, wonderful new technology as time moves forward. And with that, thank you very much. I'll be around for a bit if anybody has any questions. Feel free.