 Welcome to today's PostgreSQL World Webinar, Database Too Big to Query, What to Do. We're joined by Ivan Novik, Director of Product Management at VMware Green Plum, who will discuss data-driven decision-making and the value of real-time analytics. The historic difficulties are on making huge volumes of data with OLTP databases. Green Plum has an open-source database software solution, a next-generation data warehousing and large-scale analytics processing. My name's Lindsay. I'm one of the PostgreSQL conference organizers, and I'm gonna be your moderator for this webinar. A little about your speaker. Ivan is a Green Plum database champion, having worked on the Green Plum technology since 2009. He's experienced across the full stack of Green Plum technology from internal development to user best practices and hardware selection and optimization. He leads the product team at VMware working on Green Plum and also meets regularly with Green Plum users around the world. Outside of work, Ivan lives in California and enjoys swimming, walking and learning yoga. I always love to know that our speakers are human too. Welcome Ivan. That's all from me. So with that, I'm gonna hand it off. Take it away. All right, thanks Lindsay. As Lindsay said, I'm Ivan Novik and today we're gonna talk about your database is too big to query. What do you do, right? And is that a real problem and how we're gonna go about solving that in the future? So let's get started. I'll just let you read the disclaimer here about the legal issues and continue on. So I'm gonna start out with the hypothesis. The hypothesis is that databases are continually growing in size and that in order to manage continually growing databases, strategic technology should be selected which can help you to remain agile and effective as your databases continue to grow. So this is the hypothesis here and if we all agree that data is growing and you can think about it from your daily life, right? Think about your cell phone and the amount of data generated on your phone. Think about the websites you access, think about the way you interact with the world now and you'll realize that we are dealing with a natural force, an explosive natural force of ever increasing data which is accelerating and not slowing down. So for people out there who are used to managing 50 gigabit databases, tomorrow you'll be managing 500 gigabyte databases. For people managing one terabyte databases, tomorrow you'll be faced with 10 terabyte databases, et cetera. So no matter who you are, the amount of data you're working with is continuing to explode and that's gonna continue year over year over year for the foreseeable future. So now this is very kind of important in the details and we're gonna get to why that is but when we're looking at the worldwide usage of databases, right? And you think about, okay, not every individual person in the world uses a database but most corporations or most government agencies have at least one database. Big corporations might have hundreds of thousands of databases, right? And so what you end up getting is when you're trying to understand how big all the databases in the world are, you get essentially a bell curve where at the top of the bell curve we have the average corporation or the average government agency organization and then they will have a certain size database and then as you get to the bottom right of the bell curve are your huge companies with huge databases and on the left are your small companies with very little data, very small databases. So what's happening is this peak of the bell curve is moving, right? And from my conversations talking to customers, I would say a lot of the mainstream or median mean mode of the world have historically had or recently had something in the categories of one terabyte of data and that's quickly moving towards that bell curve being 10 terabytes. Now, why is this transition from one terabytes to 10 terabytes really important? The reason is is that when you run your database there's a limit to how much one computer can store and how much one virtual machine can store and how much data you can store in one computer or one virtual machine and a database. And generally speaking, that's about one or two terabytes, right? Once you start going to five, 10 and more terabytes you're changing the very nature of the problem you need to solve which is needing a software which operates in a network and not on a single computer, right? And so what we're looking at and what we're facing out here with the world is the majority of corporations and governments transitioning from needing databases on single machines to needing networked databases that can span across multiple machines and multiple virtual machines in order to holistically store and process that data. Now, you might think there are other ways about this which is just get rid of databases all together and move to some sort of a new technology other than a relational database. You might think relational databases are old. Well, it's true that relational databases were invented in 1970 and that you can see here the foundational paper from EF Cod at IBM Research Laboratory in San Jose. But in fact, despite the buzz or the discussion about other technologies, whether it be NoSQL or document databases or MapReduce or graph databases or whatever other thing that we may have heard about, there hasn't been a departure from 80% or the majority of data and databases in the world leveraging relational databases. And relational databases just for the, for the reminder or refresher it looks something like this on the right where you store data in tables and those tables are probably denormalized and they can be joined together to create any number of queries using a structured query language. So the point here is that relational databases continue to be the primary use case or the primary tool for storing data and querying data. However, we're in a critical juncture where most people need to move from the ability to use relational database on one virtual machine or one machine to being able to run a relational database across a network of machines. And that changes the very technology that you need in order to keep everything in the world going. So I have another hypothesis which is relational databases are foundational to technology strategy for people, for governments, for corporations but we must have ways to scale this technology to bigger data sizes. We're not gonna go away from relational databases to manage the growing data. We're going to have new technology that can scale relational databases to the next level to the next amount of size. So let's continue down this road. How can you actually scale databases bigger? So there's two general concepts in scaling. One is called vertical scaling. One is called horizontal scaling. Vertical scaling essentially means get a bigger computer. And IT administrators have been solving IT problems by buying a bigger computers forever or forever since computers existed. It's the easiest solution is just get a bigger computer. However, there's a limit. Computers don't come in infinite sizes. And in fact, computers get prohibitively much more expensive want to add more power to them. So one computer, right? You can buy a two CPU, 100 core computer but once you start asking for a 20 CPU, thousand core computer you're talking about a bespoke supercomputer that only people in strange labs have access to and buy and cost millions of dollars and nobody knows how to use them or administer them. So really there's a huge limitation in vertical scaling is that there just aren't very big computers available to buy which leaves you when you're trying to solve your bigger database with the second solution which is horizontal scaling which is use more computers, not bigger computers in order to handle the problem. And that's horizontal scaling and that's what we're going to be focused on because it's really of these two it's the only practical thing that can address the problem that we face. So again, just to recap on vertical versus horizontal scaling vertical scaling is easy, right? Because you're just buying a bigger computer you don't have to change your code. The hard part about it is it just it just doesn't get that big, right? You can't really get that bigger computer. Horizontal scaling can get as big as you want by combining hundreds or thousands of computers or two computers, but it's more complex, right? How do you, what software knows how to manage those computers effectively within the context of a database? And how do you operate that? So this is horizontal scaling which is more effective, but in most realities more complex to manage than vertical scaling. Now, when we talk about horizontal scaling the key word or the key definition is called sharding, right? And when it comes to sharding which means dividing the data into different databases. So you've got, let's say for example, 10 different databases, each one stored on a different computer and you wanna shard or separate that data out divided into 10 computers. That's called sharding. And you might think this is a proven technology and that this is, what are we talking about? There's no big mystery here. Now sharding can really come in a couple of different flavors here. When it comes to OLTP or transaction processing and essentially what this use case is, for example, you can think of Facebook. I wanna store, I've got a billion people's data. Each person's data is only a few kilobytes or megabyte. So I can divide that and say, well I'll store 10,000 people's data in each database. And then all I need to do when I'm looking up their profile is basically figure out which database has that user and then I'll go look up the relevant information and show their Facebook page or their profile. And there's, it's a very simple thing. You just route to the correct DB and then return the correct data or update the correct data. That's great, but that doesn't solve our analytical problem. In our analytical problem, you're asking a different question. You're not saying, hey, what's Ivan's previous job? You're saying, I wanna know, for example, for all of the users, what's the average duration that they've had an account on Facebook? Or what's, if I'm doing a supermarket, for every transaction, what's the average price of a purchase in a supermarket in all of our chains of stores forever? So you're not really going to one database and getting that data record. What you're doing is you're writing an SQL with structured query language and dividing that query into small chunks and executing it in parallel across multiple computers. And that's called massively parallel processing, which is much more technically complex but required for online analytical processing. And so as we're talking about data analytics and using data analytics and databases, we'll be focused on MPP processing here and how that solves our problem. So another hypothesis for you. This one, a little bit of a change of pace, but very important, which is that open source technology matters, right? And that PostgreSQL is a leading open source relational database technology, which can be the foundation for a solution to the world's ever-increasing data and how to analyze it. And you might ask why does open source matter, right? Well, first of all, let's look at the people voting with their feet. So you do see that from market share point of view, there is a movement towards open source and that this quotes from dbengens.com and it is growing the use cases for open source, the amount of people using it. Now, why would someone want to store their data in an open source technology, right? And the alternative to open source is closed source, which means intellectual property owned by a corporation with no access to the code or the technology and maybe even no access to the implementation if it's being provided as a web service, right? So basically database as a service would be completely closed source and completely proprietary where not only the only employees of the company providing the service have access to the source code and know how it works and know what it's doing, but also only they have access to the physical component where the data lives. So you don't even have physical control over your data. So when you're looking at open source, there is a huge, the list of reasons people give to pick open source include some of these items here like security, quality of the implementation because the code is in the public, customizability, freedom, freedom comes in many different flavors, but really you own your own data, you own the code that's running your data, flexibility, interoperability, community cost, cost is likely lower because there's competition because the source code is available and also educational, right? It allows people to learn more by learning how the system works. So there is a shift to open source technology and I think it kind of goes in waves to popularity where corporations and individuals think open source is important, then they kind of step away from it and they focus on other things, but fundamentally, I believe open source is important and that it gives you a lot of power and that people are gonna continue to gravitate towards open source as we go forward in the future. Now, as far as PostgreSQL, PostgreSQL is, I would argue the leading open source relational database technology in the world. It has been around for 35 years. It is hugely inactive development. It has a very friendly licensing scheme which allows a lot of innovation and worldwide collaboration on it. And it's super popular and I don't think I have to prove to you, probably you've heard in PostgreSQL yourself, but one thing that you may not realize about PostgreSQL is that not only is it a great relational database, but it is also in a sense an object-oriented programming model and it is a programmable database and why would a programmable database be interesting, especially if what you're trying to do is to use your database for analytics, right? And really what I'm saying here is, is the programmability of an analytical database gives you millions of different scenarios and options of how you wanna program that database, what you wanna do with it and how you wanna embed logic and software into the DB, especially in large scale systems to generate analytics based on data in the database. So PostgreSQL is a great technology to use. Now, what does it actually mean? I wanna kind of give you some examples to make programmable database. So these are user-defined functions. The user-defined function allows you to generate your own software which will be executed when you query the database. Now, PostgreSQL and Green Plum both support about five different programming languages. I think Python is the most popular one, right? And this shows you a simple Python function that I created to essentially take two strings, two text inputs and kind of do some of my own custom processing which in this use case is trivial, right? Just kind of getting some substrings but I'm creating my own logic, right? And then that gets wrapped into a function which is installed in the DB and essentially can be run dynamically to generate new data points. So in this case here, the name and the location of this data is stored in the database but this new column that I'm generating is being generated dynamically in my Python code, right? And it could be anything here that's business relevant that I'm processing on the data and can be context-aware and bring in the power of the entire Python stack and ecosystem to make it available to you as the user of PostgreSQL or Green Plum. As an example of a little bit more of a real-world interesting case, in this case here we've got Twitter data that's stored in the database, right? And really here I'm showing the benefit of being able to import from the thousands of Python libraries that exist in the world. So I'm gonna import from NLTK, Natural Language Toolkit and essentially import the sentiment analyzer which can detect from text are people angry or happy, right? And that's code which we didn't write or I didn't write but is leveraging based on Python libraries that exist from the best of the breed of libraries out there and I'm pushing this functioning, this processing into the database to be able to analyze the data on the fly of these text information that's there. So this is really just showing you one flavor of what you can do with the programmable database where here I can actually sort all of the tweets based on the order of negative sentiment, right? And this tweet they said, and by the way, this example was done years ago like nothing new or relevant regarding Twitter it just happens to be a good dataset we had. But in this tweet, basically the person was angry. They were talking about Apple and they got angry and whatever, whatever and you can detect their emotion based on a select query, right? So we're able to query people's emotions with SQL, right? With very simple functions. Here's another example using geography. So again, trying to demonstrate how a database can be programmable, right? It depends what your business cares about. If you care about text and you care about what the information people are saying in the text you could use the previous example. If you care about locations and where your data points are let's say you're a shipping company or a logistics company or you work for the mail and postal service of your country or government, right? You may want to store data that includes information about location. So there are modules namely post GIS which can store vector, raster, topology and geocoding data in Postgres and GreenFlow making it programmable from the aspect of geography, right? And what does it look like? Just so you can kind of visualize it a little bit. It looks like this, right? You're inserting for example here polygons with points and then you can call functions. These functions are things like area that can do calculations based on distance, area, location, nearest distance and all kinds of unimaginable things in terms of what you might query if you had data with geographical points in it. So just to play out this example a little bit more right now we're saying let's look for these tweets. Previously we're looking based on emotion. Now we're looking based on location because location is often stored in there if it's enabled in Twitter. And here we're saying, you know, let's look this is just showing the location of one tweet. You know, it's negative three, 42 coordinates and you can map it down into somewhere in Detroit if you put it in Google Maps. But you can do things like I want to search where tweets came from within one mile of and then put the coordinates and things like that, right? So you can do start get your imagination going and start thinking if my database was programmable and had modules for text and for geography or for time series or for graph or for fraud or for any business use case that I have I could write the code that embeds that into the DB and allow me to query it in that way. I can also create machine learning modules where machine learning models where I can do predictions on data and use the database not only as a reporting database but as a predictive database, right? And I can say, for example, call a linear regression model. In this case, I'm creating a linear regression model that is trained to predict house prices based on input data. And then I can just query it again in order to predict the house price based on new data sets off of the historical model. So giving you another vector of how the database could be programmable, it could be programmable in terms of predictive modeling using dozens of algorithms that are pre-baked in different libraries that can be used to do machine learning on the data in your DB. So hypothesis, PostgreSQL is amazing. It's programmable, it's open source, it's mature. It's got almost every database feature ever described, right? However, Postgres itself is a single computer database, right? It was designed back in 1980 with the 100% intention or with the 100% of the awareness saying this will run on one computer. Now that one computer easily translates to one virtual machine or one Docker container but it doesn't easily translate into a network of computers. So the question now is how do we take the amazing capability of Postgres that we wanna use to analyze data but allow it to analyze data as the data grows bigger than one computer? And as we've showed, where the majority of the world is stepping into a situation where one computer is not enough. That's where Green Plum comes in. Green Plum, and before we get into what Green Plum is, I wanna show you what some of the users of Green Plum are doing today, right? Now this doesn't mean you have that use case but this is showing the art of the possible. There are customers or users that I know running tens of millions of queries in one DB per day. There are users storing trillions of rows of data in one table, sharing it with thousands of users, pushing data between multiple environments doing thousands of updates and deletes per second, running a million queries in an hour with an average response time, analytical queries with an average response time of a second and running all kinds of the SQL standard syntax which could be extremely complex from complex window functions, recursive comment table expressions, views, views on top of comment table expressions, joins, joins on views on top of comment table expressions on top of window functions, all kinds of complexity that SQL can provide. And then also combining it with the advanced analytics and programmability aspects of machine learning, text search, graph analysis, JSON, geospatial and time series. So all of these things are there in the art of the possible but because Green Plum is open source and Green Plum is postgres base, it's totally portable, right? You can even run Green Plum at super small sizes. You can run it at the edge or an embedded system and you can run it on a Raspberry Pi as we've shown on YouTube if you have a use case for doing that. So Green Plum is not only for the people who have tens of millions of queries or trillions of rows, but really it is for this mainstream set of people who have suddenly found that one computer is not enough for them. So the key points of what problem Green Plum solves solves the problem of scalability. So you get more, any amount of data it solves the problem of mature SQL features, right? So there's often new technologies built, but do they have the maturity of being able to handle the full SQL standard and all the features you expect from your single node databases like Postgres or the equivalent, popular close source database technologies as well, right? Do they have the maturity of the feature sets to run all the different queries and millions of combinations and permutations and the experience to do that? And then do they have the flexibility of deployment options? If I wanna run it in cloud, if I wanna run it on a virtual machine, if I wanna run it in containers, if I wanna run it at the edge, if I wanna run it on my desk, if I wanna run it on my laptop, right? Does it have the flexibility? So these are the problems that Green Plum solves the scalability, the maturity and the flexibility. So what I'm gonna do right now is for the technical folks out there is give you a little bit of a description of how this all works, right? So Green Plum takes Postgres SQL technology which is that single node relational, single machine relational database. And it brings an entire set of open source technology to allow that to scale for massively parallel processing. So at a very high level, this includes a new and improved way to do query planning or generate query plans, a new and improved way to distribute the data across a network of computers, new and improved ways to pipeline the data through the system as it runs and to federate the data with external systems. So I'm gonna go through this a little more detail now, but before doing that, I'll give you a quick history of Green Plum. So Green Plum was created as a startup back in 2003. And some of the key points that you may have heard about were its integration with the Sun Microsystems back in 2006 which had very powerful hardware for data processing and then the integration of Green Plum into EMC and the forming of a data computing appliance back in 2010. And then the spin out of Green Plum into Pivotal Labs and the creation of the company Pivotal and really the open sourcing of the technology back eight, nine years ago as well as then the major integrations and updates that have been made up until today. So the way that the technology works is that queries come in through a master node and the master node is a very lightweight pre-processing that generates query plans, but the data and the query processing are done on the network of computers. So the first thing that happens when a query comes in is that it's a query plan is created and this is the magic of distributed query planning. And there's a lot of scientific papers published about Green Plum's query planning. It's based on computer science very deeply. But essentially it takes the infinite space of possible ways to solve a query and intelligently dynamically reduces that search space finding the most efficient query plans and generating the distributed query plan that can run this entire query in small pieces across the network. Those query plans are dispatched out to the worker nodes, the worker computers through the TCPIP network which is a standard TCPIP network requirement. And then once it's dispatched out to the individual worker machines the query execution will kick in and start processing the data locally in parallel on every computer. As rows are processed, there is a dynamic pipeline where rows are pushed up through the query plan to the next stage and efficiently pushed in between computers when necessary to other computers in order to do joins and relocations and things that require network processing and movement of the data. All of it over standard TCPIP or UDP protocols which are readily available and using commodity hardware and commodity switches and commodity networks which are found in everybody's data center. This is a transactional database. So Green Plum does have some additional transactional capabilities than Postgres has in order to make it transactional across the network and to make sure that when you start a transaction that correlates over the whole network of all the data and all the computers on the system. And it is ACID which means that by design you do not lose data, right? Even in the event of hardware failure, even in the event of disk failure by design things are committed when transactions are completed and they're stored in multiple places safely and securely. So by design it is a distributed transactional system that has the properties you'd expect from the database. And it also is a high concurrency system, right? So sometimes people have built other analytical database technologies and one of the easiest ways to cut corners in your architecture and I'm speaking from 14 years experience working on the architecture of the internals. One of the quickest ways to cut corners is to make it a low concurrency system to put extreme locking to reduce the difficulty of the implementation. But we do have a very advanced roll-level locking in Green Plum and we do support up to hundreds of thousands of queries a second if these are small queries. And we do support not only inserts and selects but fully support updates and deletes like you'd expect from a relational DB. And it is redundant in terms of the storage. We do have multiple copies of the data and that's all kind of transparent. You don't need to think too hard about that as a user. Now there are a couple of more finer arts when it comes to the basic principles of how to scale a database to be large. And one of them really worth mentioning is vertical partitioning. So in vertical partitioning, many of the use cases for historical data are divided by time. So you can imagine I'm working at a retail chain, let's call it a supermarket chain with 10,000 supermarkets. And I wanna store the sales at every store for the entirety of time. So what I'll do is I'll break that down into partitions to make it more manageable. And each partition can be maybe a date range. In this use case, it would be, let's say one month worth of data. And that allows us to do very efficient storage and very efficient querying when not all of the time ranges need to be analyzed. So frequently you will do analysis on the last month's data or the last year's data, but you may much more rarely analyze all the data or the old data. And so this part vertical partitioning really is critical and can provide orders of multiple orders of magnitude of speed up when processing date range data or other data similar to date range data. And we can do that efficiently. We can store row-based data. We can store column-based data, which are different formats. We can store external data in HDFS or S3. And we can compress the data using a variety of algorithms. And these are storage properties that the user doesn't need to know, but maybe the database administrator knows and allows us to really pack the data efficiently so that you can scale, right? Because if you can say this use case would require a hundred computers, but I can compress it and partition it and now it only requires eight computers. Well, now with eight computers, my company can do that job. If I go and ask for a hundred computers, they may just say, forget about it. We'll go blind without the analytics. But if I can pack that data down to eight computers instead of a hundred, okay, now we're talking. Let's do this work, right? So you can do it, have different strategies of how you do that, and maybe keep the less data less compressed and have it for a faster access storage and the older data more compressed, maybe a slower access, cheaper storage. There's different policies and ways you could go about doing that. But essentially the way it works is that when you query the data and you indicate the filter of what you wanna query only, it will immediately eliminate all the other data. So if you have stored 500 months of data and you're only querying one month, you will process only one 500th of the data making the query 500 times faster than if you don't partition the data. So this is really huge. And the data in MPPR and Green Plum is distributed, right? So if you have 10 computers, the data is divided by 10 computers. And if you do the 10 computer division plus the 500 partition division, now you're talking about chunks of data that are one in five thousandths of the size of the original database. So you're looking at doing query processing 5,000 times faster. So if you're sitting there with a single computer trying to query this database, maybe without partitioning and your report takes, let's say 20 hours and the business requirement is to do it in two hours. Well, doing it with more machines and more partitions is gonna easily get you where you wanna be. Now indexes are another key point. Indexes make the database searchable, not just analyzable. And so Postgres and Green Plum come with a wide range of indexes which can be used. And they have indexes are in computer science, they're data structures. And they come with data structures and they come with algorithms of how they're searched. And every data structure and every algorithm has its own performance based on the nature of data and the nature of the query you're running. So having a wide variety of indexes allows you to search through all kinds of structured and unstructured and semi-structured data in efficient ways looking for things that you wanna find. And so you're talking about having a database that's analyzable, that's searchable, that's programmable and that's updatable in transactions, right? And that can store rich data types, rich data types, not only just integers, numbers, strings, but also storing complex documents like JSON or XML, storing binary data like full word documents, images, videos, network traffic, IOT data, IT log data, geography data, graph data. These are rich data types which can be programmed, stored and programmed to analyze, right? So you're really just going back to that picture of the black hole or the Milky Way at the beginning. The use cases and the complexity of the situation is exploding, right? And you can manage that explosion and capture all that information and have it at your fingertips with the proper database approach. Federated query allows you to merge and blend data stored in the DB with additional data sets stored in other systems. So you can kind of take all the data in one database and query it together with data from other technologies and other systems. And streaming data allows you to have continual updates into your database. So creating real-time data ingestion always up to date database, constantly being updated programmatically through rabbit messages, rabbit streams or Kafka streams or third-party tools like Streamsets Informatica, right? Continually updating data that's programmable and giving you an always view, fresh view of the situation. So when you kind of put these features and use cases together, what you see, what you get with Green Plum is, first of all, obviously we're talking about something scalable from small to large and obviously portable. But from a feature set point of view, what you're getting is business intelligence analytics, 100% SQL programming with our Java, Perl and C, semi-structured data with JSON and XML, row-based storage, columnar-based storage, full interaction with Postgres in terms of the ecosystem and the community, superfast data loading, superfast data querying and a compute grid that you can use to do calculations and computations as well as machine learning, deep learning, time series and graph, federated query processing, text analysis, geospatial, image recognition OCR and integrations with streaming systems, Spark systems, NiFi, Kafka, et cetera, et cetera. So you've got really just a massive feature set that's available with this platform. So wrapping up in the conclusion here and I wanna kind of go back through these hypotheses and I think we have real-world evidence that these hypotheses are true and this presentation has helped to get you on the same page that, hey, these ideas are actually true. Databases are growing in size. That's true, right? And that strategic technology is needed to remain agile and effective today and tomorrow. Hypothesis two, open source technology is important. Postgres SQL is a leading open source relational database technology and a great choice to learn and to build around Postgres SQL. Postgres SQL is programmable. It has amazing feature sets, especially when it comes to analytics use cases. But we need a way to scale it as the data is growing in order to do analytics in the new world. And finally, Green Plum Database provides you that capability. And so I highly recommend that you invest in learning and building Green Plum Database technology into your future roadmap as an internal roadmap for you to leverage Green Plum for data management and analytics and have an open source, scalable, portable and powerful technology that you can have at your finger, have all the data at your fingertips that you need in an efficient way for the future. So with that, check out our website, GreenPlum.org and stay in touch with Postgres Conf and keep informed and there'll be more to hear later. Thank you so much. Ivan, that was wonderful. I really, really appreciate the time and the thought that you clearly put in. So I think we're good to wrap up. I wanna thank you for all of your time and effort and thought. I wanna thank all of our attendees for spending a little bit of their day with us. And I hope to see all of you on future Postgres Conference webinars. Cheers, thank you. Thank you.