 We heard about Greenplan before. Hands up, please. OK, quite a few. That's good. I'm going to tell you what is Greenplan and how we made it open source, what are all the roadblocks we had along the way, and what took so long about this entire process. So my name is Andreas. I'm from Germany working for Pivotal. I'm also board of directors for Postgres Europe. So we are the guys who move around this Postgres Europe or some conference every year to another country. I joined EMC in 2011 and moved back and forth between EMC and Pivotal. And now I'm finally in Pivotal again. OK, in February 2015, Pivotal made an announcement that they will make all of their products open source. So not only Greenplan, but all of them, everything they have. It's a nice move. And after this announcement, nothing really happens. And at the end of October 2015, so it's like nine months or eight months in between, we are looking for more outside. Nothing really happens. But of course, we did not just sit down and did nothing. We had plenty of work to do. And that's what I'm talking about here. First of all, I'm going to tell you what is Greenplan. What is the history of Greenplan and how does it compare with Postgres? Why did Pivotal decide to go open source, after all? And what are the other challenges and lessons we learned along this journey? At any time if you have a question, just raise your hand and I try to answer it. OK, for those of you who haven't heard about Greenplan, it's one of these Postgres forks. And we have many, many Postgres forks. Some are still around up to today. Some are dead along the way. It diverged away back in 2017, so it's already a very long time that Greenplan is around. And it really aims to be a data warehouse analytics product, something where Postgres traditionally is not very good in. Postgres is an OTP database. If you want to have this one single row from your table, you want to have an index supporting your query to get back this row as fast as possible. That's not how data warehousing works. In a data warehouse, you want to store all your data, possibly forever, and want to run all your queries across your entire data set. Using an index, probably not helpful. It's quite a different workload compared to Postgres. It also means that you store terabytes or even petabytes of data in your database. It's not something you can fit in a single box, so obviously, you need to spread it out across multiple boxes. And that's how Greenplan works here. The name we have for this Greenplan is a massive parallel shared nothing database. Sounds a bit like marketing. Let's have a look into this. So if you look at Google Trends for data warehousing, you can see that term data warehousing itself, it's declining over time. It was stable from the 80s over the 90s and early 2000s, and it went down. And the next trend we see, beginning around 2012, is people no longer want a data warehouse where they run reports like over the weekend or end of the month. They want to have instant answers. They want to run a query now. They want to have an answer now. That's where this term analytics database comes in, where you not only run this one query over the weekend, which takes six hours to answer, and maybe it finished, or maybe not, you want to have your answers now. Greenplan is not the only player in this field. You might have heard about a few others here. Teradata, which is a company which basically founded this entire, or started this entire data warehouse place. And there's Exadata around from Oracle, and it belongs to IBM. And Tisa, by the way, is yet another Postgres fork. They forked off, I think, 7.0 or 7.1. So it's even older than Greenplan. And quite a few more players. So what is this massive parallel shared nursing? This is how a typical Greenplan looks like. On the top, you have two servers because they're master and standby master. The master is a server you're talking to as a client or as an application. It really speaks secret to you like a Postgres database, and it hides everything which is beneath this master. All the network stuff, all the different servers you see down there, usually you don't see them if you run a query. Greenplan is taking care of these problems for you. And the right box, the standby master is the only box which has nothing to do in this entire scenario. It's just there in case the master fails and you have to switch over. The middle line, the dark gray box, it's what we call Interconnect. It's basically the network layer where the master and the segment boxes down there talk with each other. The faster the better. Usually we have 10 gig network deployed here, but everything which speaks TCP is fine for us, actually, UDP. And then down there, we have what we call a segment server. One, two, three, as many as you want. So we can easily work with a few dozen or a hundred segment servers here. And talking about a few hundred segment servers, you see that you can have quite a few number of CPU resources and RAM and disk in this entire setup here. So we're not only talking about terabytes, we're talking maybe petabytes of storage you have here and hundreds or thousands of CPU cores. And on every segment server we have, we have a number of segment databases. Usually equates the number of CPU cores we have, but it's not a hard-working rule. And the segment databases are basically Postgres instances coordinated by the master. So let's say we have like eight segment databases here, then we usually have 64 segment databases running in this entire segment, plus one on the master, so it's 65. Good, where's the data? The data is spread out across all of these segment databases we have here. Green Plum takes care that the data is charted across all of these segments. And that every query you run in this database will run on all of your segment databases across all of your data. Basically you're utilizing 100% of your CPUs in your segment databases. So the more you have, the better. Then how does the typical query look like? Usually you don't have just the select count star from a table, but you have joins included, aggregations included, and that's where Green Plum comes in and handles all the queries for you. So you as an application, you only have to send a query to the database and Green Plum takes care that all the joins happen internally and aggregations happen internally. You as an application or you as a DBA, you don't care about this. Okay, last marketing slide. We also have a number of features you don't see in Postgres today, like what we call a polymorphic storage. So we have column-based tables where you store every column of a table in one single file. It doesn't make sense if your table only have like two, three, or five columns. But in typical data warehouse, you have like 500 columns in a table. And if you have to read 500 columns, but you only need three columns, you're basically wasting IO for 450 or 480 columns. If you only have to read the three files for the three columns you really need, everything goes faster. We also have partitioning in place. You can say order table, add partition, drop partition, this kind of stuff. And you don't have to create triggers to move your data into your partitions. Green Plum is handling this for you. So we have parallel data loading. We don't have to go through the master with data loading. We can do every single parallel. And of course, if you aim to utilize 100% of your resources, you have to have some kind of resource management in place. So we can define how many users can run queries and which parts of the CPU and one specific user or specific group of users can utilize. Good. Any questions so far? As I said before, Green Plum was created or founded in around 2004, late 2003. This was when Postgres 7.4 was released. It's a very long time ago. Anyone ever use the 7.x release here to guys? It was merged out of two companies in a DC area but quickly moved over to Silicon Valley. We also had one of the chief architects from Terradata as one of our first employees in Green Plum. He basically designed what Green Plum is today. Okay, then moving on. Postgres 8.0 was released. Anyone remember why Postgres jumped from 7.x to 8.x with the version numbers? Because of the Windows port. So we got 8.1 and you see in the top, nothing really happens on the Green Plum side but the company grew. 8.3 released in Postgres 2006 now. Then finally, Green Plum came out with version 3 and that's about the time they stopped merging with Postgres. So from 7.4 where they started developing Green Plum up to 8.3 where we are in 2007, they always merged latest version and then they decided to stop. We come back to this fact in a few moments. Green Plum 3.x also got us cooperation with Sun. So back then remember we had a company called Sun which today belongs to Oracle. Up to the day we have customers running Green Plum on Sun Solaris on Oracle hardware. For some reasons they like it. Can't understand why. Green Plum 3.x also got what we call logical replication. So back then we already had a need for replication in a database and if you look at a timeline, Postgres got replication like three years later. With version nine. So 2009 Postgres 2004 was released. 2010 Green Plum 2004 was released and this was the time this company was bought or acquired by EMC. And now finally Postgres got replication in place. By the time Green Plum was acquired by EMC we already had offices around the world and a few hundred employees. It also got us things like what we call Data Computing Appliance. So EMC provides a rack of servers and the Green Plum software on top of it along with support and everything. Green Plum 4.0 changed replication. So we have a file based or file system based replication now. Basically every IO you do on a segment is transferred or mirrored to the mirror segment we have. We also got something which is called Data Science Team. More about this in a moment. So moving on a timeline we are in 2011 now. So we got two new versions of Green Plum 4.1 and 4.2 and Postgres 9.1. We released something called Medlib which is a package for this analytics functions. It works on both Green Plum and Postgres. So you can go to the Medlib website and download a package for Postgres install it and you get all kind of advanced analytics functionality which you can run on your tables on your data in your database. And if you look at this trend for data science it really took off end of 2012, beginning of 2013 and at some point it was called hottest new job in Silicon Valley. 2011 is another time where Hadoop took off. I don't know how many of you ever used Hadoop or worked with Hadoop. Okay, I like it. It's also the time we made a big internet choice to switch our interconnect to UDP because we figured out if you have a few hundred servers and a few thousand segment databases running TCP doesn't scale well anymore. So we have to go to something different and decided interconnect now with UDP and some kind of packet verification. If you know UDP you know that packets can be lost if you use UDP. On the other hand, we have a stable internet work where we expect that we don't lose so many packages. So UDP works quite well for us. 2012, we got Postgres 9.2. 2013, we got Postgres 9.3. And that's about a time EMC and VMware decided to spin off all of their software packages, companies acquired over time. You know, EMC is a hardware company, well known for storage and data center. VMware is a virtualization company. And I was thinking, it's acquired, doesn't really fit into these two companies so they decided to create something new. It's called Pivotal. And then beginning of 2015, here we are. Pivotal announced, okay, everything we have now which is GemFire, it's an in-memory database which is Green Plum, which is all the Hadoop we have, Spring, Java, Framework, and a few more things. Everything goes open source. It's a very huge effort here. This was announced in February in San Francisco. And looking at Green Plum, well, between February and the next thing we had was in October in Vienna. So what happened in between? First of all, a number of new people joined our company. You might have heard about Heggy from Finland. He's a well-known Postgres contributor. Dev Kramer joined our company, Daniel Gustafson from Sweden. He is one of the other board of directors for Postgres Europe. Adri Sharma joined his base in India and a few more. And this team went on and made sure that we can finally release Green Plum into open source. The first product Pivotal released was Apache Geode. It's previously known as GemFire. It went to the Apache project as a new Apache project incubation and got a name Geode. Then in October, we released Green Plum Database and in January of this year, we released Apache Hawk. Hawk is basically the Postgres SQL engine running on Hadoop. So it ripped out the entire storage layer and replaced it with HDFS. Yet another cool way to use the Postgres code for something new. So this is a commit history on GitHub for Postgres. Don't know if you can see the numbers. So it stopped somewhere in 2007. This is where 8.02 was released. And it starts again in late 2015. That's a big jump here on the right side. And this is possible because Hakey used Postgres 8.02 base to merge everything from Postgres on top of it. Basically what you see here is the history of Postgres until we forked off Green Plum. And then in 2015, it starts again with the new development we have. It also means we have quite a number of well-known committers in Green Plum. First of all, on the top left is Bruce Momgen. And a few other known people contributed to Green Plum as well because of this Postgres history we have in place. So what are we doing now? The next step we are doing right now is we are merging this more recent Postgres versions. 8.03 being the first one, it's already almost done. And then we move on until we catch up with some more recent Postgres version. So it doesn't mean that Postgres and Green Plum will merge into one product at some point, but it means that the two products will stay as close as possible. Good. So why did we even consider going open source? And that's a quote from one of our top execs, people in the company. So opening things up, not only Green Plum, but every product we have is incredibly healthy for the company. It forced us to reconsider all of the products we have, and it's quite a number of products. It really forced us to clean up everything, code, version control management, testing, wherever we host something, governance of a project, everything across the board. We had to reconsider everything. And now we are in a situation where it's much easier for us to actually communicate with people outside of our own company. Like even for customers or for other developers who want to join this project, know that everything we have is in open source. It's much easier for us to communicate with everyone. One of the reasons, probably the main driving force for going open source is that customers more and more request that a product they want to buy or they want to buy support for is open source. If you look 10 years ago, the usual check box on an LFI was, does it support Oracle or does it support IBM? These days it's, is it open source? If you're not open source, well, you are out. That's the main driving force in this market these days. Customers do not want to be locked in into any vendor. They want to have a choice that whenever something goes wrong with one vendor, they want to take their product, they want to take their applications and go to another company and get help for it. Pivotal itself, the name Pivotal comes from a company which was acquired by EMC, the name was Pivotal Labs. These people lift open source. It's basically a consigning company. So we had another major driving force inside a company to go open source with everything we have. Not a bad thing to have. Okay, what happened along the way? How much time you have? One of the first things we had to consider is the license. It was actually a major roadblock for us. We couldn't use the Postgres license because Postgres itself in its license has no patent clause. And EMC and VMware over time acquired quite a number of patents on technology in Green Plum. We're just using the Postgres license. We'll put everyone into a situation where they can't really use Green Plum because there are some patents on the code. And in five years, maybe someone acquires Pivotal. Remember what happened with Sun and Ovalge? The moment Ovalge acquired Sun, lawyers went in and started a mess with Java and Google and do it. So no one wanted to repeat this scenario. In the end, after lengthy telephone calls and discussion with lawyers, we came up with the choice of using the Apache license for Green Plum because the Apache license is compatible with the Postgres license and it has a patent clause in it. So we can't use any of the patents we have in Green Plum against anyone using Green Plum, which I think is good for everyone. The other major choice we had to do was a project owner or how do we steer this project in the future? One of the first steps we did, we approached the Postgres project and asked if we want to take on this project, but they refused. Postgres is a single product community and they don't want to take on yet another product, which is okay. The other choice we had was, do we want to bring this project into the Apache Software Foundation, like we did for a number of other products, Jamfire, Hawke, and so on? In the end, we decided against it because there's not much communication between Postgres and Apache. If this would be an Apache project, well, it's basically no more communication, no more code exchange, nothing between Postgres and Green Plum. We came up with that we steer this project on our own so far. Another big challenge, if you have a closed source product, you don't care about any customer names in your source code, any comments. If you want to go open source, you need to get rid of all of these names in the source code. And having 1,000 plus customers, it's quite an effort. We spent several weeks going over the source code and finding all possible combinations of customer names in the source code and we move everything. One of the other problems is everything was tied into a VPN network or our internet network. Build system, test system, code repository system, everything was only accessible if you are in our own VPN network. Even moving everything to GitHub was a challenge because now all the test systems and backtracker and whatsoever need to access GitHub and no longer our own internal systems. And we not only had one internal code management system, we had quite a number of them. Perforce being one, I don't know anyone ever worked with Perforce here. I know your pain. So we had to clean up all of this and move everything to GitHub. We did it in a way that everything was moved to GitHub before but the repository was private and then in Vienna we just switched it to public. Test systems, Ashwin can maybe tell more than I can do about it, but how many different test systems we had? Four or five? So in Postgres, if you wanna make checks, it runs one regression system. In Greenplan we had four or five different test systems in place plus some external systems picking up every code change and running them. At some point it took more than 24 hours to run a full test system. Obviously that's not very good if you want to have a quick change because you have to wait more than one day to get a result back. So we had to work on this one. Hickey being one of the main the way we're having forces here. Fixed quite a lot of them. He got rid of the long running tests. He broke down many of the big tests we have. So for some tests we had gigabyte of test data in our repositories and you being an external developer you don't want to download 20 gigabyte of test data just to run one test. And of course everything was only accessible inside of a pivotal network. We are now at what? Around one hour test time if you want to run a regression test in Green Plum. Can be improved but we are quite happy with the current situation. CVEs. We had to follow up with every incident known for Postgres. We actually have a security engineer just doing this. So we followed up every single CVE for Postgres starting at two up to 9.5 and made sure it's either fixed in Green Plum or we are not affected. It was quite a amount of work to do this. We also used two external codes scanning products to make sure we have no or not many bugs in Postgres in Green Plum. Using this for the first time I don't know if any of you ever used one of the services. Using this for the first time gives you many of false positives back. Like if you use the memory management for Postgres there is no free statement for any memory you allocate. It's a false positive in both of these tools. So you have to teach it. Okay, I don't have to free it because my memory management takes care of this. But first one gives you like 5000 false positives. We also used a service called Black Duck. Anyone ever heard about this? It scans for any kind of license violation. So if you have any license in your code which is not compatible with what you're using. It has a goddess in the situation where we have some code in it which is GPL. But it has a dual license like it's a ZLOP code, right? Ashwin? Is it ZLOP code? Yeah, quick as the right. It's dual license, you either can buy a license or you have to make a GPL. So we bought a license but we can't really put it on GitHub. At least not in a binary form. Oh, we had to fix this as well. Then of course you have some very caring parent companies being EMC and being VMware where especially EMC is not well known for loving open source. This entire challenge being in nine months' effort you have to split time up between what are your people doing on working on this open source project and what are your people doing on regular customer requests and new features. You can't just sit down and spend nine months on making everything ready for open source. You have to make some money as well. And last but not least there's also some tension between GreenPlan and Postgres. So the day we released GreenPlan into open source in Vienna this mail was posted on a Postgres hackers mailing list who's moment and want everyone that yeah, there might be a license issue between GreenPlan and Postgres and please don't use it. Okay, that's a very good start for a project. Fortunately, this was resolved after a few hours and everyone agreed that's not really an issue here but one of the challenge we had to face. Okay, also nine months is not enough to get everything in order. So we still have some challenges open we want to fix. Mainly our documentation up to now was decoupled from our source tree. So we have a dedicated documentation team. It also means for every new feature we write there must be communication between the developers and documentation team. New tracker requests open. They have to update their documentation. Our goal is that we only want to have all the documentation in a GitHub tree as well and that developers can provide documentation for every new feature. They submit us a pull request. We are not yet there. It was also a big catch word change and this one is still ongoing. If you are a company working on closed source for like six, seven, eight years, it's not that easy to convince everyone that development happens now in open source in the public. So we have internet mailing lists, we have external mailing lists and quite often we have to change between the internet and external mailing lists and make sure, okay, everything which is not strictly customer related should happen in the public because everyone adds every developer working on this project from the outside is also interested in seeing what's going on. Have some transparency here. And then last but not least, if you want to undertake such an effort, have some good lawyers in your team. I know people don't really like lawyers but in the end they are really, really helpful. They have some good ideas how to move this forward. So we had like weekly calls and before going open source discussing all kinds of challenges like license issues and so on until we finally could solve all of this. Good. What did we learn from it? If you want to take home one slide, that's it. Back then in 2007, people decided to fork Postgres and do their own product. By today, the overwhelming opinion in a company is that it was actually not a good idea and it would have been much better to just merge every version with Postgres. Over time, Postgres evolved, got so many new features we want to have in Greenclum. Some of them we migrated back. Some of them we can't really migrate back without major code changes. Having a constant merge with Postgres would have been better in the end. So think twice before you want to fork. If you fork a product, make sure you know how to want to steer the project. Like for us, Apache Software Foundation was not really a good fit because of the non-existent cooperation between Postgres and Apache. Postgres community didn't really want to adopt Greenclum. So yeah, we had to come up with our own solution for this. That's basically a public association but the Postgres is a registered name in Canada, not in the United States and this association in Canada, it's 503, it's a term for it, basically only holds the name and the domain name. Nothing more. You have to change your development workflow or at least we had to do. Move everything to GitHub is very cool. On the other hand, we also adopted the Postgres workflow so that people don't merge commits. Like if you ever worked with GitHub, you know there's this one big merge button and it merges everything you did in your own branch into your main branch. We don't want to have this. Like in Postgres, we only have one commit per feature and it's one clean commit message. Not a number of commit messages for every single commit you did. Obviously in the beginning it happened quite often that someone hits this merge button and we had to clean up afterwards. By now, people are used to it and do it the right way. Another thing we missed and looking back in Retro's perspective, no one knows why we missed it. We did not provide the CLA or Contributor License Agreement. So today if you submit a pull request to Green Plum the first time, there's a bot which tells you that you please sign a CLA with Pivotal. Or if you signed it already, you get this nice message that you can keep going. So the CLA is not really about handing over all your property to Pivotal. It's just making sure that they can use non-exclusive your code as well. So you're still the owner of your code. You have all the properties on it but Pivotal can also use your code. You submit with a pull request in Green Plum and maybe also if it's provided back to Postgres or any other project. It's just a legal construct to make sure we can use it. Three other small things we missed or barely missed as well. So we had a Green Plum.org domain all the time. Just no one knows how to get it. Just hold somewhere in Google Domains and no one had access to it and it took weeks to get it back and literally a few days before releasing Green Plum into open source, we got this domain back. We already started doing alternative plans in case we don't get it back. We got a website app, very simple one but at least we had one in time. We had mailing lists up and running by the time we had the domain back. We also had to make alternative plans for mailing lists in case we don't have the domain. And even so, we created quite a number of mailing lists. We found out, yeah, we're still missing one or two. And then again, talk to your lawyers. No one likes it but it's really, really helpful. So very early we decided when we want to go open source, we had a choice between PGCon Europe in Vienna and PGCon Silicon Valley two weeks later. We decided to go over to Vienna which gave us a fixed timeline, which is good and which is bad because you're already locked into this timeline. Everything must be ready by this time. And then I have some nice swag. So we had a nice booth in Vienna and some nice stuff. If you go out to our booth here in the hallway, we have this nice cleaners and we got t-shirts and everything to attract people and make Green Plum more known to the audience. Good. Any questions? Quite a number of people who helped me quitting this talk and gave me input on this. And did you already book your hotel for your next Postgres Conference in Europe? We are staying in a nice city in Eastern Europe called Tallinn in Estonia. It's worth a visit. Thank you.