 Hello everyone. My name is Roman Shapochnik and I work at Pivotal. I also happen to be a co-organizer of this dev room, which by the way was my sneaky plan to get the stock accepted, because I really think that today I will be talking about something that's quite interesting. Before I do that, let me introduce myself a little bit more. My involvement with Big Data goes all the way back to the original Hadoop team at Yahoo, which I joined in 2010, and since then I managed to work at Cloudera, Pivotal, basically companies that do data, do Hadoop. And with Hadoop it was interesting, because when it first appeared everybody felt like it would totally replace traditional data warehouse systems, and for a good reason, because Hadoop had a lot going for it. It was open source, it was scalable, it was developer friendly. It was basically a really good system, and for some time it really seemed like it was going to happen. But recently an interesting trend started to appear. More and more enterprises started using Hadoop in addition to data warehouse systems, not replacing them, but basically augmenting them with Hadoop. And at first this could look like a very strange architecture, because why would you have two different systems kind of doing similar things? Why not replace it with just Hadoop and have the promised enterprise Data Lake? So today we will be talking about why those architectures actually make sense and what drives those types of architectures. All right, all right. Oh, okay, awesome. So yeah, so basically the first time I was actually starting to see this type of architectures that they were building, I went like, huh, what are you doing? So this is actually the slide that got publicly presented so I can steal it without writing my own. That was at the IBM conference, and the guy was from Seagate. So not even the customer I was dealing with. And he was explaining how Seagate is dealing with sort of the internet of things, because they basically manufacture a lot of hard drives, and they want to collect information from those hard drives. So they basically had a whole bunch of Hadoop clusters kind of on the periphery, aggregating all of the information, but then all of a sudden they actually had a traditional enterprise data warehousing right in the middle, kind of actually taking care of the important data points. And I was like, what? Why not Hadoop? And then we had a couple of more customers like that at Pilotel, and the architecture that emerged throughout that group of customers was basically this. And I would love if during the Q&A you could challenge me on this, and maybe you can tell me how wrong it is, but that's actually what I'm seeing in the field. So almost all of these guys who I will be talking about, they're building backend architectures to essentially support data-driven applications. And what I mean by that is that they basically have to optimize the relationship that they have with their customers and the devices or whatever it is that they're managing. So there's typically like a bunch of either web or IUT traffic coming into a data center and sometimes even multiple data centers, and that's on the left, right? And on the right you basically have a bunch of users using the application from mobile and desktop technologies and whatnot. So the question then becomes what do you build in a data center as a complete end-to-end architecture as opposed to just building a data science piece or a data management piece, right? And what we typically see is that, again, some of it I'm just throwing names here. You can use different projects within the Apache ecosystem, but the name gives you a flavor for what typically gets used. So typically like all of these events get represented as Kafka queues. So the saying goes that whatever it is on the outside of the data center must look like a Kafka queue inside of a data center. And then you basically have some kind of an ETL 2.0, and what I mean by that is that it's not ETL that basically takes data from one existing place, you know, does something to it and puts it into a different place. It's actually ETL that does it on the data that's in flight. So, you know, you basically have a bunch of ETL functions that are running on some kind of substrates. So NIFI, the presentation right before me was a good example of something like that. Pivotal has the spring cloud data flow that can do some of that too. But basically what gets done here is that you essentially split the traffic that's coming into your data center into multiple different things. So first of all, you basically get some of that traffic right straight into HDFS. And you just dump it there and you leave it there and something will happen to it later on. Then some of that traffic is actually getting into the in-memory data grid, and that's the thing that actually connects your application to the data that's in flight right now. You know, whatever the center is producing right now, you will actually get it through that thing. But then, and that's actually interesting, then you actually get some of that data into a more of a traditional MPP type of a solution, and the good news is that you actually have an open source one and green plan today. And the split of all three here is interesting because what happens then is within all of these systems you basically have people who are typically, you know, either data scientists or data engineers who essentially maintain data models based on the raw data that gets into all of these systems, right? And you only get as much usability out of the data as the kind of data model, interesting data models that you can then feed back into the application you get. And these are the guys who are maintaining it, but even those guys basically sort of a lot of times, at least in the customers that I see, kind of push those data models back into this middle layer, which is the traditional MPP. So the question is why? And I think before we can answer that question, again, let me repeat the fact that what we're building here is not just a data science piece. We're not just doing it for, let's say, you know, life sciences trying to analyze human DNA or something, right? We're building an end-to-end architecture for essentially a modern way application, you know, the way you get it from Uber or from Netflix or from Airbnb. And for all of these guys, you know, the data that they're collecting must manifest itself in the application. That's the end game for all of them, right? Because, you know, in Uber it's basically, you know, tracking and sort of predicting and doing this search pricing. You know, Netflix must recommend a new movie for you. And all of these guys, again, building this end-to-end architecture. So can we just do it with Hadoop? Well, again, like I'm saying, hopefully during the Q&A, you will try to reconvince me that maybe we can. But here's what I've seen so far in, again, talking to the customers, because it's a pretty pragmatic presentation that I'm trying to do here. So first answer is like, yes, absolutely. But then, you know, the customer turns around and asks you, but what do you mean by Hadoop? What is Hadoop these days, really? Because if you're just talking about, you know, some kind of a scale-out storage, you know, file system like HDFS, yeah, that's awesome. You know, we can totally use HDFS. Maybe you're also talking about a scheduling framework, something like Yarn, that's fine. But then there are all these other questions, right? A lot of the customers that are building those types of applications right now, they are very traditional enterprises. So they do a lot of what's known as BI, you know, business intelligence. And BI is typically done through the tooling that actually expects a SQL database on the other end. So the most, you know, traditional example of the tooling like that is a Tableau. So, you know, very well-established sort of UI for doing BI. And there's absolutely nothing that you can, like, hook up Tableau right away to in Hadoop ecosystem. I kind of have to build additional bits and pieces into that. That one is actually also interesting because a lot of them are telling me, like, this is awesome that all of the big data community is talking about machine learning in Scala and on Spark. And that's basically great for the five people that know how to do it. And all the rest of us who are still stuck with SQL and, you know, stored procedures, we actually wanted to be given to us. We want to contribute to the machine learning as much as the guys who actually do Scala and Java and whatnot. So I actually tend to call it democratized BI and machine learning. If you can open up it to the people who have the old-school skill sets, then the first thing that you're doing, you're basically getting them into the game, right? You cannot really put that barrier in front of them and say, like, you've got to learn Scala, you've got to learn Spark, you know, to be productive. You kind of have to invite them over and, you know, extend the hand and say, like, yeah, you can be productive, you know, and I'll show you how. And maybe gradually, you know, you will build out your skill set. And the other two, like, once you do that, the other two then becomes, you know, the question is, like, oh, so now I have not just five guys within my company who know how to do Spark and Scala. I have hundreds of people who actually are very comfortable doing BI, but what happens if I open up that system to all of these guys to kind of hammer on it once? Can it actually handle the load? And the last one is, again, if you're building an application, you kind of have to have something that you would be able to hook up in a traditional in-memory scale-out, you know, layer. And a SQL database is typically a good option. Again, you can do HBase or something like that, but then you are sort of assembling bits and pieces in a very ad hoc manner instead of just having a central piece of your architecture that can do a lot of, that can answer a lot of these questions as opposed to, you know, trying to do just one thing, like many of the Hadoop ecosystem projects do. So to summarize what I was trying to say, I guess after I worked with Hadoop for, well, I guess it would be 10 years by now, I can really vouch for these. So Hadoop is really great for Hadoop and Hadoop ecosystem. It's really great for elastic storage capacity. I mean, HDFS is pretty well-debugged and awesome at this point. It has all the bells and whistles like high availability. You can configure it, you know, to be pretty full-tolerant. So if you want to just dump data, you know, to someplace very quickly, remember that, you know, bottom arrow on my slide, you know, just like the data comes in and you need to just store it for later processing. HDFS is great. It's great for lending data. It's great for discovery. It's great for trying to figure out what's inside of your data. And that is actually a job for a couple of people in the enterprise. And here's why, because all of these enterprises are really bound by regulations and just internal, you know, rules of what data can be exposed and how. And if all you get is just raw streams of data that is coming in, you typically want to reduce the number of people who can actually explore that data. You want that exploration process to happen within a very limited group of people so that later on, you know, you can actually open it up, but you can open up in a meaningful way. So maybe you will be masking some of the data fields. Maybe you will be transforming it into something else. Maybe you will be putting a particular eckles in place, you know, to actually make it more accessible. But the initial exploration, you know, the fact that all of a sudden the log files that are streaming through your application contain, you know, personal identifiable information, that cannot be overlooked. And, you know, if you want to do sophisticated machine learning, Hadoop ecosystem is great. So complimentary today to that traditional MPP, which stands for massively parallel processing, if you're wondering what MPP is. It's actually really great for schema on write, which is basically once you know what the data is look like, you know, what the data it is that you're dealing with, you actually want it to be available to as many people as possible. And then you create all these views and tables, you know, that a very traditional BI person would expect. But then you are making the data known. You actually have an API between you, who is maintaining the data infrastructure and the people who will be analyzing and creating the data model. That is a very well understood API. A bunch of views, sometimes a bunch of tables. Transactionality sometimes comes into picture because it's easier to do transactions on MPP into some customers is actually meaningful. It's interesting. And the last one, for those of you who cannot actually see it, I'll read the bottom one. So that democratized machine learning, you can all of a sudden use R and you can all of a sudden use your traditional sort of SQL like tools, which I will cover in a bit. So what do you do it with? There's maybe a few choices in the open source. You know, the one that I will be getting you through today is called Green Plum Database. So Green Plum Database is actually a proud member of the Postgres family of relational databases. It's kind of like a long timeline but a couple of points I wanted to call out. You know, Postgres actually got created long time ago, like really long time ago in 86. So Green Plum Database got created based on Postgres 8. You know, around 2005. Interestingly enough, that was exactly the year when Hadoop got created. Now Hadoop got created around 2005 as a sub-project of NACCH. 2006 is when sort of Yahoo got really interested in it. And the rest is history, but the point that I wanted to make is that we open source Green Plum Database in 2015 and we even rebased it on Postgres 8.4. So it's still not Postgres 9 and we're working on that, at least it's close enough to really sort of the kind of Postgres that you would expect. So very quickly, Green Plum Database is an MPP project. You know, it's something that has been in development for more than 10 years. If you want to participate, your best option is www.greenplum.org. You know, all of the community aspects are on that website. If you're wondering what Green Plum is and what it does, think of it as a sharded Postgres. So you basically have a bunch of what we call segments and these segments are essentially individual Postgres databases. You also have a master course right here that is, you know, coordinating things like query planning and, you know, what data needs to get where. But basically the way to think like the mental model, a very sort of simplified mental model is it's a traditional Postgres database, but every table that you create has a column that you need to designate and that column is used for sharding. So basically whatever the hash of the value in that column, depending on the value of that hash, your data will end up in different segments on different hosts. You still get a view as though it is a database that's just available on a single system. It's a traditional Postgres database. So anything that works with Postgres would work with Green Plum. You can just connect, you know, your traditional PCPAL command and connect to a master and then, you know, querying the databases just like you would do with Postgres. It's a shared nothing architecture. So basically, like I said, the master just, you know, does coordination. And a lot of times we sort of do the high speed interconnects for pipelining. So, you know, the guys talk, you know, talk through this interconnect. Green Plum is a thoughtless software solution. So if you've seen Green Plum in the hardware incarnation, you know, you might have seen some of the networking that was specifically optimized for it. It helps, but you don't have to do it. So you can just launch Green Plum on Amazon or Azure or in your own data center. That's just fine. So a couple of other points that I wanted to make is that Green Plum actually has a concept that is now in Postgres known as foreign data wrappers, FDWs. In Green Plum, it's a different, slightly different one because when Green Plum got it, Postgres didn't actually have FDWs. But Green Plum can connect to a lot of data sources, right? So Green Plum can actually represent as an external table something that's on S3 or in, you know, files on your file system or even in HDFS itself. And that is exactly how we're connecting it to the Hadoop ecosystem. So that connection to the HDFS is the Hadoop ecosystem connection. Interesting cool things that are sort of part of Green Plum. ORCA is the first query optimizer, you know, that is basically planning queries knowing that it does it for an MPP type of a system and it's optimizing the query to run for long. It's actually a pretty sophisticated piece of software based on a scientific research that's been done for at least five years. It's a standalone component, so it can be plugged into any kind of database. It's easier to plug it into, well, conceptual, at least it's easier to plug it into Postgres, you know, family of databases, but you can actually plug it into any database because it's implemented as a separate standalone service that does the query planning for you. Speaking of the democratized machine learning, so Green Plum actually comes with the machine learning library called Medlib. And it's an interesting one because it does machine learning as though you are doing queries on tables via traditional SQL. So for example, if we want to do a linear regression and basically just train our model, instead of writing any kind of code, we would just invoke one of the pre-canned functions that Medlib as a library gives us. Now the invocation would look like a select statement, you know, that's the training of the model, so we're basically passing tables here. And what we get is we get the output that is the model that is trained. And of course, you know, the way we would then use that model is again through the same select, you know, similar select statement. So all we're doing is we're essentially calling out functions that are being given to us through the SQL interface by the Medlib that got installed on our cluster. So again, the flow between Medlib and the rest of the system is pretty simple actually. It's kind of like, you know, the MapReduce without exposing to you the MapReduce part, right? So you essentially have a client which could be, like I said, anything, you know, could be Jupyter, you know, Apache Zeppelin if you're doing machine learning. P-SQL works just fine, so you can run all of these examples in P-SQL. So the SQL client talks to the database server and then the interesting stuff begins on the master. So you basically end up with a stored procedure that gets called on the master. A whole bunch of the same stored procedures get called on the segments, you know, then the aggregation happens if it needs to happen. So it's kind of like fanning out and then aggregating. There is a way for Medlib to also do a little lessing of the data and basically run algorithms that need to converge. So if the convergence needs to happen, then the convergence would be happening through the master. So that's the only bottleneck that you would have because, you know, the segments will not be able to talk through each other. But the convergence still could happen. So if you're converging to a reasonably small value or small in size, right, in how you represent it, then it works pretty well and that's how actually we do some of the graph algorithms within Medlib. Medlib is interesting because it's not just Greenplan. We actually have people from the Postgres community who are using Medlib to do machine learning within the Postgres installation itself. So even if you're running a single node Postgres and you have a whole bunch of data locked into that single node Postgres, you want to run, you know, a simple linear regression or you want to run, you know, some graph algorithms that we're developing right now, you can actually do it with Medlib. And just, again, for you to understand what's going on, the pieces that you will be dealing with are basically at the very bottom. You have a C API that Postgres exposes to anything that happens to be a function. Then you have a whole bunch of low-level abstractions, you know, that's just mechanics of how Medlib itself is implemented. And then a whole bunch of traditional machine learning functions that have been implemented, you know, on top of these abstractions. But essentially the idea is simple. You know, the Postgres segment gives you the data in form of tuples, right, you know, think of it as an array of data that you can read through. And then these functions basically do all of the work. And if there's anything that needs to happen between different functions sort of running in different segments, then that would happen through the abstractions that are provided here. Here, you know, you basically have a whole bunch of libraries that we've written for Medlib. So Python is one of them. There's a integration between Medlib and Python. So you can actually do a lot of traditional machine learning, you know, talking to the back end, which is a cluster. Another example of this layer is R. So a lot of times we would actually have people who would call Medlib functions through the R interface because we have that integration provided to them where a Medlib function would actually look like a R function. Now, of course, again, anything at this level, you don't want a lot of data to being transferred back and forth. So you're basically just programming a cluster and you're telling a cluster what to do. So that's Medlib. It's a pretty useful piece. You know, a standalone piece. It's actually an Apache piece right now. So it got transferred into the Apache software foundation. It's an incubating project. If you're interested, you know, we are more than welcome you as a contributor. You know, there's quite a lot of interesting stuff to do. Green Plum itself is not an Apache project. It's under the Apache license, but it's a standalone project. Go to www.greenplum.org. So then the only question remaining is how do we put it all together? And since I don't really have too much time left, I'll just say that the answer to that is Apache Big Top. And as Olaf was pointing out, Apache Big Top basically is the... Essentially, what DBM was to Linux, right? You know, when you had like GNU software, you know, Linux kernel, and then DBM was the first distribution that kind of like put it all together. And then a whole bunch of secondary distributions, you know, got created based on DBM. So Big Top is trying to do that with big data. So, you know, we have Hadoopika system, but not just Hadoopika system now. We also have, you know, things like Green Plum, which I was talking about today. There's Big Top and then a whole bunch of distributions that use Big Top to create products that they give to their customers. So far, we've laid the groundwork in the Big Top community to integrate Green Plum into the Hadoopika system to allow you the kind of architecture I was talking about today. So the basic functionality, packaging, deployment, and Docker orchestration is there. So you can actually get the RPM and DBM packages for Green Plum, you know, from Big Top community. You can deploy them on your cluster using puppet code that is provided. There are Docker containers, you know, for that stuff as well. There's basic integration with Medlib. So Christian, who is sitting here, did a demo, I think at last FOSDA maybe, or some other conference, where we basically demonstrated how you can use Medlib through Apache Zeppelin, which is a really nice tool similar to Jupyter, you know, from the Python community. It's very well known in the Apache Big Data ecosystem. You know, if you're doing machine learning and you're trying to do this, you know, notebooks, right, for data scientists, it's a pretty useful tool. We're actually interested in, you know, some of the juju deployment. So the basic rudimentary capabilities are there, but if you're interested, you know, talk to me after this presentation. And the HDFS integration is there. You can get data in and out of HDFS as external tables, but it's slow and we're trying to optimize it now. So again, talk to me after the presentation. Here's the stuff that we're interested in doing. And if you want additional items on this list, be our guest. I mean, it's basically as much as, you know, you could possibly think of. I'm particularly interested in this one and this one. So Postgres 9 is one of the big deals that we have to kind of like accomplish. I don't think we'll accomplish it the same way we did it with Postgres 8, which, you know, Haike, for those of you who know him, you know, one of the sort of core Postgres guys, kind of like just did this huge rebase of the entire code base on Postgres 8. Postgres 9 is too big of a chunk for us to bite that way. So we'll probably just backport features from Postgres 9, you know, the feature at the time. And that's actually where your feedback would be extremely useful, because we need to know which features to prioritize, you know, for backporting. So for example, we know that binary JSON is super interesting, right? And the only reason we know is because, you know, people talk to us and told us so. And this one is where I think the real interesting integration between Hadoop ecosystem and Green Plum will begin. So we want to make it a full sort of fleshed member of the Hadoop ecosystem. So all of the tools that exist within that ecosystem can actually benefit from Green Plum and vice versa. So that's it. That's all I have. You know, let's try to build it together if it sounds interesting at all to you. And I'll leave you with this quote and open up for questions. Could you just pick up? I cannot really. So that's what I'm saying. We basically have to rely on that ETL that was in the middle. So remember, I was showing you the architecture. We do, but in a different way. So the question is how do we deal with unstructured data? So these are the architectures that I'm seeing. So the way you deal with unstructured data is either here. So your ETL basically puts some structure on that data, right? And that is when you know how to do it. So you know that certain fields can be extracted in flight right away. So then you extract those fields and you put it directly into Green Plum database in form of tables. That's fine. Another way is you have your unstructured data at the HDFS level. And these guys, you know, they're basically building exploratory data models, right? The data models that are constantly being tweaked. But then they export it and make them available to a bigger audience within the enterprise by essentially, again, syncing up those data models as tables at the level of Green Plum database. So that then you can actually pick them up, you know, through MATLAB and, you know, all these other tools. So that's how we deal with unstructured data. Right. And PXF, you know, one of the features that I was talking about integration-wise is, again, in the direction of helping us deal with unstructured data as quickly as possible, so to speak. Yes. Right. So there is a support. So again, I obviously couldn't go into great details about the architecture of the Green Plum. But the question is, do we support failover? Yes, we do. So you can have basically two masters, you know, in different configurations. And yes, you would basically have two nodes. It's similar to how HDFS does HA, if you know that. You know. Oh. No, it isn't. Right. Correct. So again, like I'm saying, it's kind of like with HDFS where you have two name nodes, you know. That's the same approach. Yes. Okay. So Green Plum is similar to SiteSDB. So Green Plum and Citus take slightly different approaches to how to deal with Postgres, and you get different sort of design constraints based on that. So Citus made a decision to be essentially a plugin into a Postgres, which makes it super easy to instantiate a Citus cluster. You basically just enable a plugin on a bunch of Postgres nodes, and you know, you get a Citus cluster. But then, on the flip side, you're being constrained by whatever it is that is given to you by Postgres. So for example, Green Plum actually invests a lot in optimizing the interconnect. Citus doesn't really have an ability to do those types of optimizations. So it is similar, but performance-wise, we're typically faster on the sort of the benchmarks that I've seen. It doesn't mean that we're always faster, because again, it depends on the workload. But we do get more ability to optimize, because we don't depend on Postgres as much as Citus does. So that would be the quickest way to answer it. Let's take the rest of line. Please, we don't have time for another question. So thank our speaker again. Thank you.