 Good evening, everyone. Can you hear me? Okay. Can you hear me? I appreciate that indeed that you stayed until the late presentation and I do realize that I'm the last person standing between you and reception, I guess, beer, so I'll try to keep it short. I work at Pivotal. I'm an engineer there where I focus on big data distributed systems. I'm privileged to spend enough time on working on open source projects. I'm a Apache Committer and PMC member for Apache Currents project. Today I'm going to talk about two interesting trends. One has happened in the last decade, the last 10 years. This is the NoSQL and big data. The other one is the implementation of SQL interfaces for various NoSQL data systems which seems to be happening. It's a big team in the last year and a half. This is a quote I found from Martin Fowler from 2012 where he predicts that if ever SQL gets implemented on top of established NoSQL, this at least would generate plenty of arguments. I believe this talk is going to be one of those arguments. A little bit of context. The big data NoSQL, big bank. This is a landscape of big data technologies for the last year. You see actually, according to some sources, we have over 150 commercially supported NoSQL and big data systems out there. What actually explains this explosion considering that for decades we've been happy or some people have been happy with relational databases. It's difficult to just by the name of those projects, NoSQL or actually the trends, NoSQL or big data, they really don't provide any prescriptive definitions. But if you think of them more like a movement and try to understand what are the driving forces behind them, I think this helps to position them and to understand why they're so vibrant and what makes them think thick. On the first place, I think there's some consensus about this. One of the main driving forces behind those technologies and the boom of those technologies is the boom of Internet itself. The fact that explosion of web, mobile technologies, Internet of Things practically any device now, even smaller one, is a source of information and generate information. This in turn generates cause three main challenges. This is how to handle the volume of data, the velocity of data and the variety of all these data sources. Internet is the cause for these technologies or for the demand for those challenges. On the other hand, it is actually a solution as well in order to handle, to scale, to provide technologies that can handle and address those challenges. You are going to use again distributed systems and in turn using distributed systems for resolving those challenges means that some of the already presumed technologies and guarantees that relational databases provide like asset or two-phase commits would be challenged and has to be addressed in different way. It's known for example that the two-phase commit would hang in certain cases in failure of distributed systems. So all new class of approaches, for example the consistency and availability in case of partitioning of distributed systems. This is the cap or theorem is our main to deal and this is part of the distributed system and many of the technologies that implement this. Paxos on the other hand and as I mentioned the two-phase commit is not really reliable approach to ensure consistency within the distributed environment. So all class of consensus-based, quorum-based systems like Paxos have emerged and many of the non-SQL systems actually are based on those type of technologies. Another not that popular but I think very important driving force is the so-called relational or object relational impedance mismatch which relational database I'm not sure many of you maybe have done some application development. They know that for some class of applications you just need to persist your application state which in my case very often is object based state into relational database. For this you would need some sort of ORM technologies which is necessary for many use cases as I mentioned and this gave birth to technologies like Mongo or document-based data store. This actually is a huge group of technologies out there driven by this mismatch. Further more different type of stores and demands like graph-based databases or full-text search where the traditional representation of database models and strict very often the role-based relational models is not very appropriate to deal with. There are many other factors but I think those three are powerful enough to explain why there is such a search of technologies out there. Last one I want to talk is the cloud computing arrays and the wife itself. The possibility to program the infrastructure actually to automate this infrastructure to have infrastructure on demand is a main driver for eliminating the operational complexity and the cost and there is another side effect of this and this more architectural. There is this sub-movement I would say which is shift from integration to application type of databases. This is very popular into the microservices type of data applications. So the idea is that if you have your application instead of having a single data store as an application state for your distributed application you rather have a dedicated application store for each application and have a well-defined protocol at the application level between these applications so the database is not shared among them. I think I hope those forces explain the reason why there is such a multitude of technologies out there and I don't want to justify this or to dive further. The point is that there are out there, there are a lot and as I said over 150 commercially supported one and one of the interesting consequences of this is that almost any organization would end up with having at least few of those technologies deployed in their data infrastructure. So interesting question raised how they're going to integrate those technologies. This multitude of database that would usually they would have at least few data storage technologies dedicated to particularly good for particular type of use cases. And it was discussed today. I've observed, I have observed so far two main trends that trying to provide this type of integration and by the way the integration of technologies is very big deal. Usually the standard ETL technologies and systems are trying to cope with this. That's not I'm going to talk today. I'm really trying to talk about how organization can provide a single holistic view over the data that is spread across different data stores which might be useful for certain use cases. Very often those are the analytical or maybe some data science type of use cases to train their data set. So two main trends are emerging that are aiming to in my opinion to to converge and to provide more unified view on the data system and data processing system. One is more functional based it's unified programming model and you can today it was mentioned there was a very nice presentation comparing the interfaces of Spark and Flink. You have noticed that they're very close and actually they're not on the only two that are very similar. Apex Apache Kronska skating Apache Beam they all actually inspired by a common one paper from 2010 I think is Flume Java Google paper as a type of API and there is a trend now at least that Spark, Flink and Apex are implementing and converging at certain level under Apache Beam as a project. So this is an this is an example snippet of how Apache Beam looks as a notation. The second trend and that's what I'm going to focus now today is apparently a lot of the no SQL vendors and big data vendors are starting to implement SQL interfaces for the data stores or some sort SQL like interfaces for the backend data stores and some statistics from the last couple of years for Hadoop shows that apparently majority of the tasks that are run on Hadoop nowadays are either high base or SQL some sort of SQL Hadoop type of solutions out there. Also for Spark there is a report from last year states that the most used production component is the Spark SQL within their system. So there is this and I stated Google F1 paper actually has a quote there that any data system has to provide SQL interface and I found this particularly important because a lot of the big data technologies and now in the open space are influenced from the Google papers. So there is this shift and ideas and it's interesting movement to and trend to observe. Apparently a lot of company I think it's worth it to try to reason what are the reason what is the main driver for this movement for this conversion. It seems like the desire there is a lot of tools that know how to talk SQL out there within the organization. So it seems like SQL is pretty easy way to integrate with those existing tools. So this is more like legacy reason. Secondary and I think this is a more important one is the relational model that stays under usually SQL engines and this is the hot bits that we can talk about. We should talk about today and I think this is important slide although it doesn't seem very bright. I can argue that practically any useful data system out there in one another form provides implements the set or back semantics. So operators like projection filtering if the system is more advanced at some sort of join or group buy would be present. So in order to provide some usefulness for the users the system have to implement explicitly or implicitly those operators. So having this and acknowledging that this exist and very often in order to implement a pipeline or query execution statement you have to change multiple operators like this and when you start to play with this concept you realize that practically this is the same relational algebra concept that are very common in the relational space and there is a tools that are very good in optimizing some type of change and also the planners and indeed the relational expression optimizers are very desirable feature for many of the technologies big data technologies simply they're very difficult to implement. And that's at least they were very difficult to implement until recently. Now there are at least couple of open source technologies out there that provide some help and are useful to consider and try to use and leverage in order to provide this type of relational expressions and relational optimization within the existing big data systems. I'm proposing here I mean this is again just simple subset of possible way how you can integrate how organization are dealing with a lot of customers out there so I have some firsthand experience with some big players in the field and see and experience how they preserve and how they see the the usefulness of the data how they're trying to integrate their data and it seems like the most common approach is this one so this and this is standard for the rated database system approach so you pick one database that allows you to implement a connector so external databases or the external data systems no sequel and then you can provide kind of single review on your system just by external tables which are representation sorry which are representation of the external no sequel data systems in this case I have experienced with a patchy hope which is yet another tool derived from the green plan that we discussed today again which was derived from postgres it's MPP shared nothing solution which has very power to actually share very similar optimizer or co-optimizer inside and what's more important provides a PXF framework this is a Java extension framework which allows you to actually implement the plugins for external systems so this is I call it and one end model because the organization would use a single or sequel usually postgres gdbc connection or dbc to talk with one MPP database and via the extension mechanism would be able to see some portions of them of the no sequel system itself outside second approach is far more interesting in my opinion because it provides more autonomous autonomy for them no sequel system themselves is to implement a sequel adapter for each system isolation and for this purpose there is a framework very powerful but framework out there called a patchy calcite and as you can see in this case each of the no sequel system would actually have its own sequel representation interface it's on optimization the advantage of this is that you might be able to tune the optimizers for for sequel optimizers and the relational algebra expression optimizers better to the particular specifics of the no sequel system and there is an interesting gyrate ticket the gyrate issue that recently popped up which is exploring the possibilities to bridge those two approaches just one slide about the first the federated database approach how it looks like so on the sequel standpoint you're creating a table that looks like this standard type the interesting bit is that here you're providing location to your no sequel data system where you want to wrap and you have to implement three classes which is the fragmenter accessor and the resolver the purpose of the fragmenter is that if the date in the no sequel data store allows you to partition the dating in streams that you can process in parallel the role of the fragmenter is indeed to establish this partition for each of the streams or separate streams in parallel the accessor actually breaks them into a collection of rows key value rows and for each of the these value rows the resolver the last component you have to implement would convert it into a column a column list which would match these interfaces there is a much more internals you can pass analytics and stuff to to to configure the statistics in order to optimize them help the optimizer to adjust according this particular data store and this is very powerful approach if you for example already have Hadoop and Hokey like system in your infrastructure and you can just implement this such type of plugins and wrap and provide holistic view of your back end system the second approach or direct one is to implement a sequel interface around your and leverage sequel optimizer around your no sequel database and for this the Apache frame Apache calcite framework provides you query parser this is equal query parser very data optimizer I think this is the most important bits here as a bonus you get the gdbc driver which you can talk with the system and one very important the design decision about the calcite is to stay out of the business of how data is stored and processed which in turn makes it very useful to implement and wrap almost any existing data store out there I think this is by design and it's very powerful decision and if you take a look about the various technologies out there that use in one another way Apache calcite you would see that most of the big play already using it inside what I did I'm working on Apache Geo adapter the Apache Geo is in memory data grid yet another key value store distributed hash and I'm going to use in some of the example just as a reference to illustrate how it looks like so this assuming that you decide to implement a sequel adapter using Apache calcite through back end data store no sequel data store there are a couple of decision you have to make and they're very important regarding the from one side to how much sequel completeness you are going to expose and compliant with the sequel standard on the other hand how much you are going to to leverage the power of the no sequel not no sequel system you have so the first thing is you have to decide how you're going to convert your data type from the existing no sequel system let's say this is key value store or it could be even some sort of graph representation into a table or format that is expected by calcite and calcite has the standard metadata expected is the catalog schema which is collection of tables table which is collection of rows and row is just a list of columns represented the relational data type so this is important decision because let's say that you want to express as some sort of a JSON or Java object which has hierarchy and you have to flatter in some tabular format you have to decide whether you're going to spend computation and a lot of serialization to achieve this or we just can afford to implement only top level fields or some smartness and stuff so it's up to you to decide what is that it's a trade-off so how much you're going to expose from your model as a as an opposite of the performance that you're going to gain or lose and the second more important thing that this is general principle for any data distributed data system is move the computation next to the data not other way around particularly that means that if you in case of in the context of the sequel query you would like to run the executions of this query next to the node to where data is stored rather than actually moving the data to some central node where this processing is happening and then moving it back and forth in the context of Apache calcite you have two approaches to achieve this the first one is simple I call it simple it just allows you to implement a very simple simple in the table interface with the ability to push down the predicates predicates are operators relation operators like filters and projections that means that if you have select some fields where from some table where something it makes really sense these fields and the the where clouds of the conditions to be pushed down to your not equal solution and you pre-filter and preprocess and return back only the amount of data that is necessary for okay for the system to process I'm going to hurry up here's this is an example how it looks the simple scenario you're connecting to I'm not sure if it's visible this gdbc adapt this is the Apache calcite gdbc protocol so when you connect to gdbc via the Apache calcite gdbc driver to your backend system you have to provide a model in JSON format the only thing that model contains is your entry point implementation of the schema factory with some operands that are relevant for for your backend systems the role of this schema is usually one liner of implementation is to create a schema based on those operators that operands that you have provided the schema in turn depends on the mapping that you have decided to implement for your backend system in the system and the relational stuff would create a list of tables and the important thing is that you have to implement the column types within these tables then when a query comes it would be passed through the scan parameter so that from this book order usually would mean that it was going to collect try to extract this data set from your nautical database and convert it in the convert method convert it into type that is compliant with the table definitions the trick here is that in this simple implementation you could get all data so there is no any moving of the computation code to the data everything goes to the central processor and get processed there that to version that you can optimize this computation it's called that instead of scannable table you can implement filters scannable table or projected filters scannable table which allows you to push down the filters and projectors but that's everything that you can do as an optimization if you have a joins operators or group by operators everything could happen central place on the client side and the second approach that the calcite allow provides you with is to implement your own relational rules and and relational operators that would allow it to in provide implementation much closer to the to the native nautical system in this case this is Apache Geo and this is very fast going so how it looks like gdbc this is everything in blue is the standard Apache calcite framework this is the base that you have to implement in order to to to provide the adapter and this is called generated by the calcite so SQL query comes it is passed to the parser SQL is converted into a relational expression and tree it's it goes through the planner which performs some optimizations actually while performing these optimizations optimized tree is passed to the enumerable component which is responsible to convert this logical plan into physical plan and this is already the tricky part in this process actually if you have implemented your own rules or operators the enumerable components would consult those and would provide a well-tailored implementation actually chooses the expression trees this is concept from link for link for J I think this is something that come from from from Microsoft and this interpreter actually generate Java code that's optimized for this implementation compiles it and the gdbc query is executed so that's the whole full and the complete line and I'm going to skip this those are the internals of the if you were going to implement your own rules and and operators in calcite you'd have to get acquainted with this and this is the generic pattern if you see any of the implementations of the calcite adapter out there more or less implemented for this these steps here and just I would finish with the example let's say that you have this relation expressions this is the relational operators that you would see you have a join of two tables filter and project two of the fields so the optimizer throw would be to convert and to push some of the operators closer to the day so you see the projects so to reduce the amount of data that moves upwards so this is usually the logical part of the parser of the optimizer and this is the real example with the Apache Jota adapter that I've worked on if you have this query this is the logical plan you see that you have no any optimization the scan on the tables then join is performed then filter on the this one of the field and projection extract two of the fields if you use the simple scannable approach which doesn't implement any rules you would have some advantage so that the planner would already reorder this word or this this operator in a way that they be execute much more efficiently but still most of the computation would happen on the client side where the gdbc or doc the sequel query is performed if you move one step and you implement your own rules and have something that's much more tired to your next nautical data system in this case I have implemented the geoproject geofilter but there's also group by you see that actually it uses those operators and it leverages practically a lot of those operators now executed on the nautical system itself still you see that join is not implemented there I'm working on this that means that now this query would practically would be converted into two sub queries run on the nautical system and the result would be returned and be performed on the client side but this is something that you can progress on typical gdbc how we are going to use this from ten java stand point and I'm afraid I'm over time so I'm sorry for this try to run