 Good morning, everybody. Is it time to start? I think it's 10 30. Okay So good morning. My name is Mattias. I'm the CTO of 28 milliseconds and today I'm going to show you how you can do more with a MongoDB and jsonic To give you a little introduction about what we're going to do so The world is full of data and this data lives in a huge variety of data stores in those data stores It lives in a huge variety of data sources and data formats those data formats are either completely relational tables completely structured json documents, which are hierarchical flexible XML which is a semi-structured format or completely down to unstructured text For all of this this data to be valuable a lot of processing needs to happen to turn it into actionable information For example, the data needs to be filtered. It needs to be aggregated needs to be correlated needs to be cleaned and People are doing this today, but they are either Clueing and stitching together solutions around a technology that was developed in 1978, which is called sequel or They are using all those create no sequel data stores that are out there but that mostly only offer very low-level primitives very low-level operations of things that you can do with the data and At 28 milliseconds, we believe that it's time for no sequel to take this to the next level of what we call information processing In the heart of our solution of our technology is a language called jsonic Like sequel jsonic is a high-level declarative language Query language that allows you to process data However, it delivers to you a set of capabilities that allow you to do far more with far more data types in much less time much less code and much more productive In the remainder of this talk, I'm not going to use any slides anymore I just want to show you a live demo and develop together with you a set of queries Chessonic queries that show you the capabilities of the language to give you an Impression imagination of what you can do with this and how powerful it is so let's see if The internet works So here's the here's an in browser Development environment that we have at 28 milliseconds in this case this project that we see here Is connected to a MongoDB database this MongoDB database Contains a set of collections that you see here, and I hope that the font size is big enough it's not you might want to come to the front a little bit more and Those collections model or the two topmost collections here model a subset or contain a subset of the stack overflow data set answers and FAQs The data is stored in Mongo. So as you you can imagine it's it's JSON data You can take a look at it here that's one of the answers that is stored an answer contains a question ID contains an answer ID a last activity date and Also, it contains nested the information about the owner who posted that answer and the owner has a display name or a Reputation for example along with some other metadata the other collection is the actual Questions that were answered a question also has a question ID. That is the correlation between answers and questions It has a score an answer count a title Tags in an array and it also contains the owner information nested here. Okay So that's just a very brief introduction about the the data that that we are going to use for this for this demo And you don't need to understand all of the Know all of the details Problem because the font is to be So the first thing I want to do is I want to analyze some of the questions that are in the MongoDB collection for this I'm going to go through all of the questions in the FAQ Collection, I'm going to select only the questions that have the is answered field and the value of this answered field is true I'm going to order them by their last edit date because that's what I'm interested in and I'm going to return for each of the questions in this case only the questions title and The last edit date. So that's the most simplest JSON query that you can have What you see here is you see the title and the last edit date and the last edit date here contains a lot of Null values because the last edit field in is not available in all of the questions Couple of interesting things about those this query is you can extract fields from a JSON document And you can navigate also deep down in the hierarchy of a JSON document you can construct new JSON objects that have values that might exist in your data or might be computed and The the most important thing here is that you can see that Compared to sequel where the input to a sequel query is a table and the output is a table The input to a JSON a query can be a JSON document and the output can be a JSON document and The input and the output can actually get be much more, but we're going to this later So the next thing is that I'm going to do is I don't like those no null values Here being at being at the beginning of my result. So what I'm going to show you is that I can come arbitrarily nest Expressions wherever a value can occur. So I'm going to say if there is a last edit date I'm going to return that last edit date Otherwise, I'm going to return zero because I want them to be at the end. So now I can run the query and Yes, you see the null values are now at the end of the query result. Okay? So that's a very important observation that you can arbitrarily nest expressions here Okay, any questions regarding this query? Yes, please No, the return value can be a JSON object. It can be any atomic type that you want It could be XML. I'm going to get to that later it can be New line separated strings for CSV or whatever you want in this case. I'm just constructing JSON objects. Yes, please Yes, so that's a very good question so the the language has a notion to distinguish between the absence of a value whether the value is null or Whether the the value can be converted to a Boolean So for example, you could use a function called exists here that tells you Whether the value exists or not so I could do where exists this field and I can also check whether the field is null With a function here Does that make sense? So you can distinguish between all those three cases and that's an important thing in In chasing that you need to do because Jason distinguishes between the absence of a value and whether the value is null Yes, please Absolutely, absolutely. It's good. I'm not going to have examples for that later. It's completely compositional language So as I said the language is a declarative It's a functional language and declarative here means that like sequel. It's highly optimizable So just by looking at the the query a query optimizer can optimize it according to The the features that are available in the underlying data store in this case. For example, if The MongoDB database or the collection had an index on is answered Then the execution of this query could leverage that index The same for if it has and had an index on the last edit date that is ordered then you don't need to manually order the results here and Another thing that that what you can do with those queries is you can automatically Paralyze the execution of such a query across multiple processes or clusters Because you can see that the execution of each iteration of the for loop in this case can happen independently Except the order, of course, which you need to do at the end Okay So that would be the the basic concepts of a jsonic query in a most simplest jsonic query that you can have Now what I'm going to do is I'm going to do a little bit more complicated query what I want to do is I want to go through all of the answers in the collection answers and Those answers I would like to group by the owner's display name So what I can do is I can say group by name where the name is within the owner owner's display name So here I can navigate across two levels in the hierarchy of the json document and The display name is going to be a string. So it's going to group them according to the quality of the strings Then within each group what I would like to do for each of the owners I would like to compute the average reputation over all the answers that one of the users has posted so what I can do is I can Use floor and the average functions that are built in into jsonic and I can get the reputation out of a Out of the answers so what this one is doing is after the grouping here Answers is going to be a sequence a group and I can for all of the answers here extract the Reputation so this is going to be a sequence as well and average is going to compute the average over that sequence Then the next thing that I would like to do is I would like to obviously order by the average reputation descending because I'm interested in the most reputed ones however the as we already Discussed earlier the reputation field might not be available in all of the json documents So what the average and the floor function in this case are going to do is they are going to return me the empty sequence and In the order by clause I can determine whether I want to have the empty sequence at the beginning or at the front That's like whether you want to treat null in your order by at the beginning or at the front so I'm going to say empty least and I'm going to return a json object that contains the name Which is a per group always only one and Then I'm going to return the average reputation that I computed So now you can see that the average average reputation of this guy is this number okay now That itself might not be Sufficient and I want to get some more some more values out of this out of this query at in one run So what I'm what I'm also interested in is I'm also interested in the questions the top questions that this guy answered So what I'm going to do is I'm using the subsequence function and I'm going to have a nested query here Where I go through all the answers I order them By their score descending and I'm returning the question ID For all of the answers and I'm only interested in the top three So I'm going to pass one and three as parameters to the subsequence function here Okay So now if I run this query What I get is I get the same information and another field that says questions and those are the IDs of the questions That this guy answered and he got the highest score Does that make sense? So this answers I think your question that you can arbitrarily nest Expressions in the return clause or in the computation of a value of a JSON field so Now you might Realize that the question ID maybe doesn't tell you a lot So what you might want to do is you want to get the The titles of those questions actually out of the database But for this I'm first going to make the query a little simpler And I'm going to introduce a function That is called top answered questions. It takes as parameters some answers and In the body of the function. I'm just going to literally copy and paste What was in there? Okay, so now I took this expression put it in the body of a function and Here I'm just going to invoke the top answered Questions function giving the answers as parameter Okay, going to run it What we're going to give a get is the exact same result as we have before So that's another interesting observation here what we can do is we can Start to since it's a functional language. We can take any arbitrary expression put it in a function and invoke it later So now what I'm going to do is I'm going to take those questions those question IDs and I want to get the titles of the questions So far we only looked at the answers collection the titles of the questions. However are stored in the FAQ collection So what we need to do is we need to make a join between the answers and the FAQ collection So what I'm going to do is here is I'm going to iterate over all of the IDs All of the question IDs here and I'm going to return for each of them from the FAQ collection The corresponding question whose question ID is equal to The question ID of the answer So here This is an implicit loop over the collection FAQ. It's returning me the entire FAQ collection For each of them and that's just a shortcut for another flower with a where The dollar dollar in each iteration refers to the question that that is currently be to being iterated over I'm extracting the question ID and I'm comparing it with the question ID that is coming from my top answered questions And then I'm going to have a question here, and I'm going to extract the title I'm going to run this and As you will see if the internet is not broken Okay Took some time as you can see here now in the questions area I have I'm having the titles of the FAQ of the questions corresponding to the top answered to the top answers and There's some duplicates in here. So what I could do is I could add another Function here that says give me only the distinct values Before I compute a subsequence in order to avoid the duplicates here, so Yes, please So what you get in the result here already is you get a sequence of JSON objects and those could be considered as rows, right? so you could so There's no notion of a column here But what you could do is you could make for a construct for example as CSV File here and say that What you want is you want to name concatenated using a comma concatenated concatenated using the average reputation and then continue Here I can delete that and execute it, but Maybe you want to keep this on the screen if you have more questions regarding this So what I wanted to say is that That the title of the talk is too more with MongoDB and chase on it And so all the queries that you've seen so far you could potentially also write with MongoDB's query language or for example in this case particularly with the aggregation framework But here is we first introduced a new concept that Mongo cannot do which is a join between two collections and now you might argue that The a join cannot be done efficiently with within Mongo and that's a very valid question But on the other hand You I think we need to realize that what people are doing today is they are already doing joins But they're doing them in the host language So people need to get this information out of their Mongo database and so what he what everybody does and or what every each No sequel developer does is he's writing a 400 line Python program to get this kind of information out of their Mongo database and here you are having a query that is 20 lines. It's declarative. It's very high level and It's Maintainable and you would usually argue that the viewer lines of code you have the viewer box you have So I think this has a lot of advantages Now the next thing you argue is how can this be executed efficiently? Well, the answer is the same today each no sequel developer has to come up with his own Efficient join algorithm to make sure that this queries executed most efficiently in In in chasonic we have an or our implementation of chasonic has an optimizer That tries to choose the most efficient execution plan to execute that query and again If the Mongo database in this case had an index on question ID Then this join here could be turned into something that is called an hash join Or an index-based hash join in order to guarantee that the query is executed most efficiently Does that make sense? So those are the the most basic Chasonic queries that that you can come up with and we have a lot more queries that are Much much bigger and much more complicated and get much more useful information out of the database But what we really want to look at is that we can do even much more with chasonic than what I've showed you and At the beginning of the talk I said that today's data is living in a in a huge variety of Data sources and in a huge variety of data formats So what you really want to do is you want to write queries or chasonic programs that Process the data across those sources so in this query What I'm what I'm doing is I'm doing a join between the answers collection that we have in mongo and A sequel table that now contains the questions So what I'm doing here is I'm going through all the answers. I'm grouping them by their question ID. I'm computing the maximum score Within the group I'm ordering them by the maximum score descending Then I'm going to select only the ones that have a very high score And then what I want to do is I want to correlate that information those answers with the titles from the sequel database so Chasonic is a very extensible language and we Extended it with a plenty of modules that provide functions that allow you to do a lot of stuff in this case We we did a a JDBC connector module that allows you to execute queries against your relational database So what I'm doing here is I'm executing a sequel query select star from FAQ on A connection and I specified the information for for the connection up here So in this case, it's a my sequel database running on Amazon RDS Executing this query the result comes back in this case It comes back each row as a JSON object as a flat JSON object And I'm doing a join between the question ID coming from the relational database and the ID of the question for this group This case I can run it and Boom here it is so what we did is we joined data between MongoDB and our relational database and Including the four lines of connection information that I have here and a module import for the JDBC module This query has 19 lines of code So now I'll try to to imagine how much how many lines of code it would take you if you develop this query or This kind of functionality in Python or Ruby or Java Okay Any other questions regarding this oh Yeah, that's a very good question. So the the language itself is purely functional and declarative and a functional declarative language usually doesn't have any side effects So it doesn't make any updates to the outside world of the query And this is an important property that declarative languages have such that you can heavily optimize the execution of the query Now in this case There's a is a bug that leads to that there. There are functions that are marked as having side effects and In this case the connect function was marked by mistake as having a side effect and So what this this IDE does it it warns you before you execute the query Because if you just execute it multiple times or do you I execute it automatically? You might make changes to to some other database. So it's just an additional level of Security that the IDE gives you here Yes, please Those would be side effects it would you would get the same warning you can run it It would make those updates But if you run it again, you better be sure that that's what you want to do. There was another question over here absolutely So several answers to this question. The first is I already showed you the concept of a function So what you can do is you can start building your own modules containing functions of functionality that you have I can take this entire query here and just put it in a function and make it available for others and the The the second answer to this question would be each of the queries in our product at least is exposed using a rest API So you can invoke the resource that triggers the execution of the query You get a result back and then what you could do is you could pipe this result in to the execution of another query But that would be another level of in there. Okay So last example that I'm going to show you here. So here. I have a Query that at the beginning imports two modules. I'm going to show you what they're used for one is an HTTP module And one is an archive module What I'm going to do in this query is I Have a resource here It's a zip file that the US trademark office publishes every day And what it contains is it contains all the information about the trademarks that were modified or added at the previous day and The content the contents of this zip file is an XML is an in the XML format in this case, it's probably I think a 200 megabyte XML file and what we want to do is we want to take this information process the XML data in it and Extract some of the data transforming it into Jason because that's what our web application in this case can consume easier and Yeah, that's that's what it's going to do. So what what how it works is I'm having this URL here I'm finding the value to a variable Then I'm using the HTTP module to retrieve the binary value behind this URL Next I'm going to use the archive module I'm going to use the extract text function of the archive module to extract the zip file in order to get to the raw XML text and Then I'm having a function parse XML that Takes the XML text and transforms it into the data model that Jasonic uses At the end I'm using the XPath notation to walk through all the case file elements within this XML document that I retrieved and For each of the cases I'm constructing a new Jason object The fields being the serial number of the trademark cast as an integer the name the mark identification of the trademark and The owner information and since there might be multiple owners I'm going to put them in an area here and so here you can see the result It's the serial number the name of the trademark and the owner so now Going back to what this query does is it takes some XML data on the web parses it transforms each of processes it and transforms each of each of the case file elements into Jason and Now imagine what you could do with the if you merge that with the the query that we had in the in the other example is in the Mean in in the middle of the query I could also connect to a relational database in order to join the information that I'm retrieving from the web The information that I retrieve from my relational database and the information that I retrieve from my Mongo database And I'm going to join or correlate all of that and return it in the as the result of a query Does that make sense? Okay, so as I said that was the last Example here so To to recap what what we did is I showed you a powerful and very productive query language For no sequel. I showed you plenty of examples on top of MongoDB to begin with and then I started to develop examples that join federated data sources those data sources each storing data in various formats and I hope I could give you a feeling of of jsonic how it works and how productive it is to do exactly those things Now jsonic is an open specification It's developed by We remain contributed to the language, but Oracle EMC also contributed It has been implemented by several open source processors and also IBM recently announced support for jsonic in their web sphere line of products We have a free book at our booth about the language. I have some of them here You can just come pick them up. It is a basic introduction into jsonic And it contains a lot of examples that you can just copy paste into into the the console that that was here And there are even more examples and and obviously the product of my company available at 28.io So there any other questions? Yes, please Absolutely, that's what we believe as well now. There's some controversy In this and I guess we the two of us and some other folks are going to have a Panel in the afternoon that is discussing the standardization of query languages in a no SQL space I'm on your side here, but I have heard that there are people that have different opinion And I hope we can discuss them today at the panel Yes, please Distributed yeah, as I mentioned at the beginning the the language is declarative, which means it's highly optimizable So what specifically what you can do is you can? Paralyze the execution of a query So for example what what we do is if your Mongo database is sharded Then we leverage that fact extract the shards or more specifically the chunks out of those shards and run several Processes of the same query and send it to each of the shards or executed only on a portion of the data and later Aggregated so that means you can you can scale highly scale the execution of those queries And in some of the benchmarks that we did comparing it to to Hadoop systems For example it shows that at least our implementation is in the same ballpark Then Hadoop if you start to paralyze the execution of such queries Saw some other questions. Yes, please. Oh, that's a very good question. Yeah, maybe I didn't make this clear So all those queries were run on servers that we host in the Amazon cloud and on those servers We only run the execution of the query. We only have a query virtual machine and When you run this query we do we connect to other data sources in this case One of the data sources was a MongoDB database hosted by Mongo HQ in the Amazon cloud as well And the other one was a MySQL database hosted by Amazon's RDS service So they they were completely federated now, obviously you you want to make sure that the latency between the query Processor between the servers that you run the JSON queries on and the data is low in order to get very good performance So the time is up, but I'm happy to take more questions because there's a longer break now, so Yes, please on Hadoop So theoretically you could implement for example a module that allows you to retrieve data from the HDFS file system, for example, and then you could process the data within within HDFS And you could paralyze the execution of that data now that doesn't have to do anything with mapper use on HDFS But what you could also do is but that's nothing that we do is we you could take the language the jsonic language and Translate it into a mapper use job Theoretically that that should that should be possible, and I don't think it's very hard And then this would only be a syntax that allows you to create mapper use job Then that are then executed by Hadoop because Hadoop is a very good and scalable runtime Yes, please another data source That depends so we have a lot of connectors to data sources a lot of the vendors are being exhibited Exhibitors here. We have a connector to Oracle SQL to cloud and but those are really only Lightweight connectors that use the primitives of that those data stores provide Our integration with MongoDB is much deeper So it also detects indexes it pushes down the projection to not retrieve all the information all this kind of stuff And it really depends a lightweight connector. You can build in one day The MongoDB connector That's a very good question. So there are several things that you need to take the B send transform it into our data model So that's a couple of thousand lines of code the deep integration because it also targets then the optimizer Who needs to detect the indexes in Mongo and then rewrite the query to leverage that index? So the deep integration is a bit more work, but a lightweight connector is a relatively simple No, the the core the query processor entirely written in C++ and We do have the JDBC connector for example, obviously in in Java And we call the JDBC connector in this case using J and I There's several language bindings that you can use also to extend with functionality the language So 90% of what I showed you is open source it's In the core is the Zauber processor the Zauber no SQL processor Which is licensed under the Apache 2 license and Then some of the the connectors that I've shown you have been developed by my company They are not open source yet, and we are we're deciding whether we're going to do that And we're trying to push more and more into open source as we go Any other questions? Yes, please No Not yet. There's no syntax yet for traversing crafts But if you look at the graph query languages that are out there I think the syntax can be very well Integrated into jsonic in order to allow also to give you that kind of functionality So instead of doing doing the navigation in a JSON document You could have a syntax that allows you to do the the traversal in the graph So I think an extension to jsonic is indeed possible to to extend the to craft functionality Yes, the language specification Whether we are going to implement that or not. I don't know any more questions. Yes, please All right, so I have some books here feel free to pick them up