 Hello, good morning everybody. Welcome to the talk Harnessing Flexible Data in the Cloud. So there are not too many people here. I guess we've just chosen the title to be a little bit too general. So maybe let me start by giving a demo to actually show you what we mean by harnessing flexible data in the cloud. So this is a service that we have built for a client who is a sports news provider. And his job is to aggregate all kinds of data about the major sports leagues and events in North America. And what he's doing with this data is after he aggregated and started, he's providing syndicated feeds to his clients that are like news agencies, for example. So let me give you an example of such a feat. Let's choose the sports baseball and the major baseball league. What you see here is a syndicated feed that shows on the upper left a YouTube video from which collects like, which is the most recent YouTube video or videos from the MLB YouTube channel. It shows you a section with the latest news about the major league baseball. It shows you in this row the news about club players. It also shows you some statistics like the batting average or the home runs of the players. It shows you some of the latest results. And it also shows you an aggregated Twitter feed from some of the accounts that are related to the major league baseball. On the top, what you can see is there's a full text search box. So with type ahead functionality, you can search for a player, for example. It gets you to the site of such a player. And on this side, you see another information, all the information about the player, the latest news of this player, and the scouting information. So all of this is pretty complex. And going back to the presentation, take a look at where the data actually comes from and how it is represented. We can see that, for example, the YouTube channel, you can query it. And what you get is JSON data about all of the YouTube's feeds. The news about the players, the statistics, and the latest results are represented in a format that is called SportsML. It's an open specification. It's an XML schema, and the data is in XML. The general news here is another XML format, which is called the News Industry Text Format. It's also an open specification. And the Twitter data, as you might know, is represented in JSON. So if you make a request to Twitter, what you get back is the result as JSON. And the full text search box, obviously, what you send is text. And what you get back for some of the type ahead clients in the browser is JSON. So this is actually what we mean by flexible data. It's JSON. It's XML with no schema, with a schema. In this case, the SportsML schema and text. And by the way, the data that is XML is actually stored in this demo in a database, in our own database. And there are approximately a million documents in there at the moment. So now that we know what we mean by flexible data, we need to understand what we mean by harnessing this data. And first of all, what we mean is we want to store this data. We want to store lots of data. We also want to query and update it. We want to do full text search, like the type ahead functionality that you have seen at the top of the page. We want to do complex queries, like data cleaning or transformations. And all this, of course, needs to happen, as you would expect on a NoSQL conference. In a reliable fashion, you need to provide 24 by 7 availability, highly performant and scalable. Because as you can imagine, after a sports event, more people will visit. So you need to have kind of a scalable infrastructure. Sometimes more data comes in, sometimes fewer data comes in. And we want to be able to handle all of that. So looking at what technologies you could choose to build such a service. So for example, you could choose MongoDB to store JSON data. You could use Lucene as a full text index in order to allow the full text search of your site. In order to store the XML data, you can choose a database like MarkLogic or Exist. And you will probably run all of that on Amazon Web Services in order to scale your infrastructure with EC2. You can provide videos or files on S3 or CloudFront. And all of these are great NoSQL technologies. So a couple of years ago, it would have been much harder to build such a service if there wasn't MongoDB or if there wasn't Lucene. And having all these services, what you need to do is you need to write the actual logic. You need to develop the logic. And you can choose any of the languages up there. There are plenty more. You can choose Python, PHP, Java. So what you would do is, for example, in Python, you would select some data from your MongoDB collections. You would probably join it with some results that you get from Lucene. You would select some documents from your XML database and provide all of that in the syndicated feed that we have seen. Now remember, this code, it's not trivial at all because you have to program lots of joins in the search. And you have to do everything manually in the language of your choice. So that's actually a lot of code that you have to do here. So what we did is, we have reimplemented the client's site, the client's syndicated feed, with the technology that the 28 milliseconds provides. And that's something that we have shown in a lot of projects that we have done for clients. It's always exactly the same. By using our technology, you're about five times faster in the development than compared to any of the what I call manual approach that you've seen on the slides before. So what I want to show you is, I want to show you something that allows you to really efficiently build such data-intensive web applications. And how we do this is, first of all, what we need is, we need a processing language. And in order to really deal with all the complexity and all of this data, we need a processing language that is capable to handle JSON, XML, text, everything. And we call it the SQL of NoSQL. We need a language, actually. And what we also need is a data store. We need a scalable data store, works in the cloud, handles lots of data. And those two things together are actually called a database. So that's what we need. And of course, we want to run all of this on some cloud infrastructure, private cloud, hybrid cloud, or public cloud. So let me start by showing you the processing language that we have developed. The processing language is called JSONic. It's a functional language. It's declarative, which means you express the result how you want it and what you want, but you don't specify the way how you don't program the actual algorithm. It has scripting functionality, it has querying functionality, it has full text functionality, and update functionality. And it does that for all of the data formats that we've seen for XML, XML with schema, XML without schema, for JSON, and for text. Let me give you an example of the language. So let's assume we have a database that contains information about all of the zip codes, the states, the population in North America. So it's a collection that contains approximately 30,000 JSON documents in this case. What we do is, for all of the JSON objects in this database, we get them out of the database. We group them by their state. So what you see here is that we actually extract the state pair out of the JSON object. And then for all of the states, we aggregate the number of zip codes. So we count the number of zip codes in each state and order them by the count descending. And then what we do is we construct a new JSON object here that contains the name of the value of the state and the name of the state and the count. So on the fly, we construct a new JSON object. So that's five lines of code that allows you to do a really, really complex query. It does grouping, it does aggregation, it does sorting, and it constructs a JSON object. So the result, for example, could be, you can see that Texas has about 1,600 zip codes and New York about 1,500. And assume you would have to write this kind of query on top of, for example, MongoDB or any document store that is out there in Python. You would have to develop the grouping functionality. You would have to develop the aggregation, the counting. You would have to develop the sorting. And all of that needs to work for huge amounts of data. And also the object construction and returning the result is not that simple. So let me show you a more complex query that should show you what else you could do with that language. So it's a lot of code, but I will walk you through it slowly. So what we do here is we don't query our database, but instead we query Twitter. So this is similar to the example that you have seen in the demo. So what we do here is we make an HTTP request to Twitter. And in this case, we query all the tweets that contain the term noSQL. So instead of dot, put noSQL here. The result comes back as JSON. So what we do is we parse the JSON document that gets back and bind it to the variable that is called search results here. After that, what we do is, so the search results document you get from Twitter contains an array that contains all of the tweets. And the array is called results. So what we do is we extract these results and extract all of the members out of this array. And then what we do is we extract the text of each of those tweets and bind it to the variable text. And next, what we want to do is we want to throw some full text magic on the text of all of those tweets. So what we do is, each of the texts, we tokenize it. We do a full text tokenization. We eliminate all of the stop words that are in this text. We put all of those tokens that we extracted into lowercase and strip their diacritics. And then what we do is similar to the previous query that you have seen. We group by the lowercase term. We aggregate the number of tokens. And we order them descending. And then we construct another JSON object that contains the token and the frequency of that token in all the tweets that we retrieved from Twitter. So this can be useful, for example, as you, on your website, you want to provide something like a tag cloud about what is hot on Twitter. And so the result would look like this. I did this query yesterday. I think it contains the term hard like 10 times, the truth seven times, SQL five times, and revolution three times. And again, if you look at that query, it's 11 lines of code. It's really complex stuff that has been going on here. There's a lot of full text happening, stop word animation based on a language, stripping diacritics based on a language, the same as we've seen before, grouping, counting, sorting, and a lot of things here that is specific to JSON. So in this language, you can extremely efficiently work with JSON data. Any questions to the example? Yes. So within the language, you can provide modules which are implemented in like host languages or external languages. And that's a module that we provided. And it's using actually a lot of functionality of open source project that is out there in order to provide the full text. But the language also has a native full text extension. So everybody who implements this language can provide this functionality. So the language also features, and that's kind of related to your question, plenty of other modules. It provides a rich module library. For example, what we do have is, we have a module that allows you to connect to any Mongo database that you have, retrieve results, and work with that language on your MongoDB dataset. It has other libraries that allow you to do complex data cleaning. It allows you to do as we have seen, HTTP requests, so you can make get or post requests to any service that is out there on the web. You can do cryptographic functionality like computing a hash or an HMAC. There are geo libraries. There are libraries that allow you to construct PDFs on the fly. There's a library that allows you to extract SIP archives if you retrieve them. So you can read a SIP archive, get all the JSON or the XML out of it, or even the binary data. There's libraries that allow you to generate random numbers, web-related stuff like working with cookies, importing CSV data, the full text example that we have seen. You can work with XML schemas. There's a functional library for that, and so on. So there's a huge number of module libraries that you can use to work with that data. Now, kind of what I showed at the beginning, we've done a couple of experiments with real-world projects that are out there, and I want to show you two of them. So what we did is we compared the productivity and we measure it in lines of codes of the application that the application needs. The first one is an application that is called PubZone, which is a scientific publication forum. And the application was originally developed in Java, and it had, without all of the UI components, approximately 8,100 lines of code. So what we did is we did a feature-complete implementation in the language, in JSONic, and what we did is we achieved to do that in only 3,500 lines of code. Then what we did is we took a software that you probably all know. It's the AWS language binding for services like S3, SQS, and I think the Simple Notification Service, which is also implemented in Java. We took it from Amazon, and it contains approximately 14,000 lines of code. What we did is we also took the PHP language binding that they provide, which is already almost half, which is 6,500 lines of code, and we re-implemented it functional-complete in JSONic. What we see is that we can do it in 2,500 lines of code. And why we can do it is because, so as you probably know, the Amazon Web Services return XML, or they return JSON, and you can send XML and JSON. And since we understand, since we have those data types as first-level citizens in the language, we can very efficiently work with this data. And since we can make HTTP requests and all kinds of stuff, this allows us to do that in a lot less code. As you can imagine, having this number of lines of code, or this difference, lines of code, less lines of code means less bugs, less maintenance, and at the end, it's all about cost. So it allows us to develop those applications much, much, much cheaper. So next, now that we have this, yeah. So, we've seen that the comparison at the beginning was development time, and it's almost linear in this case. I mean, of course, if you use, like, Eclipse to generate a lot of that code, it's faster, but at the end, it's almost linear because the number of bugs you have to fix and everything that is still there and linear. So I think it's quite comparable. So now that we have this language to work with all this data, we also need to store it somewhere. And there's a great solution that you probably all know and have heard of on this conference, it's MongoDB. So what we did is we put this language on top of MongoDB, and MongoDB allows, it's a document store who can natively store JSON. So what we do is, with whatever JSON data that we get, is we natively store it in MongoDB. It's very fast, highly available, it's scalable, and it allows you to do atomic updates on a single document that you put into this database. And that's a very, very nice feature that MongoDB provides. What we did is we developed an extension for more richer data types. Specifically, we added support to MongoDB for XML. So we have our own binary XML format, which we put into Mongo. And we also have support for a lot more data types that are very useful and that lot of JSON solutions miss, which is support for datetime, durations, all these kinds of useful stuff that you need to deal with if you have data on the web. So let me show you what we do is, in our solution, we have collections. And in your project, you can have a huge number of collections. And those collections can contain JSON data, as you can see here, or they can contain XML data. What we do for each of them is we map them directly onto MongoDB collections. So we create a collection in MongoDB that either contains natively JSON or our own binary format of XML. So it's a one-to-one mapping. Then in addition, what we do is we use the index that MongoDB provides us in order to index the data that is in there. So we index the JSON natively. We also have a way to index the data in our own binary XML format. And given the collections that you declared in your project with JSONic, you can leverage those indexes in order to achieve great performance if you do selects or de-aggregation. And you're probably aware of all of that and how that works in MongoDB. So this is what we did, is we took the language and we put it on top of MongoDB as a data store. So next, what we need is, we need a scalable infrastructure to run this on. And what we've chosen, we have chosen AWS to host our service. So we've chosen EC2 and the Elastic Load Balancer and all of our infrastructure is running on AWS. So for example, MongoDB is running on a couple of EC2 instances that are distributed across regions and availability zones and their replica sets. And all that allows us to guarantee the high availability and the scalability that we want. So let me walk you through a request as we process it on our platform. So assume you have some HTTP client that makes a request and wants to retrieve some data, wants to retrieve a website, wants to retrieve an RSS feed. What happens is you make a request, it goes into the Amazon infrastructure. ELB stands for the Elastic Load Balancer, which is a product that AWS provides. And what it does is it distributes the request to any of our EC2 instances that are running our platform. So there are an arbitrary number of EC2 instances which we can scale up and scale down depending on the number of requests that are coming in. So the next step would be that within our platform we have something that we call the request handler, which takes the requests, analyzes it and see which code, which JSONic query it needs to execute. So what it does is it goes into MongoDB that we use as a storage for our compiled queries. It retrieves the query and then it passes it to our JSONic query processor. The query processor starts to execute the query. It has two components, the processor itself and the binding to MongoDB. So what's happening is while we are processing this query we retrieve data from MongoDB, we process it, we do the grouping, the aggregation, the full text, all of the stuff that you've seen in the example. Just give me a second. And you might also want to update those documents. So what you can do is you can update them. At the end, after finishing, we write all of the updated data back to MongoDB leveraging the atomicity per document that MongoDB guarantees us. Then we return the result in step eight and nine back to the client. And again, this is running on a really scalable infrastructure. It's all running on EC2. So we have a number of MongoDB machines, any number that we need, number of charts and we can scale the processing in the platform independently of the database. We actually do, yeah. So since the index is declared in the language itself, the query processor knows about an existing index, for example. So what it can do is it can leverage that index during processing and already make the selection on top of the MongoDB index. Any other questions? Okay, so let me conclude. What I've shown you is I've shown you in very brief examples, the language that we call JSONIC, which is a very, very effective language to process flexible data. JSON, XML, with schema, without schema, text. We have leveraged MongoDB as a data store and index and put our query processor on top of MongoDB and those things together we call the database. I've also showed you a couple of example that show how cost effective developing an application in our languages and we have constantly seen among several projects an up to five times more cost effective way to develop such applications. And I've showed you how we put all of that on a scalable infrastructure on AWS. Now regarding the language, this conference has two more talks that go into more detail about the language. They show you the first talk by Jonathan Roby from EMC is at 11. It shows you more features of the language and another talk by Chris Hillary from the Flower Foundation is at 11.45. It goes more into details of the implementation of such a language. So if you, for example, have a data store that is out there and you want to put this language into your data store, he will give you a lot more details about how you can implement such language efficiently. So this is it. I think we have still time for some questions. Thanks for listening. We also have a booth at the conference. So if you have any more questions or want to see more demos, we have a huge number of demos to show the language and the platform overall, just visit us and we will answer your questions there. Any more questions other than one there? Yes, please. So at the moment, what we're doing is we are hosting a service on Amazon but we're also looking into other alternatives like being able to host it in your private cloud. And yeah, we are just ready with the product and we're working on a business side in order to answer those questions. Any more questions? Okay, thank you very much. Thank you.