 It's probably not surprising for you that during the time the amount of data grows almost exponentially and with the internet of things and more and more devices connected to the internet we can be pretty sure that this trend will continue and probably in a couple of years the amount of data being generated on the internet will again grow exponentially. This is probably not surprising. What is maybe a little bit more interesting is the structure of the data. The black part which grows linearly is a rough estimation that this part is formed by structured data but this exponential part is mostly created by semi-structured or unstructured data. This has some implication and I will return to this a little bit later. I guess this amount of data is really big and this phenomenon is known as a big data. For everyone it can mean something different. Maybe to be sure that we are on the same page how big the data has to be to call it big data. Do you have some definitions or how big the data has to be to call it big data? Some more tips. I have scarf, I have half full bottle of kofola if you don't own scarves or banana. Terabytes. Okay, terabytes. Something somebody else? No. One megawatt. Okay. So some definitions. Okay, so probably the most popular and commonly used definition is this one, that big data is anything which carries your Excel. It's kind of joke but we can take something from this and that is that basically you can buy more beefy machine, you can buy a recent version of Excel which supports more throws with this exponential growth. Sooner or later you will probably, the scaling up won't be sufficient and you'll probably have to scale out. And this scale out and scalability issue is not always just because of amount of data but also because of cost efficiency. For example imagine you run a shop on the internet and before the Christmas you probably have to deal with really huge amount of data while during the summer when everybody is on the holidays the amount of orders will be probably significant slower. So there's no point to buy some really beefy big machine just before two weeks before the Christmas and the rest of the year have this beefy machine. So there are two main reasons. Just amount of data and not only this but cost effectiveness. And some other challenges of big data. This is not completely a couple of challenges which I will try to address later in this talk. So if you have huge amount of data, of course running an analysis on top of that is challenging just because of a huge amount of data. You have to also store somewhere the data. And as I mentioned in the first slide quite a lot of data can be unstructured. So you have to have some solution which is able to store the unstructured data or data which are not processed. Again there can be other reasons like that the data have some structure which are not able to process them and just for performance reason you store it as they come and process them later. And when you run some analysis on top of that you probably will get some data which has some better structure and which you will probably want to later on run some queries on top of them. So you probably also need some solution which is able to store some structure data and run some queries on top of that. Quite often you need both store unstructured data and as well as structure data. The solution as I mentioned has to be scalable and the scalability is usually done in the cloud. But running application in the cloud gives some more challenges because it completely changes the architecture of the application. In the cloud everything is ephemeral. So application have to deal with the fact that some piece of hardware will die and application will have to live with that. So can't rely that I will store the data here and two days later the data will be there. So it changes also the approaches how we create application. So probably the most widely used approach is how to address this kind of challenges is data replication and then running some MapReduce on top of the data. And probably the most popular solution is Hadoop. So far so good. I would say I just recap some probably well-known facts which you probably know. So are there any Hadoop users or who run or at least play with that? So are you happy with Hadoop or something? So it seems everything is okay. So why Hadoop is not good enough? So what doesn't make you that you are not completely satisfied with Hadoop? Well, you know people today are more and more impatient and want their answers immediately. You know nobody sends letters, everybody sends email or some people even make phone calls and want to have some communication in real time. So for some people it's slow because you store some data in Hadoop FS and for example once a day some around some analysis on top of that and some people may and with the analysis take pretty long time. So some people seems wants to do it faster. So how we can speed the whole process up? So one little powerful idea is especially for algorithms which are iterative don't store the data to Hadoop FS or to some permanent storage during the computation keep the data in memory all the time. And as I said you remember that you run it in a cloud. So it has some implication you are replicating the data you run some transformation on top of the data and if you store it it will also get replicated and again it takes time. So maybe stop and think a little bit do I have to replicate every single change and the answer would be probably no. Maybe I can run some bunch of operation and then replicate the result set. So it's easy to say maybe a little bit more hard to do but fortunately there are some frameworks which already does that and probably the most popular one is Apache Spark. And a half of Apache Spark is concept which is called Resilent Distributed Data Sets and the shortcut is RDD. It's basically immutable distributed collection of data. Immutable it means that if you run some change on top of that it doesn't change this RDD but it creates new RDD and there are basically two kinds of operation you can run on top of the RDD. It's transformation and the transformation takes one or more RDDs and create new RDDs from it. Typical example is map operation or filter operation. Second kind of operation is action and actions it takes some action as the name suggests on the RDDs. For example count how many items are in given RDD or take the first element and so on. And now the important thing is that RDDs get evaluated lazily so that you applying change of transformation and it does nothing. It runs in millisecond because it just locked operation you want to do and the RDDs gets evaluated only when you call action on top of that. So there is really huge space for optimization. Imagine that you run for example five transformations or last one is filtering and the action is called the action is calling first element that you want to run some transformation and are interested only first element. So in let's say classical approach you would run first transformation of whole bunch of data save it somewhere second, third and then you pick just first element. So basically obviously if you know what you want to do there is pretty big space for optimization so that you run the transformation and why once you filtered first element which matches the last filter you just return it and can stop and that's not all. Also as for replication you don't have to replicate whole data you just replicate this sequence of actions it's called lineage and when some partition of the data is lost what you do you take original data or some RDD which was stored in previous time and recompute the data. So basically you don't replicate every single change and the only thing you replicate is lineage and when some crash happens the data are recomputed and it's again safe. It shows that this saves quite a lot of time and not only that the RDDs are kept in the memory if possible so it brings really huge speedup. This is the original paper about RDD it's quite interesting read so I recommend it to take a look on it. So this is plot from Spark web page and it's comparison to run some machine learning algorithm for classification on top of the Hadoop and top of the Spark and as you can see the speedup is two orders of magnitudes it's really huge speedup. I'm not saying that for every application you have the speedup will be two orders of magnitudes papers report that usually the speedup is one order of magnitude and for iterative algorithms it can be two orders of magnitudes but again it's really nice speedup so I think this is quite impressive and really worth to think if your application can be pushed to the Spark and you can take benefit from it. So it sounds good but can we do even better? I believe at least for some application yes. Let's try to do the processing of the data once it arrives. Maybe it's quite trivial a suggestion but if I ask how many applications really do that I would believe that not so much or not so many and maybe it sounds trivial but the implementation could be a little bit tricky but again fortunately somebody already did that and for example Spark provide streaming support for streaming and as the data comes it creates from incoming data RDDs and put RDDs into micro batches and it emits these micro batches to Spark workers which do the processing. User can here configure how frequently the bunch of these micro batches should be created. It has some advantages but also some advantages to do this micro batching but usually works pretty well. So if you have some again it's immutable so from this is for example when incoming you have incoming stream or sorry in Spark terminology it's called the discretized stream or D streams. So here you have incoming stream of lines of the text and some micro batches and imagine you have you run some this transformation on top of that that you split every line into single words and create a flat map from it. So what it does it again creates new D stream and from applies this transformation to every single batch and so you will get new stream of this micro batches which you can process. So if you rewrite your application to Spark switch to Spark streaming is pretty trivial you just do some couple of changing lines changing context to streaming can switch from RDDs to D streams but internally these D streams again contains RDDs so there's not so much work to switch to streaming and do the processing of the data in real time as the data comes. If you are not happy with my... Okay you probably have to scale your solution or add more brokers. Yeah but it's the advantage of running your solution in the cloud so for example if you run it on some Amazon you just pay for what you really use so during the day you will scale and bring or create really huge cluster and during the night you can slim it down to one node so basically this is really advantage to run it in the cluster so basically application have to be prepared for it so we have to be scalable and Spark definitely is there can be some other challenges like that if you hard code something here so it can happen that this wouldn't be enough and you would like to switch for example to one millisecond so this is one of the disadvantage so here's my answer maybe you can switch to some real time streaming processing frameworks probably 3 well most well known is Apache Storm, Flink and Samza all of 3 are pretty similar originally I wanted to give some introduction to Apache Storm but I found that I don't have enough time for it so fortunately you won't lose the information because it's homework for you and once you go home first thing you should do is to open the web page and read and learn some hello world example of one of 3 all of 3 of them are pretty similar so it's fine to read for example some hello world for Apache Storm and to have some feeling how it works it's not something super complicated so it shouldn't be big deal to understand how it works but it's pretty powerful okay and can we do even better so at least in some cases I believe yes and what can we do more okay Spark keeps the data in memory during the computation but you usually don't store the data run some Spark analysis on top of that and then finish and throw the data away usually you have some whole stack of applications incoming data for example some Spark processing and the result of this analysis is just recent to another application to which does something else for example it can be some other processing of some already prepared data and this application send it further to another application which can be for example some business processing engine which can take some automated actions based on this data and so on so there is usually whole chain of application and you have to exchange the data somehow between this application usually store it somewhere in Cassandra or some SQL database and so on but let's try to keep it in memory all the time so how can you keep the data in memory all the time during processing whole application stack the answer is to some caching is that caching is some sketch how cloud works and you see that caching is a basic principle how cloud works because it's basically just caching but to be more serious the answer is to some in-memory data grid like for example in Finishpan it's pretty major in-memory data grid solution I won't come into the details because first I don't have time for it and secondly there was a dedicated presentation about Finishpan this morning so if you miss it I really recommend you to take a look on the presentation on YouTube or if you just want to have some quick review go to Finishpan.org and check some features I'll just quickly mention that it's no SQL key value data store but you can also define some scheme if you define some scheme you can run indexing and queries on top of the Finishpan so this to some extent address the challenge which I mentioned at the beginning of the presentation that you sometimes need some solution which is able to store no SQL data but at the same time able to store some structured data and run queries on top of that Finishpan could be a pretty good solution for this of course it's scalable, elastic and so on there is no single point of failure so if some node dies it data are replicated and everything works well one nice thing is it supports transactions so if you run some for example financial application and really needs to be sure that data arrive where it should it also supports transactions and have quite a lot of nice features I will talk about some of them a little bit later but most of them please go to Finishpan.org or check the presentation of my colleague Kierka Haluša from this morning so how we can connect it to Spark and things which I talked before there's a connector for Spark which connects Spark and Finishpan what does it mean? it means that you can pretty simple just by two lines of convicts read and write the data from and to Finishpan from Spark cluster so here is just that you define some Finishpan server address and name of the cache which you would like to process and simply create RDDs or stream from Finishpan data and maybe you can have some different use cases for example data are coming from Kafka are processed from Spark and you just want to push the data for processing to Finishpan so you can of course write the data one nice thing I like to mention is that you can transform RDDs by using Finishpan queries so imagine for example that you store some users into the cache and now you are interesting only users which has the same name as I have so you call the query on users having name equals to Wojtek and then you filter RDDs just by running Finishpan queries on top of this R of RDDs and everything happens inside the Spark so it's pretty nice pardon? everything is kept in memory pardon? the question is if everything is kept in memory so the answer is yes actually in Spark you can define some storage levels so actually in Spark you can keep it also on disk but what is highly recommended is use storage level memory only maybe you can run into some troubles if you use this option because it can Spark can run out of the memory but it depends on you so you can store it also today on disk basically if I remember correctly or you can use another strategy memory and disk so Spark automatically tries to keep the data in memory and if it's not able to keep it in memory it starts storing the data to disk but you have two solutions you can scale up and add more Spark workers and if it's too costly for you you don't have to have the data right now so you can store it to the disk and process it later from disk but it would be a little bit slower but it will cost less money so it depends on your use cases and how you evaluate your application but basically you have both options you don't have to do any compromises no but it's recommended solution recommended solution is to run on every or you have machine where you run Spark worker and recommended approach is to run on the same machine InfiniSpan or one node of the InfiniSpan cluster because the connector is clever enough and it loads only the data which are stored in this InfiniSpan cache segment which runs on this machine so basically yeah it's not just that InfiniSpan and Spark shares the same memory it's load over the wire but over the local host so it should be pretty fast so it's recommended solution to run it on the same node it's not mandatory but of course if you would run it on the different machine you will have latency on the network but recommended solution is to run it on the same machine actually both because yeah no yeah for processing yes because the question was if the data lives only in InfiniSpan or if I double the data if I load it into Spark are functionally yes because these are two different processes which doesn't see the memory of second process so you have to load and so yes but there's some assumption that not of whole data will be load into the Spark so let's say there is assumption that InfiniSpan will play as more permanent storage and you will only load some let's say stream part of the micro batches into Spark processes so away and so on so you don't have to replicate whole data which are kept in InfiniSpan okay so couple of other interesting InfiniSpan features which can be used for pushing the data so imagine that Spark processing is done the data arrived and you like to push it to other application down the stack so there is pretty easy solution for that InfiniSpan has client listeners so whenever anything change in InfiniSpan cache it sends you the notification that something has changed so you can take immediate action based on it some little bit more advanced concept built on top of that is continuous query imagine that your user runs some query on the data you get from your analysis and once as the data are coming he wants updates so you have to open for example web socket the query again and again but with continuous query just registered the continuous query and when some data arrive which matches the query you are notified about the change and can immediately show the change to the end user or some other application and whatever and another nice feature is implementation of distributed streams is basically implementation of just streams but over the distributed data so if you have some application which only do for example some map introduce you can avoid use spark and run it directly on InfiniSpan so in this case you would completely avoid doubling the data but of course spark provide much more functionality because it's dedicated tool for it for some more easy or simple use cases this would be to use only the spark so what do we do get from the big data we try to keep the data in memory all the time which should speed up processing of the data and exchanging data between the applications you process the data as they arrive and result are pushed to the users by immediately as we have it by for example continuous query so can we call it fast data well I don't know because DevOps borups stopped tweeting several years ago so I have no definition for fast data but actually I think it doesn't matter how you call it if it makes you happy to call it fast data let's call it fast data but what really matters is if it makes your user happy and if your users are happy you will be probably happy too and I believe that applying this techniques will make your users more happy and this is what matters I'm not saying that it's a silver bullet which fits to all use cases so it's just a couple of sorts how you can do better and now it's up to you to think about your application and to do the decision if you can somehow benefit from it if yes, the tools are ready so just try to use them so I have only not so much time so I have some demo it's really a hell of an example really trivial I will probably run through the code very quickly because I don't have much time but you can download the presentation and go to my github and the code is there so you can go through the code it's well commented and it's really few lines of code so you shouldn't have the problem to understand it it basically tries to show you or imagine that you have some network of sensors which measure temperature send it somewhere to some gateway which stores the measurement into infinite span then you want to compute every temperature for every place and you would like to push the result to the user so basically it does nothing useful but it's again a hello world example so as every hello world example in the cloud it has unfortunately several components so the first one is the infinite span server it will store all the data incoming and outgoing then there is one application which simulates this network of sensors it randomly generates some capital city in Europe and randomly generates the temperature and send it to the infinite span then there is spark for processing the data and computing temperature average for every place and then there is some client application if I want to be really in a cloud I should have infinite span cluster and spark cluster but let's skip it for now oh sorry so I'm starting infinite span server my spark server is already running so where do I have it it's here so you can see I have one master and one worker on my localhost every synchronous on my localhost so so infinite span is running so now I will start generating the temperatures as you can see it generates some random city and random temperature I will I will also start my client here I'm passing some arguments to it that I'm interesting in temperature only in Prague and Vienna for example so I don't have all the changes so now it waits if it gets something and here I will start spark streaming and it will will listen to infinite span and process the data as they arrive it prints how many data is in every bunch of the stream so it's generated around 10 items so let's take a look and here you can see that I get updates for Vienna Prague as they come so yeah it's changing so a little bit let's take a little bit look on the source code here is the client as I mentioned it's just simple cash listener for infinite span so I just the only thing I need to do is to add this annotation that cash entry data created or cash entry data modified and that's all and then I can implement my logic which does only check if the city which is updated is among watch cities and if yes it will print something to output and as for spark application it's simple it just registered to infinite span and create infinite span D stream from it then does some processing like extracting the city and the temperature then it as I mentioned in spark everything is done in microbeaches so it can happen that you will get several measurement for the same place in one microbeach so you group it by the key and then you just recompute the average with some math function and obviously as it's stream of the data so I need to keep somewhere the previous state so I will keep number of measurement done during the whole period and some of the temperatures and from it compute the average and I need to keep it somewhere spark provides nice feature for it it's called state so you can run the map with state and it's here so I'm storing for each city I'm storing this sum and what I basically do for every measurement I update the sum here and update the state and from this update I compute the average and return it to the new stream in spark with averages and later on I take this stream of averages and create it back to the infinite span which store the data and fire my listeners in another application which will print it to standard output so that's that simple again I'm sorry I'm running out of time so please check my github and you can go through the code if you want more so what to keep in mind try to think if you can keep the data in memory all the time if you can process data as they arrive if possible keep the data in application during the whole memory processing the whole application stack and infinite span and spark provides really nice features which you can use so these are some sorts you can think about and if you have any questions I prepared some with answers but if you have some other questions so we have a couple of minutes thank you I would like to ask what's the structure of the RDD when I have infinite span to store the data I have just the keys to the infinite span so if the RDD is transformed so it doesn't transform with the data but just the keys no it contains key and value so pairs of key and values for RDDs and for stream is key value and kind of type of operation so if the data was removed, updated or created so with the operation for the transformation are stored into spark actually for transformation are not stored anywhere data are processed only when you take some action off on it and then are stored in spark memory and if you want you can store it back to the infinite span thank you I would like to ask how is it guaranteed if I have multiple instances of spark and multiple infinite span that every single item is processed only once because I have replicas in infinite span so how is it guaranteed? it's by primary owner of infinite span so basically you take only from primary owner if it's not available it talks to the replicas but basically it takes only from primary owners thank you Wojtek for great presentation okay so thank you if you have other questions catch me here thank you