 Hello, I think that's time. My name is Sveta Smirnova and you see on the slides another name Alexander Rubin is Principal architect and prekona and actually he was a person who made most of work on the slides here on Tests he majored this time at his results. Unfortunately, he could not come So this is why only me is here Okay, so what we'll be talking about today I will be talking about Apache spark why and how to use it It's I will focus on words how to use it together with my squeal and for examples We use wiki states example wiki states analysis. This is a free Database available for download. It's about 10 terabyte So what's a spark if compared to my school? You see this picture. It's like for my school timers They may probably a full picture. You see let's how my school work so first your application sends a request and then my school has to Pursate has to validate it has to optimize it sends it to storage engine If it needs to save results, it has to write it up to file system and return results and Process it does everything so this is Big walk spark in contrast does only for computing it only only computing hits sense to third party like Amazon is three local five systems and the my scale And actually many many it supports many many backends. It's just only computers, but does it in parallel and this is Why you need it? So why you need it? You need it. It's most use case is processing big like Big data in a single query which for my school. This is Not the best thing to do and I will describe here later. Why? which is about spark it's there are there are like old fire like faq like you have You have to Have enough memory for full data set. This is not true Spark processes data in memory, but it likes it's mostly like like my school with caches It also has direct access to data sources including my school and can be any Constable data in Distributed file system. It's had up HDFS and a local file system Places it has native python and our integration, but this is massively parallel. This is more important thing It's powered spark. It's can compute compute in Multiple threads and also it can compute the distribute on multiple nodes So you can scale on a single machine on multiple cores and you can also scale on multiple machines. So this is Increasing parallel so let's compare Spark with my school So imagine you need to Process a big query with my school So what you will need you will have indexes you will like to have indexes You will also have partitioning and What is for for single query my school will just read Read data from one partition if you Created it's Properly wisely There is also possible sharding which is not built in but you can implement sharding with my school Spar in its case it doesn't have indexes and I will describe later why It has partitioning but partitioning is not in my school case, but you can process like you can partition your data and you can Calculate Performs calculation on every partition in parallel It also instead of sharding it has like a kind of may produce architecture So it sends processing requests on different nodes So why no indexes? So what is the biggest issue with my school? So it's one query. It's one core So it can my school you can use multiple cores if you have many queries But it cannot use multiple multiple cores if you have only one One query if you like have like terabytes of petabytes of data for this table you just And you have it's only one query you will use one one cpu core no matter how much your machine has For spark it has no indexes. It's this practically full table scan But it's full table scan is scans only part of tables in In multiple frets and in the parallel. So it's if you have like If you have like one machine with multiple cores, it's you can have data this split on parts and Scan them in parallel by every core if you have multiple nodes, you can have multiple nodes you can split data by Differently on then and again perform this Calculation on these nodes and again each node can have calculation in parallel This leads to high latency because after this Result set You have to combine it into single results. So it's it will take actually its latency depends from No, it's like for any my school cluster. It's like Not have to communicate So why is spark doesn't use indexes well, it's How many if you work with indexes in my school? So I think you can imagine what happens what if you create the index for petabyte of data or if you update index or if you Just read a terabyte index Not not fast not fast full scan in parallel could be much easier okay, so etl pipeline so what is Extract transform load so in my school it works like you can extract data from source Transform it's before loading So this Before loading because it must be valid and must be in format which my school understand and only then it's insert into my school So you will spend a lot of time here No, it's also here because you will load petabytes of data and only after that you can actually process your data For spark you can again you are extracting the data. You don't save anything But then you can just load you don't transform it. So this is You load data as you get it. So you don't need to spend time here, but then you can after Returning results to clients you can transform analyze visualize and do it everything in parallel here So that's called difference like my school is a gem of right This is that means When you load data it has validated It has preform kind of conversion sometimes indirect and also it it will fail if something wrong with your data And also if it's like science No, it's actually it's modern my school storage engines are like in adb are transactional and You it will have rollbacks this transaction. So it will spend time So if you have like you loaded only partial of data, it's maybe nightmare For spark it's scheme on read. So there is no lot data per se. So it can just like Are sync to nodes or are sync to date or just read it And when it's validate data on read it's transform date on read And even if it's something goes wrong on this tape, it's much cheaper cheaper to reject So just reject you don't have to discard anything Example for our example, it's we took a weaker states database which is available from this url It has wiki page access counts It's a base file is this is File is greater than 10 terabytes and here is how it can be downloaded into my school So it's load data in file insert into table fields terminated by It also has a transformation. So that's a slowdown things a little bit too And how much will it take It will take for an adb to will take 50 seconds for my 11 seconds So one hour it will take one minute and one year will take six days and six years Years will take greater than One month to load Loading wiki state in to spark will be much more easy. It's just copy files to any storage you like and then Create sql over a structure over it or Create retic vs or aggregate it and transform it and load it's for example in my school, but processing parallel. So You have like not all 10 terabytes, but only data you need For this Here is an example. It's not our example. It's example. It's spark summit about loading data Into spark it's without my school involvement. It's took 45 Seconds for scanning the four and a half terabytes of data impressive Why because pipelines It's it's better to compare with pipelines. Well my school. It's one pipeline and spark It's like multiple multiple multiple so much bigger throughput. It's much More job can be done at the same time So how the spark works here on the left in the green boxes. It's how we will process my school My school query my school bill query. First we run select and for Spark it will just like lines read file. So nothing Then we run where for for Spark this will be like Mapping them Which lines are we are going to read Then for group by it's It's Practically data processing and out file. We will write to either like in first row is to My school and second row is we can write like for example to park it file. This is column not storage and actually there are many began switch we can Save data to Here is example of how to save results to my school. This is I think So just It's pretty pretty easy And here are what happens after you run this code. So it's show process list output. So I think it explains a lot So this spark split all the data into multiple requests and They are massively parallel. So we are it's now here with spark my school can use Can explore its own ways and it's explore multiple cpu course and here is the confirmation Uh, so multiple cpu course and uh, that's uh, I think it's a pretty easy code Besides it's pressing it's um, uh, there is P spark shell which allows you to monitor Spark job. This is it has nice graphical user interface so you can see what is going on And uh, practically you can find out What happens with particle job this job file it with because table was full but you can read it's what happens Yeah Okay, so we put all data into Into We read all wikis wiki states and uh, now let's uh try to use spark for reads and for a data analysis For this comparing, uh, which would want to know that's uh, We want to find uh, 10 most uh, frequent uh, frequently queried queried wiki pages in uh, January 2008 and for my school it took To receive only this 10 rows took one hour 22 minutes And uh, with uh, spark it took Uh, uh, last uh, last time it's 20 minutes Impressive Okay, uh, so what else so it's also, uh, we have like, um This talk it's about a patch drill. Uh, you can also use a patch drill. This a patch drill is, um, A little bit competing solution, but also can use together. It's allows to access, uh, data Uh, data sources differently and uh, uh, 3 mrs. SQL it can query mongo db with sql and it can query Both mysql and mongo db compile combine the results together. So that's uh, In other great tool for uh, data analysis magic Quite not easy. It's, um Just to recap, um mysql and spark are um Mysql and spark are Difference and Watches the difference. So what's uh, you know, these two to five points. So mysql it's uh, searches Uh full data set, uh, which must uh, which may be pre-filtered it not aggregated But for spark data set is already that means that's uh, it's just can read any data set for Uh Prepared from application from file. It's can be filtered like only on site like in case if you Shard data on Geographically on so Mysql doesn't support parallelism if you run a single query Spark supports parallelism and it's also supports parallelism on all levels from having from um multiple cores on single machine and to uh multiple nodes Um Mysql is based on index Um spark is not based on index, but uh having it's uh does everything in parallel. This doesn't Uh, Doesn't it is is is is not bad. It's uh work good for A big data For in the dp if like uh, if you I think we have It's uh for in the for For mysql it doesn't have no it's actually it has column storage. I think it's Maria dp, but Even in any case it has to Has to convert data into column storage for example, so this is uh in between between this And in spark if you want uh column storage you can get it just like Specifying like uh work if you want to work like in the solution like park it that not with mysql Uh partitioning for mysql it's for single query that means no it's actually for anyway it means like partition use it only to select Select data which is Do access and if you like need to process data from two partitions You cannot you will process data from one partition and then from another for Spark partitioning you will process Data from both partitions and it's the same time and then combine results So in my opinion, that's uh you can If you need to work with big data if you need to make Big analysis Combining this together can get the best results. It's uh luxury of sql. It's easy to use sql. It's uh Which we prefer to use and uh massively Parallelism which makes it's really uh that such queries really efficient Thank you, uh, it's actually you can use No almost same uh dialect way like mysql It's it's not it's it's not like mysql. It's not like exactly mysql the dialect, but Thank you svetter. Thank you. Alex if he sees us on the streaming