 things guys like I think we should we should kick off it's not a really a formal presentation but we have some content to cover and also this time around we will not be doing any of any of the longerish coffee breaks so if you really want if you really want or need to leave maybe like an hour time or whenever just like walk past there's no no no problem we all understand everyone has like their priorities and plans with the family. My name is Arseny Chernov and I will be doing the first talk of the day today. There's also Alina Melnikova from Refinitiv that will be doing the second talk. Before we get started I just wanted to reiterate once again the importance of that RSVP versus giving us some heads up if you're not coming. So it was today was a perfect day we ordered the exact amount of pizzas everyone had a slice we'll probably increase the amount of vegetarian pizzas next time. By the way if you have any feedback on like what is what is good what is bad of our organization let us please let us know we are very open and now they're a great shout out to engineers.sg that will be recording the the stock today so for those guys that dropped off we're not able to come last minute they'll be able to review this in a recorded fashion please if you have an extra awesome so if you if you have an extra couple of minutes during during your commute today or later on tomorrow send a send a five dollar voluntarily donation to engineers.sg this is a really great company they're doing that completely voluntarily coming around the meetups recording and uploading them. We do have an announcement from Refinitiv because Alyona works for Refinitiv she wanted to highlight there's next week there is a FinTech week in Singapore and we will have Alyona actually will have a Refinitiv Labs open day so if you want to learn more about Refinitiv contact Alyona we will also show that slide once again please come over to their offices they're they're pretty cool also just as a reminder there's a way that there's a way for you to ask any kind of questions about Apache Spark be that on any platform on any on any realization in our Spark Slack Singapore Slack channel so please do not hesitate connect with us there are always like 15 to 20 people online on average it's not really active because everyone asks direct messages for some reason but like just throw everything that you want to know about Slack or some supporting technologies into the general channel and we will try to answer as quick as possible so I'm taking look at it always very big thank you to AWS for hosting us today and thank you Gabe really cool thanks a lot there was a there was a fridge full of soft drinks which was at right temperature so that was all the best this was the best the best meetup highlight so far so okay well what we're going to cover in the first talk will be about working effectively with Apache Spark on AWS and let me pull up a few slides I will try to do as as low low on slides as possible it's more about just sharing with you the best practices just a quick show of hands how many of you guys heard about data breaks before whoa that is really cool okay awesome I can now go home so my job is done so the the the idea behind this talk is to just show you the patterns that might not be might not be known to you patterns of using AWS native services together with data breaks how we work closely together data breaks mission as you know is to help data team solve the world's toughest problems by analyzing the data by making predictions out of it and we do that at data breaks by offering the managed service which is the unified data analytics platform and that platform allows multiple functions within any organization to collaborate to create value out of the data quickly and then productionize it whether whether it would be a pipeline whether it would be like a live streaming data inference or something so it all can can be done within data breaks across data data engineering division department data science department data analyst department and actual business users so everyone can collaborate on it we are a global company that just crossed a thousand employees mark we have 5 000 customers 450 partners my role at data breaks is closely related to partners if if there is something that you want to find out about partner program catch up with me later and the the story of data breaks is very neatly tied with open source so we were the we were founded the company data bricks itself is founded by the creators of apache spark and apart from apache spark there are a few open source projects that you might have heard about who heard quick show show hands who heard about delta lake yeah there is more work for me to be done okay awesome and how about ml flow who heard about ml flow okay that's pretty cool so um these are the open source projects that are um incepted already by it by the same team of founders that did apache spark but in their tenure at data breaks and for example delta lake is now in linux foundation so it's a completely open source project now when we speak about data breaks on aws uh what we actually started on aws many years ago um uh we we are able to provision an environment where data scientists data engineers your data analysts they all collaborate um it's called the unified data analytics platform that environment is provisioned within your aws account within the uh the best practices that are established in your company so we're not taking any data or any uh compute out of your existing aws deployments everything is just kind of like layered on top of your existing aws environment um i have um um i will be progressing more and more into technical topics please interrupt me at any point if you need to clarify something let's keep it very informal and also for everyone who asked the question i will have a small present i have about maybe like a bag full of presents let me demonstrate this is a real bag of presents uh you can guess that there are some t-shirts right because otherwise what would be the full bag of presents right so um and uh i think there are some power adapters so please interrupt me raise a hand uh i will i will reiterate your question i'll try to answer it um if i can straight out um so the um unified data analytics platform actually consists of three levels three layers we call them um the bottom one is the cloud service we take away any management of any ec2 instances that you can imagine are still needed to run compute so that is completely provisioned for you um uh you don't need to worry about any kind of develops they will be um they will be deployed uh the spark clusters they will be deployed on demand and they are in nature ephemeral so you can of course leave a spark cluster running for some time you can actually like disable the auto termination and it will be an always on cluster but the idea behind data breaks is to provision um as much ephemeral compute on demand only when needed as possible and uh to um to to emphasize that the data breaks runtime allows uh this is the combination of multiple packages and our enhancements to the open source Apache spark we call it a data breaks runtime uh data breaks runtime allows you to dramatically reduce the time it takes to complete an etl job or uh do like a massive join um there are many tests that prove that but like you don't you don't need to take my word for granted you can just try it out if you want to so the idea behind uh data breaks runtime plus the data breaks cloud service is to provision ephemeral clusters um that will be um basically spark applications we don't stand up an entire kind of like hadoop environment to run spark it's a standalone spark cluster with a driver and the workers and you're able to do whatever the workload that you can expect a natural it to do naturally it will be a regular spark cluster plus on top of that there are multiple adjustments uh for example on aws we have our connector that is an uh that is way faster and way more productive to to to work with uh redshift there will be enhancement here and there that are part of the data breaks runtime and it will by itself accelerate the uh the etl jobs dramatically and the third layer on top of that is the data breaks workspace this is where the magic of collaboration between data scientists data engineers data analysts that do the um dashboards this is where it all happens this is the the combination of our our own um uh format of notebooks plus the ability to uh use ml flow in a hosted manner plus the ability to run jupiter notebooks plus plus plus plus all in collaborated manner similar to like what google docs allows you to or uh some other kind of like uh office 365's collaborative platform allows you to do for regular text you can do that for your valuable code that works with the data so that's kind of like the very high level i just want to make sure that we're on the same page here um now speaking about stage maker quick show of hands uh who worked with aws stage maker in any regard before that yay we have a few people that's cool so aws stage maker is by the way did you notice this effect this was really cool i spent about a couple of minutes on it practicing it this is really cool i'm so proud of it so aws stage maker is a native service that allows for the end-to-end architecture of machine learning and it allows to build train and deploy and provision the end points for restful micro uh microservice type consumption inference of this model for the downstream applications the line below i'm not sure if like the guys on the back do you see that but the line below i'm saying aws space stage maker space help and it gives the help of the commands that are under the tree of sage maker in the aws cli uh so there are 78 sub commands that relate to sage maker um the uh they pretty much have like uh on average few sub commands so it's kind of like a really big platform so to signify the ui is much leaner and i'll demonstrate the ui later on and what what the major part of sage maker is finding its use is the real-time inference so the architecture here you can see that it's backed by the sage maker hosting component which runs a trained model in some regard it's basically a container that runs on top of an ecr and there's a load balancer in front of that set of ecr pods so sage maker allows you to send the restful request and then receive the predicted value like it's a very convenient way to build the micro in the microservice fashion in a mesh fashion a very convenient way to build the inference pipelines pretty awesome but it also has like a lot of different components within it and the training component particularly inside sage maker severely relies also on the on the ecr so you need to write up a bunch of wrappers for training a model it will take the data from somewhere the data will be in an amazon format it will take the data from somewhere apply the model code that you compose train the model the model will be stored in an amazon format and then you can actually push it into the sage maker hosting that's very very like very fast the idea behind sage maker now the the sage maker look at this i i really enjoy this this transition the sage maker offers an sdk to spark and that sdk to spark is known it's been there since about 2017 it has about like a commit to the to the github repo that you see here it has a commit every maybe like five five weeks ish so and like a lot of that is basically release note stuff the idea behind the original sdk was to provide a way from spark to train a model of a given type on sage maker so i highlighted the boxes here basically the idea to do for example a k-means clustering is to take the data from the data frame in apache spark serialize it out into an amazon format using the serializers the third days of this specific aws format and then it will be picked up by the sage maker training it will be then trained and it will be then stored as a model and served in the hosting way so it's not really like what the optimal path looks when you already have an existing spark cluster probably in production and you want to co-share it across multiple applications but it's there it has it has its own it has its own go the way that the pattern that i highly recommend the architecture pattern is to use the ml flow on data breaks you can also use the open source ml flow is just that the hosted one is provisioned for you you don't need to configure it and it's particular tracking functionality to to do the alternative reference architecture now i'll demonstrate it in a second so ml flow dot sage maker is the entry point for you to serve the to deploy the model that is trained on spark and you can use spark dot ml you can use spark dot ml lib if you're in in the preference to the older machine learning component of spark you are able to deploy the model that you trained on spark that that was trained using the data that was prepared on spark you can deploy it on the restful endpoint of sage maker avoiding any hassle and it it is really quite simple code we'll walk through that and you don't need to know all those 74 commands on the aws cli to configure stuff all you need to do is like nitty gritty details of like what is the aws iim that is needed to be attached to your spark cluster so it can upload the model into sage maker that's it everything else sage maker will will take from ml flow you don't need to do anything else so ml flow by itself consists of four parts i'm talking on the on this at this point i'm talking on the leftmost one called tracking but they're also a black box project around any arbitrary python code or docker that is called project it's not really the the topic that we want to discuss today models is the kind of common denominator format that allows us to do this magic with sage maker it's our um m leap inspired kind of like format to export from multiple arbitrary and um proprietary formats like psychic learn has its own format tensor flow and has its own format of the model the ml flow models project allows us to export into um yeah there was a question yes there actually ml flow has a lot of flavors to the model so it's pretty much similar so ml flow is providing like kind of a common denominator through tracking and to to track the model we'll do that in in in a demo to track the model you need to store it in some form so there is a component called the flavor that picks it up from your preferred say psychic learn format and stores it into a composed kind of kind of common denominator that's what we call there is not there's not much of attention that is in the modal space it's like one common denominator most of the attention is in the trackers in the tracking space and we've just released modal registry for those of you guys that didn't see the announcement modal registry is awesome it's pretty cool the idea behind ml flow is to make a full machine learning life cycle that many large organizations that only many sorry that only large organizations to date could afford to have like in some sort of a framework where data scientists submit the model some other person with the segregated duties test the model then someone else's pushes into production multiple roles multiple versions everything is segregated and check pointed in the modal registry we just released this component it never existed up until maybe like a few weeks ago ml flow is also an open source product you can do the peep install ml flow and run it completely on your own linux machine or macOS and it took ml flow the really cool metric is how long it takes to reach a certain amount of contributors in this chart on the left hand side you see about like 80 contributors that are that are in ml flow only within the first 10 months and you may see the yellow line with pie torch and red one was apache spark so apache spark took about three years to reach the same amount of contributors to what ml flow is doing and incredible amount of downloads so 800 000 times downloaded from from from pipi a month that of course includes some devops but who cares it's still a big number more than like more than many frameworks yes there's a question to save maker and that will sort of compose like microservice and give us correct yes so the question okay so the question okay I think it was a conceptual question let me repeat that the question was since we are able to train a model within spark environment and then push it into sage maker what would happen to the data so you need to assume that the microservice that will consume the sage maker the downstream microservice will be using the same schema that will be your train validate so that's pretty much the idea behind it I will show it in the code we'll do the quick codewalk and it will be evident um but wait a second so Gabe gets uh no you're done yeah okay so um there was one more question right so you you ask the first question right so you get a small price can you can you pass the price if you don't mind just pass it back this is the coolest price first power adapter and someone else had a question here no can you raise a hand if okay I'm probably daydreaming okay um so the ML flow registry this is what I was saying so you have the ability to segregate the duties which is in especially in the um highly regulated environments of today I'm not saying that the government will not catch up with machine learning it will it will be again or there will be like uh like we have monitor authority of Singapore technology risk guidelines there will be something like that but for machine learning very soon maybe maybe not but yeah in many in many large organizations the segregation of duties and the checkup on the particular components of the machine learning lifecycle is very stringent so ML flow registry is a big answer for that and this is the idea behind ML flow ML flow model format so we take flavors and then we kind of prepare the receiving part so for SageMaker for example you can see the um the SageMaker here at the bottom or Docker they're kind of like correlated SageMaker relies on a Docker container we have a container that is part of the preparation of SageMaker you need to push the like not in every time you push the model but like once you need to push a specific Docker image the receiving part of our of our model format and then the magic of Konda and the isolation environment will take care of that and you will stand up your XGBoost or whatever that need is in a completely isolated environment but that's like taking one flavor as a source putting it into the receiving end as the target that's the idea behind ML flow and of course you can close the loop so you can in after you infer the data you can send the feedback into the Delta Lake receive it in the streaming fashion and then make sure that once you deviate from a certain threshold of accuracy that was approved when you push the model through the model registry you are then retraining the model because the model changes the behavior of the downstream application and the user so like when we when we're getting more recommendations about the food to order we will inevitably start ordering more food of the recommendation so we'll start the bias of of the training data set for the original model deployment so it makes sense to retrain the model or make the new version release of the application so that you can close the loop on that cool so the demo is as follows oh i cannot see my screen oh because i need to round up okay now i can see my screen magic here we go so if we go into so for those of you that lately did not log in into aws console like myself it can actually cancel out on my login let me do that sage maker there we go come on for g by the way guys i yay i anticipated i did not provide the guest wireless so apologies for g's also used in in this demo so we are you message the password to me my bed okay there was a password we tried to hack it for like 20 minutes brute forcing it but yeah then i gave up and then you didn't this is really cool so the sage maker service in here well this is what it looks like from the aws console so you have the endpoints which is the like actively serving restful endpoints the load balancer that then speaks to the docker image that runs on top of ecr and that docker image can run whatever you want it should be like a web end like web http server of any kind so like inside our container i think we run flask restful but maybe not i'm not sure what we're on but this is the idea so it's in service you kind of do like a lot of it a lot of its configuration you can do like tags whatnot you can update the endpoint as in like pick up the configured model from a configured path and then redeploy it on the ecr and that's pretty much it the rest is the major growth the 74 different drop down commands like it's quite lean i'm telling you the the the configuration here is quite lean so the models themselves this is where they are stored in some interim location on s3 this is the model that would be put on top of the receiving end container and it will work and the the images are part of the the docker images themselves they are part of the configuration of the endpoint so the kind of the moving parts in the reference architecture are here so now what it looks like if we're trying to push it into sage maker from from data bricks this is the notebook that is attached to a cluster and it's a spark 243 cluster as you can see it's called demo or whatever 5.5 it runs the data bricks runtime 5.5 type ml that includes a lot of different packages like xg boost and like many others there are also smoke tested to to work flawlessly together to to prove that it's actually a spark cluster you can open up a spark ui which would be integrated you can also look at like ganglia metrics for ram what this trust me this is like a spark cluster it's not auto terminating because i asked it so it's up and running so the notebook is one of the ways you can use spark clusters you can also schedule a notebook you can still do the spark submit job if you have like a jar file composed of something you can use it like the same way here the idea is we are talking about the wine data set the original old school like hello world data set of of machine learning we're able to mix and match some commands like i can switch into the shell context so like right now everything that is after the percentage stage will be on the driver of my spark cluster so like i w get it from somewhere in the internet i can do whatever i want like it's just literally shell there and what i do afterwards is i put this downloaded data set into like a centrally available location it's on the level that we call dbfs it's actually s3 we just created like an indirection level so you don't need to remember the mount points the access keys you can configure almost like in an old school linux server you mounted an nfs storage you can do it in the same way you have like a root which is the root s3 bucket and then to your root s3 bucket you can mount additional s3 buckets and many other s3 buckets and like you don't you can traverse all them in a unified namespace so this namespace is called dbfs very convenient and the the idea in the data set is that okay it's a csv delimited let's load it up there are some certain characteristics of wine and we want to infer in the wine quality arbitrary metric that we want to train the model on we want to infer that in the end as the restful call from sage maker so what we usually do is exactly what we're doing this demo but we also add the import demo flow and by the way for for here it's a python demo but demo flow is also available in a bunch of other in a bunch of other variations what what we what we do here is like i define where my what would be the name of the sage maker app that i will push and it will be like used later on where will be the training data set for scikit learn later down in here so i define the um this is the train model with some parameters alpha and l1 ratio i define the training function what you see below is um start of an ml flow run this is very important thing because it actually encapsulate a certain amount of hyper parameters uh some activity that will happen with your preferred model classifier or regressor or whatever you want to do and then it will result in the model outcome so the run comprises a specific a specific attempt to train a model and get some results and runs actually constitute the idea of an experiment so if i if i go into the runs sidebar here this is what ml flow tracking is all about it shows me okay there was this run with this hyper parameters and my r2 or my root root mean square error was that so this is where i can go and double click on like my particular experiments best result i can look at my hyper parameters in this notebook i'm not doing anything with the auto ml but we have of course solutions like auto ml and i can see a plot of my results like for example here i'm looking like the mae i can do rmsc so i want to look at the lowest rmsc here uh 72 what would the hyper parameter 009 and 25 or i can do the coordinates plot and find out like it's just like a combination of luck whether the usual right so this is this is all about machine learning so once i selected a particular uh run my modal inside that run um is stored alongside of the metrics that you've seen in the in the visualization so and the way that i'll log my model is like i provide a flavor in this case it's uh psychic learn i say like okay psychic learn log the model right and then i am able to then take it out from uh ml flow tracking and deploy it on sage maker or deploy it i can also like just apply it as the udf function uh in spark sequel if i want to i can say like okay here's my data frame just like apply this infer easy but the idea of the restful endpoint is here how much it takes us to deploy to sage maker is to provide where the base receiving image of the modal format is this is the um particular ecr that is uh hosting my base docker image and you create the docker image from your command line or you can have like a system engineering pipeline that creates it once every now and then it's literally on the ml flow on my local laptop it's uh ml flow uh sage maker and i think it's something like push yeah build and push container so what it would do it will i need to have my docker running there on my local machine but it will create this docker image the receiving end and it will push it to my um whatever my dot aws uh config uh oh wait what was that ah right uh yeah yeah so the credentials right so whatever is configured as my default profile for the aws cli ml flow will then push it into the preferred regions ecr with the needed tags and then it will also obfuscate for you you don't need to worry about it will obfuscate the creation of that endpoint using that image you don't need to worry about it it's like completely done for you by ml flow um so this is where you specify here's the image and uh here is the modal the modal points into a particular run as i've mentioned we need to point to that best hyper parameter either programmatically in this case it's just like manual things but like you you can point like this is the best run get it out and then this is where the magic happens mfs dot deploy boom you're going so that thing took about eight minutes because i had to like it's actually a replacement job so you i had to destroy the previous um uh destroy the previous endpoint i could have created another endpoint like and do the ab testing if i was in production but i was not in production so i use the same endpoint and the replacement of the images in a couple of uh ecr instances behind the sage maker load balancer takes a little bit of time but then once it's up um you can use the boto three to test the status of the sage maker endpoint you can see that it's in service um then you basically create a string that would be the answer to your question about the data to be inferred the actual the actual production data so then you create a string and most of the most of the online services take the pandas udf restful jason format so that string is sent to and i'm using as you can see it's just a request request import here i'm sending this string into the endpoint and i'm receiving a particular wine quality five point three what not i have like i don't even remember it could be a completely scrappy model with a lot of like i i don't really know but this is just to illustrate the abc of uh of the sage maker deployment now what's cool about it is like you prep the data you done all your joints transformations in a pipeline that resulted in a restful inference on sage makers straight out right cool any questions yes i get yeah okay so the question is like if you do the one hot encoder or like some other some other prep of the actual inference data uh like vectorization of multiple dimensions uh if you want to send it as uh like a real number into an endpoint the endpoint has to be trained like that so if you're uh if your microservice downstream uh processes the raw data in raw dimensions it has to do the transformations definitely yes i mean inevitably and up up there there is um and you get a price thank you uh and um yeah here we go whoo can you pass that if you don't mind thank you um so the um yeah and um and up there it could be any spark code you can you can do graphs you can do spark sequel you can switch and you can do something in shell as you see and um move the data it could be like an hour long before you deploy it in one notebook and then you schedule it the scheduling of the notebook is like here you create a job you can use airflow uh like there are airflow providers for data breaks whatnot you can use any other arbitrary scheduler there is a restful api for for scheduling but you can create a job and there we go so any other questions how do we apply for this model for streaming data uh restful endpoint streaming data like i mean it will be like coming as the http request so i mean we are assuming the non kafka type like the non message uh the non mq type data so we're assuming a restful in sage maker right but you can also of course stay within spark and connect spark real-time um inference like using spark structure streaming to like a bolt in uh in kinesis or or in kafka or somewhere else and then apply a model on it using a udf use a defined function on this on the on the particular micro batch cool you get a prize now it's time for t-shirt oh no there is one more power adapter yeah please pass that okay cool so um so that was the biggest chunk of the presentation so like i i want to show you a few more things and uh they will not take a long more time and then we'll pass it to aliona that will share the spark 3.0 um and um file sources announcement now how about redshift uh redshift is awesome redshift is a data warehouse that is uh known trust that it's been out there with multiple like thousands of customers uh doing a lot of transformations the inside inside of redshift there are some statistics about queries that are under the hood but they are easily exposed you can actually download them you can you can log them and one of the things that we noticed that data breaks is that with the redshift particularly customers use and it's a cluster right so redshift is a cluster with the leader i should uh i should remember to use it the leader and a compute it's not master slave anymore right so like from from 2019 what did this year bring to us master slave no more leader and computer or whatever so um yeah so there's a leader and compute and it's cc2 instances and you may have like a stand-by cluster in zillion significant data warehousing solution with all with all things now the internals of data warehousing solution uh like like redshift it's not only unique to redshift uh with adoption they start to show that uh there's a lot of transformation work going on under the hood so it instead of being just the data warehouse where your longevity of the data is exposed for the downstream business intelligence reporting it now becomes like this engine that munges the data which is not optimal because it a just the latency of the downstream queries because you have more processing and then be not necessarily the engine of data warehouse was designed to do that the transformations at scale da da da da so what we noticed and this is like from several customers i don't want to go into the details but the red one is the transform type of queries and again this is like the the metric that is based on the type of query like if you do the like uh like load or something like that this is like consider a transform right so the metrics show that up to like half of the time by the cpu by the hour spend on the prep of the data so if you're not scaling the cluster then the latency if you're scaling the cluster then the cost right the usual compromise now how about if we if we follow this example there are like quite a number of them but you may take up the the query that does the transformation the logic that does the transformation run it as an etl job on data bricks at like hundreds of spot instances like you're not you don't need to do the usually a redshift would use the reserve instances but still it is like an on-demand instance right to just churn on the on the on the monies you can use the acceleration of delta lake with NVMe cache SSDs of spot instances i3 type you can do like really crazy fast transformations and then load them up into redshift and this is actually net net improving the latency improving the cost profile and overall serves the better kind of customer satisfaction based on what we observed in in some in in some customers now yes yeah sure it was faster and cheaper to spin up a spot instance wait for that instance to come up delegate some some data onto its NVMe disks to do the work and then get that into redshift yeah this is not correct i'm probably confusing you so the question was whether these whether these queries are consecutive and then the preparation for the actual business intelligent query is like the outcome of the transformation no no no no this is the concurrent type of queries on the hour so there are some this is the shared resource so the redshift cluster carries on some transformations around as well as serving the downstream bi queries i didn't make it i didn't make it clear instead of having redshift do both responsibilities yes etl type stuff to a parallel working cluster of spot instance right right correct combined yeah you know infrastructure yes still resulted in net cheaper and net faster yes correct so yeah i i i i i need to repeat that so the net result of taking this was a brilliant phrase i need to repeat that the the net result of taking away the the combination and splitting into a spot instance back data breaks cluster plus leaving the bi queries for the for the redshift leaving them working on the best best of the best of their design principles is cheaper and faster yes this is exactly right so that's net result awesome like i need to record it next time i'll take a note so i'm really appreciate it so the way that it looks we had a connector developed for redshift which is not as so it's actually there are two connectors that you need to have on the data breaks cluster one is coming from aws that uses the uh the kind of like the flavor of the pulsegrass to connect to the gdbc queries with redshift and another one is a connector that allows you to load and upload the interim data through a staging area on s3 so loading the data again like i already ran it loading the data bulk of data like we're talking about gigabytes of nightly uploads of transformations or something like that is easily done in parallel using multiple task workers think about those like hundreds of spot instances they all go into a scalable s3 storage each uploads a small chunk of the interim format that is then uh the then then redshift is uh knocked about and then redshift in it takes that ingest that data back into it so you can do the gdbc queries to do the um to do the lookups uh like here i'm looking at my data frame that was just sourced from uh redshift cluster and i can do the regular data frame transformations onwards or train the model or do anything i want but at the same time after after i'm done i can stage the changes using the sequel spark sequel notation i can stage the changes into that temporary s3 location on like gigabyte scale or something like that and then merge them into the uh the existing redshift folder so through this temp folder so that allows us to avoid the single uh single kind of um uh not red but like single line single choking point of gdbc query even though there are some ways to parallelize gdbc it's not really very well done when you have like a several nodes on redshift and with spark it's very chaotic so it's easier just dump it into s3 and then load it into redshift perfectly working and the the assessment of this query is it looks into this is the notebook that resulted in those type of queries breakdown i didn't run it because i don't have like my my redshifts are like all pretty much stagnant there are not but if you look into the actual types of queries inside it's basically in the metastore of redshift you can look up like your 24 hours what were the type of this queries they're not exposed in the redshift ui but then you can make make sense of make sense of how you use redshift using that the third and the last point about the optimal usage of aws and spark is about these three guys um glu athena and amazon s3 i mean yeah okay there was a question too what are the different ah okay there is a good question so what is the difference between snowflake and data bricks architecturally we see it like data bricks is in your aws account it is a spark cluster we have a better runtime which is a combination of our proprietary investments into query optimizers whatnot but it's like a spark cluster so you use like really spark that you that you know and it's more importantly it sits on top of like aws in your account so you can use the im roles integrated neatly with your preferred storage you can use the single sign on you can use the scheme api for groups it's like a component of your infrastructure that you add into your aws account this is like most important most important architectural difference now what's behind the scenes in they start to reveal stuff what's inside the snowflake realization we don't we actually are like the the way that we are getting our value like we are a private company but we have like this like hero number of 6.2 billion dollar of valuation i don't know what that means but like that's really cool uh we are not thinking about competitors as much as we are solving the customer's problems like i really don't know what's going on there like like you're asking the wrong guy i can tell you like what are the people in in southeast asia in australia and in japan are using data bricks for and aws for but like snowflake they started to open the hood but i don't know what's there more importantly it's in your aws account pii information personal identifiable information is a concern and i don't have any given like solution pattern for that you must adhere to the regulations of your internal company governance policies into the regulations of your government policies whatnot you need to like really follow whatever whatever the kind of down to tops down regulations are there speaking about snowflake we actually have the connector that is also developed for snowflake you can do the data frame read and write into from blah blah snowflake but it doesn't use anything close to what the parallelism of like redshift is for example there is no interim staging area whatnot so now this is the last bit and then i'll let i learn and take you take you through the spark theory update glue and athena who knows what's glue not the shoe glue okay the glue glue okay right athena who knows what's athena yay okay good crowds so for those of you that didn't raise the hand glue is a quite a big service native service of aws one of its largest component is a serverless meta store that allows you basically to have who knows hive yay okay so basically glue has few components and one of them is a serverless hive which allows you to store the definitions of the external tables and data that is spread around multiple s3 buckets who knows amazon s3 i mean like my mother doesn't know what's that so like i had to ask right so so glue stores the definitions of where stuff in which scheme of data that metrics statistics it stores it there like a regular hive but it's serverless so you can connect to it and query it and then you have all the different databases and and tables within it athena is a web ui for querying the glue catalog and it allows you again to create a workflow where people log in into a particular account using their assumed role and they're able to execute the queries only against the glue table that they see and whatnot yes gabe yeah you can run yes there there are two more components one is that you can actually schedule you can you can do the glue transformations they're called and you can do the transformation using the spark on top of emr which is a like an open source spark with some like some tweaks here and there but it's like a spark spark that you all know and also there is a glue uh how you call it the crawler right the spider is it the crawler yeah that actually allows you to go and um schedule the inference not inference actually the loadups of the new data as it travels is on schedule or automatically from like different file sources so that's i think the easiest way to get started using spark if you have your data on s3 and you just want to query that data and get a result yeah well there's sorry do you want to use spark a deal is the easiest way yeah the data is unnecessary yeah and you just want to write sql yeah so the uh the premise uh there is there's a um there's a basically a combination of um different things to factor in so one of the like one of the reasons why i wanted to show this integration is first it's super easy to get started with glue on um on on data breaks on aws so glue doesn't need any configuration when you are in aws console it just exists so all you need to do to start using glue as the meta store for your spark clusters is to provide um here's my cluster one of them this one doesn't have the external meta store it uses what you see here the workspace that has the notebooks the jobs the clusters the the the experience that data scientists and data engineers use is the workspace the workspace comes with a meta store which is like a hive it's like a small tiny um not small tiny i'm i'm regardless basically it's a meta store it comes with a meta store so you can actually look at the data here you can see some databases some tables and you select which cluster to look at into the um into this experience so right now if i'm looking at the glue it gives me one view of data because it looks into the glue if i'm looking into the default meta store that comes with the workspace it comes with this workspace if you had an external meta store configured say on my sql or postgres sql somewhere on premise yeah i mean there are external meta stores come on like everyone everyone knows how to configure them in spark you would have a third drop down and you will need to use that spark cluster that has the external meta store for both the driver and the workers and you will have a different picture so it's like kind of like a not a split brain scenario but you have to cater that in that you will have multiple clusters viewing different things or you have to default them to one it's like up you up up to your design decision right so how you're going to connect all the dots but to um to to configure the cluster to connect with glue or this one has the glue configuration just as a spark um uh just as the spark a config yeah there we go so it says uh glue catalog enable dot true it's uh from data bricks runtime five point if i'm not mistaken zero or something like you don't need to worry about it you provide this configuration next thing you want to make sure is that your cluster has an iam permission because you still run an ec2 instance that need to infer its own security instance profile so you want to make sure that that instance that uh an iam role that this cluster is inferring in the configuration of data bricks has the needed permissions for glue catalog access that's it read write you design right that's it so when you when you're working from data bricks you see i i will select the glue database here you can see the glue demo glue demo from data bricks uh if i'm going into glue itself where is it uh glue here uh did it go for g yay yeah so we have uh databases that was in glue demo from db whatever you you have the common picture so now you can query say um from athena you can query the same s3 source as that you would query through spark sequel in spark so through athena you go into athena and uh you can query something that sits on s3 be that a par key or avra or delta delta format massively with it's not a topic of our talk today but it massively accelerates the par key by adding the transactional uh capabilities transaction log to it so you can now do the like the multi-hop pipelines you can mix and match passion streaming it's really cool thing it's just like we're running short of time i'm not going to double click on that but the idea of um having athena um sorry this tab i had a lot of tabs so having athena read a delta table is actually quite beautifully realized so you you can see now that i'm reading a table that is called loan stats underscore train underscore via athena it gives me some preview of that table and actually this is a delta table it's not a par k table and the idea behind delta table for those of you that raised hand when i asked is that there is a transaction log level in direction into multiple par k files that are desegregated and you can do the compaction behind the scenes to to to avoid small small file problem you can do the um the z ordering that you create the index on your par k there are many things you can do but the idea is it's not just par k so if you just point athena into the folder with delta table it will not really understand what the heck is going on too many par k files and like not really so you need to enable a manifest this is one thing that connects the dots in between glue athena and um anything that you do in delta you need to enable a manifest which is a very simple um approach here you use a command to generate a small manifest file like here is like for the like actually this one uh here is the command against a particular path that has the delta table format inside of it you generate a manifest file that would be automatically updated when you update the delta table and that manifest will point into the latest proper version which par k files to connect to and then in athena you create an external table using that manifest and it's all connected so i got a couple of customers using that for the um for the processing it's all like naturally coming as as a result so these are the three patterns so we've spoken about the uh restful models serving using sage maker we've spoken about redshift and how it's easy to work with um with with spark as the as the spark instance back transformer for your bi queries to take away the latency and reduce the cost and we've spoken about glue and athena and how they come together and join forces to to weld the join solution um i have few more t-shirts questions about these three pieces yes a basic question about your database clusters yes so i have seen that there's like minimum uh there's one uh i mean minimum worker and there's uh max there are two things right yes so are the auto scale up and auto scale down you yes by default you have the auto scaling you can see the uh tag here this is when i create a new cluster you can disable it uh it will be aggressively downscaling uh or upscaling as there is a amount of spark tasks outstanding it will provision more and more spot instances and i mean will you prefer it for i mean any streaming application because i mean support correct yeah so good question about streaming applications it would not be recommended to run all of the types of applications on one cluster you would probably have to have like a small cluster that does only streaming types of jobs in your in your partitioning definitions whatever like how many workers will be using uh the connections to kafka whatnot but it will be preferred to have like a per workload type cluster so you have like an fmr cluster for a job that spins up and down you have backed by spot instances you have an always on streaming um data type of uh processing cluster you have a massive shared cluster that you wake up at 6 am every time when your uh data scientist or 7 how what time data scientists come to work i never thought about it so like whenever whenever your data scientists come to work you'll wake that cluster up and then the cluster terminated it's fmr right so it's like it doesn't mean so you'll wake that cluster up they will be able to do the data science and do the data analytics and create the dashboards and whatnot like connect the gdbc tools like tabloo or quick side of amazon whatnot they will be third cluster and then there'll be a department type like pipeline consolidation fourth cluster you you're using the elasticity of uh of uh of compute right so that doesn't necessarily mean you need to join all of them in one no no no so there'll be a different type of cluster for for your streaming applications disable it like it wouldn't it wouldn't make a lot of sense because you will not be able to you can you can always uh scale up as you see the messages falling behind on the latency or something you can add workers if you need to buy buy like by adjusting the cluster configuration but scaling up and down will introduce a little bit more chaos in my opinion okay so the question i asked because like in streaming applications there are few i mean in night there will be very less traffic and then in the morning i understand data will be used so like it has if it has to be done manually it doesn't make sense yeah so we we would then look at to creating a small cluster and then maybe load balancing the cluster after after six a.m like something like that you can just run a run a small job that even on the inside the data breaks will run a call to the api and then provision a new cluster and then connect it to the same end point in streaming we can take it like we can we can talk about that it's actually a good question it's not really not really today not really streaming related but that's a good question there are caveats about it yeah you're right yes please yeah i will need to look into i don't know it well so i'm the kind of person that will just tell you i don't know if i don't know so i don't know it well but if it can do some kind of modal export into say a common flavor we can then pick it up and track it and mlql but i don't know like we need to look it up into into integrated details to ping me or like yeah let's connect about it okay cool it's kind of integrate with a like solution like spot is like what spot is like spot is what is spot is spot instances okay so we actually control here here's everything that is that is that you're asking for on the screen so we control the mixing of spot instances into spark clusters for you you don't need to worry about it you can also um so you can say like i want to start with one on demand instance the rest based on the availability and my price bid will be spots because um you may have like a spot bid price going up and then some of the workers will disappear they will be taken away for from you but we will be bid up as you can see in this configuration we will be up up to 100 percent of the basically on demand price right so we will bid up progressively paying more and more and more for the spot instances to the level of on demand so your cluster will not shrink you will still always have the maximum amount of workers like eight and we do this we do this for you you don't need any additional solutions or anything like that um yeah that that's really powerful and also we default to the i3 instance type because it has the NVMe SSD in it but it also has up to 80 discount of the on demand price it's one of the most discounted instances because no one uses it by spark right so like it's really cool and it's really cheap and its availability is through the roof height so always um always look into using spot instances when possible yes please in redshift i think there is default uh i'm yeah it's default right yeah it counts actually the data that you see on redshift cluster it already comes like as if it's compressed with some assumption so if you're provisioning like a three instance redshift cluster it gives you like a few hundred gigabytes of data even though the instances might have like smaller disk sizes whatnot you don't need to worry about it comes with compression assumption yeah lzo yeah yeah same with parky though it's like snappy whatnot so yeah good good questions a lot of questions now because i will not remember uh and i need to pass the mic to aliona what i will do is i will take this bag and then just maybe it'll be fair if you come up to me and i'll just give you this t-shirt so that alona will will carry on by the way there are a lot of aws stickers aws data bricks and data brick stickers any combination would work they're on that table in a corner right so cool guys thanks for attention and let's tune in to the update on sparty drove and yes ah these notebooks oh by the way the wrapping slide by the way oh yeah yeah yeah yeah actually if you go if you go into data bricks dot com slash aws there will be the similar kind of link there is a there is a free training about uh data bricks on aws that includes most of the topics that we discussed today yeah if it's not they're just ping me like it's uh it's it's very easy our documentation is through the roof cool like i love our documentation because it's on point we don't have any other documentation but public is like completely agile company so you can try not with the the community edition you can try almost all functionality of fumble flow and like all the things that we've seen today uh you can try it on data bricks community edition but not with like sage maker because you need the im roll the rolls for that but whatnot uh we give away 14 days of subscription to uh data bricks and we charge by the machine run hour of our data bricks runtime so you don't pay for compute you still like it's your account right so you pay for the ec2 and uh we we charge like few cents on top of that but we give away two weeks of that kind of few cents so on average just reach out to us we'll we'll help you guys figure these things out yeah do you any minimum up front do you require for data bricks or just data me even what upfront minimum upfront we had to have a commitment to no no no no space you go platform i mean we will give like if you have like petabyte or like whatever like i i don't know if you have any particular enterprise discussions simon we're a simon simon i have a colleague of mine from data bricks but he's not around here so he was here have a conversation with simon he knows all the things that i'm not salesperson like he has all the all the things about it a lot yeah all of us come from city yeah i i'm extant chart cool let's talk um so okay aliona i am very sorry i'm taking away your time there we go you need to take this small mic