 I understand I also stand in front of someone else's iftars and it's, yeah, sorry guys, so let's get started. This talk is, I'll make a big pause so we can then cut it off from the video, so as if it's like nothing happened as if I wasn't talking yet, okay, cool. So again, thanks for your attention. This is the second talk and it's a catch-up update for the release of Delta Lakes. The technology that was open sourced about a month ago at Spark AI Summit in the U.S. in San Francisco. We'll talk about the essence of Delta. Delta has been available on Databricks for almost two years. It's been battle-hardened and industrialized in production use cases across a lot of our customers. It was actually incepted as a technology on request of Apple. Apple is known as a company that does not share their brand with startups or like with even large enterprises usually, but for us they made an exception and even went on stage in a keynote speech in the beginning of 2018 in last year's summit when they actually announced their use case and they talked how Delta helped them accelerate their data pipelines and they moved some of their queries from order of tens of minutes, literally like 25 to 42 minutes were the different queries. They moved them to sub-10 second queries with Delta and we made this technology available open sourced. It's on GitHub. You can find it on delta.io. They're a landing web page with all the documentation and it's fully available in our community edition. Our community edition has a fully featured Delta functionality with the additional perks like the small file compaction and the optimization with z-ordering, what not. This deck and this recording will use, it's not really a deck, this talk. We'll use a small sample notebook. You can play around with it on community edition, it works. It's quite just a toy example and feel free, if you need any additional support, reach out to myself on Spark Slack SG on our Slack channel for Singapore Apache Spark Meetup. But yeah, go ahead, play around. There are lots of examples. It's pretty cool technology. Before we dive into the details, into the notebook just as a reminder what actually was overall the case at Apple and therefore it's like thousands of our customers on Databricks that use Delta. The idea is we've been collecting a lot of data into some, there was a very sticky term called Data Lake. So we were collecting as companies a lot of data for a number of years and that data was enriched, transformed using Apache Spark. Everything was perfect, everything is great. Now actually it's not only about the collection of data and transforming it, it's about making sensible and quick business outcomes out of it on demand, productionizing these results of your machine learning training into real time streaming applications, making predictions that impact day by day the way your businesses work. And for that you need to have a close collaboration of both the data preparation stages, the data engineering stages and also the machine learning, the data science proceeds out of that data. And when I say close collaboration is without really great cleansed, hardened data, machine learning frameworks are not able to work. None of the machine learning frameworks have capacity to transform data like Spark does. So the problem is the majority of the intersections, they come to human beings being in two different teams. And some teams in production, for example, or some teams in application change the schema of the data type or do some breaking changes to the application. That impacts the data, the pipelines therefore are using raw inputs and you end up with machine learning pipeline at all breaking or you need to go and talk to some other colleague or they will ask you to raise a JIRA so it doesn't really help get the innovation going in the companies very fast. So what are the kind of two broad categories about the reliability and performance? What are the kind of like themes in the data that we captured? And they are maybe very close to your company, maybe you're not at that stage yet, but this is what we observed, maybe you can give feedback to what we're missing. But what are the kind of like bold themes? Themes in reliability, when you have a job that was updating in data set and it fails, you end up with a corrupted data set especially if you're using Parquet or Parquet is at the moment like a de facto standard. It's like very, very, very highly adopted format to store large amounts of data in the data lake. So if you write data to Parquet, job has some problems, you end up cleaning your Parquet manually or you have to create some stubs, some scripts, some automation to revert the transaction. There is nothing that Parquet itself adds to prevent this from happening. Another example I already mentioned is schema enforcement. So downstream from your data set there could be multiple applications that refer to particular column as like allocation for example, like the second last column or something like that. It's a very lame practice to do so but just imagine you do so. And then all of a sudden your upstream application changes schema of your data set by adding additional dimensions, adding two more columns. All of a sudden your reference is lost and you're not able to make any sense in your downstream application of what happened. Or I'm not speaking about a sabotage event but like sometimes developers are not really clearly communicating how they're changing the schema of the data set. Again, Parquet has no capability to enforce it, it just will accept whatever is going on there. And the ability to intersect a lot of workloads in terms of locking, in terms of consistent view on who is writing, who is reading. Parquet is just accepting everything. So there is no correlation, there is no manipulation of different workflows. So if you have a read that happens concurrently with write, it's an okay case but you'll get a dirty doubt result and you will not actually see what you're asking for. Until the Parquet is completely updated. However, for example, if you concurrently write, there will be no control on the concurrency. There will be no way to see which writer will win over the data set. So there is no control. It will basically be a result of luck which is a very bad strategy or a corrupted data which is even a worse strategy. So that's the reliability. We also have some improvements on performance. We are not going into this kind of in greater details in this talk, but there are ways to address many problems with performance with Delta format as well. So what Delta Lake is, is an open source of the technology that was GAID on Azure Databricks and AWS Databricks a while ago. It's based on Parquet. What it adds is the transactional awareness to Parquet and what I mean by that is there is a small, literally additional folder in your Parquet table. So for those of you that have seen Parquet table and I based on our previous conversations on the Q&A of a previous speaker, I assume a lot of you have seen the Parquet table. So in the folder that has split files of a Parquet, you add another small subfolder that is called delta underscore log. That is your transaction log. And then it adds the APIs to consult that transaction log to understand what is currently happening in that particular Parquet backed Delta table. So it's very, very simple implementation, really beautiful. And it's a drop replacement. It works exactly well in Spark. It will go upstream into the open source Spark at the earliest stage. It's currently being in JIRA. You can track it. A lot of our vendor, vendor ecosystem members like Informatica, Talent, Embrace, Delta. So they're also working on integrating native readers and writers. But it's, yeah, it's available. It's a drop replacement for Parquet. Very simple to use. And what it does, it basically allows you to add transactional awareness to your existing data lake. And we now encourage to call the new type, the formation of the new storage at Delta Lake. Well, just because first we can, because Delta is a nice word. And second one is we're not calling Delta. It's not really like a transactional relational database. No, you don't have to deal with that. A Parquet-backed big data lake. However, it adds very important aspects for you to do new things, like a multi-hop pipeline. So one example that we'll see here in this small notebook is like what you can now do instead of moving and copying the data and rinsing and cleansing, which you can now do using a park streaming. And you will stream data from one table to another table, almost like in the replication log. Also, particularly on Delta on Databricks, it enables you to read from MySQL and soon from PostgreSQL. Do the change data capture streaming into your data lake pretty much seamlessly. So if you want more details about that stuff, reach out to us. We'll go through that. So what it adds, transactional log is the only additional element in the Parquet table. Like literally it's a folder with some flat JSON files that contain the transactions. That allows readers and writers in Spark to be aware about the transactions going on. So you automatically have the ability to do atomic transactions. So you have an optimistic concurrency or exceptional concurrency. So you are able to fail a right if there is another right or wait for the other right to fail up to you. Therefore, with this ability to have transactional awareness, you are able to safely unify batch and streaming workloads. So while something is getting down into your Delta lake from upstream, you're able to do batch inference on this data set immediately or do some reporting from this data set immediately. And it's pretty cool. It's the Delta architecture. It used to be a Lambda architecture. We have separate streaming, separate batch, a lot of hands working between them. Now it's Delta architecture. Fine. And schema enforcement. So since we are able to consult what's in that table before we actually do the right into Parquet, the readers and writers are able to, you're able to enforce a schema or you can accept a flexible schema change. So you can be also, you can also configure Delta to behave like a previous generation Parquet only implementation. The coolest thing is the time travel. So if we go into the details of what we are doing by having a transaction log, we are able to write, to commence writes into a separate Parquet file that is not yet member of the table and then instantly switch the version of the Delta table once the write is complete to that newly created Parquet file. So what it gives us is we're able to have the kind of like a latest redirect on write snapshot of the table and you're always able to keep some certain amount of versions behind. Those would be just the Parquet files that are disconnected from the current head version of your Delta table. So that enables you to travel back in time. You can clean up behind the scenes. You can clean up the no longer relevant Delta, sorry, Parquet files in the Delta table by doing the vacuum command. However, you can do that like for zero hours and just drop everything without keeping any backups. We encourage to keep at least like six, seven days of backup just in case, but it's up to you. You can clean them up as needed. That's the travel back in time functionality that comes as a proceed of having the transaction work. That's pretty cool. So again, delta.io, GitHub has a lot of additional resources. You can peek into how it's realized. Everything is in Scala, the usual way. So let me show you a few things and then I can answer a question as they come or like just interrupt me and we can go ahead and dive into the details. First and foremost, this is Databricks, what it looks like. You can see here on the top it says demo.cloud.databricks.com. Demo, imagine it's your company's name. So it's basically the kind of individual workspace of Databricks and with that Databricks, you come with like an isolated space. This is exposed to you through the web browser but everything that will happen, it will be provisioned in your VPC account, in your AWS account or in VNet if you're doing it on Azure. So everything, we're talking Spark clusters in your environment. However, you look at it from kind of like the web UI experience. So when you go into Databricks, you have notebooks, you're able to schedule the jobs, we'll go through the notebooks but you need to have a cluster. So when you create a cluster, I actually already have my cluster created here. When you create a cluster, you specify the instances. We can edit a few things. We can assign it to a particular pool. So like if this cluster auto scales, it will actually, there is an ability to have multiple clusters sourcing their instances from the pre-warmed pool of instances. So you don't have to wait for the like 60 to 90 seconds of the instance to come up. You just literally grab the already warm instance from the shared pool in between clusters and it works. So pools, different versions of Spark are available. Our community edition will not give you all this kind of detailed configuration. If you're going to tinker it with offline, it's just available in the full blown Databricks edition but again, you can do trial if you want to go for that. Databricks community edition will be like six gigabyte RAM cluster total, like very small one but it's fully functional. So once you have a cluster, let me run it. I can shuffle some libraries into it. So some libraries, you can see that they're defined in here. Library is basically how you point into a particular Maven coordinate or a particular PyP coordinate or for R, for maybe like R packages or R libraries for user defined functions. You can do cram or your own jars. Like if you run something internally, that avoids any DevOps, any Ansible exposure or they need to kind of like salt and pepper stuff on top of clusters. So when we speak about clusters, we're speaking about standalone Spark. So our Spark is standalone single Spark app greedy executors that occupy an entire instance. So we do not do any of the, we ignore that even if you try to do the tweaks about like how many cores or RAM per executive, we'll ignore it. It's basically a greedy execution. Why? Because our founder is actually Mate Zahari and he does things right. He actually wrote the initial version of Spark and he says that since it's FMRL switch on, terminate, switch on, terminate type of compute, we don't need to do all this configuration. We don't need to rely on yarn or messes or anything. So we do our own scheduling and it's treated as like Spark for the reasons of Spark. We don't do like any zookeeper deployments, nothing like that. So hope that explains. So what we see here is a notebook. The notebook is, and it's also available for you on the community edition. Notebook is a polyglot. It's our internal representation of what we've seen when Jupiter was emerging. We've seen that it's cool and we implemented our own version of notebooks and we will support like actually right now you can connect, you can import Jupiter notebooks in private preview or connect our notebooks to GitHub or import to export, but the way we've done our notebooks is check this out. It's actually like almost a Google document type collaboration. So I'm in this current cell here and I can say like hello Singapore, comment on this cell and imagine I'm another guy. I'm just in another tab here in another Chrome tab, but I can see myself editing something and I can go down and I see my own cursor. So you basically have people working around the world concurrently to create a notebook quickly. It's a polyglot notebook so you can switch in between Scala, Python, SQL and many other means. Once you have the notebook created you're able to schedule it and send it into execution on time. So that's the whole idea. You rapidly innovate, ideate and then schedule. So yes, it took me exactly that time to talk about our functionality when the cluster is now fully green and up. So the notebook is attached to the cluster. Cluster I can take a look at the Spark UI, see what's going on in the usual convenient Spark history, the graphs of the jobs, everything is quite nicely integrated. Also if the cluster, since I told you they're ephemeral, we still keep all these details so you can come back and review them. You don't need to go into some other folders, into some other systems. And also you're able to see ganglia stats integrated. You don't need to leave Databricks environment if you need to see how much RAM and CPU you use. Everything is enclosed in the integrated workspace. So what are we talking about here? We're talking about building a data pipeline with Delta and the scenario is we work in a company that has a lot of IoT devices and we want to make sense out of the streaming data, maybe enrich it or do some batch operations on the streaming data as it comes along. So we will simulate this. It will not be some external data set. The way we will simulate it is, let me by the way delete this comment because I think it's lame, and actually one of the things that I referred to you about the Git integration. We have the revision control embedded, but you're able to link it up into Git and do things smartly somewhere else if the need be. So every notebook comes with... Did I clear it? I think I need to reattach it. It's clear state and result. I suspended a few things and then I do the detection, reattach. Basically the Spark context is established. There's one application, the cluster is one app. So when you attach a notebook, what you do is you instantiate this particular notebook as a Python one. So you instantiate a SQL context behind the scene. So you can now start patching the commands using PySpark straight out. So let's see, run this setup. You can notice it takes a while to do something. Let's leave it because it might be just normal. I'll be not pretending. One of the things that happens behind the scenes is I'm referring to a very strange looking path here. Parquet path starts with slash mnt slash sd shared and then delta tutorial and something, something. What happens is we have a level of indirection that allows you to connect multiple S3 buckets or Azure Blobs or GCP storage, Google storage buckets in one namespace just for the convenience of operation. And it looks like you're working on one giant supercomputer with a lot of mount points. So that will be done with the credentials. You can do it once. Basically you mount a storage once. And then on our secret store behind the scenes we will keep this configuration until you either unmounted or you change the credentials on the underlying storage. So this is kind of the level of indirection. With the namespace, this namespace, we call it a Databricks file system. It's actually a very thin layer for you to point into multiple mounts. You can actually store it, and it's behind the scene, it's an S3 bucket as well. It's like a 100 megabyte size by default S3 bucket that has the root of our file system. You can actually put stuff into it, like upload directly into it. But the best practice is genuinely to kind of mount the remote locations where you prefer to keep your data and then authenticate somehow by IAM role pass-through, by credentials or by some other means that could be different system by system. Does it also act as a cache? It would cache... I'm still looking at this running command and I'm very perplexed while still running. Databricks would cache a certain... If you do the instances of the I3 type on AWS or the type, I don't remember the type on Azure, that have the embedded solid state drive, it would cache Parquet only so that you do not pull the Parquet files over the network every time you're referring to them. That is something clearly not making me happy. Actually, I didn't plan to debug it in front of you, but let's take a look. This is the first thing to do, is to understand actually why some... This one we need to cancel it. I think it's something on the browser side. It's very weird behavior. It might be that... Yeah, so the cell... Have you seen it actually been running? So what happens here is like, I use this command, the %FSLS, to point into the indirection and I end up somewhere on some other bucket. So I'm not in the Databricks file system. It's actually within my AWS account or somewhere else globally. And I can see that I have a Parquet table, usual Parquet table partitioned by date. Actually, date is a year-month date straight away. And I anticipate that what happens is in my internet connection that this cell is actually not really running. It is actually running. This is something that you really don't ever want to have in public in meetups, right? And that always comes unexpected when you're like really humbly wishing, like, hey, let it all be good. Let it all be fine. It's 9 p.m. But Allah, right? So let's try again. Let's try better. I will blame nothing until I find out what was that. So if I go back into my shard and authenticate, it is running. Let's just create a new cluster. Why don't we do this? Do you think people will find out what's going on in embassies at this point? Yeah, so I'm looking at it right now, as you can see. So cluster is actually started, right? So the best way to find out is to see is there any particular... There was no Spark job. As you've seen, the notebook is the only one attached. And the notebook has the only command, which is in Spark SQL. Yeah. It actually might be removing the stuff, but it's not a big thing to remove. I'll just maybe comment out this. So just an idea, I mean, if we didn't want to show you the Spark UI, is there a way to check the driver log? But it's actually... Yeah, you can do the driver log, definitely. So we can do the Spark UI. Or actually, there was a shortcut to driver log straight away. Yeah. So you're able to go for it. It's just like, I don't want to scroll and read in this velocity, but definitely you have a spark behind the scenes, right? Am I salating what could be done just line by line? And that was not planned. I did not anticipate this fun. So I think it's actually removing something. Variable... I assumed that it actually passed through because it was before, right? So you would assume that it would already hang there. Okay. So let's print it out. This whole session becomes very interesting. Okay. So that works. And that works. It's not supposed to be a code walkthrough, guys, but since we're already here and we had pizzas and Travoloka gave us some Coca-Cola then why not? We'll just accelerate through... We'll accelerate through something else. So, yeah, variables are good. You might want to sit down, but you're looking back for the way it went. No, it's fine. It's all right. It's about resilience, right? So we have to proudly then tell our manager, like, my back was hurting, but I still powered through and I've done it. Way to go, man. Way to go, yeah? Sorry. Ah, no, don't get me started. So, um... Yeah, okay. It is weird. Let's attach it to another cluster. This is, like... As it's a femoral compute, let's just do this. And it will actually show you how fast it is. SGDemo... Blah. So let's do the default settings, like, spots will be mixed in. It will not cost as much. It will be, like, one on demand. The rest will be spots. Very, very, very, very default. Autoscaling enabled. Yeah, boom, create cluster. It was better to do it in semi-point. This is provisioned in West. So in your case, or, like, our customers' cases, they deploy wherever they want. The particular region is tied to a particular Databricks account. So you can schedule the cluster to be executed in a particular availability zone, and locality principle will be driving the low latency. But you are not able... For the different regions, you'll have to deploy multiple Databricks workspaces. So let's do this. Let's go back to this notebook that's given up on me in the unprecedentedly uncomfortable situation. And believe me, I test before I come into Meetups. Believe me, please. I'm not asking for many things, but, yeah, I actually was in the office going through that. Let's attach it to the cluster that is now being powered up. Where was the blah one? Confirm. So we will lose all state. Cool. So meanwhile, when you're actually... When we're speaking about the jobs, right? So these are the jobs that are submitted either by users or by our API. So you can see, like, the runs of the job. Like, let's take a look at this test job. Don't heal when it was run. So you're able to submit jobs and it will instantiate a cluster for you on demand. It will bring it up with the configs that you need, what not. And the job will have a log of the previous execution. So each job here, each run contains the spark UI logs from log4j and the metrics from ganglia. So if I go into one of these runs, I can see what was the definition of it. What actually was input, output or what not. Well, this actually was a very powerful input. Yeah, but what would be then the kind of the use case, usual use cases, you concatenate, you either create like a library that you run, like a jar, or you concatenate notebooks. So you schedule one notebook and it triggers either parallel or consecutive execution of other notebooks in a cascade. So that, like, creates your ETL pipeline slowly one by one by one by one. So what about this guy? Yeah, it is. It is up. It seems like it's a kind of like a cache. Is there any web proxy or something? Are we under net? Nothing. Clear? Yeah, yeah, good. So it's I'm just asking. It's a bit wicked. So it gets a context, schedules it. That's not healthy. Oh, no file. Look at it. No file. Did I remove it? Wait, it was when I was tinkered with it and I was removing it. It's already it's okay. It's already that means that it's already not exist. So it's it's fine. It should be okay. Come on. It's a community event. Where are we at? Yeah, yeah, because it goes like you define a you define a path. Yeah, I already removed it. So it should be it should be fine. So this should also fail because they don't they no longer exist. Actually, you know what that wouldn't hurt if we they already are removed. We proved it. So we begin just like that, I guess. So what we've done here is like we said a data set we imagine it. We should have just skipped it straight away. Right. But like, okay, never mind. We have another 5, 10 minutes to finish it. So we will generate the data set straight out. The the data set as you see is a spark range. So it's like random in the range will create some a few columns depending on whether it's divided by by two, whether it's old or even will have an open and close kind of like sensor data and then some gibberish gibberish date and time that we come up with and device ID again, like it will be like a random number. So we will create a parking table. This is where the the parking path would come handy. So that will be creating we can see what's going on behind the scenes. It's a regular it's a regular spark test that will write out the parking into the shared path the one that we have seen above. It wasn't super quick, but doable. And then next one will be you see the display here. It's something like a dot show in the spark context. It's a beautified database implementation for the notebook space. What it gives you, it gives you this wonderful UI that you can quickly traverse through and you can also switch in between the embedded visualization and tabular output, but it's similar to what dot show is, right? It's a similar thing. So let's see what would be the count of rows in there if we group by action so if we say hey what's inside this gibberish data set that we created, the IOT use case. So we have 50-50 because it's like random normal distribution and it's auto events, right? So we have open and close. By the way, you see that there are some like hints it will probably recommend you do something based on the performance schedule. So next one I want to want to see how the transaction log helps me to mix and match streams and batch workloads in one delta lake. So what I'll do is I'll instantiate a stream that will be that will be writing out that will be again like creating, it's in memory stream so it doesn't read from anything. The type the dot format rate is an option when you basically say hey here's a streaming data set and then within it do this. So it doesn't read from Kafka or from Kinesis or from anywhere, it's just like an imaginable stream. So within it I'll start generate another set of gibberish that I want to append into the parquet. So what would happen if I write this stream out? So I take the stream and I write it out using parquet into the defined table that we already have, checkpointing like all beautiful and I will do that with an append mode Databricks has this again quite neat visualization for those of you that worked with Spark streaming it's kind of a mess to find what's going on so here you have like how many records a second and it's all of course logged you can come back to it but Yeah, everything it's just like you will have a small very small cluster, 6 gig cluster. So let me do this let me see what would be my batch query so I'm asking to read from the parquet table to read the action summary and count it similar to what we've done before let's see instantiates and we keep on appending to it right so this guy appends come on woo rubbish right it only shows open not even actually able to show closes and it's not able to make any sense out of my previously generated 50,000 counts because this is how it works so it's not able to make a match in between two readers and writers and make a full comprehension of the parquet's content so this is where this is where the benefit of Delta can help so imagine we take the same data set the raw data that we generated and we write it out as it's still it's a data frame right we write it out using format Delta this is literally a drop replacement from Spark API just a better spelling of word parquet because you have to spell q u e t right and sometimes you spell q e u t and what not so it's just simpler to write like literally so you write it out and also you can have the in place conversion so you can actually do the hey this is the existing parquet table converted into Delta it will basically clamp on top that transaction log folder and similar to writing it out from scratch from the data frame it will just make this particular table a data frame so let's see we'll do the same thing so we will establish a read from it and see what's the count of items inside let me cancel just to save some same same save a little bit of RAM to allow this gibberish so yeah this is exactly the same content just the drop replacement of the format so if I now do the write of the stream with the generated new appended data into the delta table you got to love one of the major features that we have is to make a controllable rotation so it should go both counterclockwise and clockwise on desire of our clients and we were asking for a long time whether it should be in JSON format or in YAML and we cannot agree that therefore we never implemented it so let's see now we're doing the same count but we are writing through the transaction log intersection and we are able to make sense version by version by version of the delta table and well this is loading I can show you what the table actually looks like with this percent of this yeah so you can see it actually like it came out and it increments it as we go in fact if I start if I did it in spark sql not through the data frame api if I did it in spark sql it defined the temp table because sql api allows me to subscribe to a stream I can make it a streaming dashboard so I can make it a visualization of like 5050 show me the show me this kind of like table but I will not need to refresh it I will not need to rerun it or click it will be automatically refresh to me on the back so the way that delta table looks in comparison to parquet right it has to be it's a it's the python world not the dbfs world yet that was in mnt so I need to copy this and I might say actually there is also you can you can you can mix and match shell commands did I tell you that no that's pretty cool so you can do a percentage shell it's a polyglot notebook and therefore everything that you execute from here now onwards would be on driver so you can do like ls minus la and there is a mount point on the driver that is called slash dbfs so we can do something like this so this is what parquet this is what delta table looks looks like you see the folder here called delta log right and again this is on driver so it's kind of ugly I can switch back to the implementation of our dbfs and do the percentage but we're still here already never mind so these are the these are the json files the transaction log files so all our open source all our open source commitment was about what's the format of this json files and what are the when we've given away the data frame readers and writers to create them so if I go and head them I might need to json if I had but it basically gives you what was the transaction there there was like a commitment for what was the device from where this is what our readers and writers are consulting to to power to power the delta to power the delta use case so that's pretty cool now the before we break and I'll save you some time don't worry I think we're behind the table already by 10 minutes so in delta there is this new notion now that we can mix and match safely streams and batches we're able to create something that is called multi-hop pipelines the easiest thing to think about it is like the easiest way to think about it is you've already seen the silver the bronze in the paths that are used right so it's the rinse and clean so you have the raw summary presentation the usually tl flow but you do that without actually having the jobs you can connect them using streams so you have spark streaming in between the data sets whenever there is a change lending into the source delta table you are able to do the exact ETL on the latest update to that table put it into summary put it into presentation and then push something into and always running inference or a dashboard live updated or something like that so here I define reading from a format delta this is the trick so you read from format delta like how cool is that you read a stream from a database pretty cool so if you kick this off and there will be some ETL like you we do as you see like I'm doing some group by action and I partition with overwrite stream stopped wait why what how delta table doesn't exist which delta table it was from reads delta data count does it exist yeah I defined it low delta delta delta delta delta check yeah it's a fully fledged delta table you can see the history this is the commits that we were doing is the stream up yeah stream is up yeah and this is the time travel history snapshots that we talked about so if I go into the format checkpoint location plus checkpoint partition by complete looks good to me what's going on start delta gold path what's that is it defined again I remember there was this very beginning of our the very very beginning of our meetup today we had some issues with all these variables it was very annoying friend yeah it exists okay query checkpoint ah the checkpoint is dropped okay ah the checkpoint is not dropped ah so we need to drop the checkpoint that's easy that should be doable never store the checkpoints inside your data delta directory I never actually looked at the details of the session this should be pathetic if they ran into it in spark ai summit where was that what was the checkpoint name there ah it was delta gold underscore checkpoint yeah thank you sharp eyed who was that yeah man good stuff thank you and where was I deleting it the score checkpoint less of course yeah so we need to drop it so what I do is like a ram path is a folder ah this I think it's a capital R yeah or R yeah keep your checkpoint separately wow rolling okay ah with few hiccups we are sparring through that's good stuff yeah yeah yeah yeah yeah so what would what we would do is like we will end up writing into the new delta table ah and this is ah instantiated how are we doing in executors five tasks quite good quite lean nothing happens should be fast and then we will do the load from that table right so this is again this is a stream from a stream pretty cool so stream from a stream and ah aggregation results by because there is a group by thrown in it also shows you the aggregation stats straight out and then you do something like this nice straight out so it is still going on with their pens from that IOT use case and that's about it so schema evolution a few other cells in here they talk about how you can merge schema so if you do the appends you can actually allow delta to be flexible about schema just don't want to use a lot of your time but basically you are able to drop the request if the schema is enforced or actually append if you want to by submitting the data frame extension on the on the option so there is an option that only delta will have asset transactions yeah you can actually do the in place over writes of the schema time travel this is actually interesting let's do that actually how about I answer a few questions like 915 but basically you just trust me you can travel back in time so and there were some hiccups on the way that led us to this question so in the interest of your time what do you guys think this is open source now it's been battle tested for two years on on on data bricks dot com in in azure data bricks it's been in septed by the use case at apple which was the 300 terabytes a day of data said that was coming from all their network devices it was the forensics for their networks and it's acceleration we don't talk we don't talk about the performance improvements that delta brings like the compaction of small files that are created if you stream into the table behind the scenes you're able to be impact like into one gig park a file so even if you have like a 10 kilobyte park is arriving and you have millions of them behind the scenes you're able to compact them you can do data skipping you can do that ordering behind the scenes not possible to cover everything in the session what do you guys think was it worth open sourcing is it cool looks cool yeah all righty so if you need more info download this notebook ping me on slack on the slack spark slack sg and there is a lot of tutorials out there a couple of blog posts how to time travel how to do the compactions check it out it's straight out working on community edition just don't bloat it with a lot of data sets it's six gigs you can run out of memory so cool thank you guys for your attention what a what a horrible day they started at 6 a.m. arriving into singapore on the red eye