 So yeah, welcome all of you here. I'm very happy to see so many different faces And I'm really excited to have the chance to talk about the patchy spark was as cloud native SQL engines Here at Europe Python and I hope this talk will be as equally exciting to you as it was for me preparing it Well, and I guess few of you may have stumbled across the talk title a patchy spark was a cloud native SQL engines Well, we're comparing a single technology against the whole stack of technologies and to be honest There's an alternative talk title, which is more explicit and more appropriate But which wouldn't be as suitable for a conference talk title, but soon we will figure out what this More appropriate title actually is To start with there will be a short preamble Describing my motivation. So why did I came up with this talk in the first place? And also a bit about my intention that I want what do I want to achieve with this talk? And then we will dive into the comparison before we start comparing we first have to understand the context in which we compare those two Competitors once we've done this I will quickly and briefly and say a few words about a patchy spark and those cloud native SQL engines for all of you who may not be so familiar with it and Then we will do a deep dive on three specific Distinctions that sets apart a patchy spark from those cloud based computation engines And then we can derive implications based on those distinctions which will then guide us to understand or to like provide a Recommendation when to choose which technology? That's the overall idea So let's data start off with the motivation Why have to talk about this topic in the first place? Well from my personal perspective I studied psychology and I got into touch with programming virus statistics started off using our Then gradually developed more into the two words the Python world Python Ecosystem and basically never looked back and ended up mainly using spark but also SQL in my day-to-day work And I always had this confusion about the coexistence of so many various computation engines You can use a data frame API you can use SQL There are some other declarative approaches to describe or to translate business logic into data pipelines And also this is also like a motivating factor here that my company We had to make a strategic business decision Regarding a future analytic analytics platform and for this at its core We have to decide to use like a single engine or multiple computation engines So this was also like a motivating factor Coming up with this talk So what might be motivating reasons for you to come here if you're using a spark or simulated or like Associated technologies in your day-to-day work, then it's a no-parana, but maybe for those who are more new to the topic You will get a kind introduction to what analytical batch workloads are And I hope that you can develop an intuition about the conceptual differences between different or various computation engines And hopefully also in the end we will understand what practical implications arise when choosing one technology over the other So what's my intention? I touched a bit. I don't want to bash any other technology And as I said, I'm mainly a spark user even though I use SQL also on a daily basis But my roots are more in the open-source spark world I want to outline trade-offs between different approaches and that's really important for me I want to emphasis I want to put emphasis on a complementary view of those worlds and in the end it all boils down to choosing the right tool for the right problem and In the end hopefully we will be able to fill out all those question marks So we see that we have our two competitors there on the left We have spark and we have those cloud native SQL engines And then I picked six dimensions on which we will compare those two competitors And in the end there will be a pattern that we can discover which then will guide us to a recommendation Okay, so let's get to the actual content part First as I said before we want to set we want to understand the context and which we compare those two technology choices And then we can talk about spark and our cloud native SQL engines But first what are analytical batch workloads? When talking about analytical workloads, well, there is no much data in the first place so in order to collect data We have to talk about transactional databases or transactional workloads So let's imagine we do have a web shop for example, and they're different real world entities such as customers There are orders their shipments and we want to capture those real world Entities with data or in data in the database and we want to do so in a very consistent and valid way And transactional databases they provide the means to do so so we have concepts such as primary keys foreign keys We have like none none null constraints. We have like non-duplicated Constraints we do have asset compliance, atomicity consistency isolation and durability and those are very those functions Or like this functionality is really required in order to to to capture the outside real world in data So this like transactional databases They really constitute the backbone of our daily business operations and they're more focused on specific entities for specific point in time And they're really associated with all those crud operations. So in the original rows like individual items We create them we query them we update them and we might also delete them in Contrast with analytical workloads. We are not so much concerned about getting like a very consistent and valid state of the outside reality But rather we want to access already existing data Which has been provided for example by transactional workloads and then we want to understand or we want to derive Insights from historical data in order to drive future decision-making Hence we ask different questions here We might ask all what is the most frequent error state for all our printing machines or all devices over the last five years So that's a complete different thing complete different story here And also the excess pattern here is also very different because now we have like aggregations across a huge amount of rows and Hence the underlying data layout is also different because we're accessing here now columns where we have a typical filter condition For example, we do have crew buys and then we have those aggregations So in this case when comparing our competitors we talk about analytical workloads not about transactional workloads Next Badge and streaming. Well, this is more obvious. Well with batch processing What typically happens is that as soon as a data or an event occurs We don't transfer and process the data right away But rather we wait until a certain threshold this could be like a logical threshold such as like a session is complete Or sequence has finished and then we start processing collecting the data in contrast with streaming as soon as an event occurs We transfer it and reprocess it process it and those trade-offs. They're also fairly obvious with streaming We have near real-time information. So very low latency whereas with batch processing There's a high latency because we only invoke processing once in a while However, batch processing has one actually two major benefits. The first one is here on the slide. It's highly efficient We can apply multiple optimizations on transferring the data storing the data and processing data Whereas with stream processing there's a large overhead by individually processing each and every event and this one to second Benefit with batch processing, which is not written here. Unfortunately It's about complexity. So when you think about like an unlimited and continuous stream of data You may have like late arriving data You have missing data. You may have duplicated data and you have to account for those those difficulties With batch it can occur as well But it is it's less likely and with batch data if you have like a complete session of data It's easier to reason about and it's also easier to implement pipelines for batched data So we've set the stage we say we do compare those two competitors In the frame of analytical batch workloads So next what is the patchy spark for those of you who might be not so familiar with it in general? It's a computation engine. It's a distributed fault tolerant computation engine That means it can run on your single machine. It can run in a cluster. You can scale out It's fault tolerant that means if you run it on commodity hardware and a single worker fails That's no issue by design. You can recover instead of Rerunning the entire job. It has many many connectors to all imaginable data sources So we can see that we can connect it to Streaming sources you can connect it to a transcendent transactional databases to HDFS objects doors Mainly everything that you can think of and also one thing that's special about spark is it provides many different interfaces to Work with it. You have an R interface. You can write plain SQL. We have Python, PySpark. You can use Scala and Java Now what are these cloud native SQL engines? Well, they are fairly obvious the name already describes it Well, they're cloud native like engines which only run in the cloud and the SQL native meaning that they have like a main SQL interface There's some exceptions here. For example, we see snowflake here on the slide Well snowflake now also also provides a data frame API, but this is less known and this is fairly recent So I just left snowflake there on the slide to So now we had this brief introduction about the context We had a brief introduction about a patchy spark and we also had a brief introduction about our cloud native SQL engines Perhaps one thing I should mention is because it's also written on the slide here snowflake redshift and Big-Crabby those are data warehouse because they also have a storage engine However, if you look for spark and if you look at benchmarks, oh, we will find some beam spark being compared with other data warehouse computation engines So in this case, they usually compete in the market So I think it makes sense to add them here because with snowflake you can process an iceberg data as an external table We can also process party files So they have the right to be here on the cloud native SQL engine side So let's get ready and and let's first focus on three major distinctions that I picked The first one is the SQL versus data frame API This is like the the part with the most depth and we'll spend a lot of time on this one Then we do have the runtime flexibility and we do have the render independence Let's focus focus first on SQL and data frame API and Of course, we have to start with SQL for a good reason Because SQL is quite old. It's more than half a century old And it's something that really is still astonishes me that nowadays we're using a language which has its roots in the 70s And there was Edgar Kott who first Like discovered invented a relational model for data So nowadays we're so so so used to like a 2d representation of data that we think in Excel tables For examples or like in database tables, but back in the day and data analysts They had to spend a lot of time implementing stuff to store and to retrieve data Nowadays, it's all just declarative. We don't have to care about how we store data mainly or mostly but back in the day That was not the case. However, there was one drawback with his original language to interact with the data It was quite complex. So four years later Donald Chamberlain and Raymond Boyce they were like the they make they give birth to SQL as we know today And this is like almost 50 years ago that they described SQL at the first time and There's a little fun fact to it. So we can see SQL on the paper written there So really SQL and not SQL and we might wonder why do we use SQL sometimes others say SQL? Well, the point is that there was a British aircraft company. Well, if Wikipedia is right They put a trademark what a trademark the word SQL and hence they had to rename it as to SQL That's why there are different variations of how to pronounce them SQL or SQL. I found this funny But anyway Like the really the astonishing part here is that SQL is half a century old But nowadays it's more popular than ever and that's also the reason why all those like modern cloud native computation engines They have picked up SQL as their primary interface to translate like business demands business requirements into data pipelines And there must be good reason for it for this why those like major plays they chose SQL and Well a good reason is it because SQL was designed as a human readable declarative language and you can just use it in that way With you have high level capabilities like our crud operations here Which makes it really or fairly easy to use to understand to onboard Even for those people who are not like programmers by heart However Even though SQL excels in this regard It has some drawbacks and now I want to focus on those drawbacks without bashing SQL Hopefully too hard here And the point is SQL lacks the ability to create powerful abstractions In common general purpose programming languages We do have constructs such as control flow structures if else for example loops for loops while We do have object-oriented features and we can use inheritance composition polymorphism and others We do have error handling like try accept finally and we of course we can create our own custom types like non-scholar types But these things were not part of the SQL language in the first place and interestingly well, of course there was a requirement for such features other Database when there's such as Oracle they develop their own dialect or their own kind of PL SQL PL stands for procedural language here, which then first included control first control flow structures and error handling and also possibly did the same with PG SQL and this was long before The actual SQL ANSI standard and then specified in these functionalities. So ANSI stands for American National Standardization Institute. So this is like the general overall specification of the SQL language and Me being this kind of new to this entire programming world was asking well Why isn't there like a wealthy ecosystem of libraries in the SQL space is similar to Python? Why isn't there like a PI PI installed partners for SQL? it's just not there and we might wonder why and While doing some bit of research on that I came across a blog post from the HDB guys and they this is a quote from their blog post they say well SQL lacks the orthogonality. That's like a Feature of a language. Yeah, that describes the language ability to compose more complex constructs from simple building blocks if I say this in my own words in my own intuition, that means if a language has a Set of constructions that you can use in a very consistent and predictable way well, then you can create powerful abstractions based on this core set of keywords and this doesn't seem to be the case with SQL and As a fun fact well the SQL 2016 specification now includes more than 360 keywords, so just imagine having Python with more than 100 keywords hard to remember Like C has 32 keywords and we built operating systems on top of this Python 311 now has 35 keywords And Python everything is an object you can pass around a function as an object a class is an object Even a module is an object everything is an object and you can just pass them around Think of SQL well stored procedures are already a bit awkward and how can you can like pass around stored procedures in a Store procedure doesn't work that way And maybe also to giving a bit more Hint on this if you if you type Google, this is my like own search history So this result is not really representative or ever if you type in SQL ecosystem You would expect like a large wealth of libraries and stuff But what you will find for me at least data bricks SQL well data bricks is not really well known for SQL They're rather known for spark so that they did some really good advertising here And also if you look like at the result at the bottom it says data management with SQL for ecologists I'm asking for the ecosystem. Why isn't there like like the first major result ecosystem SQL ecosystem? So I went further and asked chat GPT And what does the GPT say and it says yes finally there is a rich ecosystem Surrounding SQL and if you go through well, yes, they're database management systems They are clients and whatsoever but down at the bottom. I highlighted what SQL says help developers Interactive databases using brogramming languages constructs instead of writing raw SQL query So that's not me manipulating jet GPT. It's just GPT what it crawled from the web Yeah And now let's give a very concrete example here That's also part of my day-to-day job. We do have IoT data we have pipelines that we have to monitor and I have to track if there's data incoming every hour and To set up such a monitoring view. I have to first generate a date range or a base range that contains all relevant date units. I can't just read the raw data because if there's data missing I just won't see it in the dashboard. Hence, I have to create this date range base range in the beginning So how to do this with SQL? So that's the challenge here And that's like one thing or like I'm one concrete example where I stumbled across If you use Postgre SQL in this case, well, this one looks nice We do have a select then there is some function generate series And then this must be the start timestamp Then this is like the end timestamp and then we say the interval would be nice to have some keywords arguments here So I know what the parameter is about but that's a different story. So the post group We're just fine. I take this. So let's move on with MS SQL The first part is nice. It says declare start date and end date So this is fairly explicit and then we do have a select with date at Okay, we have some date at function here in the beginning. Okay, I just take this and there's some number minus one Hmm, and then we have within the from a clause We do have a sub query and then it says select row number over Order by some C object ID and this is taken from a system's column C So this is already a bit. What is going on here? This is not really a straightforward. So the level of abstraction here is somewhat strange Take it further redshift This looks very similar to post cream and this is not by accident But on purpose because redshift is based on post-credialect, but they're like the weird thing here is Well, we have to get date cast it to date. Why do I need to do such thing? Well, it's called get date and then cast it to date. Why do I need to do this? And then I subtract this here series that I generated later and then I cast it to date again Yeah, that's fishy, but I can understand this Let's take a look at big gravy. Ah, that's spicy I don't go to detail here, but we see some sub query yet another sub query some right padding split This is the accepted answer on stake overflow for generating a date range with big credit How awkward is that? I will skip snowflake here just to show this there's even an example of recursive queries to generate the date range It gets really funny. So what I'm trying to show here is that we are using a tool for an issue That doesn't seem to be as suited for this Okay, so we talked about SQL about its birth 50 years ago and About some deficiencies of the language now. Let's focus more on the data from API. This is more quicker In general, we can say data from API stare embedded in general purpose languages such as Python or Scala and being Embedded in a general purpose language all those concepts that we know from SQL such as table projection filters transformations They're explicitly mapped into object representations. So we don't have a table anymore We just call it the data frame and then we have attributes on those data frames such as color names and d types and we do have methods such as like a filter join and aggregate and The great thing about a data frame API being embedded in general purpose language is that we gain all the functionality and also we can participate in this rich ecosystem that those languages provide and that's a major difference So summing up our first distinction We can say spark offers both SQL and a data from API Whereas there's cloud native SQL engines. Well, they just have SQL mainly That's the first distinction second distinction. This is fairly quick. Well, where can I execute? Those engines well with spark you can run it on your local machine. You can run it in your CI CD environment You can run it on premises You can run it like on your own in EC2 instances in the cloud or you can use many various range services for spark However, in contrast with cloud native SQL engines as the name suggests. Well, this only runs in the cloud When it comes to vendor independence spark is open source Where's those cloud native SQL engines? They're mainly governed by specific cloud providers such as Azure Synapse or AWS Athena And then maybe they also like enterprises such as Databricks Databricks now also offers like an SQL engine So we do have Yeah, we do have a less of a lock in lock in or vendor lock in if we are going for spark so as I mentioned in the beginning there's an alternative title to the talk and This one would be the run anywhere when the independent data from API versus the cloud only proprietary SQL for analytical batch loads But I hope that you agree that this title would be a bit less Suitable for a conference so knowing a patchy spark will it probably attract a bit more but behind the patchy spark It basically says you can run it anywhere. It's when the independent and has a data from API Okay, so now we've Talked about those Distinctions now, let's try let's try to derive implications that are based or that can be That result from those distinctions There are six dimensions that I want to focus on managing complexity Testing debugging inspection performance future proof and ease of use I will focus mainly about managing complexity and testing and less so about performance future proof and ease of use because they're also like they don't have that much depth to it and Let's start with managing complexity and for this I will introduce something that's related to my My employer which is Heidelberg a truck machine. We produce like those large-scale printing machines They are massive so you can see like this little man here on the right He stands right there where there is the the feeder so the paper gets sucked in and then those Those units here each unit applies a single color to the sheet and then in the middle We have like a perfect to be called this just flips the paper And then we have another four units which print another like another four colors on the paper And then finally there's some coating some dry or a powder so that then paper doesn't gets Like it doesn't stick to each other and finally we have a delivery and those machines they print 18,000 sheets an hour so like they're massive in size. They're really loud and they're fairly complex those machines So when I first started working for Heidelberg was like air printing machines But when I saw this I was completely amazed well how huge are they and this is like just a standard model you get machines with 20 of those units and Talking about complexity. I'm just this is like a schematic view of a single printing unit And you can see you can see all of those different Different cylinders here applying the color and distributing the color fairly complex But anyway, we collect lots of IOT data from those machines temperatures electrical currents movements quality measurements everything you can think of But there's an issue to this IOT data that we are collecting for quite some time So that's like a huge benefit that we have that we have on this data for more than 10 or 15 years by now But this originally was like developer logging meant for developers to understand their modules or like the software and the hardware that they developed and To give you an example of the data first a quick description. We do have a timestamp We have some channel information. It's like where this this Event occurred then we have some some string like message then we have a message ID and then we have some numeric param value here and In the best case, there's a unique message ID and we have some numeric value There is no further passing or filtering required It's just the value is there and now a digital logic is required to get this data out of the raw logging But that's only in 10% The case so the next one is well We still have a unique ID this 1001 but then well the information is not numeric anymore So it's just zero doesn't contain any information But the information is given as a message string and this message string is somewhat XML like so we already have to provide Apply some reg access here to get the information out of there the next one it gets even spicy and now it's not like XML anymore It's more Jason like but not not really a Jason So you can't use off the shelf Jason decoders and then you see all those different hex codes and those hex codes They're fairly complex to do to decode so we have to actually fetch lots of information from several on the databases to make those To make this data point any speaking for humans and then the last example is like the kind of the worst So we don't have any unique message IDs anymore But rather we have to combine a message ID with a channel information and then we just have some arbitrary string that we have to pass And that's quite challenging and then you can imagine software versions They increase and then those things made change then things break. So we have to account for many different Failures here. So which makes this issue just of extracting the data fully complex and And so the challenge in general is that we have some raw IoT data in various forms And we need some sort of standardization layered that we call extractors And then finally the result is some qualified data that I can put my analytics logic on top So what I'm now showing is a simplified PI spark implementation that we came up with in order to solve this issue And just to quickly and explain a few things here We use expert biggest abstract base classes to enforce an interface We use some class attributes here to pride some additional meta information such as like is this a classified Extractor or this is like for which software versions does this extractor work for example We allow some configurable behavior such as like in the production environment We only want to retrieve like numeric values. However, when we do a prototype or like an attack analysis We might just want to see the labels instead of the numeric values to make it more speaking We enforce an implementation of a filter method. So this defines what rows are relevant We enforce the creation of a path method. So this is explicitly that what logic is required to extract the information This is usually like a reg access and some string operations And then we also enforce and like a validating method We force our developers to provide a validating method So they have to think about how can I ensure that the results that I get extracted are correct And then finally, we just provide like a base simplified an extraction method and To see an example of this. This is the totalize or the totalizer specific to printing machines This is like how many sheets have been printed on this printing machine And then we provide some text some meta information as I mentioned before like it could be like is this one validate It has it been classified what software versions does it work for and so on we provide a very simplified filter condition We have a very simplified pass in this case We just to cast to integer and then there is some validate and in the end You can instantiate this extractor and you can use it on a given spark data frame So why do I telling you this? Why do I tell you this? I want to pick up or pick out a few things that enable all that help us managing complexity here for one We can define interfaces with Python and the benefit of using interfaces that you can explicitly Communicate intent and purpose to your developers and also to your users who you even use those interfaces We do abstract away implementation details. So the end user doesn't see anything beneath it It doesn't like user does not need to know about this and also we decouple the user from its implementation We don't create any direct dependencies but rather we just focus on the interface itself and one really important thing is designed by contract because in our company lots of Different developers made their own implementations, but they were all completely different There was like no standard API and now having like a shared understanding a shared interface There is no more like Wild West of extractors similar to the Wild West of the royalty data that we've seen Next thing reusability and dry. Well, you can use inheritance. We can share code. We can use functions in this case And we don't have to rewrite it again and again and again Also an important part in my opinion is separations of concern. I do separate the filter and pass part here Think of SQL. Well, it's always coupled. You have a single SQL statement and you have your filter condition and your pass condition You can't test those independently. Whereas in this case, we have a filter method that we can pass individually or separately also from the pass step And yes, we can leverage meta information. I talked about this and we can also enable composability So let's think of me doing some I talk analysis You want to just say I need those three extractors here some like totalizer We want a software version and want to get some information about the machine state Then I can put them what we call an extractor stage or think of a pipeline and then we can just run these at once and These functionalities they're really hard to achieve if plain SQL So defining interfaces. I'm not aware of a way to define interfaces in SQL Usability and dry. Yes, you can use stored procedures, but they come with their own drawbacks as well regarding testability For example separation of concerns. I have no idea Leveraging meta informations. Yes, you can use comments, but you can't really inspect these at runtime and also Composability like this example. I showed before compose different extractors to a pipeline hard to achieve and Yes, there's dbt to the rescue So I guess who of you who of you knows and dbt was used. Yeah, quite a lot. So And what is dbt? Well, they themselves say this from the current documentation. They say well, they are the programming environment for SQL Giving you the ability to do things that aren't normally possible in SQL And then they provide some examples here using control structures such as if statements or for loops and also down the bottom It says abstract snippets of SQL into reusable macros analogous to functions in most programming languages So the funny thing is dbt itself is written in python using a ginger templating language written in python to generate SQL Templates, which then are then passed to our database engines I Don't I can skip this example. So taking a look at this dbt improves the situation for many things, but still it's not on par with the actual data from API So summing up managing complexity Since spark is a pie spark a page spark can be embedded in a programming language a general purpose language It provides a like a huge toolbox of abstractions and one thing that I really like is that you can dynamically generate building blocks of your data pipeline Where's SQL and its surrounding frameworks that basically fall short in this regard So first Implication yeah, when it comes to managing complexity, I'd say we have a smiley for a patchy spark And we are not so sure or less more of a frowny for our cloud native SQL engines now when it comes to testing typically Programming languages they have testability as a first-class citizen or as a design principle in mind However, when we look at SQL well the from table is always hard coded It's not like a parameter that you can pass to a function in exchange and can walk for your test functions No, it doesn't work that way. It's really hard to test SQL So if you have any good solutions to unit test your SQL snippets, please approach me Because I'm still struggling with this and this is like my personal observation that doesn't seem to be like a general vendor Independent easy to use testing framework in the SQL space. Yes, you can do tests for yet like data tests as dbt provides like you have non-nulls There are no duplicates and shut such things you can even extend it with great expectations But really unit testing you have some input data and some expected output data Just like in classical software engineering. This doesn't seem to exist in the SQL space so again, I asked a GPT and Is there I know how does the Python is a testing ecosystem compared to the SQL testing ecosystem and then just have a look at the Highlighted path there. It says the SQL testing ecosystem or the other hand is still developing Wait a minute half a century old and the testing ecosystem is still developing So there's something strange about the language like by design that we do not have a proper testing system in SQL It's still today. Well, I it struck me There is some extension to dbt to do your unit testing, but Supplying your test data in SQL Really hard and on the other hand Python a pie test. Yeah, Pytest. Who knows who uses Pytest? Yeah, quite a lot of people and Pytest is just so great because you can dynamically generate tests You can use components mock dependencies Rerunfield has code coverage everything you can think of an awesome more advanced features such as property-based testing or mutation testing It's just possible. It's out there. You can use like the rich ecosystem However, we can skip this one now Summing up sorry and spark gets a smiley here and SQL gets a frowny. Yeah, I think I need to hurry a bit Yeah, so when it comes to debugging inspection well with spark since it's like embedded in the programming language Just use your favorite ID step through line by line through your code With SQL engines, I did some research on this and I was also surprised to find well It's not really possible to debug SQL. It's really hard. It's really challenging There's an example of a snowflake here. They asked well, how can I debug and restore procedure and they say well I'll write some helper function to print or something to a different table That's not debugging that I'm used to the same if you look at I Think this is a big grab in this example. They have an error statement So you manually add yes, there's an error, but that's not debugging, but it's a debugging function. They call it Strange isn't it when it comes to inspection? Spark offers like a rich UI to understand each and every part of your driver and your executors It's really detailed, but you have to be an expert But you can detect data skew you can detect low CPU utilization everything is possible with snowflake in this case It's less detailed, but very intuitive, but you might miss an important part. So when it comes to debugging inspection We say yes spark looks great here SQL like cloud native SQL engines. Well, they're okay, but they could be better Yep, when it comes to performance, this is also an interesting part Just this morning. There was a talk on substrate and error-based computation engines And put it differently Data bricks is the spark company. So the creators of spark They found a data bricks and at their core they have like the spark engine But they also developed now like a custom C++ engine called photon Which is then substitutes the heavy compute workloads of ray from the JVM So if the spark company develops their own custom computation engine, which does not use the spark engine anymore this tells us there's something wrong with the original JVM engine within spark and They have been like open source initiatives such as blaze which they do they still keep spark as a scheduler to generate a DAG To create like a physical execution plan But then they use not the JVM engine anymore, but rather use an error-based computation engine and Also, there's now this approach to translate a spark DAG into a substrate and then have some native C++ engines doing the execution also in the open source space So this is like analogous to the photon engine the data bricks Developed in a proprietary way, but then open sourced so there will be in the future some improvement I assume for the spark computation engine So in this case, I'd say well those highly sophisticated Disrupted engine of our cloud native SQL providers such as Snowflake, they have an edge here and they have an advantage here So ease of use And with the ease of use there's a clear winner cloud native SQL engines because using spark as requires you at least the proficiency in a programming language using it It can be difficult to set up on patchy spark tests on your local machine and NCI in the first place And this is like an important part a patchy spark really forces you to think about the distributed execution Distributed computation in lazy execution model you have to think about how to paralyze your code You have to think about different ways how to join with SQL everything is abstracted away You just have declarative approach and you have no imperative choices here and Also as I said setting up in a patchy cluck a patchy spark cluster is still quite complicated Even though there are many managed cloud services, but still you have to think about the driver We have your executors how much memory do I need how many cores what is like the level of reparlism that I am at there And what you have to think about caching to like if you re-use data you have to catch it otherwise it will load again and again and again, which is Less performance you have to think about or you have to also understand if you have data SQL you have less I'm CPU utilization you're sorting some techniques well as spark grows as a framework They have included more and more optimizations such as the adaptive query executioner, but still spark is still More difficult to use so in this case Yeah, those cloud native SQL engines. They're easy to spin up just say t-shirt size I want an XL cluster and then I use my SQL in the engine does everything for you in the background So no need to hurry of course you can use SQL with spark and just let it run But often it's better to have a look of what the spark engine is doing. That's my experience from the past So when it comes to future proof I think since a patchy spark is vendor independent and they're managed spark Runtimes on all major cloud providers. We do have like an AWS you can AWS use AWS clue You can run Amazon EMR on easy to one EKS now even surfer less Nowadays Athena Integrates with spark you can pass your spark code to Athena. How crazy is this? You can use as a synapse to run your spark workloads. You have Google data broken. Of course, we have data bricks Which of course is these bug runtime or these bug engine and Yeah, when they depend it's important and really important But I think in general to sum up in the future both of these won't go away So my expectations is in the next five years. All of these will remain so wrapping up We We had this matrix in the beginning and now let's put everything together We see those frownies and smileys and I guess you can detect a pattern here and also that's the way this the talk was structured chronologically We do we do see some benefits here for spark for certain aspects and we see some more like positive aspects for our cloud native SQL engines and This is my personal recommendation on my personal opinion if you have business critical data pipelines with high complexity Then I would choose a spark over SQL because I have more possibilities to manage complexity I can build my own powerful abstractions on top of it. I can have proper testing support I actually can do unit test And I have proper debugging and inspection methods with spark which really lag in the SQL space However, if you already have like a high or like if you reached a sufficient level of analytical readiness So there's no so not so much more complexity in your data. Then please use SQL I would never use spark for writing my SQL like my my dashboard crevice no way as long as you as long as you have like first level theater and aggregations Then just heavily use SQL, but please don't write those walls of SQL Commands with like tennis CTE's and recursive functions. It's hard to understand. It's hard to reason about use language use a tool which is more appropriate for the issue at hand and Basically, this is my closing word So in this case complimentary because use spark for those things which are more complex very required testing and managing complexity and happily use SQL and cloud native SQL engines for things that are suitable for this and That's it. Thank you