 My name is Thorsten. I'm one of the product managers at Snowflake and So this is a sponsored talk So I'm by no means an FTB expert But I do want to walk you through some of the highlights of snowflake, which as you know is heavily using FTB And you might have heard this before during the day So one of the foundations for snowflake is what we call a multi-cluster shared data architecture What it means is that we're very disciplined about separating compute from storage And that gives us a number of interesting capabilities So the first one is everybody that's working on the same snowflake account is working on the same data set So they're all Seeing the same transactionally consistent view of the data no matter which team or which part of the company is Making changes to the to the data set In addition to that you if you have different teams or different departments in your company You can define and create independent compute resources That are powering your team and they then get access to the shared data set and they're independent of Whatever other teams are doing over the same data This has a great benefit that you can do workload separation So for instance if you look into this busy picture here at the top you can see there is data loading going on ETL Workloads are happening and in the bottom you have for instance dashboarding our power users there using for instance to blow to do Bi Whatever happens on the data loading side gets its own dedicated compute resources We see this little sequel cluster that's sitting there So that's the independent compute for data loading and then in the bottom you have Different compute clusters that are powering the bi experience for users that are using tableau So you have this workload separation the other aspect is you can also scale these compute resources Independently of each other so for instance if your data science team needs massive scale our compute capacity You can have that for them in a dedicated compute cluster while other teams are working with much smaller clusters and then last point what I mentioned here is You can also give the same team multiple clusters and that essentially gives you The ability to get virtually unlimited concurrency So unlimited numbers of users or queries that are running at the same time over the same data set So let's drill a little bit into how we are doing this So this is an architectural blueprint of the different layers that snowflake is is using at the top You see what we call the cloud services and essentially there first of all making sure that when a connection comes in That you're authorized and authenticated to work with snowflake and then as the query goes through the different layers We're compiling a query plan. We're taking care of transaction management And all of that is backed by metadata that is stored for all intents and purposes in in FTB So once a query gets submitted for execution, then it goes into the second layer that you see here This is where we have our so-called virtual warehouses Those are these independent compute clusters that you saw on the previous slide And they're connecting to the third layer at the bottom, which is the data storage where snowflake databases live What's nice about this is the tiered storage architecture architecture between level two and level three So we're using the local memory and the local disks on the VMs that are powering this this middle layer the virtual warehouses We are using that to cache any data that you are frequently used for for your queries And that that saves a lot of round trips into remote storage for for the third layer All right looking at this storage This might be a term that you also have heard earlier today the way that snowflake organize its its its data internally is Into so-called micro partitions and this is done automatically as the data is being loaded into snowflake We're taking the data and we're cutting it into Micro partitions that are a couple of megabytes maybe a few tens of megabytes in size What we are doing as part of that is we're transforming the incoming data into a column of representation And we are also building up metadata structures that allow us to reason about what's contained in what partition This gives us the ability to do pruning during query processing at various levels So first of all because of the column of representation We can discard any any any columns that are not participating in the query We don't need to touch those and then in addition to that the metadata gives us the ability to reason about which Which micro partitions may actually contribute results to the query that you have just submitted? And for those where we are sure that they're not contributing. We don't need to touch those partitions either So that gives a great performance benefits Obviously and these partitions are also the unit for our DML operations So when you're running insert operations, we are not changing existing micro partitions We're typically just adding new micro partitions to the end Now the one of the cool pieces here is that's true for both structured data as well as Semi-structured data that you may may load into snowflake. No matter what you're using as the incoming data format We are converting that or transforming that into this optimized storage. That's based on micro partitions And there's there's no schema or no hints required to do that and as soon as you submit a query They will automatically benefit from from that optimized data structure. So you're getting the pruning and the filtering that we just talked about So that's a that's a bigger topic for us. So we'd like to Make things as automatic as possible for our users We talked about micro partitions. There is no distribution key that you have to define We are doing that automatically for you You don't have to worry about like a high availability configuration failover between different virtual machines All that comes automatically out of the box. There's no vacuum You don't have to define specific statistics to make sure that your queries run fast so all of that comes out of the box virtually no knobs and the time to To see great performance for your workload It's super small So let me jump over into quick demo so that you get a brief look of snowflake in action and What I would like to do in the demo is just show you one Particular aspect where snowflake integrates into the spark ecosystem So we have a spark connector that you can deploy into your spark application that injects itself into the query plan generation process of spark to Give users a highly optimized experience when they're storing their data in snowflake So what I've done here, essentially, I've taken about two gigabytes worth of tweets from Twitter Stort them into s3 and you can see those files here and Let me jump over Here I have Browser window with Some queries over that data in data bricks. You can see here. I'm populating a data frame with The data that sits in the bucket in s3 over on the left and you can see this is the schema that we're getting it's all Jason Lots of properties in here If we are doing a count you can see that it's about 3.2 million different tweets that are contained in the data set That I have in the bucket and if you squint you can see that's down there It's takes about half a minute to run the count operation and running this on a two-node spark cluster So two compute nodes one master node pretty standard configuration Just clicking through the cluster setup and here's a little bit more involved query. So over these these tweets I'm doing a group by the language property and then just counting to get the most popular Languages for the different tweets and you can see English and Japanese are awfully popular here And again, if you squint you see that takes about half a minute to to run So that's kind of interesting now Let's take that same data set and bring it over into into snowflake And I've done that over here You see a tweets table that lives in a snowflake database if I click on this you can see that we are storing the Jason in Variant column so this is our data type for this semi-structured data like Jason and parquet that I talked about that automatically does the Translation into the internal storage representation And now let's jump back over here Into a different browser window where I'm connected to a somewhat smaller cluster just one compute node instead of two And I'm connected to snowflake and let me run a couple of queries here against snowflakes So this is the first one so you're just checking if we can get a connection to my snowflake warehouse and that should Hopefully just finish in a couple of seconds and spit out the current date and you can see today is December the 11th, and you can also see that Here we are essentially telling the spark read operation to run a query with current date against snowflake So let's do something more interesting here Let's populate a data frame this data frame with all the rows in this table the tweets table that we just looked at over on the left Let's run this and then do the same count operation on it, and you can see here the Jason is coming back for the first couple of rows Here's a count operation in about two seconds the same 3.2 million Jason documents Now here I'm running an actual snowflake query to populate another data frame, and you can see here in the syntax I'm using some of the built-in Snowflake sequel constructs to work with semi-structured data in snowflake Let's send this off And there you can see we're getting languages and some texts back And now let's run this query here, which does again the same grouping by language and Counting the tweets per language, and you can see that same result in about three and a half seconds Coming back now the nice thing again here is we're running a much smaller cluster Back by a snowflake warehouse, and you can see some of the performance benefits here, which Which are due to the to the internal optimizations that snowflake does behind the scenes and know that you literally have Don't have to do any changes to your spark application application code So the connector injects itself Transparenly into the plan generation process whatever transformations you do over your data frames We'll look at them and figure out which ones are relational in nature and backed by data That sits in snowflake, and we'll automatically translate that into sequel queries and run over slow in snowflake We can take a quick look at this here so here is the sequel query for this that does the counting and Down here you see the group by language So this is the sequel query that was automatically generated by the spark connector when we When we submitted that that last group by statement in in spark So with that, let me jump back in here and Great software only exists because of great engineers. So Obviously being a sponsor talk. I wanted to show this for a minute We are we are hiring in into our development offices one local here to the Seattle area down in San Mateo As well as in Berlin, Germany things that we are working on are obviously performance One of the most recent things in that space was materialized views Which where we were taking a fairly fresh view on how to do materialized views for relational databases Other areas that are of interest to us is the whole data science machine learning space And how do we best support these workloads from a snowflake perspective? Modern enterprise data architectures like how do we integrate into application stacks like spark which we spent some time on today And then obviously performance also for our metadata layers and our cloud services particular FDB for metadata scale and then also Increasing our footprint growing into additional cloud regions both in AWS as well as in Azure That's all I have for today. Thanks very much