 Okay. We are going to start now. Sorry. Well, thanks for coming. I'm going to talk today about table formats, that is this new concept somehow, that is revolutionary. It's a new revolution for the big data world somehow, and that has direct impact for people working in, well, data analysis and machine learning. First, I forgot to put open table formats because what makes sense for this conference, of course, is that they are open. I'm Ismael. I work for Microsoft. Don't worry, this is not at all, vendor, paid talk or whatever, so it will be transparent. I'm using a Mac, just to check. I'm using a Mac. This is the new Microsoft, so everything's good. Just a little story about myself. I used to work as a data engineer for three or four years. Then I switched to work as an open source engineer and mostly working in Apache projects, all these Apache big data projects. In particular, I was working in Apache Abro and Apache Beam. If some of you have complained about Abro not being backwards compatible and hate the maintainers, well, it's not my fault but I am complice of this. I'm also a member of the Apache Server Foundation, which means that I care about open source and all the way we do stuff on Apache. Now I work as a data advocate for Microsoft because life changes and you get a family. Let's talk about the internal problem of that engineering. We, and this is historical, we always have this idea, oh, we have so many data sources in our company that we want to integrate and they are quite varied from databases to Excel files to CSVs and even now to this kind of SAS vendors who control also data. We can also count GraphQL kind of things too. We want to integrate all of this inside of centralized repository for data. Let's call this data warehouse. It has, they call it in the past and it's still coming all the time. We have this separation of what we call the operational data that is the data we use to operate our application, let's say somehow, the ones who make the business and the analytical data that is the data we want to analyze and we don't want to touch the operational side because this is critical for the business so we want to move all this data somehow into this data warehouse. So, well, data warehouses were somehow an issue that was solved decades ago because, well, that's what they sold at the time. And this is, well, this had some issues in particular. It was too SQL oriented, I would say. So, it has some constraints and biggest constraint, of course, was that it was really vendor-centric also. And, well, the performance of your jobs and everything dependent also in the design decisions that the vendors did at the time. And when you wanted to scare what it was quite costly. So, there was some pushback against those systems, let's say. But, at the time already, these systems store the data somehow. They have this kind of private file format for this. And they were really supported by other vendors. That's what differs from today from this talk, let's say. And they were, of course, they were tied to the design of the database of the data warehouse. They modeled things in a really particular way. That was not, and you rarely knew how this was designed. You couldn't access the data, see the raw data, and do things with this data. And everything changed because of this big data revolution and this is the infamous map-reduced paper. And here we care about two things, or they care, Google at the time. One is how do we, how can we parallelize computations to make it faster? How can we store data distributed in multiple machines now? Those of this task, well, a part of the programming model, of course, that is part of this paper, were shown in the context of examples that are not really SQL-able, let's say, somehow. These were the early days of big machine learning, let's say, not big in the sense of distributed data. But they were tasks like calculating counts and making this page rank algorithm of Google that were part of this and that you cannot do easily with SQL. So this showed that SQL was not the only way to do stuff. And, of course, Google presented their distributed file system in a paper, and the idea is the same. We want to store data and process this data as fast as we can. This is somehow the origin of what we call Cloud Object Stores, no? And the origin of the so-called Data Lake. So Cloud Object Stores, this is all credit for Amazon who made them popular with this tree, are just really good. I mean, let's be honest, well, they are not open in the sense that, but they're really easy to use. They just have a key value model, and they have this durability and availability guarantees that are pretty amazing. You see this tree has 99.99s of durability, which is pretty impressive. They are the original serverless thing. I mean, who cares about servers when you use this tree, nobody? They are cheap, and they do the quotes here, because most of you probably know that they are cheap to put data into. They are a little bit more expensive if you want to get the data out. And of course, they were really supported rapidly by all the different projects and open source projects, and so they have a massive support in the ecosystem, and they have a little nice extras like multi-region replication, and they have control for permissions. You can program events when files are written. There are a lot of nice things. So it was kind of natural that database applications or applications oriented to data were going to end up putting the data inside of this sort of distributed file system. And with this ability to put data in this distributed file system, what came was that the mindset started to change, and it started to change in the sense that way we can have copies of our data for cheap, and we can somehow think that we have infinite space to do everything. So why we don't keep them as they are? Why we don't keep these copies and we start to work more on this concept of immutability? With the advantage that, well, with immutability, we can reproduce the things because we can go back. This is the paper about Pat Heland. It's a really interesting rate if you get about these things. Or if you want a more familiar kind of presentation, this is Rich Hickey, who does all these closure presentations, who also has a talk of where he mentions the impact of this new way of thinking about data, like just think we have infinite data and the consequences of that. But of course, when we have this kind of data lake concept, well, we think we can put files inside of the data lake and that will do it. But the true story is that, well, it's not so easy, as many of you probably have lived it. The first thing when we have data in a format that is not structured or with a scheme associated to it is that we have to fix this. Somehow we have to give it a schema. We have to clean it. We have to normalize the different columns. This is inevitable. So there's a little bit of, I wouldn't say a lie, but it's not so free on a structure as we want. Of course, the advantage is that we can have other type of formats there. We can have images. We can have sound, video, whatever. And of course, logs. That is also something that we use a lot for analysis. So this idea that let's put our raw files and everything will work. Well, it's not so easy. For two reasons. Well, first one, because the files we use to do this kind of analysis or machine learning algorithms, sometimes are not fit for distributed systems. One example about CSV, for example, is a file format that has the advantage that it's raw oriented. So we can code the file in pieces and maybe send different parts of the file to different machines, which is good. But well, it's not efficient in the sense that it's not encoded in a proper way. It's textual. So it's super big. It does not have an associated schema. So you are always trying to assume, oh, this column is like a date, but maybe not. And maybe this column is just an enum, but you don't know. And the other thing, well, the SQL is pretty non-standard. Some of you have had the painful task of implementing just a CSV parser. You will see that there are many, many exceptions. Other formats are even worse. I mean, like JSON or XML. The problem with those that are part of being verbose is that you cannot code them in the middle. You cannot split those formats. And if you think about just serialization formats for your language, let's say like Java serialization or Pico in the case of Python, they don't pass the test also because sometimes those are not consistent between versions of the language, which means that you can store maybe this serialized object in the file system or in the cloud object store. And then you take it back, but it's in changed versions. It won't work. So this brings us to the creation of data formats for this problem. And basically the first task here was, okay, we need to define a schema for this data. So define which type corresponds to which column. We need to guarantee that we can split this or partition it, as they call it also. And, well, of course, we have to have a really well-defined specification for this. We have to, if possible, compress the data, so we use less storage. And, of course, we want it to be efficient. So that's what happened with this first generation of formats, let's say. And these formats, well, here I'm showing how it works in practice. As you can see, the representation of a table there with the ABC columns and the data types that correspond to each column. If you serialize this in a row layout, well, you have something like this that makes kind of immediate sense. And this is what ABRO does, for example. So this is what ABRO is useful for. And then there are more recent formats like ORC or Parquet, that is probably the most popular one now, that store the data columns. Well, it's more complex than this. They store row groups and inside the row groups there are the columns. But what is interesting is that if you only care about one or two of these columns, well, you don't need to read all the others, so you can job with pointers inside of the implementation and, let's say, offsets or pointers to the specific parts that you care. So this makes reading faster. And what we can in all these distributed data stuff, what we can is to read the less we can. That's our goal now. We don't want to read more because it's too costly. And a part of this, let's say, columnar representation, one thing that we also store are statistics. And when I call the statistics, for example, they can store things like the mean value or the max value. And this is pretty good because I wouldn't even need to go and get the data. I can just immediately get, oh, this is the minimum value. So that's also good for querying. So we had this format. We have these files, but we still don't have, let's say, SQL-like representation. So at the time, it appeared a project called Hive that was the first one that bring this concept of, okay, with these files, we can represent a table somehow. So it's what I call data warehouse for large datasets. And it's totally oriented to a SQL experience. This is the appearance of the concept of table format explicitly. And as you can see, what we are defining is just an external table, and we set up the compression we are using, and we set up the table of files that we're using, in this case, parquet. And we set a location. In this case, well, I'm using the location for one of these object stores. In this case, this is the Microsoft one. And, okay, that's how it is. So we put files inside of directory, and we read them. So what's a table format? Is the question, well, a table format is a way to present all the files that compose a dataset as if they were a single table. That's what we're trying to do. The way that Hive does it is with a directory. And as you can see here, well, the layout is just a directory, and we just list the directory and get the files. What is the issue of that? Well, that this can be slow for some particular cases. If we, for example, want to filter stuff, and like in this query, well, we have to read everything. We cannot to know the results. So what people ended up doing was just creating a concept of manual partitioning inside of these directories. And as you can see here, well, they put just the date, the specific date, and then they put the specific hours of data. And when you do the query, well, you are going to reduce the quantity of data that you are going to read. You just go a little bit logical, let's say, somehow. But you have to be aware of this. That's the first issue, because, well, you as a data scientist or someone who is not that engineer who is dealing with the infrastructure, well, you cannot know these things in advance. So, but the cool thing, let's say, somehow is that Hive became the standard format for everything. It started to be supported not only by, I mean, not only, the stable format was supported not only by Hive, but by Spark, by all the other systems. And the issue is that, well, it's not so good as intended. I couldn't find a better way to put it, because that's the truth. I mean, updates suck in this model. And one thing that you can immediately notice is what happens if two people are writing into the same directory. So you are basically in trouble. So concurrent rights was the main problem, because it's not a safe operation. So it's a problem of isolation, let's say, somehow. Also, well, updates were not transactional. And you had to install a part of Hive, at least in the Hive world. You have to have a postgres database in the site to coordinate this locking of who wrote, who didn't write, who was writing. Of course, you had issues with asset properties also, when you use distributed file systems that were not consistent. That was the case of S3 until like two years ago. So you were going to read, but, well, the things were supposed to be written, but you didn't know yet. So that was another problem. And of course, sometimes if you really needed to do a copy, for example, you have to copy a lot of files to other sites. And this created a mess, because a lot of people are starting to create staging areas that make things even more complex, because you had to track now this data came from this stage and goes to the other stage, and nothing is connected. So that was pretty bad. And also, of course, reads can be, reads also are not so good either. They can be better, of course. One thing is that this listing operation in the directory is quite slow, especially if you have too many files, that's the real problem. Also, well, the one thing that happened is that data became stale quite rapidly, because well, since the updates were so complex and slow, well, that's one issue. And of course, and the last one that I like this one is that since you put everything in the same directory, all these cloud storage tend to hash this first part of the name of the directory to distribute the data internally. So if you are using the same, sometimes your data was not really distributed. So it's probably better to have different, different paths, let's say somehow like the previous part of the path. So with all these issues also, there was also some other issues, as I mentioned, when you use this high table model, you had to be aware of some things. And also, the statistics were not really updated. They were like somehow not say optional, but I mean, since all was complicated for updates, well, the metadata also was not updated. So we were in the quest for a better table format. So that solves many of these issues. Well, we had assets properties, we have proper updates and upsets, we support that we support concurrent writers, all the issues. But also, well, that has the same advantages that I've had, that is the fact of having an open specification and being able to be supported by many of the systems in the big data world and the data in general world. But also, it would be nice if we could improve some things like the evolution of this table somehow. So we can track what happened with the schemas of the table, what's going on, we created a new column. And when this happened, for example, if we can track also changes in general to the tables, write a pen that happened to them. And as a consequence of this design, well, that we could go back into other versions of the table, like this concept of time travel or even rollback, let's say somehow. And of course, kind of hide this complexity of leaky abstractions and partitioning and all these things that, well, as a user, you don't care about this, you just want to analyze the data, you just want to run your training algorithm or you want to extract some features, you don't care about all these internal details. So we had the solution. Finally, we have a single standard for this, but no, this is what happened. Three different companies were dealing with this problem and three different solutions came out of it. Databricks produced Delta Lake that this was after was donated to the Linux sort of foundation. That's why this conference. And then it's Apache Iceberg who was created at Uber. And then there's Apache Hoodie also. Sorry, Netflix, my mistake. And then Apache Hoodie who was created at Uber. And all of the three are trying to solve the problem. All of the three with a different approach and all of the three with different properties and levels of support. So it's kind of not yet at 100% clear moment to say who is winning this battle. So we ended up in this kind of SKCD situation when we have more standards. We are looking for one unified standard. We come from the unified standard that was high. And now we are getting into an extra standard. Let's say it's not the Sack case from here that they're proposing a new standard that replace the others, but three different standards and life at the same time. So a pity. But how do they work? And it's quite simple in reality. I mean, the idea is that instead of tracking these directories, what we're going to track is a file. It's a file that can be like a special snapshot or commit that contains the list of the different files that compose the table and other extra metadata like what was the schema at this moment in time. What is the current partitioning strategy? That's for the case of iceberg, at least. And the statistic that I mentioned is kind of mean max things that can improve SQL queries and other properties. And as you can see, well, this is if you are familiar with Git and I hope so because I think I would explain it more than this. You are creating snapshots that can be read by different people or even somebody can be writing a new version like it happens here in S3. So this is what kind of simplified view of this in practice. And this is how Delta makes it. What happens is that when we had the table before, now we're going to have an extra directory where we are going to put the metadata. And this metadata is a JSON file that has the different operations that are part of the table. So in this case, we have two files, one and two packet files that compose the version zero. When we create the version one, maybe we remove one of these files. Of course, this does not happen at all manually. This is done by the different algorithms that we are running or the queries that we are doing when we are going to write into the next data. The partitioning in Delta Lake is still explicit like this. In Iceberg, it's a little bit hidden. So that's the difference. But let's say it's somehow straightforward. It reminds Git, if you have seen a little bit of internals. And well, the problem of concurrency control was solved also. And then when a user wants to write, just takes the latest version that was at the moment, in this case, the first version, and both of them try to write at the same time. The first one who can write wins, and the other one has to retry and then update the version and fix the problem by themselves. So how do you use this in practice? Well, if you are familiar with Spark, this is quite straightforward. The only thing that we change is the format that we use. Instead of using Spark, we use Delta, and that's most of it. If you are used to SQL kind of systems like Hive, you just define also that you use Delta and that's it. And also, the consequences of this is that we have more operations that we had before. We can do selects that are dependent on the version. And this also can be done on timestamps. So I cannot only query version one, version two, but also query what was the version at this date. And this is, I mean, this is from the data science point of view. This is huge, because now you can go and play with the data as it was when you train the model, for example, or when you were extracting the features. So that's interesting. Of course, one thing we want to do, since they are quite similar, we want to convert this directory of files that we had with Hive into a Delta or Ice version. So there's a way to do that. There is also ways to optimize the data. And when we call about optimizing data, in this case, we have, as I mentioned, these files are partitioned, and these partitions or files can end up being of different sizes. What we want is to have, like, homogeneous, well-balanced size. And this can be done also by the implementation of this, not by the standard, but the standard helps to rewrite this. And also what we can have, like vacuum-like operations, like in databases, when we ignore earlier data. Oops, there's a typo there. And finally, we have this log of all the operations we have done with the table. So we can see these different things that have happened. We added more data. We appended data. We merged data. We deleted data. And what was the version when these things happened? This is pretty good to follow what's the story of the table. So we got this kind of traceability. I won't dare to call this liners, but at least we have a way to audit the table changes. But of course, as I said, there is more. Well, you get zero copy clones. So now you have one table, and you want to create a copy just to make your experiment on the site. Well, that's way easier to do than before. And if it doesn't work, well, you can go back and roll back your changes, and nobody's going to complain about your mess. Of course, well, since we now have better statistics, well, the queries are going to be faster. And also, if some of these systems, and in particular, Databricks does this pretty well, is that they cache some of the read files in the local disk, so they can read it faster. And if they have faster disks, this is a good optimization. The table format ecosystem is relatively well, well, feeling rapidly with support from every different project. And even one of these commercial vendors are supporting it in some level. There are also even new vendors who are really focused on this. There is this tabular company who is supporting Iceberg from the creators of Iceberg. There is one house doing supporting Huzy. So this is quite exciting, the way this is changing. So, well, just to have a little bit of a recap. Why should I care about this? Well, if I'm a data scientist, why should I care about this? Well, first one is because you can start working on scalable experiments from the beginning, not just taking the CSV and then saying, oh, I cannot do anything because this one is scale, so that's important. Other things that you can also start to think about data versioning from the beginning, like, oh, I want to roll back or go upwards with my versions. Of course, well, the most important for me is that you can reproduce what your experiments do and this is key. As I mentioned before, also, you have zero copy tables to create new tables that's super useful also. And you can roll back in case that things don't work. I think the ones who are winning the most are the data scientists with this. What should I care about this? If I am an open source contributor, so if you are working creating data tools or ML frameworks, let's say somehow, more in the tool side. Well, one thing is that there are new possibilities to create new projects for this. For example, I don't know. I can imagine, what if I want to move from Delta to iceberg? That shouldn't be that difficult because it's just metadata, but I haven't seen yet a project that does that. Also, I mean, if you already have support for Parquet or for other, this data format, maybe you care about integrating this. Like, now you have to support the Delta. You want to support Delta because you saw all the advantages that it has and because more and more companies are starting to move into that. And of course, because of the performance improvements that you can get from those, from having faster information. And this in the end is the, I think is the real rise of the lake house because they call it a strange term. The idea here is that we have this in the middle set of, let's say, common data stored in a cloud object storage. And then you put any tool that you see fit to process this data to read it, to write it. So with the flexibility of being in a distributed file system so you can put whatever you want into, you are not tied to the implementer, let's say. It happens with a data warehouse, no? If you are in BigQuery or a snowflake, well, you don't know where is your data and how can you take your data in time. This, at least, is all open. All the specifications are there. You can just use it because the tools are also open. So for me, it's a big thing. So what's next? And then the first question that comes, of course, is which format will become the standard. And I'm pretty pessimistic about this today. I mean, I used to be more optimistic than, I mean, because Databricks, of course, is pushing a lot Delta Lake. And I would say that it's the more user-friendly format of the treat right now. If you want to use it, it's what you can go straight with. Iceberg, in my opinion, is well advanced in many aspects. And it seems, of course, in the result, this commercial competition just one or two weeks ago, with snowflake, said that they are going to support more of Iceberg. So they are not, probably are not going to end up with a single standard, but with two, maybe. I have to be honest, I don't follow who did that closely. I don't know who has the current status of this. Of course, as you can see, well, sometimes we do strange design decisions, like when the Hive guys decided to use a directory for this. Well, maybe also the same will happen with this format. There are things that still can be improved. And for example, Iceberg, the first version of Iceberg, didn't support deletes. That was fixed in the version two, that's the current one that everybody uses. So don't worry about that. But now they are adding support for some security features in version three. And in the case of Delta also, there are like new things coming with the different releases, like I remember something about calling, renames that were not supported. Of course, once we have this, the first step, the next step is to have Git-like semantics of our data. And there are tools for projects that are being grown around this, the Project Nessie and Lake FS, both are open source. And the idea is that now you want to have like a little bit more advanced of Git semantics, like with a tool to do Git branch of this version of the table. And it's interesting and definitely an area to watch closely. Of course, adoption is going fast. Most of the people who already had the data lake saw the advantage of this and are moving if they have not yet done it. And the next step for me is more growth of the ecosystem in the more local point of view, like more languages support for this, this kind of tools that everybody uses every day that they support this format. This is still also ongoing. In case someone who you want just to check go deeper into this, well, I already uploaded the slides, there are like more in detail presentations of what I just highly mentioned, like in a higher level, let's say. I think that's all for me. So in case there are some questions or there is a question, all right, this just has to make a microphone, I think. Okay, I'm kind of curious, like how you think about in terms of like mitigating some of the initial concerns that you might have about like adding an extra dimension to your to the data that you're storing right of this version history. Like, do you think that in all cases it makes perfect sense to kind of keep like a full transaction log of the entire history of the table? Or do you think there's like a trade-off there that needs to be considered? I think I think that's that's a good question. Just want to repeat just to see if I got it. The question is if it makes sense to, for example, have a so complete intressability of the data, so how I understood it, for example, what is the trade-off of doing this or not doing it? I would say it all depends, as usual, it all depends. Of course, this will add more issues with how to manage the stuff and definitely I think there will be tools who are going to appear to deal with this. I would say on the other hand it comes with its mindset of storage is cheap, so I don't care. And if we have these systems have like so called copy and write semantics, we're just adding extra appends, like that's why it's called Delta by the way. So this is cheaper than before, so I think the tendency is still in that direction and since you can remove also things and then you optimize or vacuum operations, I don't think you lose much. Is there any kind of native or planned support to integrate some of this with like more layered storage as well where you can have like older versions that you maybe are not as likely to go back to, like at a cold storage or something like that. That's a good question, that's I'm sorry, that's curious because nobody has, until before I seen, nobody's using these versioning capabilities of the object storage itself for this, no, but I suppose the next thing. I could build like a multi-billion dollar company off of this side. Yes, you can go ahead, yes. No, but maybe somebody's doing, but I'm not aware about the moment. The other thing is that it definitely can make sense that a new version of the file, since you have events, you can trigger the action backwards, like, okay, I'm going to trigger now the vacuum or whatever, that's interesting. Thank you. Okay, thank you. And don't hesitate to ask me, so