 Hello everyone. Thank you for joining this session. My name is Philippe Mété and I will be presenting the Holy Emporical, which is a protocol with just open source. Before starting, we really want to thank the Linux Foundation for the organization of this open source submit. So before jumping in a few words about who we are, my name is Philippe Mété and I will leave the floor later on to Thomas, Thomas Chatagnier. We're both engineers in computer science. We met a few years ago when we were a consultant and developers in the blockchain ecosystem. We've worked for a company in France called Blockchain Partner, which has since then been acquired by KPMG. We've always loved pushing for innovation, so we always managed to dedicate some time to research and development, to participate in a few hackathons. For example, we managed to put into production some projects around zero knowledge proof. At the beginning of the year, we felt and we understood the need for a new open source protocol in data transformation. So that's when we left our position and co-founded a company called Polyfen to develop and actively support the ecosystem around what we will talk about today, the Holy Emporical. So this session will be organized around three parts. The first one being this presentation, which will last for around 30 minutes. It will be followed by a live demo by Thomas. And then we will have time for a few minutes of question and answer. Let's start then with the presentation. The presentation itself will be made into three main parts. A broad idea that we are at the conference of two macro trends, which I will talk about, and then we'll try to assemble and build the active design of the Holy Emporical. So first part is about the first macro trend, and I'd like to talk about web three. First long trail trend is about data. It's about what we do, what we use computers for, what we do with digital data, not at an individual scale, but really together. In fact, as a species. How do we use data to collaborate? Do we even collaborate? And so, you know, today, another digital data and deniably has the greatest potential to serve as a source of knowledge. How do we, at a high architectural level, create, share, and store data? To answer this kind of questions, we could talk about web three. And, you know, even if the internet is larger than the web, even if not all digital data has to go through the internet, at least it's a good starting point to address this topic. So let's have a very general approach of what web three is. You know, we called web one later on as an era of information where users could read data from centralized server. Then we've seen the rise of web two, the web two era, with more platforms oriented networks where people could not only read but also write data on these platforms. It's the era of Wikis, of WordPress blogs, of Facebook. And now we more and more talk about web three, web three ecosystem where applications allow users to not only read and write, but also we say that the network themselves are the support of execution. So it's a general evolution. Let's see how we could define web three, maybe in a little bit more details. We could ask one of the initiators of web three, which maybe gave in wood, one of the co-founder of Ethereum. What he says is that web three is an inclusive set of protocols to provide building blocks for application makers. And these building blocks take the place of the place of traditional web technologies but present a whole new way of creating applications. So web three as a protocol, as a set of protocols to build applications. It's a very quite broad definition, but for sure. Yeah, it's hard to define in terms of use cases what web three could be in the same sense as. Yeah, we don't know all the use cases yet. At the age of web one, we couldn't predict the destiny of hyperings, in fact. So in a way, this underlines the fact that web three could have the potential to become the right full success or just something as big as the web and web two. And maybe it could be easier to draw a formal outline of web three, of what web three is if we focused on shared technical characteristics of these networks. So to start with, we can say that web three definitely started with Bitcoin and blockchains. And we could pick three intertwined characteristics that often make for web three protocols. First one would be the existence of decentralized and open networks. So even if the web is essentially decentralized, web two conceptually was more around centralized platforms. And so web three is made of decentralized and also open networks where people can join in freely. Then another characteristic is the existence of consensus algorithm. Bitcoin, the big novelty in Bitcoin was essentially proof of work and the fact that this kind of algorithm allow tears to achieve agreement. One of the most important innovations is that this kind of algorithm allow people to trust the network implicitly instead of having to trust explicitly each actor of the network. And finally, a third important characteristic could be the fact that these networks often are built with an embedded economic incentive that make that these networks are not only pure to pure decentralized networks, but these incentives kind of make them kind of autonomous in a way. So of course, these three things lead to many other features like censorship resistance, modularity. But in the end, yeah, maybe we could say that the economist had it right in 2015 with this cover. Web three networks are essentially trust machines. And if we come back to this illustration, then of course, it is important that web three are not only about read and write, but also about execution. But most importantly, what changes the nature, it's of the quality of what is sort of data and which is different. Now data on these networks, it verifies open and transparent rules which are embedded in the in the networks. So what changes is the first roughiness of of data. Okay, let's step back a little from the hype around number of web three projects and it could be useful to try to extract some key primitives out of these projects. So it's hard to do. It's a choice of mine. But yeah, we can see that a lot of projects are around value transfer of value, like the original blockchain, Bitcoin blockchain. We can then have a number of projects about making the global computer like Ethereum. I think there's interesting projects around identity with the Ethereum name service, for example, with the rise of self-surround identity. And one of the key primitive we mostly focus on today is about storage in particular with the whole IPFS stack. So that's one of the key primitive IPFS stack is a very modular stack and we will try to study it briefly, building it from bottom layers to the highest layers to see briefly how it kind of works. So first, first part of the stack would be the P2B, which is the core of IPFS. It's a modular networking stack that that makes it easier to build robust P2B networks. These functions include the discovering nodes, connecting nodes, describing data and transferring data, things like that. On top of that, you get multi-format, which holds a collection of hash algorithms and self-describing formats of methods. And that makes the whole system more interoperable and upgradeable. Then you get IPLD for link data. It holds data models that help traversing, connecting blocks, data through content identifiers. It's basically a middleware that unifies different block structures. Obviously, IPFS, which is the distributed storage and transmission protocol. And on top of that, the highest layer of the stack would be Falcon, the economic incentive layer, where miners could earn Falcon by providing open hard drive space to the network while users spend Falcon to store data on it. So really, it's a modular stack. Conceptually, we can say it's a beautiful stack. It's one of the most beautiful ways of storing digital we've ever had. And so the next question would be, one may ask, why would we use these networks? What kind of data would we store on them? And that is a valid question, you know. We are living in an era where AWS S3 is very efficient for static storage, for example. Information sometimes needs to be stored in databases that are super efficient at what they do. So that's a valid question. And on one side, we have a beautiful object, but that's a valid question. And so we don't want to play with the hype, but the fact is we definitely already live in a world of data-driven decision. And it isn't undeniable that this trend will most certainly keep on increasing. And so if we look at the exponential amount of data, we manipulate the market value of big data companies like Snowflake. The number of services we can create once datasets are publicly released, like Uber with Google Maps and Superhuman with Gmail API. When you look at simply the continuing hype of artificial intelligence, then in the end, what better way is there to store, share data, collaborate and bring this world to life and turn this hype into a reality than with Web3 protocols and the IPFS stack in particular. So that's basically the first macro trend that I wanted to talk about. It's something, these Web3 protocols, something that is quite close to our hearts. And we may do it for ethical reasons, like our novel processors in the 60s, but yeah, in the end, you don't even have to do it for ethical reasons. The fact is we won't build the best AI. We won't unlock the best data-driven decisions. We won't be the best post-computing civilization if we don't share the raw resource, if we don't share the data. Of course, competition is good sometimes, or most of the times, but it stops to be once it restrains innovation. Why would we rely on GAFAM to build the future of AI? Why would we delegate our right for innovation to locked-in platforms? And so simply put, we think that it's time to foster competition on what we do with data and not anyone who owns data. Another way of phrasing it would be let's delete the data-rent by making its quest obsolete. In the end, it's not even like to delete the data-rent in itself, which is quite a political goal, but we have to do it to unlock innovation to better AI, which is way more consensual, obviously. So that was the first big micro-trend. I talked about the second one, which is more at a more practical level, about architectural patterns in the domain of data integration and their level of openness. Why focusing more on data integration? First reason, because in terms of collaboration, it intersects with a number of our interests because the purpose of data integration is to combine data that resides in different heterogeneous sources to provide a unified coherent view of them. Secondly, because in commercial domains, the explosion of the amount of data generated and the rise of service-oriented architectures made research in data integration techniques quite vital for business information. That is, indeed, one of the fields where most innovation happens, even if there are still quite some open problems that remain unsolved. So what we can see in the evolution of common stacks here on this diagram is that components are getting more and more service-oriented, which is quite prone to interaction, to more collaboration. Since the release of Redshift in 2012, we've seen the rise of modern data warehouses and the emergence of an ecosystem of cloud-native adjacent technologies. In the years 2014-2016, the ETL for extract-transform load pattern has become extremely popular thanks to these cloud services, and now it's moving forward towards more ELT patterns where transformation tools like DBT play a quite prominent role. So how could we explain the popularity of ELT patterns? Definitely the main reason lies in the power of modern data warehouses which got so performance unscalable that it's now become possible to run data transformations in database rather than external data processing layers. The typical data stack today could be synthesized like that with data sources, like many databases, PostgreSQL services also, like AdWords, APIs, Facebook APIs. It's followed by integration services like 5-trans, segment, open-source solutions like Airbyte. All data ends up in your data warehouse like Snowflake, BigQuery, Redshift, and on top of that you get a transformation tool which is almost certainly DBT. You may have a reverse ETL tool like Census and High Touch, and in the end your business intelligence service looker mode Redshift. So when we look at that we easily understand the importance of DBT. In fact, DBT sits on top of your data warehouse and its only role is to take code, compile it to SQL and run it. So that makes DBT essentially the T in ELT and it does not extract or load data, but it is perfectly suited to transform data that's already located in your warehouse. This model has limits. It has built-in limits and these limits are clearly identified. DBT only works with SQL and that's one of the strengths of current ELT stacks. It's that all components speak the same language, which is SQL, but it's also one of their limits. They have to heavily rely on models and they can only transform structured data. So what about semi-structured data? What about complex data? DBT is unable to process them and that's a clearly identified limit. The team behind DBT clearly acknowledges it and they say that it's quite frustratingly. They think that there's no great solution today. Even Bob Muglia, the former CEO of Snowflake, he also acknowledges this limitation and publicly wishes for a solution to be introduced in the near future. What is missing is the support for complex data. Could you imagine if you could fully transact all types of data like images, videos and things together with any source of semi-structured data in a data warehouse and things that this is going to come in the next two or three years. That was a year ago. That is precisely the problem we want to solve with Holium. We want to handle the T in ELT for not only structured but also semi-structured and complex data. We think that this is what we need to unlock, not only business intelligence but a broader intelligence and that sits on top of data warehouses and data lakes and conceptually what wider data lake could it be than IPFS. This is where our two trends come together. First one being able to transform complex data to create performance-generic ELT flows and then connecting it to one of the most beautiful universal data storage networks we've ever had to build a future-proof data stack. That was the second macro trend. Logically, the question now is how do we do that? What kind of architecture can we design to do that? That brings us to the third and final part of this presentation where we will try to assemble part after part this Holium protocol. We will progressively build the Holium protocol in three steps that will help us get progressively a larger picture of how it works. We're going to compare Holium to DBT on three aspects, the three you can see there. All these issues tie in with each other and make up for the complete design of the solution but the first one we can focus on is the execution environment of transformation. For DBT, transformations are held in database. They are based on SQL. In our case, we want transformation that could be written in many languages. Ideally, we don't even want to restrict the number of languages we want to support. We need to only unify their mode of execution to make them kind of interoperable and to make them individual modular elements of pipelines. Thus, we just simply use a containerization solution. Conceptually, it could have been a docker. But in 2021, we can take advantage of WebAssembly, which first version was finalized by W3C in late 2019. In terms of containerization, WebAssembly is both lighter, more portable, more secure. It offers almost native performance for our compiled languages. In terms of pipeline platform, it offers quite interesting characteristics like replayability, scalability, reliability, security. It's quite more secure than docker. These all are essential pipeline design principles. We've got this execution environment and we also need to define what is executed in this environment. We need to standardize the interface between two steps of the pipeline, the world that was played by SQL models in DBT. For the sake of briefty, we will oversimplify and essentially we can say that all transformations are atomic transformations that take one JSON input parameter and return one JSON object. There's a light wrapper, this IO data translator, which is responsible for mapping the field of all these JSON objects to actual variables in original transformations. But simply put, that's as simple as that. In this example, we have a function that could be written in Rust that's responsible for computing the performing the Euclidean division. We would hand over a JSON object with a dividend field and a divisor field and the atomic transformation, if successful, would only return another object with two other fields for cushions and remainder. The first answer would be that each step of the pipeline would be an atomic transformation that runs inside a WebAssembly runtime. Now we get to see how the chaining of these transformations could be defined. In DBT, you can use Jinja variables that can play an important role in this mission. Simply put, the problem is that to enable chaining, we don't want to attach any key to that objects used as input-output parameters of transformations. Data that in fact serve as the only interface between two transformations. So we consider all keys to be contextual information, to be only semantic information. For example, here on this slide, 462 in this example is used as a divisor in this Euclidean division, but once stored at Rust, it's only 462. It may be used as something else than a divisor in another transformation and it will still be the same 462. It will still be the same value in substance. So context may seamlessly be attached or detached later on. This is why we define a deterministic way to transform JSON maps into arrays and vice versa by removing keys and that's what we called Holium JSON. For the more we don't really use JSON, we use a binary version of it called C-bar. And so that brings us to the format we mainly use in our protocol, which is Holium C-bar, a subset of C-bar. If you come back to this diagram, then at the interface, at the boundaries of the execution environment, you don't have these full JSON objects, but only simplified Holium C-bar arrays that hold the most important part of the information. So you get these atomic transformations that you could use to run, for example, the Euclidean algorithm to find the greatest common divisor. So here, dividend, divisor are contextual informations, but all input-output fields, subfields, sub-subfields have precise index numbers. And so what it means is that creating pipelines basically means connecting all inputs to preceding outputs. That's basically what we do. Pipelines are as simple as these connections between Holium C-bar index numbers, and that's how we define pipeline. Finally, sorry, let's study the case for storage and storage of all this information, data, meter, data. In dbt, all information is stored on a centralized data warehouse, but in Holium, we can see how that could be shared and stored at the lowest level. So essentially, we hold SCARAR and recursive data, SCARARs and arrays, and what we think is that any piece of data should be uniquely identifiable, whatever the context, its ID should be enough to access the data, and that's the reason why the Holium protocol uses what are called content identifiers, CIDs, and content addressing. CIDs are based on cryptographic hashes of the content, and any difference, any small difference in the content will produce a different CID, and the same content will always produce the exact same CID. And we use content addressing to identify content by what's in it rather than on its location. So in this context, SCARAR values are stored there, and recursive values only link to other values through these cryptographic hashes. And that is true not only for integers, but integers we've seen up until now, but also for other types of data like floats, text and byte strings, but also transformation bytecode and pipeline. They all use these CIDs, they all use link data, and they all can ultimately be safely stored and shared through the IPFS stack. So that's the last question to our questions. So that's basically how the Holium protocol achieves its mission. That's also the end of this part of the presentation. The design and all the specs of this protocol have been open sourced as well as CLI to start using it. Then we hand it over to Thomas for a live demo of this CLI. Thank you. Hey everyone, Thomas here. Thank you Philip for the presentation of the theory around the Holium protocol. I will now try to demonstrate the implementation in the form of a CLI that we did of the protocol. That should help you to create your data pipelines and make them run. So first things first, I will show you the subcommands that we have at your disposal in your CLI. So they are all quite linked to objects that Philip already presented. So we have here the pipeline command, well for pipelines run and also management. We have the transformation command, which you can use to add new transformation and manage transformation inside of the pipeline. Connection to connect those transformation data to import data manually. And we have also pollution here, which I won't get into detail now, but we'll automate the import and the export of data from the pipeline when you run it on your file system just to help you around that. As other command, we also have the init command here. Well, that's the first thing you should use when actually building a Holium project on your machine. For myself, I already used it because I wanted to have some materials ready for the demo. So if I do LS here, I can see that I have that Holium folder. And if I explore that, I can see that I have three files, Qtignol for the VCS, two config files, which can be used, for example, to specify that you don't want to use data version control, DVC, or some type of that. And two folders. First one is cache for optimization and performance and objects that will contain all your objects such as data, transformation, pipeline or foundation or connection. So now let's say that I want to build a pipeline. Well, the pipeline that I want to build myself is a case that I found on Kaggle, which is an Intel basic case of data cleaning and completing. So the idea is that I have these data sets here, which is a project data. And in that project data, which comes from the World Bank, I've shortened it to only 10 lines. So it goes a bit faster with the import, etc. But I want to remove the last column, which has no purpose, which is called column one, which is empty here. And I want to add another column, which is the country name, which has been reformatted, because you can see here it's kind of strange. I have a small column, there's two iterations of the country name. And I also want to add the country code just for later possibilities to have maybe it's a bit clear. So to do so, the first thing first, maybe what I want to do is actually import a new transformation, because I already have two of them that I can show you. So the transformation, what they do is the first one will actually drop the column as I showed you, that I showed you, and the other one will reformat the country name to have it in a grid format. So the transformation that I want to add is here. It's coded in Rust. We have a SDK Rust to actually compile our code to a proper wasn't byte code. And so what it does is it takes CSV data, adds a new column here, which is called country code. And then thanks to the service creates, it will use the country name to generate the country code and push it into the column, into the recall, sorry, to expand the CSV data. So what I can do is use the Rust SDK command line to compile it. In my case, to have a shorter demo time, I've already done it and it's prepared several. Let's actually go to a command line here and we can actually import our transformation and it's metadata. The metadata is also produced when compiling the transformation. And we will be able to see when I inspect it, why are metadata important. Metadata, what they do is they specify, thanks to JSON schema, so it's specified in JSON object formats, all of the transformation information. So the name here, so the entry point of my wasn't byte code. And it also have the inputs, so CSV data in my case, which is an array of array of string. And the output, which is also the same type. So thanks to that, anybody that actually receives the pipelines or check out the pipelines can see all the information of the different processes inside of the transformation. Now that I have my transformation, I'll be able to check my list of transformation inside of, sorry, small issue with my machine. I'll be able to check inside of the transformation list that have the transformation here. And now I want to connect those two, right, because there are the last two of my pipeline. So to do so, what I can do is use the connection subcommittee. So to connect different transformation, what I want to use is their hash, but also their ports here. So what I'm saying here is that I want to connect my parents here, which creates a new column country name with formatted country name inside of it. And I'm saying that, okay, take all the column that is uploaded, the format is the same as this one. So take all the column, so headers and also data. And put everything inside the inputs here of my last transformation. So I specified here subindex, but I shouldn't really be obliged to do so, because if I would have just put zero, it would have worked, but it was just to showcase to you how you actually can write down ports and how it can work. So let's create that connection. And here if I inspect it, sorry, make sure inspects with the correct hash, I'll be able to check here that I have my child and the mapping. That's pretty cool. Now we have our entire pipeline, which is ready. If I put on my data, it should work right. So I need to import my data inside of my project. So how do I do that? There's two ways. The first one would be to specify the source folder and add a portation, an automatic importer to import my data. Also add a result folder, which would be used at the end of my pipeline to output CSV or JSON or the format that I specified inside of the result folder. And another way of adding data is also import it manually. So I can do that with the data subcommon. So if I use the data subcommon and import the CSV that I showed you earlier, I'll be able to get the hash of my data so that the hash of my CSV and if I look at it, at my data list, you might think, okay, we have the hash of the CSV, we have that. But we also have the hash of all the underlying data because we fragment it to granular pieces. So we optimize storage space inside of the machine. So now I have my data imported. So that's the first way. We'll be able to use it later on. I'll show you how. But we want also to add portation to have automatic importers and exporters. So to do so, it works kind of like transformations because we want people to be able to share and use different importers and exporters quite easily. So in our case, I will add an importer which will take a CSV file and produce it in the data type so that we can see inside of transformations. All transformations take the exact same inputs and output the exact same type as I showed you earlier. So an array of string. So now that I've added my importer, what I can do is actually add the portation. So add the way of import. So what I'm saying here is, okay, take the data folder, use it as a source, pass it into my transformation. And you have to know that the data type CSV as an input. Let's add that. And if I inspect that correlation, I'll be able to check that. Well, it's an importer. What is the source and which transformation it is running? Here, directory path transformation type import. That's nice. Now let's add an exporter. So for the exporter, what I want to export my data to is a CSV file. So I'll add my exporter and also add my results here. So what I'm saying is use the wrong hash here. So what I want to say is use the exporter that I just added. And now put the results inside of, oh, my bad, wrong one. Put the results inside of results here, which is already ready in my environment. So now what I can do is just run the pipeline. So to run the pipeline, again, I had two types of data imported manually and automatically. So if I want to add my manual data, I can use a hash here of my data, which is a hash from my CSV. And so I do an olium pipeline run, my data, because it takes a CSV as input. Once I've run that, so I've run my three transformations, drop, reformat the containing a data code. I can see that I have all my data here, which is ready. So if I do a olium data list, you can see that my data has been added correctly and the metadata has grown. Now I want that data to be able to be readable, sorry, for human eyes. So I will run it, but not with manual data, but with actual importers, automatic importers and automatic exporters. So now if I check my results, I'll be able to find the CSV. Yep, here I have my CSV. And now you can see that inside of my CSV, I've dropped the cotton, which doesn't exist anymore. I've added the reformatted country name and also with their country code. So that's all for me. I've shown you a basic usage of our CLI to use an ETL classic case of data cleaning and expending. So well, I hope that it was clear for you. I hope to meet some of you later on or discuss with some of you later on about that and to get your feedback on what could be improved. I will give the mic back to Philippe so that he can talk to you about the next step of our protocol. And I'll see you around everyone and have fun at VoisS. Bye bye. Thank you for this live demo. A few last words. You know, from the beginning, we've designed this protocol, thinking of it as an open source project. So we've just released everything on GitHub. You'll find all useful links like specifications, documentation or the GitHub discord online on holliam.org. We are actively working on development parts like integration. We want to create bridges to other components of the ecosystem both to develop and run pipelines. We are also actively building more extensive user interfaces, graphical interface as well as more data as code oriented interface. And you know, one last part, which is quite important is that in terms of protocol, it can always be improved. And so anyone can right now open and track what we call holliam improvement proposals online. So again, thank you very much for your attention. And thank you again to the Linux Foundation for the organization of this great event. Thank you.