 Thank you so much. This is back out. This is the temporary logo. I love it too, but like, it looks like a dead fish. It's not, it's not good. You can't have a, what do you call it, a new company without having some like, credibly pithy like data transform. So I'm worried you need to use that as a place to learn. I actually also like this, the fish way to process data. I wanted to have a caveat for this session. This, my session is gonna be an extremely high level walkthrough. It's gonna be something from, you know, my experience. A lot of you don't know me from Adam. I previously was the, I led Kubernetes for several years. I was the first non-infinity PM for Kubernetes. Started the Kubeflow project. I've been in ML and data science for a long time. This is very much like pains that I have seen in the world. However, this is gonna be super high level. It will not answer many or potentially any technical questions. You want those, you're gonna have to stick around. A short way to describe this is, you know, PM hand waving for me and real stuff happening momentarily, okay? So just stay with you. So the first thing I wanna do is really set this stage for the world that we're going into. It's an incredibly powerful world. It has got so much attention. And we in the web three space, you know, obviously we know a lot about this, but it's really a general thing that people know about. As Juan mentioned, we're adding a petabyte of storage deals a day. This is new stuff coming on the network. This is like unheard of out there in the world for a platform to be adding this much data constantly. And, you know, when you look at the analysts and whatnot, they talk about how big data and data usage will affect every industry. And I know a lot of this stuff is washed out again. You can go watch it all online, but down the left hand side is basically every domain you could possibly think about. And within the next few years, they will all be affected by the accumulation and use of big data. To measure this, a lot of times about, okay, well, are you using it? Does it actually matter? This is spend per year. And just a reminder, like Juan put up something where it's like total cloud spend is about, you know, 369 billion is what he proposed. You can say like whatever, for me estimation that at between 300 and 400 billion, roughly in five years, one third of that ish will be on data alone. Like that's an enormous number and an enormous piece of this pie. And obviously we are participating in almost none of that, right? So there's an opportunity there. So there you go. So, and to show you that it is really early innings, you know, you kind of follow where the big data investments are happening. And here you have like about $67 billion in 2020 moving into investing in big data platforms, right? This is all private or, I mean, excuse me, this is all like public markets or private markets and things like that. So there's a lot of stuff. This isn't spend, this is betting on companies to solve this problem. So again, a ton of attention playing in here. And it's not our job to supplant those in many ways that it's our job to compliment these to enable them to go off and build great big businesses. I promise this is the end of the, you know, kind of setting the stage here, but like it's kind of impossibly big to talk about some of these numbers. People asked to make that about $3 trillion is wasted on bad data, bad data processing and things like that yearly. Users generate, this is actually, I promise it wasn't coordinated, but it's actually the number that Juan had circled in that pie chart down in the lower left hand side about 2.5 trillion, excuse me, exabytes of data generated by users every day. That was the circle that it adds up to. So on and so forth. And just to give you kind of inspiration that we're potentially saving the world here, Google is literally using big data help solve fusion, right? Well, not a terrible thing to be spending our time on. But this is, right? So Juan talked about debugability, monitoring, so on and so forth. This is the level of success for organizations, right? And, you know, again, the numbers are small, but almost any of these vary, but they're almost all entirely, you know, 70% of projects are successful or less, meaning like you can basically flip a coin and your project is going to be failed, right? Like it's that bad. So there's lots of stuff we can do to improve this. Okay, so that's kind of the market. I hope I've inspired you to, you know, suggest that this is big and we can go after it and make a big difference. So one thing I want to set the stages, you know, first off, there are many, many super smart data developers in this room right now. You should go talk to them. They work in academia. They work in previous big companies and things like that. And they'll have a really good sense. For those that are not big data developers or have not done this previously, I'm just going to walk through kind of the pain of what they experienced today. And I hope to impart to you like our target market here, at least at the start. So just to let you give you a kind of sense of this, our target is, you know, about four million data developers today who are using big data in some way. They're developing big data pipelines, transforming large data sets or things like that. And they're growing extremely quickly. According to whatever various measures, about 10X up from 2016 to where they are today. So like extremely quickly, we can plug into them. And for better or worse, again, referencing Juan's talk, they are almost entirely ignored by a lot of developer tooling today, right? So for example, the standard for developer tooling today, for example, doing breakpoints, because I got this GDB, right? There was no GDB for a distributed system, for data processing, for any sort of pipelines. And that's a nightmare, right? Like how do you set a breakpoint in your data pipeline to know whether or not you are transforming the thing wrong? It's really hard today, really hard. That is like a goal among many others. Like if we did nothing else to take all the standard development tools that people have today on their local machines and enable them to do work in a distributed way, we have already won. We have already done so much better than whatever $67 billion of investment have done. So, you know, that's very modest and we can go much, much further than that. Oh, sorry, it's a little bit washed out, but where do they spend their time? On the left-hand side here, you have like this is the standard flow for a data developer today. It's a data loading, data cleaning. I'm just reading them out since it's washed out. A data visualization, model selection, model training and scoring and deployment models. This is broadly from the ML space, but the concept of developing a model based on data is not specific to ML. That's really standard for what people do. Basically output a set of things, create an artifact and use that artifact in your code or whatever it might be. Roughly speaking, about 70% of it is just getting your shit together, right? Before you even begin to build artifacts. Again, I'm not saying that we can't go further than that, but if we just solve for 70%, I promise you're gonna get a ton of people happy. So for those that don't know what a typical data pipeline looks like, you start with ingestion and processing. You move to engineering and splitting it into some form of, this is the live stuff I wanna train on or analyze, then you have a hold back set that you never touch your training or other code, but you use to test what you train. And you always wanna keep those separate because if you allow your test to bleed into training, then you can overfit and have other issues like that. Then you finally train or create your artifact based on the result, and then you serve it in your overall artifact. And then ideally, you loop back the results to the original. So if you show this to just about any data developer today, they're like, of course, this is exactly what I do, right? And it doesn't matter. It doesn't need to be on a distributed platform. This is also what they do locally. This is what they do all over the world. This is our focus for now. Again, it's not to say we don't wanna do federated learning or distributed training or checkpointing or any crazy things like that. But again, if we just solve this, we will make such a huge difference to the world. What would they like? So to lay it out for you, I've summarized it in one of three categories here. I'm just gonna, I'll show you how I'm plugging and plugging back in, I guess. So it really comes down to one of three things. Familiar, they wanna understand it already, ideally. Second, simplified even from where they are today. And third, collaborative. And I'll get into what each of these are in a second. So first, a little bit about data pipelines. I mentioned that pipeline system today and people often like very narrowly focus on just building a model or an artifact for your end solution. But in truth, it looks like this, right? It's many, many components wired together. Suggestion, transformation, engineering, validation, training, then doing in all those steps again at scale, rolling it out and then ultimately monitoring and observing it in production. This is what really moving things to production is. And again, this is not new. This is kind of like what software development looks like today. Each of these steps are independent and each of these steps are independently composable. Now the challenge here is that every person doing these development will experience it in a slightly different way. This is a classic Microsoft Office thing where people are like, well, why do Microsoft Office has so many features in it? I don't use 95% of it. Turns out everyone uses a different 5%, which is super annoying if you're a product developer, but it is the reality of the world. What does this look like inside a big organization? Well, here you go. Microsoft actually published this paper about three years ago. They did their own survey internally. One of the most sophisticated machine learning, data development organizations in the world, they have 159 different tools. 159, can you imagine if you're an SRE at Microsoft saying like, oh fuck, I have to support whatever, 10 years old CNTK? Like what the hell? But what? 11 people, 11 people need it. So what are you gonna do? Tell them to go F themselves now. So this is again, super standard, but it really highlights a need for us to understand that people are going to use the tools they're gonna use, and we need to encapsulate those tools so they can keep using it in the way they're familiar with, but still give them an opportunity to participate in this very public data platform. Make sense? Okay, so that's what familiar is. Now, to just up the level of difficulty, we haven't even got, it was just at the tools. We haven't even gotten the platforms, right? So by platforms, you have compute providers, right? And you got a lot of those. And then you have data platforms and you got a lot of those. And those are useful as well, except I think there are too many choices, right? We should get rid of all of them. This is what users actually want. So here's said, this is the Wikipedia from said. Said was invented in 1974. So that's pretty good, 48 years old. I think we should break data science back to the 70s is what I think, right? Like we should make it as easy to use this 48 year old technology on your brand new technology as you do today. And I cannot tell you how many people use said. It is so commonly used out there just to process a CSV file or something like that. They, it is a wonderful tool. Like let's not reinvent that. Let's not try and throw that out. Instead, let's try and meet folks where they are. Okay, so that's familiar. Simplified. Now let me walk you through what the data scientists workflow actually looks like. Here you have a very, very standard. This is like the tutorial in machine learning and data science. This is how you create the housing price data frame, right? So this is something that you go to any tutorial you're gonna see one of these things predict my house price for me. In Jupyter notebook there, you can see it's about, I don't know, three lines of code. Half of that is literally like loading the thing in and you're done. Pretty simple and you can get going to do that exact same thing over there in Hadoop Also whatever 15 year old technology, it looks like this. And this is still missing like half of it, right? That's how bad it is. And so you're asking a data scientist who had something working pretty well locally to now translate all that mess into this for the exact same functionality, like not so good because it's just super painful. Now to be clear, like it's not just even data developers that are facing this pain, SREs face this pain too. So here you go. I'm gonna take you through a play in one act. So the data scientist now has her local machine and it's running perfectly and she signed her data set and her model works locally and it converges. Presto, I'm ready to go. So the first thing she does is go to our IT ops person to provision an entire cluster. Again, this is something that most folks actually face. You can't just get unlimited compute. You don't have protocol labs, credit card. So you need to go to your central IT staff and get it. The first thing she has to do is like provision in and that by itself takes forever, right? So the IT ops person just goes around and says, okay, I have a hundred other things to do. Maybe you filed a ticket on it. I'll get to this afternoon, later this week, whatever. Finally, it's provision. And now the IT ops person says, okay, well great. I'm glad you provisioned it. Now can you do this, right? Like here's half a dozen things or more that she has to do just to take that code that runs locally perfectly well to production. And many of these things are because she works in an IT ops organization that requires ACLs and various things like that. You have to rewrite it into Java. That's a super common request which no data scientist wanted to do. I promise you, use out of date libraries that have passed global security requirements because we're not gonna allow anything to deploy that touches production data without this. Various things like this. It's just a lot of stuff that they ask. And she's just like, well, I just wanna run that simple job. Why can't I just do that? And the reason is because that's the requirements. So she does that, that sucked. And she says, fine, great, off you go. And she provisions it, it runs and success. It actually did, except they forgot to turn it off which happens all the time as well. And Presto, now you've just blown through your entire monthly budget because you forgot that these were GPU machines that cost whatever $2,000 an hour. Super, super common situation. You see this all the time. And you see like super pernicious behaviors around this where it's like, oh, I'm gonna secretly spin up and use someone else's cluster. I'm gonna plant on GPUs because we have a limited number of GPUs. So I'm not gonna let other people have them. I'm gonna maintain idle processes to make it look like they're working. Like it's really fucked up. But it's what data scientists face today. I'm telling you, man. So one thing they can agree on, right? Map or Nuke sucks. We can do better. And finally, I wanna inspire you around collaborative. So literally the reason I joined my protocol lab six months ago was like, I wanted to try and avoid our children and children's children living in a barren hellscape. And this is it, right? Like it's really positive. The problem is this collaboration around science today is really hard and data is even harder, right? So today you have these open data sets all over the place, literally petabytes of data, very valuable data out there in the world that are awesome, right? Here you can see the cancer genome atlas. This is just hosted on Amazon. Actually, technically it's not hosted on Amazon. So it's on like an FTP site. You can click a button that provisions an S3 bucket and copies it there, which means you now start paying Amazon for it, which is messed up as well. But suffice it to say like, there's at least a catalog out there of these things, right? So far so good. So this is Landsat. Alex is gonna talk a little bit about Landsat later. Landsat is already hosted on IPFS, which is awesome. But today let's say you have three scientists that come together and they're like, well, I would like to use Landsat. It's a super popular satellite thing, satellite data set that is donated by governments of the world. So number one data scientist says, I wanna create a tiled version. So I'm gonna take a subset of the original version that is tiled so that it's focusing on different areas that are interesting to me. In this case, she's a volcanonist, volcanologist, whatever. Anyhow, she wants to study volcanoes. So she's like, I'm just gonna grab a picture of that volcano. Second person wants to do scale. So same thing before just reduced pixel density. This is a super common requirement because of not needing that kind of fidelity and images being very, very large. And then the third data scientist says, oh, I wanna do the same thing, but I wanna actually grayscale it. Again, very common when you're building your artifacts is to use lower resolution versions because you don't need that higher resolution. You can achieve the same thing at one tenth or one hundredth of the cost by working on these smaller sets. So far, so good. So each data scientist has gone off and done her own thing. And we have a fourth data scientist who comes on and says, oh, I actually want all three of those, right? I want it scaled to interesting elements. I want it tiled because I don't need all that various land and water. And I want it grayscale, but she can't touch any of those. It all went off into private research. They didn't republish their methodology for doing these particular things. And again, oftentimes papers will describe that, but it's like we were talking about last night, I was talking about the Fonzo. He's like, I hate reading papers because the first thing I wanna do is at least try and attempt to figure out how the hell they did this thing. And it's hard because oftentimes they don't publish it correctly. The code doesn't work on anywhere on their machine and so on and so forth. So not so good. However, with Bakayao, we can change this, right? Same exact story situation, except in every case there, they republish the CID. And now it's out there. And I can see what happened. I can see lineage. I can see from the original data set, it came down and what they did. And now the fourth person comes along. It's like, oh, great. I'm just gonna grab all those and use that as my C data set. And there's a variety of ways we can go about it and achieve that, but not just that, then that becomes collaborative, right? And they can leverage on that. And so now in the next person comes along says, oh, I just want to know what they do. Right, I'm just gonna save time. Makes sense? So unprecedented collaboration because of this, you know, the way that we're operating here. So that's the scope. I hope I'm being like getting to the inspiration you're excited, familiar, simplified, collaborative. Again, these are just my words. I'd love to, as a community, come together and figure out what our core tenants are and move forward from there. So can we improve big data with small changes for data developers? And that's where computer over data and Filecoin is or the back of YOW project. Vision, so this is my words. Again, take it for what you will. I think there's lots of crafting here. It's too buzzwordy, but you get the idea. I think we could transform big data by giving developers simple first class distributed tools and unlocking a collaborative ecosystem. This is, I think, our mission. Again, lots of honing. I think we should probably have an conference just talking about how we talk about this thing. But setting that aside, this is what I would like to do. It looks like all the things I mentioned already. We simplify it, give meeting people where they are, using these tools that they already know and love. We deliver performance improvements because we can and I'll talk about that in great depth in a moment. And then folks later will and launch this new collaborative science community. What does this look like? You take a 10 gigabyte file, CSV, you upload it to IPFS. From that, you get a CID. You then execute using the command line. We have a downloadable. You can go to backaliad.org right now and install the stupid binary yourself. You submit your job in your CID. You name the CID. And then in the command, you name the command. And so this one right here is I use said, as I mentioned earlier, to process a large CSV and filter it down to just the things within whatever 50 kilometers of Portugal. Pretty simple stuff. Stuff that data scientists do every single day. And then I fetch the results. Presto, I have added a new tool, but most of this is totally understandable to a data scientist no matter where they are. I didn't have to use Hadoop or HDFS. I didn't have to rewrite this shit in Java. I didn't have to do any kind of like figure out concurrency or job resolution orchestration. Presto just works. You know, in addition to that notes, temporary storage, using mostly idle compute, the results were automatically added back to the chain. Again, privacy and things we're gonna have to tackle. Right now we're just focused on public data and performance as Juan mentioned. Failure commands automatically resolves failures. There's retries, there's concurrency and ideally quite cheap. And I haven't even gotten to the biggest thing which is no egress. You didn't have to move this 10 gigabyte file. That's it. It was already there. It was like running locally. And, you know, it obviously gets much, much worse as the data size gets bigger. Egress, I think is Amazon's most profitable thing. Let's try to, I'm not gonna say bad things about you. Stop, there you go. One more. So we go back to our play in one act. Data scientist comes along, says, here's a data, that's perfect. How do I engineer it? Presto, she now submits. She adds her CID in there. She writes her said command. It ran great locally. She knows it's gonna run great in cloud. We're already checking bash syntax, which is convenient because I cannot tell you the number of times that I personally have made bash syntax mistakes. She runs, off it goes. And after a time, it's all done. And now our IT ops person knows how many cat videos are up to YouTube every second. So she has stuff to do too. Very important. So you might say, wait a second. What about these? These are all good things. Homeomorphic encryption, selected execution, GPUs, enclaves, so on and so forth. The vision is there. Nope, not yet. We're gonna get there. We have the vision. We want to achieve all these things and enable great domain-specific things exactly like Juan was saying. We need to enable businesses, organizations, projects to go and do great things on our platform, but not yet. Our goal is exactly like Juan said, let's achieve performance first. Let's make sure jobs run. Let's make sure they run well efficiently. They resolve correctly. They recover from errors. All these things that are kind of the block and tackle for even a system being valuable. Our roadmap is, again, that Tildy is doing a lot of work. Approximately in May, we would like to launch two public consumption, no incentives, 100 nodes, data smaller than 32 gigabytes, fitting into a single sector, ideally on a single machine. One CID only, public data only, deterministic only, CPU only, no incentive structure, no verification of results. So again, this is not for general use, but ideally anyone in the world will be able to consume it, use it, and engage. By October, again, Tildy doing a lot of work here, approximately 1,000 plus nodes, we're not gonna stop. Like we'd love to get 10,000, 100,000 a million and so on and so forth. Running 10,000 jobs, one petabyte of processing across many files, 99% job success rate, 49% malicious nodes supported, DAG execution, so distributed as cyclograph, allow multiple steps to connect together. A primitive reputation system, likely only at a reporting level, not injecting like choice on whether or not I wanted to deploy to a reputable node or provider, and swappable systems, swappable verification, swappable execution, swappable reputation. Makes sense? So you might ask like incentives, why would I choose to run this? Seriously, not yet, promise. Like we're gonna get there. And by incentives, I mean tokens, verification, all these things that are required, stake in particular is dependent on this. How we get there, TBD, but like unless we have a well functioning system, there's no point in going forward to figuring out other things out, really. So let's get through a high functioning system first. It is not that we are ignoring this, it's just a little bit later, promise. I cannot stress enough, we expect there will be many structures, most of which will not be done by this project, right? I cannot stress that enough either, right? Trusted environment execution, GQs, super fast resolution time for subnets and scheduling and all those kind of things, wonderful. We will support all of those, ideally we'll support them via interfaces and loosely coupled systems. And by that, I mean this, right? Extensibility. So Luke and Kai momentarily will talk about the overall architecture interfaces and plugability. They will go into this diagram as well. You can see here, this is basically straight, almost one-to-one from those line items that Juan had earlier. These are core elements that we expect to have many implementations, most of which will not be written by us. It is our job to build clean interfaces and explain to people how to extend into systems that they can build for their own incentive structures and other things like that. And we provide core primitives that, you know, work out of the box, but ideally you can swap out. Some, so critical from day zero, our system must run on these interfaces. Like there's no cheap and cheerful way of not having interfaces at the start. Even at launch, we expect to have those various optionality and like I said, domain-specific customization over time. So sounds great. When? Well, I already told you, right? Nothing new here. Again, tilde, approximate, very approximate. It's software engineering. So like what's the rule of thumb? Double the time in minus two weeks, something like that, add two weeks. But it's not about the data or the idea. Now the number one way to identify that someone is an absolute blowhard is they put a slide on that has like a quote from Steve Jobs. So it's not quote from Steve Jobs, it's me, right? I said it. But this is actually critically important and it just leverages exactly what Juan said, right? Like the disease is thinking the idea matters. I cannot stress enough. The idea does not matter at all. This is about execution. UX is the killer feature. UX at every phase is the killer feature for the data developer, for the SRE, for the storage provider, for the eventual compute provider, the browser, everything. UX is the killer feature. We cannot move forward unless this works liquid smooth. But you say I want it now. How we move faster? It's all of you. We have some key skills and hires that we are missing. Right now it's like three of us doing the coding. So not so good. We're hiring very fast, obviously. We would also love, many, we have lots of partners in the room right now or on the stream. We would love to understand where you would like to go and see what we can do in our core project to support you. So what interfaces, core primitives and things can we take off your plate and work collaboratively on and then figure out where to go from there, right? So that is stuff like storage or plugging in schedulers or things like that. The time to suggest things to us is now. And even if it's just coming by and looking at the already published interfaces, documentation and so on, that's enough. But if you can collaborate with us, I know that we talk to folks about like, how we execute with Wasm, how we do this, how we do that, we would love to talk more. And with that, that is my overview. On time, which I'm very pleased about.