 Give it up for the man. I welcome everyone for the second day of Alixir Conference. I hope you had a first great day yesterday and you're going to have a possibly even better second day. Yesterday I was there at this one watching the talks and then suddenly I heard a car alarm and I'm like, oh no, someone is stealing my car. Then I realized my car is in Poland. Doesn't make sense. So I went to Jim and I said, hey Jim, there are like some weird noises coming from the room on the side. And then Jim looked at me and he said, all right, we are going to move the whole conference to the dolphin just for you. And then I'm like, hey, I'm just giving feedback. You don't need to throw me, right? So later I'm there watching other talks, and I check my phone and I see people tweeting that the conference moving to the dolphin. Then like, wow, Jim takes his trolls really serious. He's getting 550 people to pretend they are going to the dolphin, only to throw me. That's commitment, that's dedication right there. So I woke up today and I see my email and we have an email now saying, we are going to the dolphin. I'm like, ha ha, very funny, Jim. You got me finally. So I got ready and I went to the swim. And then I realized he was possibly serious the whole time. So, joking aside, thank you Jim for organizing another two conference. I'm actually really glad it's here in the dolphin because I have a two and a half year old son and he doesn't speak anything yet but he knows how to make dolphin sounds. That's priority, right? You need to figure out what you want in life. And coming back from yesterday talks, I want to echo what Chris said in his keynote that I'm also really happy that we're able to get the Alexier and Finks conference to happen together and not separate things. We are starting to see great initiatives happening in the Alexier community like Phoenix, Nerves. I hope what I'm going to talk about today is going to be one of those other initiatives and allow people to do things that were not so easy to do with Alexier before. And I think we do have a lot more to learn and share together than separately. So that's really, really nice. I'm also really glad that Chris gave the opening keynote yesterday because when I'm giving the opening keynote, I have this hidden responsibility that I need to talk about the community and where you're going. So he did that. So now I can talk about code. I'm going only to talk about code, okay? So the talk today is about two, let's say abstractions. They're called Genestation Flow. And I hope that those abstractions are going to allow us to do a lot of interesting ways to do data processing that was possible before but not really straightforward. And before we get to talk about those tools in particular, so nothing that I'm going to talk about today is on Alexier yet. It's on this separate proposal story. Everything is under an experimental namespace, but the only reason we put it in the namespace is that when that eventually becomes part of Alexier, we don't have naming conflicts. At the moment of this talk, they are ready to, you know, for us to use and try it out and give feedback. And before I talk about the tools, I want to talk a little bit about the goals that led us here, okay? So since the beginning of Alexier, I had not, yeah, since the beginning actually, we had this, I had this vision that I wanted the language to be really good to work with collections, collections of data, okay? So even from simple things like lists and maps to data coming from, you know, the database or any other system. And the goal was pretty much this. I wanted the language that would allow us to go from eager to lazy to concurrent and to distributed. And that's the goal since the beginning. And if you've seen my other keynote presentation, even the first Alexier conference, I was always hinting about those things. So we are going to use one example. If you're done data processing, you're going to know this is a very cliche example, but it's nice because it's really small and allows us to exercise a bunch of properties that we want from data systems. And the example is word counting, okay? So if we get something like roses are red, violets are blue, imagine this is coming from a file or it's a string in your code. We want to count the words that are in this text in a way that it's going to return a map. For example, say, hey, the word are appears twice and then blue, red, rose and violets, they appear once. So there's one way we can do that. And the eager way, and it's called eager because at every step we get a full result back is using the no module, okay? And using the file module to read from a file. So for example, imagine that this point or something larger, right? It doesn't need to be necessarily that thing, but we're going to use that as an example. So imagine that I'm reading a file, I'm reading the poem from some source, and what that's going to give us, if I say file.read, that's going to give us the whole text. And what I do next is that, okay, I have the whole text, I need to split that into lines. So now I have two different lines, roses are red, violets are blue. And then I'm going to use enum flat map. Now to work on this collection of lines, essentially transforming this collection of lines into a collection of words. So flat map is going to go over each line, split the line into words and put that back, kind of back into the original list. And then we do that for each line until we have all those words. And then finally, now that I have a list of words, I can traverse this list of words with a map. And then for every word, I want to say, hey, does this word exist in the map? If the word does not exist in the map, I'm going to put it in the map with the value of one. But if it exists in the map, I want to increment the value that is in the map by one. And that's how we can solve that problem using enum, right? That's an eager solution, because at every step that we did, we had, for example, we loaded the whole file into memory and then we split the whole file into lines and then we got all the words and so on and so on. So that's the eager solution. It's really simple. I think it's the simplest conceptual model that we can think about. And it's very efficient for our small collections. Okay, so I'm going to show other models soon, but for our small collections, none of them is really going to be enum, okay? But the issue with that is that it's inefficient for large collections with multiple passes if you're doing a bunch of enum operations. So imagine that instead of reading something small, like rows are red, ballots are blue, you're reading a really large file, right? So what that's going to do if the file has like 10 gigabytes, you're going to load 10 gigabytes into memory. And then when you call strings split, you're going to go through that 10 gigabytes and build a huge list with all the lines. And when you're saying enum flat map, you're going to go over each line in that list and try to build a really big list with all the words in there, okay? And that can potentially take a lot of time. So I've been benchmarking all those examples and I literally, I gave up on waiting for this to finish for I think a two gigabytes dataset because it's just trying to build a very huge list of words. And it doesn't make sense, okay? So we want to solve that. And to solve that, we can start to consider laziness. And the idea behind laziness is that instead of loading everything into memory, okay, I want to have a way where I can express all the work that I want to do and then do it in parts by parts. So how we get laziness with Alex here, we use the string module and we use strings generally. So for example, instead of saying file.read, I'm going to call file.stream. And that's really interesting because that's not going to do anything with the file actually. Well, that's going to return, that is going to return a string and the string knows how to open the file and how to get line by line of that file without loading it all into memory. So what file string with the line parameters going to give us when we eventually try to run this computation, that is going to give us each line in the file. And then if I say something like, now string.flatmap with string split, it's going to return another string that will basically say it's going to go to the file.stream and say, hey, give me one line. And then it's going to get that line. It's going to split that line into words and pass the words forward without loading the whole file. It's always going to be line by line and then word by word. And when those words for the first line, they are processed, they are over, I'm going to go and get the second line, then go word by word and then the third line, go word by word and so on. So now that I have this string that knows how to lazily get lines from the file and break them into words, we can call enumReduce to go over the whole thing in this piecemeal fashion, putting, loading the whole data into the map and we get the final result. So the enumReduce here is the same as before. So when we are doing things lazy with the stream module, what we do is always press a bunch of computations, like I want to read the file line by line, I want to call a string flat map. If you want, it could call a string filter to remove some words and so on and so on. So you specify all those computations and then when you actually want to run them, it goes item by item, never loading the whole thing into memory. So we have less memory usage because now we don't need to load things into memory anymore, we are constraining how much memory are actually using. So we add the cost of computations. So this is a higher abstraction, so there is a very small computation cost in that. But it comes with a very important benefit because now we can do things that we couldn't do before. Now we can allow us to actually work with large files, we can actually work with infinite collections. So for example, if our data source is getting tweets from Twitter or data from the database that can be coming all the time, we can express those computations with string. So when we added strings to the language, now we knew that, look, I have this way of expressing computations without actually running them and we started to think, what if, when we want to run those computations, we actually use multiple cores from this computation because we have the computation stored. So what if when we do one to the data processing, we just spread this computation that we have around and then we are trying to do everything concurrently. And that's the next step. And that's what flow is about. So right now I'm going to show a very quick example of flow. I'm not going to go with details on what it's doing, but just to give an idea. So first we start with the file string because we still want a way of reading lines from reading a file line by line. So I start with the same file string and that's going to return a string. But now what I do is that I'm going to call flow from a normal and what that's going to do for us, that's going to get a stream and it's going to put in a separate process. And that's going to return a flow instead of a stream now. So now we're moving from streams to flow. And now instead of stream flat map with strings split, we are going to call flow flat map with strings split. And that's going to do is that it's going to get that computation and spread for a bunch of other elixir processes. Okay. And then we're going to do things like flow partition. We're going to have a very similar reduce operation to before where we get the map where we're putting things into map and then we're going to put everything into a final map and we're going to get the same result. With the difference here that now everything is using all of the cores in your machine as much as possible. So for this particular problem, I told that for the two gigabytes file, I couldn't for the eager solution with a num, I didn't wait for it to finish. It was just taking too long because over 10 minutes it did not finish. With stream it was able to execute in 60 seconds and with flow it was done in 36 seconds for machine with two cores. So it almost got a twice faster for machine with two cores. So what about flow? Well, what is happening when it starts using flow? So the first thing we need to understand is that we give up ordering and process locality for concurrency, but that's what we want, right? Because previously when we had the stream we're going line by line. But now what we are doing is that we are sending line to different processes that are executing things concurrently. So we don't longer have a guarantee that things are going to be executed in order. And for this problem in particular, we do not care about order, right? It doesn't matter the order, the words are in the text, they're always going to get the same word count. And flow gives us two similar to a string. It keeps the laziness aspect allowing us to work with bounded which is finite data as well with unbounded data which is infinite data. However, it's not magic, okay? There is an overhead when you're using flow because now we're sending data between processes. So if you say, okay, I want to use flow to sum all of the numbers, to add all of the numbers in a list, that's not going to be faster. There is an overhead in there from sending the numbers between process is not going to be worth it. So if you want to use flow, you need to have a really good amount of data for you to start to reap the benefits. Or you need to be either a CPU or a OBOMB. Or you need to contact external services or you're doing something that requires a lot of CPU. So even if you have a small collection that takes a lot of time and you want to do everything concurrently and so on. But one thing that's very interesting about flow is that the implementation is only a thousand to a hundred lines of code. It's really small implementation. And we're going to see we can do a lot of interesting things with that. And even more interesting is that we have more lines of documentation than we have lines of code. Which means that it was not really a hard technical problem to solve. What is really interesting about flow is that it's introducing a new domain and a new way that we need to think about data when we want to leverage concurrency and possibly distribution in the future. And that's worth talking about. So those are the topics for the top. Those are the two things I want to explore. So the first one, so how is Flow Implemented since we're able to get a bunch of interesting properties out of it in about a thousand lines of code. And if we have that much documentation written about it, so what do we need to learn to really reason and think about flows, okay? So we kind of know, I had some spoilers. You know the answer for the first question. So how is Flow Implemented? We assume it's related to gen stage, okay? So let's talk about those abstractions per se now. So starting with gen stage, so what is gen stage? Similar to gen server, gen event, it is a new behavior. So it's how we can model processes to have one particular behavior. And the responsibility, the goal behind gen stage is that we can exchange data between the stages transparently, so as the developers, when we are coding the stage, we are not really worried to like how the data is going to place this, okay? We just call this stage, I'm going to see some examples. And it does so with back pressure. And we have three kinds of stages which are producers, consumers, and producer consumers. So here's an example of a gen stage pipeline we could have. So we start there at the producer and then the producer can send data to some other stage that can send data to some other stage, some other stage, and so on and so on. So if I'm only producing data, if I'm not receiving data from anywhere, I am a producer, but if I'm receiving data from somewhere, I send it elsewhere, I'm a producer-consumer, until you get to the end of the pipeline where you'll have a consumer because it only receives data. And when we start to see those pipelines, we start to ask, hey, what happens if one stage there in the middle, for some reason it's slower than the others? We cannot allow the producer to continue producing data because eventually it's going to overflow that if one stage cannot be fast enough, it needs a way to say, hey, it stops sending me data, I cannot process everything in time, and that's the back pressure mechanism. We want to wait for the stage to say, okay, I had enough, let me work with what I have right now, and then later I'll tell you to send me more data. And the way we do this back pressure mechanism is that if we, let's have just two stages here, side by side, is that we do it demand-driven. So the first thing we do is that the consumer subscribes to the producer, and at the moment that subscription happens, the producer cannot send data to the consumer. He cannot send data yet. First, the consumer needs to ask, that's the demand, he needs to ask for a amount of data, okay, say, okay, give me 10, and now the producer can send up to 10 events, okay, but it cannot exceed that demand. And what is really, so the producer is going to eventually send 10 events, okay, and what is really nice is that the consumer can ask for more data synchronously any time it wants, and the producer can send more data any time it wants as well, okay, and if we now go back to the pipeline, and we have this demand-driven mechanism in mind, what is going to happen is that if we have multiple stages, first C is going to ask B for 10 events, and then B is going to ask K for 10 events, and then we can have the data going now from producers to consumers. So these whole demand-driven aspects, so stages, they are elixir processes. There's nothing more to it, and this whole aspect of demand-driven and exchanging events, they are regularly elixir messages, messages that processes exchange between them. So the demand-driven aspect is a message contract, it's only a message contract, and what is really cool about it is that as we saw in this slide, it pushes the back pressure, it goes all the way to the producer. So imagine your producer is getting data from an external system, from Rabbit MQ, Pashkavka. Now, even if we have a pipeline, the demand goes to the producer, so the producer knows exactly how much data it needs to get from the external system, they're never getting more data than you are supposed to, you're never overflowing your system, okay? So it is a message contract, it pushes the, it gets the whole back pressure to your system boundary, and the gen stage is only one implementation of this contract, and this contract is documented, so if you say, I don't actually like gen stage, and I want to have my own gen thing that uses this message contract as well, that's going to work just fine. But assuming you all like gen stage, let's see a gen stage example. And what we are going to do is that, it's a very simple example, but it's going to allow us to exercise a bunch of those things I just talked about. So we are going to have two stages, which is a producer, and the producer is a counter, so it's going to start from an initial value, which can be zero, and then every time someone asks the stage for an event, it's going to count those events, so it's going to emit zero, one, two, three, four, five, and then all their stages form our data, then it's going to be six, seven, eight, and so on. And the consumer is going to be a printer, it's going to sleep for a while, and it's going to print all the events it has received so far. So here's how we implement a producer. So we define a module, and similar to implementing a gen server, for example, in fact, the gen stage, they are implemented on top of gen servers, okay? So here's a producer, let's say user gen stage, instead of saying you use something like gen server, and we define, we need to define two callbacks. The first one is the init callback that is going to have the initial state, for example, if we want to start counting from zero, that counter will be zero, and it needs to return which kind is this stage? Is this a producer? Is this a consumer? Is this a producer or consumer? So in this case, it is a producer, we're returning, okay, on init, I return, I'm a producer, and this is my state. And then we need to define the handleDemand callback, and handleDemand is called every time one of our consumers asks for data, and demand is going to tell exactly how much data you need to produce. So in this case, handleDemand then receives the demand how much data it was asked for. It receives the state, which is the state of the stage, and then we're going to say, okay, so if the counter is zero and someone asks it for five items, I need to remit zero, one, two, three, four, okay, and that's what this code is doing. And then we return no reply, the amount of events we want to actually send now, and our new state, okay, which is the counter and the demand. So let's see how this is going to work. Let's do a very quick simulation. So assuming that we start with the counter at zero, that's our initial state, and imagine that someone asks for 10 items. So what handleDemand is going to return, is that it's going to return no reply with the events zero, one, two, three, up to nine, and it's going to say that the new state now is 10. And then we're going to have the state of 10 if the consumer asks for five items, five events, now what we're going to have, we're going to return is no reply with the events, 10, 11, 12, 13, 14, and with the new state of 15, and so on, and so on. And you can start to have an idea of how we can use this to get data from external system. If someone asks for 10 items, I can go to the external system and say, hey, give me 10 items, because that's what I'm going to process internally now. Okay, and that's all in handleDemand. The consumer is similar, but slightly different. So we are going to define a module, and we are going to also say useGenStage. Except that now they need to call back, instead of returning the first element saying producer, we are going to say that this is a consumer now. And this particular consumer implementation doesn't really care about its internal states, or I'm going to say the state does not matter, it can be whatever. And now instead of implementing handleDemand, it's going to implement handleEvents. And what handleEvents receives is that it receives a list of events. This list can be of any size. Because for example, let's say that you ask for 10 events to the producer, but the producer has only three. So we're not going to receive lists always. So you're not necessarily receiving lists in the same size that you ask for. Sometimes you can get less, but it's always going to be a list of events, and it's going to say from which producer are actually receiving that, and they're going to have the state. And what we do in the handleEvents implementation is that we're going to slip for a second to simulate work. So what's going to happen if the consumer cannot be as fast as the producer, for example. And then we are going to print all the events, and they're going to return no reply, no events to dispatch, and the state, which we don't care about. So after we define those modules, now we can wire everything together. So I can start the producer. So I'm going to use GenStage, start link with the initial state of zero. So I'm going to start counting from zero. I'm going to start a printer that is going to, should connect, should subscribe to the producer and start receiving things, okay. So now that I have started those two stages, they are two processes now. They are running, I can say, I want to subscribe the printer, which is the consumer, to the counter, which is the producer. And at the moment we do that, we are going to see, we are going to wait for a second, and we are going to see zero, one, two, three, up to 499 printed, exactly 500 events. And then we are going to see, we're going to wait another second, because remember we put a process to sleep in our consumer, and we are going to see 500 into 999, another 500 events, and so on, and so on, okay. So you may be wondering now, what is, where this 500 came from, okay. Why is it sending, why we are starting to see, let's say batches of 500 events. The reason why we are getting exactly 500 events is that when you subscribe, so when we call sync subscribe, we can pass two options. We can pass max demand, and max demand is the maximum amount of events that the consumer asked the producer. And then the foe is 1000, and the minimum demand is the value that when reached, should make the consumer asks for more item. So we start with, so what's going to happen is that we ask for 1000 events and then we start processing them. And when we get to 500, we say okay, I need to ask for 500 more. And that's how it goes. And the reason why it's important to have both max demand and min demand is that we can allow the producer and the consumer to be working at the same time. Let's do another very quick simulation. Imagine, let's reduce those numbers to allow us to understand a little bit better. Let's say that we have max demand of 10, and we have the minimum demand of five, okay. It's still half. So what is going to happen is that the consumer is going to ask for 10 items, 10 events, and then the consumer is going to receive those 10 events in our counter case. We ask for 10 and then the producer is going to go and say, okay, zero, one, two, three, up to nine, is going to send that to the consumer. So the consumer is going to start processing that. And it's going to start to process that in a way that's going to only process five of those 10 items that it received. And the reason why the consumer is going to do that is because it knows that when it's processed five of them, it needs to ask for more. So the producer knows, okay, I got 10, but as soon as I get to five, I need to ask for more. So instead of processing the 10 at once, I'm going to process five, okay, and then I'm going to now, so I'm going to process five out of 10, then I'm going to ask for more five, and then I'm going to go and process the remaining five, and then ask for five again. So what is going to happen is that when the consumer process five after the initial request, it's going to tell the producer, hey, give more five, and the producer is going to start producing more five while the consumer is still working on the remaining five, and then there will be exchanging data information all the time in a way that you always have both doing work. So when I get five, I process this five, I'm going to ask five more, so we want all of them to be working at the same time. We don't want to have the producer sending the data and waiting, okay, which is what would happen. So let's see what would happen. So this is the ideal scenario. This is kind of what we want. So what would happen if we set the main demand to zero? If you don't have it as five. What would happen is the consumer is going to ask for 10 items, the consumer is going to receive the 10 items, the consumer is going to process the 10 items, and then the consumer is going to ask for more 10, and now since it asked for more 10, until the producer gives it more 10, it doesn't have anything to do. It's going to be waiting. It's going to be wasting time. The consumer is waiting. That's not what we want. So that's why we have the max demand and the min demand, and the default value is 1,000, okay, and the default value for min demand is 500, and that's why we saw batches of 500. We're almost done with GenStage. There is one last thing I want to talk about, which is this idea of GenStage dispatchers. So when we saw those two examples, we've handled demand and handle events. All we had to do for it to work was to subscribe the consumer to the producer, and then the data started to go. We started to produce the data. We started to receive the data in the consumer. We didn't care about who is sending data, what is actually managing this demand and things like that. We didn't have to worry about any of those details. Internally, the thing that actually takes care of handling those details, it's called a dispatcher, and the dispatcher is per producer because the producer is the one that receives the demand and dispatch the events. So we have dispatchers, they are per producer, and they are the ones who effectively receive the demand and send data and send events to the consumers. And what is really nice about dispatchers is that they allow a producer to dispatch to multiple consumers at once in different shape, in different ways. So the photo we have, so if we started GenStage today, that example I wrote, for example, that we saw, it was using something called a demand dispatcher. And what a demand dispatcher does is that it's going to send the next events to the consumer with the biggest demand because they are assuming that if we have multiple consumers and one of them has a bigger demand than the others because it has more free resources available, more than the others. So imagine that I am dispatching the events one, two, three, four, five. So and at the beginning what we do is that we send one to the first, two to the second, three to the third, and four to the fourth. Imagine that after I send those four initial events, two was the fastest. The second one was the fastest. It processed two really fast and asked it for more. So now when we get five, because two was the first one to ask and the one that now has the biggest demand, we are going to send five to that. So we're kind of doing load balancing on whatever stage, whatever consumer supposedly has more resources available, okay? That's the default one. We can do a lot of interesting stuff. We can have something like the broadcast dispatcher. And what the broadcast dispatcher does that guarantees that every event that goes to the producer is going to reach all consumers. So now we have one, two, three, and one, two, three is getting to all consumers. And then we had other interesting stuff like a partition dispatcher. A partition dispatcher guarantees that we are going to route an event to a particular consumer based on some information that is stored in the event. So in this case, imagine that what I want to do is that I want to dispatch to a consumer based on the remainder of the division of the event by four. So if we do this, one, five, nine, 13, all the events where the remainder is won is going to go to one process. Two, six, 10 is going to go to a consumer where the remainder of the division is two. And so on, okay? So it's guaranteed we can leverage the event's property to guarantee that they are going to be delivered to one particular consumer that is going to process all events that have that shared property. And that function there, like the remainder, they are custom. We specify it to whatever we want it to be. So that's GenStage. Those are the things I had to talk about. It's a new behavior. It allows us to exchange data between stages with back pressure. We saw how the demand aspect works and how we can configure that. We talked a little bit about dispatching and the different options that we have. And this has, we've been working on this for a long time. And we had previous abstractions in the past. We were trying to solve this problem, but they're not quite right. So when we implement GenStage and we were like, you know, this, I feel like we, this is right. This is what it needed. We already had a bunch of goals, a bunch of ideas that we wanted to validate and have GenStage replacing, okay? So for example, what are those goals? How do we validate GenStage? So one of the things that we wanted GenStage to do is to support generic producers. So what we want to do is that I want to implement a GenStage producer that is going to get data from RabbitMQ, a database, a Pash Kafka, Twitter, Firehose or something like that. You know, it can connect to any external source, okay? And get the data into your system with the whole demand-driven back pressure aspect. And we wrote a couple of examples. When we announced the GenStage officially on the AlexiLang website, I think a month or a little bit more ago and we have some of those examples there. We have a link to a talk where I implemented a GenStage that is reading data from Postgres, for example. And I will start to see, there are already some examples coming from the community library authors integrating GenStage to get this data. So that was one of the goals and we're able to validate that. Another goal that we explored in the article so you can go check the article announcing GenStage from our information is that we can use GenStage to replace GenEvent. So GenEvent is something we have in AlexiLang where we can send events to a process and that process is going to guarantee that different handlers execute. And it has a couple of limitations in there. And GenStage solves all of those limitations because remember that I talked about the broadcast dispatcher? We can use the GenStage of a broadcast dispatcher to guarantee that we have a process and that process is going to send events to whoever is interested. We're going to broadcast events to whoever cares about. Again, the announcement goal and explore these other goals as well. And we also introduced something which is a dynamic supervisor. It's part of the GenStage library and that's a supervisor that was designed for supervising children dynamically. So if you don't have your tree specified up front, you want to add children dynamically and remove them dynamically, the dynamic supervisor is really neat and the dynamic supervisor is also a GenStage which means it has all of these back pressure mechanisms we are talking about. So we have those goals, we're able to implement that. And here's the article, there's a link at the bottom, you can read more to find more information. If we have time, I can even... So the first time I prepared this talk and I rehearsed it, it took one hour and 20 minutes. So I had to trim some stuff and some stuff was actually the GenEvent examples. If we have time at the end, we can even do that. We'll see how it goes. But that's it about GenStage and it answers our first question. So how is Flow implemented? Flow is implemented on top of GenStage and the core of Flow... So remember when I said when you use FlatMap, it starts a bunch of processes. All of those processes that FlatMap starts, they are GenStage and the core is only 80 lines of code. All the other lines of code that we have in Flow is setting up topologies and we are going to see about it next, which is exact next question. So how are we going to reason about flows? Okay, so if flows is a new domain, we need to understand this domain in order to use this new tool called Flow, how we can reason about that? So that's the second part of the talk. Let's talk about Flow. So going back to our original problem, we have word counting, roses are red, valets are blue, we want to convert that into a map. And when we saw our solution with Flow, every step that we had, when we say Flow from a new row, Flow.FlatMap, all of them returned flows because similar to streams, Flow are lazy. When you're expressing all those computations, don't worry about those computations yet. When you're expressing all those computations, we are returning something that is building the topology and only when we are interested in the data, it starts to execute. So it's lazy like streams, but concurrent. So only at the end when we call something like ennemy2 that we actually have the data pulling through. So what I want to do now is that I want to go over, we have this piece of code that is word counting with Flow, and I want to go over it line by line, explain exactly what each line is doing and how it affects our topology. So let's start from the beginning. So the first line here at the very top is file.stream and that is a stream that knows how to get entries from a file line by line. And then we call Flow from Enurbo and what that's going to do is that it gets the stream and puts it into a stage. It makes that the producer stage. Now we have a producer by calling Flow.Float.fromEnurbo. And now when we call the next thing, so we have, so now we have a process that knows how to emit things line by line. So the next thing we want to do is that we want to split those lines into words, right? So we are going to call Flow.FloatMap with the string split. And when we do that, now we have a producer and now we have a bunch of other stages that they are going to be subscribed to the producer and they are going to receive the lines and break those lines into words and send those words forward, okay? And then the moment we start spawning, starting all those stages, is when we start to leverage concurrency. So let's see how it's going to work. So now we have a producer which is Flow from Enurbo. By default, remember to use the demand dispatcher, which means it's going to send the lines to whatever stage below that has the biggest demand. So what we have is this, this is our topology so far. We have a producer and then it's going to go, get the first line and it's going to send to a stage. So we have Rosa red, we are going to send it, for example, to a stage one and that stage is going to break it into the words, roses are red, that it sends as events. And then the line, violets are blue, can go to another stage. That is going to do the same thing, it's going to break that into words and it's going to send those words somewhere else, okay, violets are blue, right? So now we have a producer and we have a bunch of processes doing all the work. They're all working at the same time and that's the beauty behind operations like flat map, map and filter. They can all work independently. They don't need to coordinate. They don't even care about what was the previous line because all they care is about that event, that line right now and breaking that line right now into words. That's why we say those problems like breaking a line into words, that's embarrassing parallel because it can just do it. We don't care about previous lines, we don't care about who else is doing the work, we just do the work. Okay, so this is what we have, okay? We have flow from a normal and then we have a flat map. And the next thing we did in the original example was to call flow.partition. So what flow.partition does? The best way to explain what flow.partition does is to not call it and see what's going to happen, okay? So we have flow.flatmap and let's call reduce as we did with stream and enums at the beginning, okay? So we have flow.flatmap and then we call flow.reduce and there is a very important change. You can see here that at the bottom here, all the stages, they have an arrow saying, you know, when I break the line into words, I'm going to send those words forward. At the moment I call something like flow.reduce, we can see that that arrow now, it's a cycle because when you call reduce, what reduce does? Reduce needs to build a state. For example, we want to build a map, right? So we cannot just split the line and send the word somewhere. We want to get, we split the line into words and put those words into a map, okay? So now we're not sending the data anywhere. We are accumulating the data. That's what reduce does. We care about everything we saw because we need to build this map that has the count of all the words we have seen so far. So we are not sending the data anywhere at this point, okay? So how it's going to work is that now we have rows are red. We are going to send it for a stage, for example. That stage is going to break that into words and it's going to build a map now because we have reduce there. And flow reduce runs per stage, which means that if I get violets are blue and send it to a stage four, it's going to break that into words and build another map, okay? So now we have all those stages doing reduce and they are counting the words on their own. But there is one big problem here, okay? The problem here is that you can see that the words are, appear in two different stages, which means that if we want to count all the words in the document, we would need to get the map, the state, of all the stages and then merge them together and merging the map of all these stages together that needs to run into a single process because we want to put everything together. So that's not good because now we have, the stage is doing the work, but they have information spread around, which would mean that if we want to actually know how many times the word are appearing the text, we need to go at every stage and ask it. That's not good. So what we want to do is that we want to guarantee that after each stage sees a word, they are all going to send that to the same process. So if any of those stages see the word red, it's going to send to one particular process and they all agree which process that's going to be. If they see the word are, they're all going to send to that other particular process, to that other particular stage that is going to count the words are in particular another one. And the way we do that is with partitioning. So that's why we need partition. Partition is going to guarantee that we use a property that we know about, about the word, to guarantee that it always goes to the same process. So if a process sees are, it's going to guarantee, hey, that goes there. So let's start again. So I have file stream, flow from renewable, that puts this stream into a producer, and then we call flat map. And remember that before we, after flat map, we called reduce and that changed how flat map works. Now we're going to do differently. After flat map, we're going to call partition. And what partition is going to do is that it's going to create a whole bunch of other stages, okay? And those stages here from the middle all connects to the stages at the bottom, okay? So they're all connected between them. And now when we call reduce, what is going to happen? This is really nice. What is going to happen is that we have the stages here that I'm using numbers for them, stage one, two, three, and four. They'll be doing flat map. They'll be splitting the lines into words, and then they are going to route those words, okay, to particular reduce stages at the bottom that will be counting them. So what's going to happen now, let's run our simulation again. We have rows are read. We are going to send to the flat map stage that is going to break that into words. And then we'd say, oh, roses goes here, stage A. And then stage A is going to have roses as one. And then we'd say R now should go to stage three, and that has R1. And then red could go to stage three as well, okay? And that's going to have that red with the count of one. And now when violets are blue goes to another flat map stage, it's going to get to the same conclusions. We'd say, look, I have R here. So, and I know that R needs to go to stage three. So we can see that happening. And now each reducing stage, they're going to have their own map and those maps, they are disjoint. The keys are disjoint, right? They're not going to have to share information anymore. If I want to know how many words the R has appeared so far, I go to one particular stage and that's it. So at the end of the day, if you look at this topology that we have right here, we have this, we have a producer and we learn the producers use the demand as patcher because we don't care who is splitting lines into words because that's embarrassing parallel. We can just split lines into words. We don't care who is doing that, okay? And those stages that are performing operations, like flat map, map filter that I say is embarrassing parallel, those are our mapper stages, okay? And then we have a bunch of stages in the bottom that we say those are the reducer stages and they are all connected with the partition of the patcher. So if you ever heard about using map-reduced topologies for solving data processing problems, that's it. We have stages doing mapping, we have stages doing reducing, and we have coded that with flow. So now that we have defined our topology, okay? What is going to happen is that we have all those stages running reduce, counting the words, and then at some point the text is going to finish, right? If I'm processing a 10 gigabyte file, at some point there are no more lines, okay? And now all those stages at the bottom, they have their maps. And when we call enumming true, we are saying, okay, now that you have all the maps, let me just, if you want to have a fully unified view, I can just merge them together, but without worrying about adding things and so on, okay? Adding the R's from different states and so on. So if you want to have a unified view at the end, we can do that. But if you don't want, you can have the state in those different reduced processes, and that's it. So reduce tree is going to collect all the data into maps. So every time we call reduce, it's important. Reduce is aggregation. We need to know all the events that have happened. That's why they don't emit events. It's building its internal state. It's putting words into a map. So reduce tree collects all data into maps. And when it's done, we can stream the maps to the original process, and then int is going to collect that. However, and that's how flow works. And we can use that to, you know, if you're working with data processing, text processing, anything you want to do, you can use that today, it might work just fine, okay? But there's one big question that comes here. So I said at the beginning that flow can be used for bounded data, which is finite data, like a file of 10 gigabytes, but it can also be used for infinite data, like I'm connecting to this stream API, or I'm connecting to RabbitMQ, that data never finishes, okay? And there is a conflict with what I just said, because if reduce tree is going to collect data until the data finishes, what happens when the data never finishes if you're collecting data forever? Okay, we need a way to reason about that. So this is just a note. So flow also has the idea of windows and triggers. For example, if you're getting data from Twitter, counting all of the words ever said on Twitter, that's way too much, right? We don't want that, we sometimes want to break that. Okay, I want to see what people are talking about from two to one. So we are windowing the data, okay? So flow has support for that. So if reduce runs until all data is processed, what happens when the data is infinite? So windows and triggers allow us to solve exactly that problem. So we can break the word, we can break our data into windows, we can have triggers that say, okay, every time, we can have time triggers that say, every five minutes, I want to get all the data that I have collected so far, but it's here and then start again. So there are a bunch of different ways. And since we already wrote 1,000 random lines of documentation, you can go read that. So I'm not going to explore that in the talk, in particular, windows and triggers, but we do have a lot of documentation. And if you have any feedback on the documentation, on using windows and triggers, we'd love to hear more, okay? So that's it, that's flows. So let's go back to where we started, okay? So we have this Go in Alex series that we are going to have tools that allow developers to go from eager to lazy to concurrent to distributed. And for this particular word counting problem, we started with something like, I want to read a file, load it all into memory, and then split that into lines, and then split each line using a non-plat-map string split into words, and then reduce. And then we made it lazy that allow us to work with large files, both without leveraging any concurrency. And then with a couple new concepts, okay, like the whole demand aspect that we had with GenStage, we're able to write a flow. So there are two, we learned some interesting concepts here. So there's the whole GenStage concept which how data is exchanged, that's one of them. We learned about partition that every time we need to route data based on one particular property, we can use partitions for that. And we noticed, and we learned the difference between mapper stages, which are things doing flat-map, mapping and filter, and reduce the stages which are stages that need to aggregate data, okay. And one of the nice things about flow is that if you call things in the order that's not valid, so imagine they are trying out and they're not sure like where you should do partitioning so you mix up the mapping with the reducing, flow is going to let you know about that, okay. And then we wrote it in flow. And as I said, for my benchmark, so my computer with two cores, two gigabytes of data, eager didn't finish, is train took 60 seconds and flow took 36 seconds, okay. So we got a very good improvement of simply writing that into flow. And what is really nice is that flow provides as we saw mapping, reduce operations, we saw about partitions, we can actually merge flows. So if you have like two flows of data and you want to treat that as one, you can do that. You can do joining of flows. So similar to database in your join, left outer join, so imagine that I have flow data from one side, data coming from the other side and you actually have one key that you can join those flows together exactly like you do in the database. We have those operations all there as well. As we saw, we have configure batch size because flow is implemented on top of GenStage, which means we can configure how much data we need to send at every step of that topology we saw, okay. And I just mentioned that we have data support for windowing, triggers, watermarks and so on. And then there's this question, right. So we have this path that we set up, like eager, lazy, concurrent and distributed. And I talked about concurrent today. So what about distributed? And this is a very interesting question because when you look at flow, APIs has feature parity with big data frameworks like Apache Spark. There are parts like the whole data windowing aspect that we do more than what Apache Spark does today. So that's very interesting, all right. However, on the other hand, flow so far is only concurrent. There is no distribution nor execution guarantees. So we can, you're going to do everything on a single node and if something goes wrong by default all of the data, so imagine that you have something that needs to process for one hour. If that goes down, you're going to lose all the work unless you add explicit checkpoint. We don't have the full checkpointing, but it could add that, okay. So given that we have this row to distributed, does it even make sense to talk about concurrent then? Like we have a place to go still, right? So what we have so far is it actually useful? And as a matter of fact, it's very useful. So there are a couple of papers and I have the reference here at the bottom that so they're doing like studies, they are raising analytics on a bunch of those big data processing systems. And one of them said, is more inputs are common in practice, 40 to 80% of cloud error customers may produce jobs and 70% of jobs in a Facebook trace have less than one gigabyte of input. And then there's another paper that expanded on this and this paper was comparing different big data technologies and their conclusion was for between 40 to 80% of the jobs which is the one we just saw to map for those systems, you'll be better off just running them on a single machine. Okay, so it's for a huge majority of cases, that's exactly what you need. And a lot of the solutions out there, they only ship with the distributed mode. They don't allow you to run on a single machine or they make it harder than it should be, right? So when we talk about distributed, the single machine matters a lot. So you should all try it, you should experiment with flow, gen stage and so on. And what is really exciting about this is that the gap between concurrent and distributed in Alex here, it's really, really small because the abstraction we use for concurrency and the abstraction we use for distribution, they are the same, they are processes. And today you can get a gen stage, you can get a gen stage producer, run that in one node, you can have a consumer in another node and that's going to work. We could actually have the flow working with, if the flow producer happening in some other node and that's going to work as well, okay? So it's really straightforward if you want to start exploring more distributed aspect. And coincidentally, what is really the next big thing then that we need to tackle from this whole distribution aspect is the durability concerns that Chris actually talked about yesterday. We, in order to have the other thing we need to do more robust and productive distribution is that we need to have a durable pub sub. That's actually what Phoenix plans to tackle next. What a coincidence. So, and that's, those are our plans, okay? So that's what I had to share about gen stage and flow. Before I go, I have a couple thank yous. So first library authors, I really start to see some library authors giving gen stage a try. And if anyone here has a library or are planning to work on a library that connects an external data system, try out gen stage and try to get data from those external systems using gen stage. It can be, as I said, can be anything from getting data from creator over the HTTP API. You can chunk that and use gen stage for that. And there's already someone working on that or any other system that I mentioned, Rabbit, MEQ, databases and so on. So, build your own producers, library authors. And if you don't want to depend on gen stage, right now, if you don't want to impose that dependency to the users, there's a small tip, use the behavior gen stage and that's not going to be a hard code dependency. So you can use all the gen stage features without actually depending on it, like hard coding it. And as I said, in the announcing gen stage that we did, the announcement we did on our website, we have examples of doing that. There's a talk I gave at the London Meetup. Okay, so check that out. I have two pink inspirations. A lot of this work that we have here is from existing papers and existing projects. I want to mention three in particular. So, August Streams, the whole back pressure contract comes from the reactive streams and August Streams initiative. A lot of the MapReduce API that we see in Flow came from Apache Spark. And the whole data window in model, that's, it's really interesting. It comes from a project relatively new called Apache Beam. So if you go check the data window and stuff, you're going to learn the new concepts that I didn't explore here today, but window triggers, watermarks, they all come from Apache Beam. So a huge thank you to them for all their work. I want to thank the Alexian team as well. This is a group of four, most of all, I'm the messenger. And I have to especially thank James Fish, he's not here today. But he is the one, so before GenStage had something called Generalter, he's the one that did prototype for that. And he spends a lot of time pointing out the bugs in my code. Which is really, which is really helpful. I don't know how he does that. I send a commit and then he pings me on IRC and says, look, that line is going to have a race condition when you run in a distributed setup and the position of the moon is above 30 degrees. So you have to fix that. And I'm like, thank you. And we iron all this stuff out. So really helpful for everything OTP that we're doing so far. There is a lot of input from James. And a lot of thanks also to Eric and apparently my husband Chris for the therapy sessions. So a lot of the time, you know, I'm designing API and we have, or doing design decisions and then I would go, I would call Eric and Chris and say, oh, I don't know what to do. So I would explain the whole problem for them. They would listen and they would give very helpful suggestions as well. And finally, I want to thank Plataform Attack, not only Genestage and Flow, but all of Alexia was built and designed at Plataform Attack. I'm really glad that at this year, we are starting to see more of our team at Plataform Attack coming to Alexia Conference and there are being even more active part of the community. So there are a couple of us around. So if you see any of us and are interested in any kind of Alexia coaching, design review, custom development, if you are interested with playing with Genestage and Flow and you want to validate projects, validate ideas, get in touch with us, who are really interested in exploring all of that together. So that's it. Thank you very much. Thanks for that. This looks like a great addition to the language. And when you were talking about that big pressure, I imagine that it would be really useful to use that for HTTP server because you could use that for like not accepting more connections than the server can actually handle. So did you play with that? Yeah, so the question is, the whole back pressure thing could also be used for HTTP servers. And it's true, but one of the things that is important to remember is that the demand-driven aspect, it's only one of the ways of doing back pressures. And so if talking about back pressure in particular between the servers, I'm not necessarily sure if the Genestage way of doing back pressure is the best way to do that for a server, for example. So it's something that I have not explored, but given we have tools like Cowboy that already takes all of care of that today with its own back pressure mechanism, even the socket's going to have their own back pressure mechanisms, it's not something that I am, if someone wants to explore that, please go ahead, but it's not a particularly goal that we have. Cool, thank you. Are there ways in Genestage to ensure that all events produced are eventually going to be consumed by somebody? Consumed by? By anything, eventually? My app, for example, I'm thinking of is we do accounting logging, and I'd like to consider Genestage as a way to distribute that out of the main workflow, and I don't really care if it's quick, but I do care if events ever get dropped. Okay, so you could, imagine that you're processing data, and then something goes wrong. So you can use everything that Alex did, because they are processes, you can use everything that Alex here has in terms of process management to know, oh, something went wrong, you can have a supervisor that maybe restart that from the beginning, or you can just monitor and try to do something else. So you have the tools there, there's nothing built in. So when I said about checkpointing, for example, so imagine that you're receiving data, and then you're like, okay, I processed 100,000 of those, and I want to write it somewhere that I processed up to this point, so if I need to restart, I need to continue from here. We don't have anything that does that by default, but you could add it. So we have the building blocks, there's nothing really built in that is just like, hey, I want checkpointing that we see some other tools doing, but it's something we'll eventually get there. Thank you, you're welcome. I have a question, how the partition works, because it takes no arguments. Can it take a function that actually are arbitrary partition? Oh, that's a very good question. I didn't explore the partition a lot in the case for the word counting. So in the word counting case, what we did is that we hashed the word using the Erlang term hash. So because it's an algorithm that we have in Erlang, every time we see a word that's going to be guaranteed to go to one particular stage because it's always going to have the same hash. And when I called float of partition, by default that is going to partition, it's going to hash the whole event, but you can pass anything you want. So you can pass, so you can say, so imagine that your events are maps. You can say, I want to partition, I want to hash this key, but it can also pass a custom function and do anything you want to. So imagine for example, that the way you want to partition the data is that you have like five categories in your system and you want to guarantee that there are always like a category X is always sent to one particular consumer. You can do that, you implement your own partitioning function and that's going to guarantee that it always goes to the same place. Thank you. You're welcome. Okay, that was the, we only have time for that question. So please give a resounding round of applause for what I say. Thank you.