 Hello and a warm welcome to everyone joining this session today, we have Michael Bilquist with us today to share his experience of building the functional streams for Scala. He'll be talking about FS2 chunks, the chunk data structure powers the functional streams for Scala stream library. And we have Michael will be taking us to the design of chunk and in particular. Oh, he'll be explaining the design constraints that guided its evolution. So I'm pretty much excited for the talk. Without any further ado, let's head over to Michael. Thank you so much. I'm happy to be here. So yeah, I'm here today to talk about the FS2 library, and particularly the chunk data type. For the folks that don't know me, Michael Bilquist, I have been with Comcast for about 17 years where I've had a career that sort of started with C and C++ and went through a Java sort of enterprise phase and then for about the last decade I've been heavily involved in functional programming. I maintain a bunch of functional libraries in the open source ecosystem. In particular the FS2 library as well as things like SCODEC and a couple others. Normally when I give a talk on FS2, I kind of get into the, the internals of the streaming approach or some of the novel ways we do concurrency. But what I really like about today's talk is that it shows that there's still really interesting problems to solve with sort of simple data types with data types that maybe don't attempt to, you know, boil the ocean with their capabilities. So with FS2, we have this chunk data type. And the chunk is a collection. And if we look at the SCODEC of the chunk data type, we end up with a definition that looks somewhat like this. It says that a chunk is an immutable strict and finite sequence of values, and that it supports efficient index based random access of elements. Okay. The problem is that in Scala we already have a data type that meets these constraints. We have the persistent vector data type. Persistent vector is immutable, it's strict, right, it doesn't delay computation anyway. It's finite, it doesn't allow, you know, infinite values to be represented. It supports, you know, effective constant time indexing. So maybe the first question to ask about the FS2 chunk type is why does it exist at all? Why aren't we just using the vector type in the library? And so to understand that, we have to look at some of the APIs that get built with the FS2 stream data type. So here I have two sort of example capability traits. Maybe there's a trait that represents reading bytes from a network socket in some effect F. So here this socket, you know, has a reads operation that returns a byte stream, a byte stream that evaluates a, you know, that evaluates values of a given effect. So like if our effect type here was IO, then in order to generate that byte stream, it's going to invoke IO computations. And likewise, you know, you can imagine a files API, right, to read all of the bytes from a file on the file system, and again streaming those bytes back by evaluating some arbitrary effect. These are both very common signatures that you see in the FS2 library, and all of the operations on our stream data type, let us manipulate values of the output type of the stream. Right, so here we have stream of bytes. So if we were to like filter over the stream, we'd be filtering out individual bytes, or if we were to map over the stream, we'd be be transforming individual bytes. But internally that would not be an efficient way to represent a stream data type. Just all of the sort of machinery involved in the streaming approach sort of adds up if we're manipulating individual pieces, individual, you know, values in the stream. So rather we move stuff around in the stream as chunks, as these densely packed, you know, maybe arrays of bytes you can imagine. Right, move those around within the API lets you sort of ignore the fact that these chunks exist and operate on individual elements. Now, why not use vector instead of using chunk to move around those densely packed sets of bytes. And there's a bunch of reasons we're going to walk through them. But one of the key ones we can see right away just by looking at these two sample API is that if we tried to use vectors, then we're going to have to do a lot of copying. You can imagine a socket API that's interfacing with the operating system, maybe providing, you know, byte arrays directly, or maybe having directly allocated byte buffers that, you know, sort of the network API is putting data into for us. And so we want, we want to be able to be efficient if we're using this for IO. But before we even get into the details on on copying. It's worth pointing out that vector, really the thing that caused us to first move away from vector years and years ago is that vector is not particularly space efficient. So here, you know from left to right in this table we have the size of the collection. And from top to bottom we have a bunch of different types. And so in Scala 2.12, you can see the overhead of vector was pretty, pretty big. An empty vector was 56 bytes up through a vector of 10,000 elements being, you know, 47,000 bytes. In Scala 2.13, vector was rewritten thanks to Stefan Ziegler and a few other folks. The 2.13 vector is significantly better. And in particular, small vectors are much smaller, right, vectors with few number of elements take up much less keep space. And even with the improvements in the 2.13 vector, you can still see, like if we wanted to actually store bytes in a vector, the overhead, you know, is still significant. And that's because vector is not specialized on the byte data type. And so every byte we put into a vector actually gets boxed, right, so we actually store object references to individual bytes, pretty, pretty inefficient. Now, moving arrays around would be sort of the best we could do. But if we move to raise around, we have to deal with the mutability of arrays and this is a functional programming conference. So chunk is, is our first reaction to that chunk gives us a way to have the efficiency of arrays but be immutable. We can sort of think of it as an immutable array. So we can alter the definition of chunk of it, we can say a chunk is not only an immutable strict finite sequence that supports efficient look up, but we want it to be memory efficient for all sizes for small chunks for large chunks, etc. And we want to avoid copying as much as possible right interfacing with those IO boundaries. So let's walk through those constraints and see how they manifest in Scala. So we say a chunk is finite. So you can imagine that a trait here called chunk that stores values of type a, and it has a size, the size will be an integer. We don't need to worry about representing sizes as long, right because FS2 is a streaming library so you know if you, if you have a chunk that's larger than a signed in, then you can just use two chunks and manipulate them together through a stream. So a chunk is finite. We want this efficient random access. And so we have this apply operation that, you know, you pass an index and you get back the value at that index. And right away, you know, we've only added the second method to our data type here right away we have made a trade off. In this case, we are trading off the safety of making this method total, right, like, we could pass a negative one we could pass an index out of bounds. But rather than getting back an option or getting back some type of, of safe value that prevents that we're going to throw an exception will throw an index out of bounds exception. And we're going to do that specifically because of the need for efficient random access. If someone's writing an algorithm that's sort of traversing through different indexes of a chunk chances are they're going to be doing a lot of those, right. They're going to be going around, you know, pulling out different or different values different indices, etc. We don't want to pay that boxing cost of wrapping those values in option. We want to avoid copying and making sure we have these memory efficient implementations. And so here's like three data constructors for the chunk trait that we just created. Here's an empty data constructor which actually caches that value off as a value. You know, the type of this value is actually a chunk of nothing indicating that it's a, it's a chunk of any type, right, and chunk is covariate in its only type parameter. We have a singleton chunk where it just wraps a single value. So the only overhead we pay for that singleton chunk from a heat perspective is the single object reference, as well as the size of the object we're pointing to. And then finally we have an array constructor, which just sort of lifts in a mutable rain. And again, you know, we've made another concession. Right. The safest implementation of this array data constructor would do a defensive copy of this mutable array. So you could argue is chunk really immutable even now, given that it has this data constructor that can reference mutable arrays in which post construction someone go and manipulate that underlying array. And we would, you know, in essence break our immutability promise. And then we have a bunch of other more specialized data constructors. So here's two example ones you could you could come up with. There's an index seek data type in the scholar collections library, which is basically a immutable sequence that has a fast indexing lookup operator. So we can lift those right into chunk. And then, you know, because we work with bite streams a lot. It's often, or it's probably convenient to have a way to interoperate with like the job and I owe bite buffer interface. Right. And once again, we don't do defensive copies of those underlying buffers. Okay. We've avoided copies in our data constructors but how about the operations on our data type itself. You can imagine some code like this, we have some huge chunk of bites that we got from somewhere doesn't really matter. And maybe we've got some algorithm that wants to take that huge chunk and, and take the, you know, first 10 elements and process them and then maybe delay the processing of the remainder. Right, maybe it's like a binary parsing protocol or something. And that more efficient right off the bat is by thinking about the operations that we're providing to our clients. Right, so rather than having a take and a drop. Maybe in this case it'd be better to offer a split at right just to have that knowledge in the implementation of split at that we, we need both pieces, and we can avoid some, some duplicate work if we know the client wants both a prefix and a cell. So how can we make split at efficient, right, a naive implementation of it maybe would have to pattern match and all the data constructors we've defined so far. And then handle like array copying bite buffer copying etc. You know, and generate duplicate arrays. We don't want to do that. So instead, we can introduce a new data constructor. In particular, the array slice data constructor holds a reference to an underlying array, but then also holds a view, a subset view on that array. And that view is identified by an offset into the, the wrapped array and then a number of bytes from that point forward or another, a number of values from that point forward. The apply operation we've written before is just normal index based look up just, you know, adjusting for the offset and the size. And then split at, we can implement in a way that is zero copy. Split at can say, in essence just create two views of our underlying array and just do some offset arithmetic. Now, we've given up a little bit of strictness in this case. Right. When you call split at the API makes it look like, like, you know, when we said split at 10, the API made it look like we had a small prefix and a large suffix. But if, you know, we were relying on that small prefix, like we want to accumulate a lot of these small prefixes. We have to be careful that we're really referencing like huge underlying arrays beneath that, right. So we're trading a off again, you know, the initial constraints we put in place. We're giving up a little bit of strictness in this case for performance and efficiency of the split operation. Okay, let's let's add some combinators to our data type. Maybe the first combinator we might add is a for each a side affecting for each operation, right apply this function to every element in the chunk and just perform some side effect. And since we have a finite size, and we have an efficient apply operation, then this is really just a while loop. Right, we started index zero and just rip through all of the elements of the chunk, calling our function f. In other words, we can do side affecting with an index, right for each with index, basically the same exact implementation we're just passing that index into our function that's passed to us. So far so good. But this is again a functional programming conference so let's let's look at more functional combinators how about Mac. Let's look at function f over our chunk, right transforming each individual element. How might we go about implementing map, given our definitions so far. So, maybe a first take at the implementation of map could look something like this. We can create an array of our target type B. We're going to go with the size of that array because maps not going to alter the structure of our chunk at all right. And so we allocate an array of that size. And then maybe with for each with index we just side effect, you know, array mutations through all of the indexes of our source. And then we return an immutable chunk referencing this brand new array we just created. We're not going to work in school. And in particular, we need, we need this thing called a class tag from the scholar reflection library to be able to allocate an array of a polymorphic type B. We can't just create an array of any type in scholar, as a result of running on the JVM. And as a result of the JVM having primitive array types built in. If someone was to instantiate this map operation with be equal to bite, for example, we want a primitive bite array created on the JVM we don't want like an object array, where the individual bytes get box. So, we can, we can't quite write this code. The way it is here. So one option or one option to address this is to add that class tag constraint. So here we've added this constraint to the type parameter be one map, and I changed the name to map compact now. That class tag basically acts as a witness that we're able to create these primitive arrays. Right so if be ends up being a primitive JVM type we got a primitive array of the corresponding type. If be ends up being like an object type we're just going to get a regular object array. So this is fine this works, there's no problems here. But it doesn't really scale. Right like this, this works fine for this one operation map, but really do we want these class tag constraints propagating through all of our APIs on chunk. As we continually run into operations that maybe need to allocate new chunks. Or the definition there is interesting but it's, it's not as performant as it may look like we want, we want these primitive byte arrays to be created on the heap. But we had a function argument, and that function argument said you know map, for example a bite to a bite, or map a into a bite, you know and so forth. So the function trait that's used in scholar to sort of implement function values is specialized on some primitives but not all. And in particular it's not specialized on bite. So, even if we're mapping a chunk of bite to another chunk of bite. Each of those bites will get boxed past to the function one, and then the return to the resulting bite will then have to be unboxed. So there's tons of boxing going on even though we thought you know we were working with these densely packed by the race. I mentioned these class type constraints, virally property right anyone that wants to call a map compact in a generic context right where they don't have a concrete known type are going to need to pick up that same class type constraint they'll continue to propagate until until reaches a concrete case. And you know and then furthermore, we end up with with different implementations. If we implement map compact, where we maybe have like an efficient version in the compact case and an inefficient or maybe it doesn't doesn't normally matter right or a default implementation in the non compact in case. And if we were to really offer both of those API is then we're making our users of our library pick in each case, which one to use. And it's just too much of a burden on everyone to pick constantly all of the time. So, map compact isn't really a great solution. So instead, we can take some of these facts and put them together and say, well we're going to end up boxing here anyway. Right. Like I like I mentioned the, the bite sort of flowing through this function one are going to end up getting boxed as a result of just scholar, not specializing functions for bites. And so, really we don't, in this case we don't necessarily need a primitive array underneath, so we can construct an array of any. And this will really be in a an array of object references on the heat, right so this is not space efficient. Or it's unsound, right like we have an array of any and we're claiming, you know, or sort of trying to treat that as an array of type B. And that's that's not a safe thing to do. So in this case, we can work around it, this is the same trick that's used in the scholar collections library internally in a bunch of similar cases. We're not evaluating these array of any is, but as long as we never reference that array of any as an array of B, then it's okay, then it's safe. And in this particular case we never expose this underlying array right we're constructing it locally, and then encapsulating it and keeping that array private. There's no way to access it. So in this case, even though the operation is unsound, overall, it is a safe operation. Okay, so what about compacting what about when we really want those densely packed fight arrays on the heap. Well, in those cases we can just ask for right we can keep our operations perhaps inefficient that we just from a space perspective like we just saw, and just add compacting as a separate operation. And so in this case, we've added two operations to chunk to array, which just copies all of the elements into a brand new array. And in this case this to array will have a class tag constraint, because from an API perspective, we're saying this is going to allocate a primitive array. And then likewise compact just does the chunk wrapper around a call to to array. And in reality in the library, the implementation of these type of operations is a bit more complex significantly more complex because we have all sorts of special cases to avoid these copies when possible. Or in the case of like underlying array slices, we can implement to array by actually doing like a native array copy of just the part of the, you know the slice of the array that we care about. All of these same constraints and trade offs that we just made still apply, even in the optimized cases of these operations. All right, let's look at another one, filter operation just like map, but this time, we don't know the size of the elements. So we can't just construct an array of the same size as the starting chunk right. So here, we lean on the scholar collections library again, and we use an immutable array builder. And the mutable array builder in the collections library is very much like a Java array list right it will. It basically has some algorithm to resize periodically by factors of two. We run into the same problem, though, with respect to class tags. So again, well, in this case, create an array builder of type any knowing that that's going to, you know, in the end give us an array of object references. And then we're going to do that same unsound cast that we did before. One thing that's kind of interesting here is that we can choose to do size hinting to the array builder. And so the question is like if we have a huge chunk, and someone calls filter on it. Is it more likely that they end up with a chunk that, you know, is near the size of the input chunk, or is it more likely they end up with a chunk that's that's near empty. Of course there's no way to import to answer that question in a totally generic sense right but in a data type that's in a streaming library that met much of the time works with bytes. We can use that context to make a better implementation here. And so here we choose to say you know what we think most the time when folks use filter, they're going to keep elements in. And so we're going to size hint the underlying array builder towards the size of the input collection. And of course that will be the absolute wrong decision sometimes right like if you filter all elements out. But in practice I think it ends up working out for our use cases. Okay. So, let's take a look at this next example we've got a huge chunk of bytes again. And then we've got a tiny little chunk of bytes, right carriage return line feed, which just is, you know, a two bite array perhaps. And we have two examples of how those two chunks maybe combined into a stream. Right so in the discouraged case. We just a single chunk into the stream API. And we the chunk that we lift is the result of concatenating the huge chunk with a little chunk. And we want to discourage that because we don't want to do any copy right we don't want this plus plus operation to sort of have to do a whole big array copy right that huge array plus two more bytes on the end just to stick those elements rather we want folks to use this encouraged example right of, there's no need to create a single chunk representing each of those two constituents. Because with the stream API you can just lift both chunks and concatenate those streams, and that's actually constant time in FS to. So the question is like from an API perspective how can we encourage folks to do the latter. And the answer really is that, you know, when we write software we tend to be lazy about about things right we want our API is to be elegant and simple and not have a lot of craft. And so in this case we discourage it by just not offering the API by saying like hey if you want to, if you want to concatenate two chunks, then we're going to kind of make it ugly. We're going to make you kind of put those chunks to want to concatenate into a list, and then call this concat function. Right and this is sort of just a speed bump it's it's just as expressive as the previous example, but this speed bump making it just a little bit more inaccessible and making a little bit more ugly to call actually has a material impact on how often folks end up using it. All right, how would we implement concat though. So here's one possible implementation of the concat operation, we take a sequence of chunks as an argument. We sum all the sizes of each constituent chunk. We allocate an array, and then we just copy all the elements in. And sort of manipulating all the offsets. And in this case since concat right we're going to concatenate a bunch of individual chunks we, we think it's a good idea to keep concat as a primitive backed, you know array. So we're going to pick up the class tag constraint again. Let's see how we might use this concat operation to implement something maybe sort of complicated an FS to. So here's this uncons and operation. We're not going to get into all the details of this API, it's not particularly relevant. But I want to show just the way concat the decision that we just made in the concat operation sort of impacts usage. So concat cons and is a stream combinator it says, given some stream s. Take from it n values, and then emit them sort of in a chunk. So we want to output this chunk of n values, and we're also going to output the remaining, or the remainder of the input stream. The full data type is a is a functional data type that lets us recursively build up the stream computations. And so, typically it looks something like this you have like an internal recursive driver function in this case I called it go goes going to close over some state in this case like a queue of chunks. This is the stream that we're pulling from and the number of remaining elements we're waiting on. And then it pulls on the source stream and says give me the next chunk that's available in the source stream. And if we reach the end of the source stream, we we reached it before coming up with our n elements. So we can just admit whatever we've got will admit the concat nation of our accumulated chunk so far. The remainder stream is empty. If instead we got a chunk of elements from the source stream plus a tail plus the remaining remainder of the stream. Then we want to sort of, you know, add that head to chunk to our accumulator queue and recurs. And if we reach the right number of elements then we do some splitting and concatting right and then finally admit the net result. Okay, so don't worry too much about that API. The important part is that in multiple cases we called concat. And as we saw concat requires this class tag constraint. And so, we have some options on how we can address that. And to make that code compile one option again is adding the class tag constraints to this generic Oh type parameter on the uncons and method. And if you do that, then all methods that call uncons and are going to need to pick up that same constraint, we're back to that viral propagation case we talked about earlier. Another option is to just change. The problem entirely. Right, rather than concatting, which is going to require this class tag constraint. What if we just admit our accumulator without concatting we just give the queue of chunks back directly. Right. So sometimes it it's difficult to see these types of options right we get so focused on the problem we're trying to solve. We don't realize we can just change the problem. The third option is we can remove the class tag constraint from concat, just like we did with Matt. So let's look at that third option, it seemed to work nice for map, can we do it here. And of course we can write it's the same exact trick we allocate an array of any we're going to, you know, trade off the fact that we, and we have this generic operation now but we lose the ability to create these densely packed primitive arrays. And the operation like concat that sort of matters. Right, like, if you're concatting a bunch of bite arrays together you kind of want them to still be a bite array in the end, right, not an object array. So this was, this was our attempt at fixing this a long time ago. In essence, we, we sort of taught chunks, how to know what their element types were like, if we could just capture some, some knowledge at data construction time. Saying like, oh you put bytes in here. If we can then query that later and say like hey, if someone wants to concat a bunch of chunks together, if we know that all of the chunks only contain bytes, then we can use an optimized version of concat that allocates a bite array. Right, and we can do that for each primitive type, and only if you put a non primitive into a chunk then we could fall back to the untagged based operate or you know approach. And the question, you know, with with this type of approach, how do you do this efficiently like you don't want to do a linear scan over chunk. And so we had a bunch of tricks like I said we captured some witnesses at construction time. There were some fallback cases for some of the other constructors. It roughly worked until we started getting more adoption on Scala JS. In particular the HTTP for us library, a functional Scala web library got ported to Scala JS. And so you can do, you know, web servers and node, you can do web clients you can use it on browser based API etc. And that library uses FS to as its underlying mechanism for moving bytes around. And as Scala JS adoption increased we eventually ran into this case where the, the assumptions made in our implementation of this optimized concat no longer held. And in particular because of Scala j, or sorry because of JavaScript's approach to like a number tower being different than the jbms we end up with operations like this, right where the, the number one is an instance of bite. And so the whole scheme we came up with didn't, didn't in fact work. So option threes out, we can, we can't do the untagged version and still get, you know, specialized tagged behavior or efficient behavior in those cases. So rather than falling back to option one or two, we sort of combined options. And then, well, what, what if we created a new data constructor for chunk. That's just a subtype of Q. And so the idea is that the chunk queue data constructor just references an underlying set of constituent chunks. And then, you know, manipulates like a total constant time size operation as chunks are added room. We can support pending a chunk or pre pending a chunk. Take care to make sure we don't put empty chunks into our constituent chunk queue, just to preserve some asymptotic complexity. And then indexing through this, you know our efficient index based look up ends up just being a sort of a walk of our constituent chunk queue, right just looking for the right offset. And with this new data constructor, we can actually go back and add that plus plus operation, right, because now that operation is no longer linear in the size of the input chunks. It's rather logarithmic in the size of the constituent chunk. And we can use this new implementation in unconsent and the trade off seem really good. Like our accumulator now is just a chunk of Oh, our output, you know, is the accumulators just the same as our output type chunk of Oh, and in each of the cases throughout the code were before we were calling can cat. Right. Now we're just emitting our accumulated chunk. And when we want to add elements to that we're just concatenating. So we're just hiding all of that underlying Cuban, but all is not well. There's one big problem here. And it's in the asymptotic runtime performance of that index based look up on Q base chunks. Right. When the constituent chunk queue is small. And then we have effectively constant time runtime performance. Right if there's only like two or three elements in that constituent chunk you. And when you go and look up an index it's real quick to kind of jump through them and say what's the size of the first one's the size of the second one. Oh, your target index must be in the third. As that constituent chunk you size approaches in, or another way to say that is that as the size of the individual constituent chunks approach one. Then our runtime performance now linear. Because each of those bounds checks really just moves you over one element. And now let's go back to our for each. Our fundamental building block operations that we use in all of our other companies. If apply can be linear in some cases, then for each is quadratic. Because for each is already linear in of itself. We're going to say each time through that loop or potentially making a linear operation. That's not good. So we can fix for each, and we can fix for each with index real easily. We can say, we'll just don't index. Like we can specialize the definition of for each on the queue constructor to say, you have no need to index in this case just for each over each of the elements in the constituent chunk you. And that works fine that works great. And the other combinators that were implemented with apply still have this potential for for nasty asymptotic performance. So how do we fix that. This was an idea from a maintainer of FS to the general idea here is that we're just going to binary search and index table. So let me show you what I mean. Our constituent chunk you as five elements with these sizes size three 10 one so on. We want to build this lookup table of accumulated sizes. Say, the accumulated size after the first element is three after the second element is 13 and then 14 and so on. We also create an array that references each of these elements in the queue. And then we will binary search this accumulated sizes array for the target index. So in this case, if we binary search this array for 20, we start at 14, move over to 34. And now we know that the target element is in this position. Right, so we can move up to here. Let me say the target element is in. So the implementation of that is shown here. Both the creation of this set of lookup tables, as well as all of the sort of index arithmetic needed to binary search it and then lookup. Feel free to take a look at the slides. Go ahead. Yeah, just to change the reminder, in the last few minutes of the session. Yes, yes, thank you. And so the net result of that implementation is that the first call to apply is linear in the number of constituent chunks. The subsequent calls to apply are logarithmic in the number of constituent chunks, chunks. With this implementation, even in the worst case where you have a chunk you have singleton elements. That for each implementation, for example, the one that used index base lookup would have been n log n instead of quadratic. Okay. So in the parting moments here, I just wanted to say that we made a lot of compromises on our initial definition of what a chunk was right. And those compromises are okay. No, we traded off all of these sort of hard constraints that we started with for good reasons. And I think that's something we have to remind ourselves in functional programming is that we got to look for those balances and find the right approach. I don't want you to leave this talk thinking this design is this series of logical decisions that we just walked through. Rather this was years of a created knowledge, based off of all sorts of use cases. So this is like over like a five year period there were 20 issues and pull requests, just manipulating chunk. And so that that's really my talk for today. You know, I'm happy to take questions maybe in the hangout area. But in general, I hope that's an interesting walk through sort of the design process that we went through, even for a relatively simple. So what a message to us to design is an iterative process of success and failures. Remember that. Thanks a lot for the wonderful time.