 So we are going to go into details of how to make a Haskell program more performant. So how many of you think that Haskell must be slower than C? Okay, one. All right. So it must be slower than C. So we'll see that Haskell can do as well at C with some effort. Right? And we'll see what kind of roadblocks are there. So let me introduce myself. I'm Harind. I have, for the good part of my life, I have worked on C. And for the last few years I have been programming in Haskell. And I'm trying to, currently I'm focusing on Streamly, a library that aims to unify several aspects of Haskell programming into a very simple list, uniform list API. Particularly it makes concurrent and non-concurrent programming indistinguishable. So but today we'll talk about performance. So let's see. So this is a high performance library and this is the context in which I started looking at how to make Haskell better perform better. And what are the problems? What are the roadblocks? What are the problems? It's based out of that and one other library that I'll talk about shortly. So there are a lot of slides. So I'll try to breeze through some of them so you can go and look at more details later on in the slides. So in Haskell performance can easily drop by 10x or 100x very easily. Just one change and it's just 100x slower. In C refactoring is a nightmare for correctness but performance is more or less unaffected. But in Haskell it's the opposite. Correctness can be guaranteed. Refactoring is very easy. Correctness will be there. But performance will get affected and you'll have to do some work to get back to the same performance. So you cannot be confident unless you measure. So that's what I have figured from my experience that you have to measure everything to be sure it forms well. So best practices can get you in the ballpark but squeezing the last drop of performance can be harder. It could be an art for the Renaissance. People need to know about the specific details. But with some effort you can get to see. So let's take a look at a case study. I wrote a library called Unicode Transforms in response to an issue in the Haskell build tool stack. This library, so I think all Haskellers in their lifetime many of them they try to prove that Haskell can form as well as C. I have seen many of this series but this one is not a toy program. This is a real life useful library. And I compared it to an ICU international consortium for Unicode ICU library which does Unicode normalization. And compared this library with that to see whether it can perform as well as the best library available in C++ land. So what's the problem that we are trying to solve in this library? The Unicode characters may have multiple forms. They could be composed or decomposed. So a composed character as you can see in the first lines of the green characters they are composed characters. So they have for example the A with an adornment. It is 00C5 and then O with an adornment it's 00F6 and then they decompose into these codes that are written in orange. So how do you compare strings? They could be in different forms. How would you compare them? So you have to bring the string into a normalized form before you can actually compare and say these two strings are equal. So you have a sequence of characters. So when you decompose a composed character into multiple characters, so you will have a starter and there are combining characters followed by the starter. And the combining characters they need to be put in a certain order because if the order is different then your result will be different. You can't compare deterministically. So you need to have the combining characters reordered in the standard form. So to do that what do we need? So we have a couple of lookup maps. There is a decomposition map where you want to know given this character how which characters does it decompose into. There is a combining class map which says what order the combining characters should be in. And then there are Hangul characters which are called the Chinese, Japanese and Korean. So they are in tens of thousands not hundreds. And so there is an algorithmic decomposition for those characters compared to the other characters which have mappings. So there was this library written by Antonio Nikishov. So this was around 50 lines of code for normalization. It used map for database lookup, lookup of characters and Haskell lists for processing. This was idiomatic code. In just 50 lines you could do whole normalization very elegant code. So this is the performance of that code. So ICU, first one is the ICU and the second one is the beautiful elegant Haskell code. So it is close to 200 millisecond the Haskell code and the ICU library performs less than 10 millisecond. So 20 times worse and in the worst case it is maybe like some 70, 80 times worse. For English ICU takes close to 2.7 milliseconds. So instead of int maps, so the first change that I tried was like this is a read-only database, right? So why don't you use the GSC pattern match lookup instead of using the int map? So I looked at the GSC code and saw that the GSC folks have done a really good job in making the pattern matching very efficient. So the code looks like this, pattern matches, so you have all the characters that is automatically generated from the Unicode database, this code. So let's see how it performs. So on the left you see the improvement. It's negative, so that means improvement. So this much is the improvement compared to what it was before. On the right is the absolute values and on the left is how much improvement you got with this pattern match. But this is not much. Then I tried the path path decomposition lookup. I mean instead of looking up the actual decomposition of a character, first lookup whether it is decomposable or not. A lot of characters are not even decomposable, right? So you would use a bitmap to see whether it is decomposable. If it is decomposable then you go to the secondary lookup saying, okay, what is the decomposition? So that didn't have much impact. It had some improvement in two cases. So let's try something more. So decomposition is recursive. If a character decomposes into two code points, then they can further decompose into more and the process keeps going on. It will terminate at some point, right? So instead of using iterate, zip and tail more idiomatic code, just use a simple recursion to do that. So that gave some improvement. Now for reordering, the code was splitting it into groups, right? And then you sort by a combining class to reorder them, right? So this is like, there is a starter, then combining character, then another combining character. So we have split the list into multiple lists, right? For each group. So whenever a new starter, starters are characters with combining class 0. So if you find that there is a combining class 0, then you start a new group, right? So and then these groups, you use the regular sort to sort it, right? So the optimized code, it uses custom sorting for the cases when the group size is one or two. So in most cases, you will find that there is only one or two combining characters in the first part. So if you optimize that case, you would have optimized quite a bit, right? Because the other cases are slow path. I mean, you will rarely encounter more than three, four, or as the number starts increasing, there will be rarely, you will see rarely those kind of decompositions, right? So then use a bitmap for a quick combining or non-combining check. I mean, instead of getting the combining class directly, because a lot of characters are not combining. So in the most cases, you will find that it is not combining. So you won't even have to look up. Then the next optimization was, so original code had reorder.decompose, right? It's modular. There are two stages. You're decomposing, then you're reordering, right? So instead, we made it monolithic. So it's not as elegant as before, but it performs better. So we said, okay, decompose and reorder and use a reorder buffer to actually reorder during decompose itself. So in the common case, the buffer will have just one character and it will get flustered as soon as we get the next character because the next character is not combining, right? So we only need to sort the buffer only when there are more than one combining characters and we use custom sorting for two characters. We just swap them if needed, right? So another problem was that I was using string append, which was not performing too well. So instead, manually deconstruct the list and reconstruct for short strings common cases. So that gave 10% improvement. For the CJK characters, the Chinese, Japanese and Korean, it's called Hangul characters, right? So Hangul Jamu. So Unicode specifies a algorithmic way to normalize that, right? So there are multiple small changes which made it much better. So we use the algorithmic decomposition plus. So Hangul Jamu case is more expensive compared to the regular fast path case. So we know inline it so that it doesn't impact the fast path. Then use code and REM instead of div and mod. So that gave some improvement. Now instead of using code and REM, use code REM, right? In one operation, you will get both quotient and remainder, right? Then instead of using care, use unsafe care. So that doesn't do a check of legality of the code point, which we already do somewhere else. And use strict values in list buffers instead of using lazy lists. So make them strict. Then the next was like instead of using lists, for short buffers, use tuples. That performed even better. And another change was like localizing recursion. Instead of having a big recursion loop, localized recursion to wherever it is needed, only to that point don't recurs over a big code. So with all these optimizations, what we got was this. So the first one is ICU. The second one is after optimization, Unicode transforms. The third one is C code, which is another implementation of normalization, UTF-8 proc. It is written in C. So Haskell is doing better than that, but it is still not as good as the best implementation in C++, which is the ICU implementation, right? So can we do better? So we are still using plain Haskell strings. So I did some small experiments to see, like in the limit case, how are we performing? Let's say we map care plus one to a string and to text. So using the text library. So the string performed 17 milliseconds, and text was 11 milliseconds. So slightly better, right? Then ICU, if I use text just stream and unstream, right? Instead of doing a mapping on it, it takes four milliseconds. That's the minimal operation that you can do on text, right? And ICU normalization itself, the whole full normalization is taking 2.7 milliseconds, so we have no chance, right? Just streaming and unstreaming is taking much more than that. So then I looked at what is going on in the unstream and stream code. So I separated some of the memory reallocation code and did a knowing line on that to move it out of the fast path, right? And that made us to 1.3 millisecond unstream and stream, right? And now we have a chance. I mean, we can compete with the ICU normalization using the text library. So let's apply that now. So we use text with stream and unstream instead of using strings. Then there was some conditional just branch readjustments like some fast path branches. It can give you some sort of improvement even by rearranging code, have the fast path branches arranged in such a way that they're more efficient, right? Then just one inlining gave 16% more and no inlining slow path code gave some improvement and using the unbox strict fields gave some more improvement. So with these small changes and optimizing the reorder buffer now, so instead of using a list, use a custom data type. So we use a buffer type with empty one character and many characters. So one character case is unpacked. So this is optimizing the fast path again. I mean, that's the common case. So use the most efficient data structure for that, right? And use a mutable reorder buffer instead of using a list. We used a array, a mutable array. So this gave us like plus 5%, but at that point, that 5% was a really good improvement because we had already optimized quite a bit. So where are we now? After this, what we got is pretty close to ICU, right? In the Korean characters case, the Korean benchmark, we are actually doing much better than ICU, right? And in most cases, it is slightly worse in other cases, but quite there in the wallpark, very close to C. But now let's use LLVM as the back end. Instead of GHC native code generator, GHC provides you LLVM to be used as the back end. So it generates better code in some cases. So that gave it 10% boost by just using the LLVM backend. And we are pretty close to C now, right? And we can do even better. There are more algorithmic optimizations that we can do. And the ICU library also uses the QuickCheck properties, not the QuickCheck in Haskell, but the Unicode specifies some QuickCheck properties. Using that, you can quickly figure out whether you need to normalize or not normalize, right? So that library has that advantage as well, right? Even then, we are pretty close, right? We are not using QuickCheck properties in our algorithm. Code generation by GHC can possibly be improved. I raised a couple of tickets when I was investigating this. The code generation is not... The registry allocation in some cases is not great. Perhaps can be improved, and the memory allocation can also be improved a little bit in one case. So what are the lessons? What are the learnings from this? So we can write concise code in Haskell pretty quickly. It may not perform so well. The code can be optimized to perform close to C. Most of the optimizations that we did were not like language features or language-related stuff. We did... Mostly, we did like fast path optimization, right? How to do better, use better data structures for the fast path, right? The most common language-related optimizations that we did was inlining, right? Figure out where to put an inline. So I usually say that half the time you take in writing the code and half the time you take in figuring out where to put an inline. So, yeah, so that's the most common optimization that we did other than strictness from where we are here or there, right? Yeah, we can do as well as C. So if you think Haskell can't do as well as C, so this is an example. And this is a library which is not a toy. It's being used by Pandoc, for example. I recently looked at it, so Pandoc is using this library now. So let's take a look at the ground rules for optimization. So in Haskell, I say in Haskell you need benchmarks and C you need tests, right? So Haskell is like pretty sensitive to benchmarking, and that's the only thing, one thing that in my experience that is fragile is inlining. So if you can figure out what to inline, then you're good. So if GSC can do it better, then perhaps this problem will go away. So the ground rules are measure, analyze, optimize, so algorithmic optimization first. So most of the time we think this is performing slow. It might be because your logic is not good. I mean, it's not according to the best possible optimized way. Do we just gain first and optimize where it matters? That's the most important. I mean, optimize the fast path. Why would you optimize something which is not even coming in the fast path? It happens once in a while. That's what we did in the Unicode Normalization Library. We optimized for the cases which occurred most frequently. Now debugging, so from my experience, so it helps to narrow down incrementally. So if your code is not performing well and you made some changes incrementally to the code, now you start figuring out, okay, remove this, remove this, remove this, what is, where is the culprit, right? And that helps quite a lot. Same way you can do incremental addition or incremental elimination. Both techniques you can use to figure out where is the problem. And then ratchet. So we discover one issue in, like, one day I spent discovering something, and if during refactoring I may lose it, right? Let's say I remove the inline or something happened during the refactoring. So you will again spend the same time, so you do ratcheting. So you never let your code to perform worse than what it was before, right? Just at least as much, as good as it was before. So the three things that you need for optimization in Haskell is inlining, specialization, and strictify your data structures where needed. So if you take care of these three, then you're good. For, like, almost more than 90% cases, or even I'll say 99%, you won't need much more unless you are, like, hungry for the last drop performance. Or some corner case, you ran into some corner case for which you may need some weird kind of optimization, right? Let's talk about inlining. So what is inlining? Just instead of making a function call, you expand the definition of a function at the call side, right? Very simple. Now, there's a big confusion when you're beginning to write Haskell code. There's a confusion, like, because the inlining process is distributed at two places, at the call side, at the definition side. So there are things that you can do at two places which can help or block inlining. So for a function to be able to inline, you need its definition in the interface file. And for it to be in the interface file, there are many ways it could land up in interface file. One is, like, GSC decides, okay, this can be inlined. It's a short function. So it will put it there. The other way is you put an inline on it. So the compiler will definitely put it in the interface file because it has to be inlined. If you put an inlineable, then that is also a directive to say, okay, put it in the interface file, right? Inline is like inlineable plus also inline it at all call sites, right? So that's a directive to say, put it in the interface file plus wherever it is used, inline it, right? So at the call site, the prerequisite is that the function must be in the interface file. It must be available. The definition must be available in the interface file. So you have to do, ensure that it is in the interface file if you want to ensure that it will be inlined, right? Either use inline or inlineable. So if it is marked inline, so it will be inlined everywhere, right? It will be there in the interface file as well. But if it is marked inlineable, you can inline it using the inline call as well. So there is a function called inline which is in DHC, DHC prim, right? The primitive package. So you can use that as well if the function was not marked inline at the definition site. So you can decide at the call site. So that way you can say inline here but don't inline here, right? With inline it will be inlined everywhere. So when inlining cannot occur, so we need to understand that to be able to understand, okay, we might get confused if this is not getting inlined. So what happened? So the function has to be fully applied. If it is partially applied, it cannot be inlined, right? If the function is passed as an argument to another function which is not inlined, then again it won't be inlined, right? That's one of the reasons why it can't be inlined. If the function itself recursive, there is no point inlining, right? In mutually recursive functions, DHC tries to use, tries not to use a function with an inline pragma as a loop breaker. So in mutually recursive, you have to use one of the function as a loop breaker because of loop, right? So what happens when an inline is missing? So look at this code. So this is one of the benchmark codes in Streamly package. So there was no inline on this function. So I figured that without an inline, it was taking 50 milliseconds. With an inline, 500 microseconds, right? That's 100x faster. So just one missing inline can cause, like, your program is performing so much slower. So what is happening here? Without marking the function funk inline, the f, which is an argument to it, it cannot be inlined and therefore it cannot fuse with mapM there. So the stream fusion can really get you very tight code which performs really well. But if it doesn't happen, then it will be a function call and which will be very slow, right? So in this case, that's what was happening. So one point to note here is that fusion is very sensitive to inlining, but CPS code, the continuation passing style code, is not, it's pretty robust. So in Streamly, I have both versions. All combinators are available in CPS as well as in direct fusion style. And we use what makes sense at a particular point. So CPS has been very robust though it is very slow compared to the fusion code, but it is very robust against inlining. In fact, the fusion code can perform much worse than CPS if there is an inline missing, right? So it's not like CPS is, because CPS is slow in itself. So that's why it is slow. So that's why it is robust for inlining because it doesn't matter, right? So this is counterintuitive for many people. DSC manual also says that you won't need no inline anywhere, but I use no inline all the time and it helps in performance because you push the slow path code out of the way and let the fast path code work better. So the way CPS works, right, when you fetch the instruction pipeline, so if a lot of junk gets in the middle, it won't be as efficient, right? So if you do no inline, okay, put it in a separate function, market no inline, it's out of the way. Okay, this case happens once in a while, just put it out of the way. Let my fast path code be as efficient as possible. Now the specializes another technique. It's not required as often as inline, but this is one of the important ways of polymorphic code. So inlining copies the function at the call site, right? Specializing, it makes a copy which is specialized to a particular type. For a polymorphic code, you can make multiple copies for different types, right? So that way you can have more efficient as if you wrote one homomorphic code, not polymorphic, right? So this is one example of specializing. So let's move on. I think we don't have much time. So you need inlineable for a specialized issue. So you need to mark a function at definition site inlineable because you need to understand that to be able to specialize in a different module, the function definition has to be available in the interface file. So inlineable does that, so you can specialize it in a different module in a specialized directive tagma there, right? So this is confusing initially because inlineable, what does it have to do with the specialization, right? One may think like that, but its meaning is overloaded. What it does is put the definition in the interface file. If you understand that, then you will understand, okay, it can be used for specialization or it can be used for inlining in a different module. So at the call site, then you can use the specialized directive. Once you have ensured that it is there in the interface file, you have used inlineable or inline on it. That will make sure it is there. So then you can use specialize in a different module. Now another important optimization is call pattern specialization, which I actually used in encode transforms library. So with this, what you can do is that you can think about it as if there are different constructors of a data structure, right? And there is a recursive function which uses that data structure as an argument, right? So that recursive function can be specialized for each constructor of that data structure. For each constructor, I think you have a separate copy of that function, right? So that way for each constructor, it can be made more efficient. So and you can force it by using a special argument called spec. You can force the function to use constructor specialization and that's heavily used in vector and streamly as well. For recursive functions, you have different constructors. For each constructor, you may want to specialize it to just use spec constructor. So when specialization cannot occur, the function is not fully applied, it can't occur. The function calls for the functions which cannot be specialized, then you can't specialize that function itself. If the function uses polymorphic recursion, you can't specialize it. So the third technique is strictifying. So I have written buffers here because it's usually used when you are buffering data. That's when it is required. When you're processing data, you use streaming. When you are buffering data, then you need buffers which should not be taking up so much space, they should be reduced most of the time. So you don't want to keep lazy expressions which are anyway ultimately going to be reduced so that puts pressure on GC. So as a general rule, be lazy for construction and transformation and be strict for reduction. When you have to reduce strict, use everything strict. When you are doing a folder, for example, that's transformation, so use lazy structures. When you are constructing new structures, be lazy. And in left-folds, it's reduction, so use strict accumulator. That helps a lot in certain cases. And use strict record fields. Unbox is another technique, so you can use the unpack-pagma to have the constructor fields unboxed. So wherever you think it's a buffering data structure, you are using it like a buffer. So strictify it, unbox it if possible. So that will help. Unbox may not help in all cases because if you are using it again and then the compiler has to re-box it, then it won't be as effective. But if you are just keeping it there as a buffer, unboxing will help. All right, we have 10 minutes. Let's come to measurement. So benchmarking. So how do you do benchmarking? So there are two tools, Gaze and Triterion. Triterion is the go-to tool, right? Which actually benchmarking in Haskell started with Triterion, kind of, right? That's the gold standard for benchmarking. But the new tool created by Vincent Hanke is Gaze. He removed a lot of dependencies from Triterion and was fast-moving, so I contributed quite a bit to Gaze. Instead of Triterion, it builds faster as well. So we faced several benchmarking issues during extremely development and made significant improvements to Gaze to address those issues and also wrote the bench show package to be able to scale analyzing benchmarks. So what are the pitfalls in benchmarking? So the code needs to be... The benchmarking code needs to be optimized exactly the same way as you would optimize your regular code, right? Otherwise, your benchmarks will be invalid. You haven't... So we saw that there was a missing inline and it caused the benchmark to run very slow. So your benchmarking code has to be fast, right? In one case, I found that the RNF implementation of a particular data structure was slow in itself. So benchmarking code uses the NF data instance to reduce the data structure deeply, right? That's how it works. So if the implementation of reduction itself is not fast, then it will look like it is very slow, but actually the code is not slow. Its measurement is slow. The NF data instance is slow. So the other critical problem was that multiple benchmarks can interfere with each other. In one case, I saw that the benchmark that was listed later in the list of benchmarks performed much slower. So I was comparing, for example, the multiple streaming libraries and the streaming library which was at the end was the worst. So the problem was sharing. So what we did was that run each benchmark in a separate process. So start from scratch for each benchmark, right? So that's what I introduced in Gage. So Gage can do that. You can run each benchmark in a separate process. So you could be doing nothing even with NFIO. If you are using NFIO, so the NF in benchmarking, when you use NF, normal form, right? Application in pure case, you use a function application. But in IO case, you don't use function application. But in many cases, it may be the pure component of the IO thing that you are measuring which may be taking a lot of time. And then you do NFIO, it doesn't do anything. It's just nanoseconds. It's coming out to be nanoseconds where it should be milliseconds or microseconds, right? So there was one improvement in Gage to address that as well. It's there in criterion as well. So we run each benchmark in isolation. So in Gage, other things that were added was like measuring all the getR uses stats, which includes RSS, page faults, context switches. So you can gather all that information using Gage. And there is a quick mode that I added. So you can measure in less than a second instead of waiting for a lot of time to run. If you're running so many benchmarks, you don't want to run for that long, like waiting for minutes to put the benchmarks to finish. Even though the difference in the measurement is not that much. All right, so let's move on to bench show. So in Haskell, benchmarking is a business. You could... I don't want to miss... I don't want to say that, okay, this inline is missed, this function got 100x slower, right? I always want to measure all the operations, all the combinators in a streamlink. So there are hundreds of benchmarks. Each combinator has a benchmark, right? So now if you have that kind of benchmarks, how do you analyze, I mean, how would you know when you run these hundreds of benchmarks, right? How would you know which one regressed? So how would you know how much regression was there, which one? So you will have to go through pages and pages of the output and see, okay, this one was this much and now it got this much. How do you do that? So bench show solves that problem. And for stable comparison, it uses three statistical estimators, linear regression, median and mean. And we try to get the best out of three, right? Which one gives you the minimum difference? So that's what we use. You may use whatever you want, but that's what by default I use using bench show. And then it computes the percentage regression or improvement because different benchmarks, they may be... So one benchmark regressed by 45 milliseconds, right? The other benchmark regressed by, let's say, 45 microseconds. But both of them may have regressed by 5% or 2%, right? So you can't, by absolute values, you can't figure out, okay, which one I have to focus on to improve, right? So the one which regressed by 50% may be the one that I need to focus on, rather than the one which regressed by 45 millisecond, right? So we compute the percentage difference and then sort it by the percentage difference which one regressed most, right? And then I have a list of, like, what I need to attack, right? I need to attack the first one in the list, first, right? That's the benchmark that regressed most. And we can automatically report regressions on each comment using this, right? And it uses a threshold as well. You can configure, okay, if benchmark regressed by more than 5%, then only then report, right? Or you can say, fail if something regressed by more than 5%, right? So this is one of the outputs, like, percentage difference. You can see it is sorted. The first one regressed by 596% and then it keeps going down, right? So you can also visually see the graphs. So you can see here the same thing visually. The second graph, it shows the delta from the first graph, like how much it regressed, not the absolute comparison. The second one is the delta from the first one. So you can see the negative as well, which is improvement, actually, to some of the benchmarks, right? So how do you compare packages? So with Bench Show, you can group benchmarks arbitrarily in different groups and you can compare those groups with each other, right? You can say this one is, these benchmarks are for conduit, these benchmarks are for pipes, and then you can compare the two sets of benchmarks, right? They might be there in the same file, in the same benchmarking suite, but you can group them in different groups and then compare those groups. So this is one of the results of Bench Show, comparing Streamly, which is the first one, and the streaming library, conduit, and pipes. So the benchmarks are in the same file, right? So we grouped, okay, these are for conduit, these are for streaming, these are for pipes, these are for Streamly, and then we compared that. So it is very flexible. You can do whatever you want, whatever you can imagine it can do. So you can see I was talking about the high performance so Streamly is a focused on very high performance, so you can see that the bars are so slow you can't even see them. They're so low, I mean. Okay, so pure streaming, so this is the time comparison, so using Bench Show, so I have a package called Streaming Benchmarks, which actually compares all the streaming libraries as well as pure streaming, so which is like lists and vector. And Streamly also has a pure, a module which can do the same thing as lists do, right, the pure version. And the first one is Streamly, the second one is the Haskell standard lists, and third one is vector. You can see vector is not performing even as well as list in these cases, perhaps because it does a lot of memory allocations. This is a million elements. You're going to a million elements and doing filter, drop, et cetera. These are filtering operations. So there are more that are available in the Streaming Benchmarks package. You can see elimination operations, all kind of operations you can compare different libraries. You can even compare byte string and text to see where they stand. I didn't have a fair comparison, so I didn't include them yet. So this is the space utilization. This is what we saw in time. So vector is using 60 MB in many cases, whereas the pure stream and list, they are taking a lower amount of space. So don't think that vector is bad. Actually, Streamly uses the same code as vector, but there may be some improvements that can be done for vector. And also, you need to keep in mind that these are operations that are streaming operations and vector can do more operations which are commutable operations, right? For array operations, where lists and streams may not work well. All right, we are at the end. So here are some links you can go through them and you can send me an email or there is my Twitter handle as well. So you can contact me through these means. My email id is there on my GitHub account as well. So I also need feedback on, for example, benchmarks. Tell me what is wrong with them. I mean, if there is anything wrong, am I measuring things fairly or this is just something that doesn't make sense? My questions. So by doing all these changes to the source code, obviously it will not be idiomatic, it has to be anymore once you start optimizing a lot more. How far do you drift apart? Yeah, so I had a point about that because I breezed through it. So it was something like 50 lines of code doing all of normalization initially. It became, if I measure totally the lines in the file, it became like 500 lines, but that also includes some fundamental operations like streaming and un-streaming. If I remove that code, and I am also using like I haven't optimized composition, decomposition is very well optimized, and I can actually, perhaps, both of them can reuse the same code. If I do that, it can be compacted quite a bit. So probably I didn't know about vector at that time, so I could have used vector as well for mutable operations, right? So I may not have to use the stream-on-stream code from text. So with that, it could still be idiomatic, right? And yeah, so you need to actually to address the fast-part code, you actually need to drift from that. You can't do everything in the same way. So if you want high performance, but it is not worse than C. I mean the ICUC++ library it is used. And even if you cut out the part which is used for normalization, because that library does a lot more things, that part is quite a bit longer than the Haskell code, much longer. This could be like, I mean, 100 lines perhaps. Hi. So my question was how do you decide which functions to inline and not inline? So yeah, always measure. That's what I said benchmarking is a business. So you measure the function and if it is fast you inline it? Is that how it works? I don't understand, but how would measuring tell which parts to inline? Usually you get, so in my case, I had comparisons, right? So there are other streaming libraries which I'm comparing against and there is the vector library. So they are already optimized, so I have a baseline. It can go as much faster, why is it not going as fast? But in many cases you can also see this is not acceptable, right? In many cases it can actually go faster. So based on my understanding of the code it can't be that slow. Why is it that slow? So based on that I can sense there is something missing. I need to do better. So it might perhaps be some inlining that's missing. That means that you would have to understand the code really well and that probably works only for experienced folks. Yes you have to understand the code, but if you know that you want to get there, I mean it is not as good as it is not acceptable. So usually even the idiomatic code could be acceptable. Why would you optimize if you don't need it? If you need it then you will go and look. So generally you would run a profiler and then you would look at profiling results and then decide something. Yeah so there is profiler as well. But you did not talk about profiling at all. Yeah I didn't talk about that. I don't use that very often. I mean I have used it maybe once. I usually just have a good understanding of the code and I know that. And I am actually during this development I know the baseline as well. Okay this is how other libraries are performing. So we have to do at least better than that. So yeah I didn't use profiler very much. And I don't think I will use it very often ever even in future because based on the understanding of the code I can always figure out this should be performing. This is acceptable or not. I have a good idea of like this is somewhere in the ballpark. Thanks. In one of the slides you had this GHC option minus F. Something to do with inlining. Expose all unfoldings. What does that exactly do? Yeah so this is not use it. And the specialize aggressively all of these. So actually two questions over here. Good question. Expose all unfoldings is equivalent to marking everything in lineable. So everything is available for inlining. What does that do to the compile time? So it will explode your interface files. But it's just that the hifiles size will increase or compile time will also slow down. I haven't measured it. So and there was another aggressive inlining. Yeah so specialization, specialize aggressively. So this is another thing that does the similar thing to specializing. So it says okay there's no other control to say these are like blanket options. So blanket all everywhere just do specialize more aggressively. The compiler chooses to specialize this one function over another. So maybe this one I don't want to specialize in this one. In this case it says specialize as much as you can. So follow up question. Now say there are some you're writing an application. You're not writing a library. Your application depends on 100 libraries either directly or transitively as dependencies of your libraries of libraries. You want some of the base libraries to be compiled with these flags enabled. Yeah so this particular thing should be available in the interface file. So when you're doing a stack build or a cabal build. Do these packages like say vector, text, byte string which are used everywhere. Are these already compiled with these flags turned on? Or it depends on your application's GSE flags. So usually the writers the authors of these libraries they will expose the functions that might be required for inlining. So it's on their description. But if they didn't do that then you may need these options. For example expose all unfoldings. In a stack for example you can say or even in cabal you can say that all the dependencies should be compiled with this flag. So in that case all those functions available in those dependencies they will become available to you for inlining or specialization. These flags which when you're compiling your applications executable the flags that you put over there are used for all the dependent libraries as well. So that depends on the build tool right. So you have a way to do that that all dependencies will be compiled with the flags that you are specifying right. You can control that using stack I think you can control. With cabal actually by default I think it does that the GSE options it passes to all the dependencies and that's that could be a problem as well. Do we have time for one more question? Do we have time for one more question? Yeah you can. So those who want to go they can go I mean you can continue asking. Can you just move to the specialization slide. There was something which talks about what specialization is. So there are two slides definition site and call site. So in I mean if. Yeah so this let me see this one instead of calling a polymorphic version of a function make a copy. So code this specialization has nothing to do with type classes right. So this is if you're writing a polymorphic function. You specialize the instances right. But instances are by definition specialized. Not really. So you can I mean say if you have the functions they may be polymorphic and some variable which you want to spend. So here is an example right. So this is a type class member function concept right. So I have a specialized to I am on ad instead of the polymorphic variable M. I can I can show examples offline. No so I'm so this is this works for any monad M right. And then you're saying I want to specialize it for Monad IO. Yeah. But how exactly are you specializing it for Monad IO. I mean if the compiler knows how to what magic assembly to generate. If it's specializing it for IO then why does it even need this flag. Just do it. Okay. So let's take this one offline. If you have any other question. All right. Thank you.