 to detect a little bit of a pattern. Yesterday, I was talking about quantum physics as part of his talk. And there seems to be something with rubists that transition into elixir one way or the other. They seem to also pick up some physics along the way. This talk is going to be about string theory. I expect particle collisions of all sorts. And what can I say about our two speakers? Well, when it comes to James, he's currently building his very own castle gray skull. Yeah, that's worth a round of applause, I think. Yeah. And Nathan, well, he's not only a developer, he has actually a past in helping President Obama counter-terrorists using hashes. Please welcome to the stage James and Nathan. This talk is not going to really have much to do with physics. As far as I know, string theory is something about knitting. I don't know. Basically, it's a plan. So what we are going to talk about is all of the different stringy types in elixir, including IO lists and atoms and fun stuff like that. I've never met this guy before, so hopefully this is going to go well. I'm James Gray. I work for No Red Ink. We have elixir in production, and we're hiring. And I'm Nathan Long. I work for Big Nerd Ranch, and we would love to help you with your elixir projects. We're not actually going to do Q&A in this session. If you have questions, you can grab us afterwards or start a post on elixir form and things about it. Nathan and I have been friends for a long time, so we have these conversations with each other regularly. And I just want you to pay attention, because I'm pretty sure when we divided up this talk, somehow I ended up with sections that look like naming things and cash invalidation. So I'm not really sure what's going on there. When I was on the Ruby Roads podcast, we spent a lot of time trying to get our definitions right. It kind of became the running gag of our show. And here I am trying to get the definitions right so we can actually talk about this stuff. But that's what we're going to do. So there's lots of different kinds of strings in elixir. The first one is the single-quoted string, which is different from the double-quoted string. It produces a list of integers that represent code points for that data. This is primarily for compatibility with Erlang, which will generally return strings in this format. So it's not something you'll typically use in straight-up elixir code. Mostly you'll get these back from Erlang or you'll be sending these to Erlang. There's another kind of stringy-like thing, and that's the atom with colon and some characters after it. And the reason that this exists is an atom is just a name and a table somewhere and a number that points to that name. So when comparing atoms for equality, it's extremely fast because you just have to compare numbers and see if they're the same. And as you will get into a bit later, comparing strings in the same manner is much more complicated. Beware of atoms that are not garbage collected in the beams. So if you do something dynamic like convert every incoming parameter to an atom, you open yourself up to denial of service attacks. You can exhaust the atom table and the beam will crash. A way to work around this is you could use the two existing atom converter, which will only convert it if it already exists in your program. Okay, now to the double-coded stringy-type things, of which there are three. A bit string is just a sequence of ones and zeros between the double-angle brackets. So anything between double-angle brackets is a bit string. Bionaries are bit strings that happen to be in 8-bit or byte chunks. So if you're dealing with bytes, then it's a binary. If your binary's bytes happen to represent UTFA code points, then you can refer to it as a string. This is Elixir's double-coded string, the normal string. Okay, so these are kind of like subsets of each other. Bit strings is everything. Bionaries are when the length is evenly divisible by 8 and strings are UTFA-encoded binaries. They're subsets of each other. Now I'm going to introduce one more data type here, and this one is a little bit trickier to get your head around. So I'm going to kind of see if I can walk us into the definition. So let's say we wanted to print out three strings. One way we could do it and a way we have to do it in a lot of languages is we can concatenate them all together. This would create a fourth string, and there would be a lot of memory copying, right? We've got to copy the first one in, the second one in, the third one in. We could do this with interpolation. This is just a syntax trick. It's still the same thing under the hood. We're still building the fourth string. We're still doing some memory copying. Elixir lets us instead use lists. We could send the data to puts via a list. And in this case, we're not actually building a fourth string. We are allocating some small pointers for the list, right? But they in turn point directly at the existing binaries. So we didn't have to copy memory around and stuff like that. We saved a little bit in using these lists. And if you think about this, right? A link list is a head that points at some data and a tail that typically points at another list. So we've actually got like four lists here, right? One with James at the head, one with Ann at the head, one with Nathan at the head, and the empty list at the end. We can actually make it less if we use some kind of strange nesting and stop using proper lists. Now we're down to two lists, right? To represent the same data. We've cut it in half. And this format may seem a little strange, but it's actually very quick to append to, which is a problem that lists usually have, right? We use the list concatenation operator. It's big O of N. Every time we call it, we have to block the entire list to put something on the N. But using this improper format, we can add very quickly to the N, right? With big O of one. You end up with this strange nested structure, but Elixir's IO methods don't care that you give it this strange nested structure. So I've told you these are called IO lists, and if you go look up IO puts in the documentation, it's going to tell you it takes a care data argument. So you probably think I'm lying. And the answer is it gets pretty complicated just to read this stuff out. If you read Elixir for them, you can actually go back and look at me trying to sort this out. And I got lots of really helpful information from many members of the community, including Jose Valim and Robert Verding, one of the co-creators of Erlang. And then after they were done helping me, they hijacked my thread to debate the best ways to implement strings in a programming language. So it gets pretty confusing. But this is my best attempt at solidifying all the definitions of things here. The first three are those subsets of double-coded strings that we discussed before. Care list is your compatibility string, your single-coded compatibility string. IO lists, IO data, and care data are all quasi-related to each other. The main difference is that care data has that UTF-8 code point assumption that IO lists does not. And the reason I introduce them to you as IO lists is if you look at blog posts and other materials, that is typically the name that gets thrown around. So this care data is not as global, in my opinion. So we're going to talk about UTF-8 strings unicode. If you're anything like me, possibly your first introduction to unicode was stuff not working in a web page. So before we talk about unicode, let's talk about ASCII. If I run man ASCII on my computer, I get this. And what this shows me is that ASCII is just a mapping. Here's a bunch of characters we want to be able to type, and we're going to assign each one of them a number. So here's our beloved alphabet. And if you want to encode ASCII in a way that will let you store it on a disk, all you do is take that number that it's assigned and convert that to base two or binary. And you can store that on a disk. You can transmit that over a network. And it's pretty simple. What's interesting is all of the ASCII characters will fit in a single byte because there aren't that many of them. So you can see that the red zero is there to tell you, hey, look, this is empty. It's always going to be empty. But unicode wants to let us type a lot more than those characters that we Americans were typing in the sevens. We want to be able to type accent and letters. We want to be able to type Greek letters. And we want to be able to type this Han character that means to castrate a foul. We want to be able to type more than words also. We want to be able to type pictures, right? We want to be able to do a lot of things. We want to be able to have emoji for laughing and crying and being upside down and having dollars in our mouth. So unicode lets us do all these things. Now, exactly what gets included in the unicode standard is kind of a political thing. And some people have felt like they got, you know, short shift. So, for example, there's a thing called the Han unification whereby people who type Japanese and Chinese and Korean have been asked to share some characters ostensibly to save space in the unicode standard. Even though unicode did see fit to include playing cards and alchemy symbols and ancient Greek musical notation. And they're adding linear B which only scholars would ever type. So you can see why they're missed. But in the end, what's important for us is that this is just a mapping. So we can include anything we want to. In theory, we can fit all human characters in here. So we're just saying, look, we've got capital A, we've got lowercase lambda, we've got men in a business suit levitating. And each of these is assigned a number. We have a mapping number character. And now, once we have this mapping, we have, you know, here's what it should generally look like. Font designers, please go include this in your font if you really want to. So there's a lot of code points in unicode. There's somewhere around a million. And obviously, they're not all going to fit in one byte. You can see that this, our beloved Han character here takes up a good bit more than one byte. Some code points are going to need more bytes than others. So how do we handle that? We could say, well, we'll just give all of them four bytes so that we know how long each code point is and there you go. That's not great, right? Because then the letter A is going to take up four bytes when it could have been one. It's not very efficient. So the going solution these days is called UTF-8. And what it does is it gives us, this is an encoding for unicode. You take your code point and you're going to encode into one of these four templates. There's a one byte template, a two byte template, all the way up to the four byte template. So you take the binary that you need and you slot it into one of these templates, the smallest one that you can get it in. So let's see an example of how this works. So here's the clock character. That happens to be code point 9200. If you give any character after a question mark like this in IEX, it's going to tell you the code point for it. So the code point is 9200. That's the number that that character is assigned. If I convert that to base two, those are the bits we need. That's the bare minimum that we have to encode somehow. So the way we put that in UTF-8 is like this. We just divide it up and slot it into the non-header parts of the template and pad with zeros on the left. And there you go, that's what we have. There's some really cool things about UTF-8 encoding that it enables, but first let's see if this is what Elixir actually does. So if I use the iHelper, the iHelper is great in IEX by the way to inspect any piece of data you should use it. If I use the iHelper on this string, it tells me that the raw representation, what's in the binary, is these three numbers. 226, 143, 176. If we map over those and convert them to base two, that kind of looks like UTF-8, doesn't it? It looks exactly like UTF-8 because that's what it is. That's how Elixir stores this stuff internally. So let's take a look at what UTF-8 is actually doing. The cool thing is that there are three kinds of bytes in UTF-8. So there's a solo byte like we saw with ASCII characters like A. Which bit is always zero. And the reason is because we don't need all that space. UTF-8 is backwards compatible with ASCII. Whatever something is in ASCII, it's going to be the same in UTF-8, which is handy for those of us who are American and want to be the same as when we typed it in the 70s. So we have solo bytes. We have continuation bytes. Any byte that starts with one zero says, I'm following after this other one. And then we have first of end bytes. So if it starts with one one zero, it says, I'm a two byte sequence of code points. It starts with one one one zero. It says, I'm a three byte sequence of code points. I'm the first one. So if you look at these two characters, the A is a solo byte, starts with zero. And this roasted sweet potato character begins with a four byte header. It says, hey, I'm four bytes long, and then you can see the three continuation bytes. So this is cool because it enables us to do things like string reversal correctly. If you have a string that looks like this, you have a single character followed by a single code point character followed by a three code point character and you want to reverse it. You would not do this. This would be terrible, right? If you do this, you've scrambled that first character or what was the second character because it's bytes got reversed. That would be a bad reversing algorithm. Instead, you want to do this. You want to keep that three code point character intact, the three byte character intact and you want to get a solo one after it. So, but it gets even more complicated because not only do we have multi byte code points, we have multi code point graphemes. So a grapheme is what a human being would consider generally a written character on the screen. Some characters can be written with multiple code points, like this e with an accent can be written as here's an e and here's an accent that goes on it. Those are two different code points. That's called combining diacritical marks. So here's an example, here's an e and then we have this here's a combining diacritical mark to go on the e. L. If you ask Elixir, give me the code points for this. You get five code points out and hilariously the display of that puts the accent mark on the quotation which is pretty broken and if you ask for graphemes out of that then you get what you would actually consider the characters. Like a human being would consider the graphemes to be the characters. So you may wonder if you can stick diacritical marks on something. How many can you stick on there? Can you stick as many as you want? And the answer is yes you can. This is called Zalgotex and it's horrible. You may see things like this on the internet you may see questions on Stack Overflow where people ask how can I prevent users from doing this to my web page and you may see snarky remarks where people do it to them. And the answer is you cannot prevent users from doing this to your web page. Because there are actually languages that need this kind of thing. Like these folks from Peru have a tonal language and they want to be able to put multiple marks on letters in order to represent those tones. So deal with it. The best thing you can do is if you think it's unreasonable to have a giant column of those characters maybe you can use CSS to hide some of them. But Unicode is there to support this kind of thing and not to prevent it. So all of this all of this complication makes strings pretty difficult to deal with. So traversing strings, asking for the length of a string, becomes difficult. Because in order to find out how long a string is in terms of graphings you have to walk through that string, combine the bytes into code points and combine the code points into graphings and then say okay how many graphings did I get? If you want to index into a string and say give me from the second character to the third character. Well if you want from the second graph you have to do all of these all of them operations. Length is ambiguous, you have to specify do you want how many bytes are in this or do you want how many graphings are in this? String.length and elixir is going to give you the graphing count. But again that's O then. Reversal can be tricky because when you reverse again you have to be thinking in graphings. Like if you look at the elixir example here does this correctly but the Ruby example messes that up. So if you look at the graphing on the L equality is tricky because hey sometimes there's more than one way to write something. You can have even an accent written as a single character, a single code point or you can have it written as an E followed by a combining mark code point and if you ask if those are equal they're not equal, they don't have the same bytes in them but if you want to know if they're equivalent elixir has a method for that so it kind of converts it into some canonical representation and then compares them. You can use the FB strings and anything with international kinds of characters and what not. Casing can be tricky even though casing is implemented via basically a case statement we say hey what's the uppercase version of this character or whatever. A lot of languages do this wrong until actually the upcoming release of Ruby they're still getting this wrong right now. And even elixir which has a really nice case statement in the form of a bunch of function to do this operation doesn't get everything perfect because human language is ridiculously complicated and can't be contained. This sigma character when you downcase it according to the actual grammatical rules has to be downcased differently whether it's at the end of a word or not and elixir doesn't have that. Here's the downcase, alright. If you need to do it more correctly for Greek then write your own function. If you get only one thing out of this talk I hope it's that there's an emoji for a man in a business suit levitating. Is that not the most amazing thing ever? I don't know it's pretty cool. Also his section is funnier than mine did you notice that? Pretty sure this division was not fair. We're going to revisit our friendship after this. Okay so let's do cash invalidation. The beam has certain rules right? Processes are isolated, data is immutable and therefore if you send a message from one process to another we do a full memory copy of the message because we have to move it into the new memory space, right? Because of this there exists some optimizations in the beam. The beam actually has a handful of different types of binaries but here we're primarily concerned with two categories. So if your binary is small under 64 bytes it's allocated on your processes heap like all other data types lists or whatever and garbage collected off of that heap normally. If it's larger than that it's actually stored in a space called the large binary space and internally you just get a tiny reference to this external data. So you're probably wondering it seems like a strange implementation detail and the answer is because it's awesome right up until the point where it causes you horrible problems. So we probably want to be aware of this so you can understand what it's doing for you and what it's doing to you. So the win is this. If you are passing binaries between processes that are big it's very fast because you only have to copy these tiny references and it's just super quick and it makes sense that we would be passing large binaries between processes a lot. Think about like HTTP responses. If you fetch some data off of a JSON API it's a giant wall of data. If we do rendered sections of Phoenix templates they're going to be big chunks of HTML and boilerplate and stuff. So we do this a lot and it's good for this to be fast. But here's the problem. It exists outside of processes therefore it can't be garbage collected normally like the stuff in processes so it has to be reference counted. Every time we allocate one of those tiny little references the counter goes up. Every time one of those tiny little references garbage collected the counter goes down. It's better than we can get rid of that thing that's stored externally. But what if your process doesn't hit garbage collection for a long period of time? It's possible for this to happen. If you built some system that was pulling data off of the web and sending it through a process that just looked in that giant text for the three things you care about and found them. If you do much memory allocation in that sequence you may not get to the point where you need to get garbage collected and so those references would stick around and keep that data alive in the large binary space. If this happens you might begin to see things like memory leaks over time and if you take it to extreme scenarios it will actually crash the beam. So this can be a big problem. It's affected people very really. Heroku has this great write-up of them trying to find it in one of their products long dash I think. It's like a long six month search of them figuring out what's going on, trying to find ways around it and things like that. Avdi Grim who you probably know from the Ruby community came and checked out Elixir for a time when this issue got real frustrated with both figuring it out and what to do about it and we kind of ran him off. So it's something we have to stay aware of and know what's going on. So here's one way you might be able to find it. Erling keeps track of memory that it allocates in various areas and this particular code will ask you what the allocation is going to be. This code is a little bit tricky. I can't seem to get it to work in IEX for example but it does work in the general case. Also I believe Observer has the same information under the allocators tab so you can choose to look it up there as well. If you do start seeing these problems what you should do about it is you have to find the right place and the right time to force it and we're talking about a generational garbage collector here so not all garbage collections are created equal. They don't all visit all the bits of allocated memory so there's a bit of trickery and there's a lot of information about garbage collection that you can use to so there's a bit of trickiness to getting this right and finding what works. One idea I've had is that it may be better to have your processes with shorter periods of time the best garbage collection is to exit. Then you're not holding any more memory so in that example I gave before if you have a process that's analyzing incoming data over time maybe it's better to spin up a process analyze a particular chunk of data have that process exit when the next one comes in spin up a fresh process analyze that data and do it that way that way you won't have these long with references to that data and hopefully the large binary space will get cleaned up more often. So James talked a little bit about IOLIS and just to kind of give you a quick refresher of what they are. So if I do this puts with an explicit concatenation we get out what you'd expect I can also do the IOLIS with a list of strings I can put a code point in there and we get the nice snowman and I can give it this deeply nested list of strings and that all works just fine it all comes out just as if I had given it a single concatenated string so IOLIS when you're doing IOLIS operations like putting the standard out file writing to a socket stuff that goes outside of your program it's going to be able to use IOLIS just as well as if you build the final string yourself so this enables some cool things it enables things like string reuse so if I want to build up an output that has users names wrapped in LI tags then I can do this where I've allocated these strings and just keep putting references to them again in this and I'm not allocating those LI tags into new strings over and over again I'm just referencing them as needed and then when I do the puts it's going to come out just as if I had copied those strings so that's cool so the benefits of string reuse one is that we skip doing the work of concatenation of continually allocating this new string and copying characters from one place to another we also the fact that we're not allocating those strings means we use less memory and the fact that we're not allocating those means we have fewer things to garbage collect so that's less garbage collection work so that's all cool IO lists are for IO like I mentioned this is anything your process is doing to talk with the outside world so writing to a file and sending data over the network are probably the big ones so let's talk about system calls system calls are things that allow your program to do things that normally the operating system manages so we can say to your operating system please write this data to a file for me and the operating system will take care of that for you and your program doesn't have to know the details of how do I get in touch with the disk and what kinds of what protocol does it expect me to speak and all that kind of stuff the operating system handles that for you so there are actually different system calls that can be used to write data to a disk or a socket so one of them is write it's very straightforward it just writes the data and the way write works is we'll say please go to this address in memory for me and pull this many bytes out of it and write it to and the operating system will do that for you write v, the v is for vector so it's a list of things we can say please write this item and this item and this item and we give it addresses for each one so here's a code sample where we're going to take advantage of these things in the first line I'm opening a file I'm using the Erlang file function and it's important that I'm passing raw I'm not going to go into all the details of what that means I'm opening this file in raw mode and then I'm going to allocate a couple of strings I make a list that has those strings in it the foo string is repeated and then I'm going to join those into one big fat string at the end and then write it to the file I'm using a detray script that I got from Evan Miller's blog post Elixir in the template of DOOM which is a great blog post you should read and using that detray script I can see the system call that's being used here it says go to this address in memory and write nine bytes for me and the foo bar foo that you see there is just for us like that's for us to see as humans from the detray script it's not actually part of the system call now watch what happens see the line where I'm concatting those strings together enum.joinoutput I'm going to comment that out and watch what happens all of a sudden we're now using write v and write v says go to this first address write three bytes go to this other address write three bytes and go back to the first address again and write three more bytes so this is actually really cool because if you think about what's happening I'm never in my program ever building the completed output the contents the ultimate contents of the file never exist in my program the only place they ever exist is in the file I'm writing to so I didn't have to allocate the full contents of that file imagine that was going to be like a multi gigabyte file where a lot of the stuff was repeated I never had to allocate that it just gets put into the file so this is really cool so what kind of IO operations has a lot of repetitive strings hey HTML has a lot of repetitive strings we're doing web responses we'll have snippets that are repeated in the page like an li tag that you have over and over and we're going to have chunks that might be repeated not only within the page but across web requests every time somebody requests your web page they may be getting the exact same footer as the last person who made a request so it would be nice if we didn't have to repeat those if you are using phoenix and ex or if you're using hamlet with the cali p library you're going to get to take advantage of this so here's an example in a template you can see the chunks of you can see all in this page we've got some static pieces of information like the listing users that's going to be the same on every request and then we've got some dynamic things like we're iterating through people's names all of these things are static right they're not changing so it'd be nice if we didn't have to keep building those strings well in fact you don't have to this is what happens when you use phoenix so you have a template like food.exe phoenix is going to find that as it's booting up and it's going to compile that into a function it's going to use ex to compile that to a function and it's going to pass something to ex to say by the way as you're as you're building up from this template don't build it up by concatting into a big string build it up by building an IO list for me phoenix specifies that for it and so this is going to be compiled to a function head on your view like myview.render matchingfood.html and taking the assigns and that function is going to be ready to return IO lists so all of those static pieces that were in your template are just going to be sitting there in a structure inside of that function waiting to have the dynamic bits added in so the IO list that's returned from that function is going to look something like this so you can see it has very much the same structure as the template and the listing user is just going to be the same string every single time that this function returns it's going to be a pointer to the same string so this is really neat because it enables a really simple caching strategy you may have worked with web frameworks that you have like a caching strategy where you have to say this piece of the page can be cached if this model's updated out hasn't changed and by the way this model depends on this other model so if that one gets updated make sure and update this one so that this the piece of the page will get updated as it needs to be this is a really simple strategy everything that's static in your template is cached everything that's dynamic in your template is not cached the end that's great and the way you validate the cache is you change the template so it's very very simple to understand it doesn't accomplish all the exact same goals but it gets you pretty far just as proof here's me de-tracing Phoenix and watching it build its responses and you can see writeV is showing us that it is in fact we are using writeV in writing this response to a socket this is a really contrived web response that I did just for this what I highlighted in red here was a chunk that I saw as being exactly the same as on the previous response so the whole .type HTML opening chunk of the web page it returns the exact it says go back to the same place in memory you did last time a user requested this page and send them the same piece of data and then the blue chunks you can see that they are the exact same multiple times within the same response so what actually ends up happening here kind of like the file example we never actually build the full completed web page in the memory of the beam the only place it actually gets built is in the socket buffer where we are actually sending this back over TCP and then of course TCP is going to send it out in packets and throw it away as it gets done with it but it is not our problem we never have to allocate that memory and every time a user requests the page we are going to point them back to the same header string as the previous user so this is really efficient and cool and this kind of stuff actually matters in that you can actually you can actually enlarge web pages in a Rails app you can run into situations where the rendering time is the response time and the fact that Phoenix use render so quickly and efficiently that it is not having to continually copy strings to build up a giant string it is not having to build that up just so that it can do this right it saves us a lot of effort and it saves us a lot of performance so the moral of this is basically when you are doing IO use IO lists now there are several caveats to this I am going to try to elaborate on this in a blog so make sure and check that out because there is more than we can cover here when you are passing IO lists or cross processes small like the non-ref counted strings, the smaller strings are going to get combined and when you are doing raw mode when you are writing to a file you have to use raw mode in order to take advantage of write feed but these are details that are probably more than you really have to care about on a normal basis what you should take away from this is if you are doing IO use IO lists it is not going to hurt you and it may help there is no point in actually going through and joining those strings yourself that is it, thanks for listening