 alternative subtitle, Unicode, dates, and names. So really quickly about me, I'm co-founder of a startup called Apex, based out of Charlottesville, Virginia. We do analytics for digital marketers, so those agencies that have lots of data that you heard a previous speaker talk about, we help them manage all that data and understand and try to make sense of it. You can read more about Apex at Apex.com. My personal website is the second link, and then my Twitter account is JXXF. Oh, also one more thank you to the organizers, Ben, Karen, and Jonan. Thank you for inviting me, and thank you guys for having me. So this talk is fundamentally a talk about assumptions, and we use assumptions all the time to understand and simplify the way the world works. Now, some assumptions are good, and they serve as well, so for most of the time, for most of the assumptions that we make, there's no problem. So eating a balanced diet is good for your health. I don't think many people would dispute that, but sometimes you make assumptions like I did on this trip that don't turn out to be true, and then you have consequences for those assumptions. And when our assumptions are wrong, we wind up with problems that we might not have foreseen to begin with. So for example, I'm often prone to the problem of believing that I know where the performance hot spot is without having to do any actual profiling, and so I just sort of be like, that's definitely the problem. We take that out, but usually I'm wrong. Sometimes people might think that for all objects that you could define in Ruby, the objects are equal to themselves, but that's not true because float nans are not equal to themselves by definition. And other times we make assumptions that are predicated on social or cultural constructs that wind up in software that turn out not to be true. And they're reflected in the structure of our software. So here's a database schema from a Massachusetts civil service that I worked for that specifies the gender of the people that are to be married in a marriage license, which doesn't work as soon as you allow same-sex marriage, and it's a very expensive process to reverse that. So we make assumptions about the way the world works, not just in terms of software, but in terms of social and cultural assumptions that are then encoded into software that we're also making a whole other set of assumptions. Sometimes our assumptions are maybe even more simple, so here's a string that says Noelle, either in reference to the term for Christmas or somebody's name, but this seems to be five characters long, but it looks like there's only four characters there. And if we reverse the characters, we don't wind up with what we might expect of reversing the individual letters in the phrase. And we'll talk more about that later. Sometimes we make assumptions about the way dates and calendars work. Sometimes we might think that very logically, the date that's after one day can be gotten by incrementing that by incrementing the number of the date by one, but that turns out not to be true. So as we'll see, I think this talk is really all about people and the assumptions that people make and what happens when they go wrong and how we can do better. So I hope to enlighten you on that subject. So first let's talk about time. I don't think people would dispute that time is a difficult thing to get right in software. This is the on-diss size of TZ data, which encodes all of the information about how time zones work across the world. So all of the offsets you have to know, all the times in history that those dates changed that takes up almost a megabyte just in terms of rules about how to organize time, right? Imagine writing one megabyte of Ruby source code just to deal with time zones. Forget about encoding time in terms of hours and minutes and seconds and months and years and calendars and so on. So we're always adding rules too. Every time a political body changes when daylight savings times occurs, for example, another entry goes into TZ data, one for every time zone that's affected. We're always adding rules so we can never reduce the rule set unless we want to be incorrect about some subset of dates. And, you know, I don't think it should be any surprise that time zones are tough because we seem like we ask a lot of questions about time zones. But why is it so hard? And the answer is that we made it that way. We made a lot of presumptions about the way time work that sort of accreted over time, so to speak, and then led us into this big ball of mud. So first, to understand what time really is, we sort of have to, in terms of software, we have to ask the more philosophical question of what time is. And this is sort of a perennial philosophical debate. Isaac Newton came down on one side and he said that time is sort of this object or concept that exists independently of whether or not humans exist. So it's not a mental construct. It's something that's real and it's something that's a property of the universe we live in. Other people thought that time was not such a property that it was a mental construct that we created to try to understand the way the world worked and would not exist, but for the fact that human minds perceive the world in a particular way. But for purposes of our discussion, we're going to say that the definition of time is like the definition of space. So space is what separates me from any of you. The reason we're not literally occupying the same place is because we have different spatial coordinates. Likewise, the reason that all events don't happen simultaneously is because they have different temporal coordinates or at least that's the way we perceive it. So we're doing pretty good, right? A few slides in and we've already resolved a major philosophical problem of the ancient philosopher. So hopefully we'll have an even better track record for the rest of this. So just pull out your cell phones or look at your watches and just take a note of what time it is right now and just yell out what time you think it is. Okay, so most people said an hour in a minute. A few people said p.m. But I think I heard most people say an hour in a minute. So we can tell you what time it is by giving you a really small amount of information or we could be really, really specific about the time. We could say it's two hours and 30 minutes and 12 seconds or 230 and 1200 milliseconds or any number of granularity all the way down into the minimum possible unit of time that we care to think about. So that's an arbitrary distinction about how we want to do that. But that doesn't always work. So think about what 230 means. So 230 really means that we have this continuum of time and that there's a range of possible moments on that. There's an interval of possible values that time could take where it would be correct to say that it's 230. So that is the moment 230, all time that starts when 230 is true and ends when 230 is no longer true. But moments are... And when we talk about a specific moment, we can be more specific and narrow the interval that we're discussing by adding more specificities. So instead of 230, if you said 230 and 12 seconds, you'd get a smaller interval, but still a continuous one. So that's the moment, two hours, 30 minutes and 12 seconds. The problem, of course, is that moments all by themselves are ambiguous. Just today, there are two different... There are two distinct moments for which the statement 230 as a time is true. So that's an ambiguous reference as to which moment we're talking about. And, of course, this repeats every day, all the time, so to speak. It's gonna be a lot of time related inadvertent puns, I think, but bear with me on that. Now, the reason that I can understand you when you tell me what time it is and I'm not lost in a maze of ambiguity is that humans have context. So I understand that when you say 230, you're talking about the... You're talking about some window that includes this afternoon, today's date, this year, the Gregorian calendar, this time zone, and so on and so forth. I've assumed a lot of things that make it possible for me and you to have a conversation about what time it is and still understand each other. But if we're gonna tell this to computers, we need to be a little bit more specific. We can't just be more granular. We have to be more specific about what we mean. So when we say what time is it, we can't just say 230. We have to add all the pieces that remove ambiguity about what time it is. So we have to say whether we mean PM or AM, we have to say what day of the week it is, or we can say what the actual date is and that will tell us the day of the week, we have to say what year it is, and so on. Now, that gets us to the point of we can talk about what a time is and reference it with respect to other times, but we're still gonna have some other problems when we talk about times between two people that aren't necessarily in the same room. So when we compare times, one of the first problems we might encounter is we may not be using the same calendar. We might not have the same words for months or even the same number of months or the same number of days per month. We may not even have the concept of months at all. There may be one big year or a Mayan calendar cycle that has nothing to do with months, but has everything to do with seasons and so on. So the first problem is that we have to agree on what calendar we should be using before we can talk about other kinds of time because otherwise we're not gonna have a good way to convert between different moments in time. So we'll say, okay, everybody's gonna use Gregorian calendar and that's pretty much what happened around the 16th century of all of Europe and North America and parts of South America switched to that same calendar. And let's also, by the way, remember that when we say that we're gonna switch to that calendar, we also have to remember the date that we switched calendars because if we wanna go back and look up what old dates were, we're gonna have to remember that we used the old calendar before a certain date and the new calendar after that date. Okay, next problem. Your time isn't going to be the same as my time if we're not both standing in the same spot. So if I say 2.30 p.m. and we're in Portland, Oregon at 2.30 p.m., that's not gonna be about the same position for the sun in the sky at 2.30 p.m. in Portland, Australia. So, okay, let's fix that problem. Let's give everyone an offset for their local time so that we can all look at the sky and experience the same daylight hours roughly as each other depending on where we live and that way when I memorize my offset, I'll be able to convert between a time that you have. So this guy named Charles Dowd basically did this for the U.S. He was a seminary teacher who proposed time zones to a bunch of U.S. railway operators and they liked the idea so much of having time zones that it was eventually adopted in the U.S. in 1883. So all of our time zones in the U.S. basically harken back to cross-country railroads. So he said, great. So now we've got time zones and we've got the idea of offsets. So we know where you live when you are talking about a time. So you've associated that time with a time zone and that time zone has an offset. Next problem. That time offset's not the same value for all possible values of time. So, for example, in the U.S., we have daylight hours and evening hours. I'm sorry, daylight savings time and not daylight savings time or a standard time and in Portland, Australia, they don't have that. There's all one time offset no matter what time of the year it is. Okay, next problem. Your local time offset isn't the same as my local time offset for at different times of the year like we already saw and it's going to be different for different years. So, for example, the Congress of the U.S. changed when daylight savings time started a few years ago. So we have to remember when the daylight savings time start for a given year. Okay, so we have a lot of exceptions to when those rules happened and when they didn't happen and when they started and when they ended and so on. Next problem. Some of the times that you want to talk about don't actually exist as moments in time in my time zone. So, for example, if you want to talk about one in 30 a.m. Pacific time on Sunday, November 2nd, 2014, that is not a time that exists because that's when we go back an hour for daylight savings. There is no moment in time that universally corresponds to a unique instance there. So, for example, if I try to convert in Ruby from that, if I try to say give me the UTC time for this local time, it's not going to work because there are two UTC moments in time that correspond to the same local time. Okay. All right. So we've got all that information we have to remember. Obviously, we have to store this somewhere and we have to store it in a way that everybody understands it. That's what TZData does. So TZData is a way of encapsulating all the rules and weirdness in human logic about when time started, when times ended, and so on. So for every time zone we have, we have to remember a distinct list of historical offsets and that's what TZData does. So this is what a TZData file looks like for Pacific time. They're all referenced based on the canonical city for that time zone. So in this case, the canonical city is Los Angeles. There's a bunch of rules about when Los Angeles or when Pacific time was started being observed or not being observed. Now, sometimes, remember, you have to remember every single offset change to the list of offsets we have, which is why Indiana has changed its mind a lot about what time zones apply to which parts of Indiana. So we have 10 different time zones that you need to memorize. You can't just say central time. You have to know which part of Indiana you're talking about. So in Ruby, the two big classes you probably care about if you're just using Ruby and not Rails are time and date. There's also date time, which is more of a shell thing that doesn't really do anything unless it's augmented. But the important thing to know here is that time, as in the class, is aware of its offset. So an instance of a time is aware of its offset, but it is not aware of its time zone. So you must keep track of the time zone that you care about, or otherwise you run the risk of converting between two times in a way that does not make sense. And date is sort of even worse. It's not really a completely specified time. It's just a day on the calendar. So it doesn't correspond. It's like a moment. It doesn't correspond to a unique enough point in time that we can pinpoint something that we care about. And when you convert to UTC, you get a new time instance from your time zone. But again, because it's a time instance, you also don't have a time zone awareness. Another big problem in Ruby is that we don't have an idea of period or duration. There's no way to add like two hours to a specific time. There's no object that we can instantiate that represents three days and two hours. There are ways of making a new object that correspond to an advance of three hours or an advance of three days and 12 hours, but there's no object that represents that fact. So the closest thing we can really get is, or I'm sorry, one example of why that's a problem is if we start doing additions in terms of seconds or other granular units, we'll wind up with the wrong answer whenever we cross a offset boundary. So here, for example, when I add, when I pick three times November 1st, November 2nd, and November 3rd, each at 1 and 30 a.m., that's right around the Daylight Saving Sign fallback for the fall. If I subtract the first two dates, I'll see that they're 90,000 seconds apart. So I'll believe that a day is 90,000 seconds long, but a day is actually 86,400 seconds long. A good way of getting around that problem is that active support has an augments time with a method called advance, and advance pretty much does what you think it should do, and in this case, we'll get rid of the ambiguous time error by being able to advance one day and get that next day after Daylight Saving Sign. But it's not always correct in terms of what you think it should do. So here, for example, we see that advance is not an associative operation. If we start with January 30th, 2014, and we say, hey, I want to go two months in the future from that date, we'll wind up with what we think might be the expected answer of March 30th. So two months from January 30th is achieved by incrementing the month counter by two. However, when we do that twice in a row rather than all in one step, what happens is that we go first one month forward into February, which only has 28 days, and so the day counter is reduced to 28. So if a subsequent month when we add one more month, we're only going to the 28th day. So we wind up with March 28th versus March 30th. So with times, they're really, really hard to get right. Make sure that you always store your data in UTC if possible. Avoid trying to roll your solution, trying to roll your own solution of any kind because that will only wind up in blood and tears. And be aware of the limitations of whatever library you have to happen to be using. And this is a special plea. If anybody has ever used Jota Time, which is the Java Time library that's very popular and there's a number of ports to it, I would love to see somebody finish the Ruby port and come talk to me afterwards. I would love to hack on that together sometime. Okay, next up, Unicode. Why is Unicode hard? Again, because humans. So in the beginning in 1963, IBM invented an encoding that worked on mainframes called EBCDIC or EBCDIC is the quick pronunciation. So you can convert strings into EBCDIC. You can represent strings like as you can with any encoding in EBCDIC by using the icon library. And here we see that we've turned Cascadia into EBCDIC encoded bytes. People didn't like that because EBCDIC had a really inefficient system of encoding how the bytes were stored. So everybody kind of switched to Assy, which used only the lower 127 values of each byte. And so they believed there was an inefficiency there. So if we imagine how characters are encoded, if we imagine this grid that occupies one byte and it's gonna have 16 rows and 16 columns for 16 times 16 will give us the 256 values. So if we imagine this chart represents all the possible... If we fill in a value in each chart here or in each box here, we'll be able to have an encoding system that fits in one byte. So EBCDIC's encoding chart looks like that. Lots of wasted space. The red boxes are assigned values and the empty boxes are values that are not assigned to an encoding. They're undefined values if you try to encode them. Ascii's encoding chart looks like this. We use all the lower 127 values and we don't use any of the upper 127 values. But there's still a lot of empty space there for each byte. We're not only using 50% of the space. We could probably do something with all that empty space, but right now it's being wasted. Sometimes we use them for box drawing characters. Have you ever played old DOS games or used old DOS applications? That's what they use, the upper 1.5.4 is the extra box drawing characters. So you can have lines and do sort of primitive graphics. Or you could use them for emoji. You could define your own encoding that had extra symbols in it. But probably a better use is if we took that extra space and we said, hey, what about those languages other than English that need to be represented in the world and need to exchange data with one another? What if we used that space for them? What if we supported more codes? So we had the idea of code pages which were basically mapping a special number to the encoding that you wanted to use. So I would tell you, hey, this string is being encoded with code page 472. And then you would go look up what code page 472 meant and now you would have a mapping between all the characters that are in code page 472 and all the bytes that are in that string. So the code page tells us what to put in that empty space, how we should render those characters. Turns out that most of the time for most of the code pages they all kind of agreed about what 0 through 127 and each byte should be. So that was still A through Z, 1, 2, 3, 4, 5, 6, 7, and so on. And some of those other languages, we don't really care about you. You're not gonna get code pages because you don't use computers right now and this is the 1980s and we'll figure it out later. So here's a problem though. What encoding is this string of bytes? Well, you can't tell because there's no way to encode the... There's no way to determine the encoding of an arbitrary byte sequence. And if you need to do that, you have to remember both the bytes that you care about and the encoding that you want to use. But the encoding itself is also stored as a sequence of bytes. So we have to remember that and now we have to make sure that we understand what identifier is being used for that. So this is not a great state of affairs. We would really prefer that there was some way to encode stuff that didn't require us to negotiate content between each other. We would rather just have one system that managed all of that. So enter a Unicode. So if you imagine the English alphabet for starters, A, B, C, D, E, F, G, and so on, we can also think of other variations that are essentially A, B, C, D, but which are represented visually in a different way. But you can still identify that these characters are A, B, C, and D, and so on. Likewise, you can see that these are still A, B, C, and D. So we understand that different representations of the same letter are still the same letter which is shown differently. So we think the way and the word we call that is a glyph. So a glyph is sort of the platonic ideal of what a letter is. So when we say A, what we don't mean the word of the letter A written in a specific font or with someone's handwriting, we mean the generalized perfect ideal of what the letter A is. And a glyph assigns each encoding that we might care about to a specific value called a code point. So in Unicode, for example, the Latin small letter A corresponds to encoding code point 0061. Actually, the extra zeros at the beginning are superfluous, but by convention, there's four of them there. Or this tie character can also get its own encoding. Or this emoji can also get its own encoding and so on. So Unicode's encoding chart actually works by giving you an unlimited set of values to choose from. All you have to do is be willing to commit more bytes per character that you care about. So each character does not occupy necessarily a fixed number of bytes. You have to look up what glyph it corresponds to before you can understand what the, before you can understand whether or how many bytes that character should take up. And that specific variant of Unicode we get as much space as we want. It's called UTF-8. So Unicode has, UTF-8 has unique unlimited numbers of encoding. So for all the characters that currently existed, anyone has ever written down at any point in time, anywhere in time, we have a way of representing it. So, you know, are we done? Great work. Well, not quite because Unicode and Ruby has a couple of gotchas. So remember before I showed you the example here where this value is five, but we might expect it to be four. And if we reverse this string, we don't necessarily wind up with what we thought. So I've broken apart the characters here. And you can see now that the diuresis, that's the little two dots above the, to the right of the E, that character is actually what's called a combining diacritic. That means it combines with the character of its two, its left, to make one single character. So I've underlined in yellow the fact that those two bytes are used to represent one character. So if you, so when we ask for the length of that string, we actually get five and not four because that combining diacritic counts as a character, even though as a glyph, they're represented as one unit. And that means we can have surprising results like if we reverse the order of the bytes, sorry, if we reverse the string and then ask for the bytes, the bytes corresponding to the diuresis are not reversed. They're kept in the same order. So in other words, it reverses based on character, not based on byte. So likewise, just because you think that your character should be atomic doesn't mean that your library will treat it like that. Make sure if you're performing unicode string transformations that you do so with a healthy awareness of what's going on under the covers. And again, as before, be aware of what assumptions the library you're using is making about how unicode encoding works. All right, last thing, names. Names are hard. Why are they hard? Because people are using them. Everyone should read this article by patio11, aka Patrick McKenzie called falsehoods that programmers believe about names. I won't go into it too much here except except to say I believe that 80% of the things on that list were true statements and it turns out that none of them are. So a fundamental problem with names is that names are a possibly empty set of strings that map to something, a person, a place, a location, a ballroom, etc. But almost no system that we invent for software models names that way. Many systems treat different parts of names especially like you may have a field for middle name and a field for last name and so on and so forth. But this will always lead to disaster whenever you're trying to accommodate a global audience. So who here has heard of the scunthorpe problem? One person, okay. So scunthorpe problem is named after a town in England called scunthorpe England or it's called scunthorpe and this contains an offensive word in it. But it also has about 100,000 people who live there. So every time they try to put their address into a filtered field they're going to run afoul of that problem. Linda Callahan in 2005 tried to register a Yahoo email account but couldn't because her name contains the string Allah and Yahoo was banning all people or all accounts with that substring in them. This domain name was not possible to register for the first 10 years of the internet's life because internet, the organization that was the registrar at the time prohibited domains with profanity in them and shatakeymustrooms.com contains a profane word. Craig Coburn that's pronounced Coburn couldn't register Coburn at hotmail.com his email also even after he was able to register it his email got caught by spam filters a lot and see if you can guess why because his title was software specialist which contains the string C Alice which is highly associated with spam email messages. So Google Plus banned people for having names that looked like they were fake like me they thought John Feminella was a fake name but it turns out people have way weirder names than I do. So Dr. Loki Sky Lizard a thoracic surgeon who by the way was on Google Plus before I was so explain that Google. Or maybe this University of Alabama football player ha ha Clinton dicks or some other real actual names of people that are all awesome but probably would not be allowed on Google Plus. So you can see there's a lot of really bad assumptions being made about names. I'd like to review a couple of them real quick for you. One of the worst assumptions that people make is that names don't change. How many times have you been on a website if it didn't let you update the name that you were assigned right? So what if you get married and change your last name for example? Another assumption people make is that names only change at predefined times and the number of those times is probably limited in some way and that you're probably your name will never change unless you go through one of those predefined things. I will bet any set of such events does not include someone going into witness protection and so therefore would be missed by that sort of scheme. Another assumption people make is that users have a single canonical name that they go by but that's not true because what if someone has a nickname or maybe you are called something different in college? Okay fine but maybe people have a single canonical name for financial and legal purposes but what about credit reports that aren't in sync? If you have two credit agencies that both have a different idea of what your name is one may have your middle name and another may not, different names. Another assumption is that names will be unique in the context of some system. There are 820,000 people with this Chinese name. There are about 46,000 people in the U.S. named John Smith. There's about 120 people named John Feminella. Another assumption is that names will contain capital letters. That's definitely not true. You only have to look at history and some famous literary people to understand that. E. E. Cummings was a poet and Bell Hooks was a prolific feminist. Names don't contain numbers but this is a real New Zealand child's name. Number 16, Bus Shelter, 16 numerals. Okay fine but they probably won't contain non-Alpha-Numeric characters so that's clearly not true. Jay-Z has a hyphen in his legal last name. Okay fine but names will always be unicode characters surely not prints. Another assumption is that people have names to begin with but that's not true. In the U.S. nobody is legally required to have a name. That's just a social convention. Your life will certainly be harder if you don't have one because you'll have no way to identify yourself but you're not legally required to have a name. So don't filter inputs for the sake of just trying to identify a specific real world person, place, or thing. And I think you should probably just have names as a unrestricted single text field of virtually unlimited length if you can manage that. Thanks very much for having me and thanks again to the organizers for inviting me.