 Please welcome Jair, who's going to tell us a little bit about random module. Thank you. So, non-secretary, an exploration of Python's random module. Non-secretary means it does not follow in Latin. And I chose it as the name of the doc, because it sort of describes the behavior of random sequences, but also because my own interest in the topic was completely random. So, it does not have much practical relevance for me, but still I think it's an interesting and beautiful topic and worth talking about. So, my name is Jair Trejo. I work for Pink Orbeez, a small development shop in Mexico City. And I want to talk to you today about randomness in computers and in the Python Standard Library. So, I like the English word random because besides its basic meanings of unpredictability and impartiality, it also has a connotation of spontaneity or suddenness. In fact, it lightly comes from all French words that mean things like speed or violence or impulsiveness. The Spanish word is azar. It comes from Arabic and it refers to an old dice game. So, even now we call chance games juegos de azar. So, the mathematical term is very much related to the gambling meaning. It is no coincidence that the first explorations of probability, the mathematical theory that measures analysis into a point predicts random outcomes, has its roots on trying to understand gambling and what goes into predicting the outcome of gambling. One of the first examples of probabilistic analysis comes from a series of letters between Isaac Newton and Samuel Pepys, concerning some dice bits that Samuel Pepys was going to make. We think of the rolling of dice as a process with a random outcome. For a fair die, we hope that each of the six phases has an equal probability to come up when we're rolling it. So, before throwing it, we don't really know what is it going to come up. And even when we do a series of rolls, the information about past outcomes of the dice does not give us any insight into what is going to come next. So, if I roll the dice and the number 4 comes up, is 4 a random number? Well, it certainly is a number chosen at random. But just by looking at it, we cannot know whether the process that produced it was actually random. So, we can't really talk about the randomness of individual numbers, but about sequences of numbers. And sequences of random numbers have many applications in real-world situations. They are often used for reducing the size of a problem by sampling it at random points. This can be seen, for instance, in statistics where you take a representative sample from a population, like when you pick people to call for an election poll, or in simulations where you want to approximate the probability for some event or property, we can randomly generate events and then statistically measure the probability that we're looking for. These applications require the sequence of random numbers to be uniformly distributed. So, this means that every number in a certain range needs to come up with roughly equal probability. Or otherwise, our result is going to exhibit those same biases. For instance, this sequence, it looks pretty random, reasonably uniform. So, it can conceivably be used in simulations as a source of numbers between 0 and 9. In fact, if we try and take the average, we will see that it's reasonably around 4.5. That's what we will expect for such a sort of sequence. And every number comes up roughly with the same frequency. But random numbers also have important applications in cryptography. Mini-secure communication algorithms use random numbers for secret generation so that only trusted parties will notice the secret random numbers. Cryptographic signing schemes also use random numbers for generating signatures in a way that doesn't reveal information about the key, even if you have a lot of signet messages. And for instance, in Django, there is a long random secret key that you need to use for every website that is used to sing sessions and encrypting them so that users or attackers cannot tamper with them. And there's been a recent scandal about the NSA back in making a trapdoor in the random number generator used by RSA in many of their products. So, apparently they can predict which random numbers the RSA products are picking which has disaster security implications. So, these cryptographic applications require more than just an unbiased sequence of numbers. They require the sequence of numbers to be actually unpredictable. An attacker that knows which random numbers I'm picking or even that has some insight on the word to look for the random numbers that I'm picking has to draw up into all of my secrets. So, the sequence we were talking about just before, it looks unpredictable, maybe, but it really isn't at all. So, they are the first digits of pi, which of course is a completely fixed predictable sequence. If an attacker knew that I was using digits from pi for generating random numbers, he will only have to compute pi and he will know all of my future pickings. So, keeping in mind these two requirements of impartiality and unpredictability, what can we use for getting suitable sequences of random numbers? One option is to use a natural phenomenon that we know to be unpredictable when measured with sufficient accuracy. For instance, the website random.org uses atmospheric noise. It measures it and it extracts random numbers from it. In the UK, there is a machine called Ernie that uses transistor noise measurements to pick winners in a national lottery. And we can also use radioactive isotope decay, which we know is pay-natured and predictable and independently of the precision of our instruments or even the quality of our models. But what this case is having in common is that it is often slow, expensive and require of specialized equipment to measure these sort of natural quantities to generate random numbers. So, it might be useful to generate this large quantity of random numbers once and then compiling this into a table of random numbers that we can draw from in the future. As a matter of fact, in 1955, the RAND Corporation published a table of a million random digits obtained from specialized hardware. This enormous book came to be widely used in simulations for engineering and science. But, of course, large numeric tables also have some disadvantages of their own. Especially with the computers from back then, it is very hard to store and efficiently access such a large table, which led researchers to look into techniques for random number generation on the fly. So, of course, computers are deterministic artifacts. So, the future state of the machine is completely determined by the present state. So, how can an algorithm actually generate random numbers? Well, it turns out that unless we incorporate input from outside devices, we can only generate pseudo-random numbers. That is, random number generators, output numbers that look random when statistically measuring them. But they are not actually hard to predict if you have enough information about the state of the generator. In the 1940s, Jumbo Newman was doing simulation work that required a stream of random numbers. He came up with the idea of generating it by taking a random number, squaring it, and then taking the middle digits to produce next. The output of the generator looks reasonably random, but it is crucial to pick the right seed for it, because, for instance, if we get a zero somewhere in there, that means that from then on the sequence is going to generate a zero. And it also has a tendency to fall into short loops, which, of course, there is no way to get it out of. In fact, we can use different seeds and measure just how long does the generator run before starting to repeat numbers. And we can see that even for 40, the sequences are not very long. But if we take one of those long sequences and check the average value and other statistical properties, we can see that they look reasonably random. So is it possible to evaluate randomness more precisely? If we want to mathematically evaluate randomness, we need ways to formalize what is impartiality and its unpredictability aspects. What we formally measure unpredictability is to look at the entropy of our output, which is a measure of the space of possibilities that it can take. It is important to know that this cannot be immediately told from looking at the numbers. You have to actually look into the process that generated them. For instance, if we see those numbers and I told you that I picked them at random, you might think that I picked them in the range from zero to a hundred. But if I told you that they are actually prime numbers, then you will see that the actual space from which I drew them was much smaller than we thought. It's similar to how my bank asked me to pick a password, but it only can be like A characters long and I cannot use repeating patterns or consecutive numbers. So in general, they are reducing and reducing the space of possible passwords that I can pick. Although it might be worth it if it stops people from using passwords as a ranking password. As for impartiality, we can look at the statistical properties of our random sequence and see if they are consistent with probabilistic predictions. When checking the randomness of the gates of pi, we used the very informal test, sort of a roll of thumb. We checked that the average of the values was what was to be expected in random sequence. And we strengthened this test by looking at individual frequencies for different leads and see that they are roughly the same. But we have much elements to assess whether this is sufficiently right or disasterally wrong. We need something a little bit more quantitative. A much better evaluation is the G-square test, which is using the statistics to see if a set of data conforms to a certain distribution. The general idea is to measure the squares of the difference between real and expected values weighted by the expected value. And summing them, so this gives us a sort of a measure of how much our observed frequencies deviates from what we will expect probabilistically in the real one. With this measure, we go to a table like this that gives us the likelihood of observing different values of this quantity. If it is too big, then we conclude that the sequence is too different or is too much deviated from what we will expect in a random sequence. But also, and different from the application in statistics, if it is too low, then the sequence is also suspect to be too uniform for being random. Other tests check the sequence for more complex patterns. For instance, in a random sequence, pairs of numbers need to be as uniformly distributed as numbers themselves. Or we can also check if there are gaps between successive appearances of the same number and whether the length of these gaps is consistent with what we will expect probabilistically. And there's also a number of other patterns that we can use. As a matter of fact, there are standard tests that can be used to check random sequences. There is one by the American NIST, which group a series of conventional mathematical tests or programs that can evaluate a list of random numbers. So we can see if our generators are random enough. On the other hand, there's some more exotic tests like the Marsaglia dihar battery tests that prove that tests random numbers in some quicker situations, like the spacing of birdies in a random population, or it tries to place circles and see which circles overlap in a plane. And many other random experiments that we know what values to expect. And we can check that against the performance of our random or supposedly random sequence. Taking into account these tests of randomness, better generators have been devised. One very popular is the Linear Congressional Generator, which is a recurrence where we take an initial value and use this equation to produce subsequent values. Of course, this is going to eventually repeat itself with a period no greater than m. But if we pick the right values for a, c, and m, we can get reasonably large sequences that exhibit very good statistical properties. The problem with this algorithm is that it's very easy to fall into this situation. Even if numbers look random when seen linearly, when you plot pairs of them, they sometimes exhibit this kind of behavior, like they are all falling in the same straight lines. We can choose better a, c, and ms to get rid of this behavior, but it always ends up happening in higher dimensions. So how can we get rid of even these deviations from truly random behavior? Well, the Mercent Twister is an algorithm proposed by Makoto Matsumoto and Takujini Shimura, which consists of a large linear feedback register. And it operates in a way that permits the sequence to have a very, very large period of 2 to the 19 or 20,000 potency, more or less. It is also interesting that this sequence has internal state and uses that internal state to produce the actual random numbers. So even if you know the random numbers themselves, you cannot predict immediately the next number in the sequence. You need a large sample of them. And if you actually measure the statistical properties of the sequence that it produces, they are very, very close to randomness, and they don't exhibit these weird correlations in many dimensions up to 623 dimensions. So it is a very good random number generator. These desirable characteristics have made it a very popular generator. It is making it in many languages. Python is one of them. The Python Random Rule uses Mercent Twister as its underlying default random number generator. There's also the question of how to get random numbers that are cryptographic as a queue. This obviously cannot be obtained from an algorithm because algorithms are deterministic. So they have to be gathered from system activity. Linux and some other Unix systems provide a source of random numbers in Dev Random that is feed by an entropy pool that derives randomness from various sources like keyboard inputs or the timing of mouse movements, noise in sound or network interfaces, et cetera. So when users need random numbers, they can get true random numbers from this pool. Of course, getting random numbers out of the pool sort of drink a bit of our entropy milkshake. So we need to replenish the pool with more entropy. Besides the regular sources for a consumer system like the keyboard or the mouse, modern computer systems often incorporate some form of hardware random number generation. Intel chips, the eBay Bridge family come with a dedicated random number generator in the hardware. So now we have finally arrived to the actual random module. The Python Random Rule starts from this generator of numbers between 0 and 1, only firmly distributed, and provides a lot of other interesting distributions based on that. So the way to use it is there's a class in the module, random, which can be seeded, and that provides a method random that is going to produce a sequence of numbers from 0 to 1. If we can use a seed if we are going to repeat or we need the same sequence, or we use the same sequence several times, then we can just let Python seed it with a number get from the view random or from the number of seconds at the time of the call. From real random numbers between 0 and 1, it is very easy to get real or integer random numbers up to a certain number. We just multiply the random real by the maximum value. If you have a specific range, you have to generate a random number up to the width of the range and then offset it by the start. So it is still very easily derivated from the random real. And if you also need a certain step in this sequence of possible random numbers, you just generate an integer up to the number of steps and then offset it by the start and you have your random integer. So we can generate random reals. We can generate random integers. And if we need to include the whole interval of numbers, so we need a number between A and B, including A and B, we can use this special function, randint, that just calls randrange with the appropriate arguments. We can also pick or we can also perform some operations or random operations in a sequence. For instance, we might want to get a random element in the sequence, which is very easy to generate an integer in the range of indexes for the sequence and pick the element corresponding to that index. If we need a sample, we just repeat this process several times. If we want a sample without replacement, we need some form of tracking of which numbers we have already picked. There are two ways to do it. You can track which numbers you can still pick in a list and then remove from them every time you pick one. Or you can use a set to remember which numbers you have already picked. Which is more efficient depends on the size of the population compared to the size of the sample that you want to get. And the Python random variable actually computes this on the fly and uses the more efficient method. You also might want to shuffle the list. The algorithm used by the random variable is the featured jade shuffle which just goes to the list and exchange every item with another one that is randomly picked. This, of course, destroys the sequence. So if we need a simpler way to do it that gives us a new list, we can just sort by a random key. This is not as efficient, but it's much more simple. Now, we might be interested in random real numbers that have another distribution other than the uniform one. How may we go about it? Well, let's consider the normal distribution. It is determined by two parameters, me and sigma. In this plot, we can see that for each real number we can know the probability of picking it with this normal distribution. But we can also plot the probability of a random variable with this distribution falling before every real number. This is called the cumulative distribution function for the variable. And we can see that it's always increasing. This means that to get a sequence of normally distributed random numbers we can generate a uniform random number that we will use as a probability in this plot. And then we can check to what x does that probability correspond. And the result of that selection is going to be normally distributed. This does not only apply to the normal distribution but to any distribution for which we know the distribution function. But it is not always obvious or easy to generate the inverse cumulative distribution function just from looking at the distribution function. So there are many mathematical tricks that have been devised to ease these computations. For instance, for the normal variant distribution we use sort of a mathematical trick where we pick two random numbers, use them to generate a point in a circle and the x and y coordinates of that point end up being normally distributed. And from there we can get a number of interesting distributions that you might know from science and engineering like the triangular distribution, gamma and beta distributions, the Pareto distribution or the Weeble distribution that is very popular in engineering because it can be used to approximate the other ones. Another one of note is the von Meister's distribution that is sort of like the normal distribution but for angles in a circle because when we have angles we can see that several angles may actually correspond to the same point in the circle. So the von Meister's distribution is wrapped around the circle to consider the effect of these like double angles for every point. And finally the random middle creates the following instance of the random class and provides the bounded methods as module methods. So you can just import random and if you don't care about the state of the generator you can just use the module functions. If you need separate generators like for multi-threading applications or because you need two independent generators for different experiments you can actually instance the class and cheat them individually. You can also subclass the random class to provide your own random number generator. The Python random middle comes with the Wigman hill for backwards compatibility reasons and as an example of how to provide your own random number generator. There's also a system random that will get numbers from the system provided random number generator in Unix systems. And there is even a library that will connect to the random.org server and use that as a source for random numbers. In true random numbers you can use this. And since all of the other methods rely only on the generator of real numbers from 0 to 1 they will still work even if changing the source of the actual numbers. So concluding, the definition of randomness is more philosophical than a mathematical problem. But we can use mathematical definitions that are useful for our purposes. If you need sequences that are deterministic but behave as if random we can use pseudo random number generation. But if we need numbers that are completely unpredictable we need sources of entropy like input devices, noise measurements or other external natural phenomena. And for most of our random number needs Python provides more than adequate capabilities. Finally I would like to talk about a book that inspired this talk. This is a very good book that takes this very short basic program and uses literary criticism techniques to analyze it word by word. It sounds far-fetched but it's actually a very interesting book and the chapter of randomness is what got me interested into this very beautiful topic. The art of computer programming volume 2 half of this book is about random numbers. It is very theoretical but it's also very fun. Lots of really nice mathematics in there. And finally if you want to read a little bit more there is a series of really good articles about randomness in cryptography by Clothfair that might help you understand why is randomness important in cryptography. There is a very good description in the second link there is a very good description of how statistical testing of random numbers work and if you want to read more about the possible backdoor in the RSA's random number generator this art's technical article is also very good. So that will be it. Thank you very much.