 Hi, everyone. You didn't have to say hi. I just assumed you are welcoming. Okay, so I have a lot of material here and a very short time to go over it, so pardon me if I'm moving extremely quickly. So I'm Alison Parrish. I am a poet, computer programmer, and artist, and my talk is called Lossy Text Compression for some reason. So I am 34 years old and I've been a computer programmer forever, but until recently I didn't understand JPEG compression. Then one day I was reading over the Wikipedia page and I realized, wait, I have somehow learned through osmosis enough about trigonometry and linear algebra that actually most of this makes sense without me really trying just a factor of me being 34 years old and being in computer programming for a long time. And all I could think was, oh my god, that is so clever and really, really, really cool. And a few hours after that I tweeted this, everything is different now that I know a text is nothing but a waveform in high-dimensional semantic space I can see through time. So I want to show you some of the experiments that led me to this conclusion. So first I have to explain a little bit about JPEG compression, or at least what I think is the really interesting part of JPEG compression. This is a graph of a list of 25 values randomly generated with a random walk. To represent these values we need to keep track of 25 floating point numbers, right? The goal of data compression, of course, is to represent the same list of values as precisely as possible using fewer values. The tricky way to do this is to use something called the discrete cosine transform. I think there's going to be a better talk about this later today. So if my explanation doesn't make sense, you'll have a second chance. The key inside of discrete cosine transform is that any list of values can be represented instead as the sum of a series of cosine functions at different frequencies. The DCT for a given list of values has the same number of values as the original list. Here's a graph of the DCT vector for the original list of values that I just showed you. It shows the coefficient for the cosine functions at each frequency. Stick with me here, people. Here's a graph of the actual waveforms of their corresponding frequencies. If you were to sum the values of all the waveforms together, you'd get the original graph. I have a little animation here that should be happening. Yeah, so there's... The original values are successfully adding together the values of the waveform, starting with the lowest frequency cosines. The red line is the original list of values, the light gray lines are the coefficients, and the black line is the approximation. As you can see, as you add in more coefficients, the approximation gets more accurate. The trick of using DCT to compress data is that the lower frequency components usually carry more information than the higher frequency ones. So if you look at the approximation of the data, this is just with the first 10 coefficients, you can see it's pretty much the same thing. It's really close to the original values, but only using 10 numbers instead of 25. So as long as you're okay with being a little bit inaccurate, you can represent the original 25 values by just storing a few values from the start of the DCT. This is what is meant by lossy compression. JPEG compression uses DCTs too, except it uses a two-dimensional cosine function. This is a little animation that I stole from Wikipedia. If you perform a DCT on an 8x8 matrix of pixels, you can throw away most of the resulting vector and still be able to reconstruct a reasonable facsimile of the image. You can increase the compression without losing a lot of fidelity just by throwing away the higher frequency coefficients. This is really cool. So because I am a poet when I finally understood how this part of JPEG compression works, I thought to myself, I will apply this procedure to text and see what strange things result. But the question arises, how do I convert a text into a vector? There are a number of ways, and the first thing that I tried was the simplest thing I could think of. Let's just convert each letter to the number corresponding to its position in the alphabet. So here's the source text that I'm going to use for the rest of this talk. It's the first three verses of Genesis from the King James Version of the Bible. I normalized it to remove punctuation and convert all letters to lower case. The Python function there converts that text to a series of numbers between 0 and 26. The 0 represents a space. In other words, a series of numerical values, which I've graphed here. I performed the DCT on that series of values, and we discovered that the first three verses of Genesis can be represented as the sum of these cosine functions, which is pretty cool. This is actually just the first 30 cosine functions. This is all 192, if you were curious. So let's write some poetry. Now that I have some texts as a series of numbers, and I have the DCT of that series, I can compress the text in a lossy way and attempt to approximate the original with only a subset of the coefficients from the DCT. I'll use this function to convert the numbers in the inverted DCT back into texts, reconstructing, approximating the text. Here's what it looks like. That's a little small on this slide. I'm graphing the values from the source text in red with the values in the inverse DCT in black with the difference between the two, graphed in light gray. The original texts in the transcribed approximation are beneath the graph. This graph shows what it looks like when you're using all 192 coefficients. So as you'd expect, it reproduces it with 100% fidelity. Now here's what it looks like with the first four coefficients missing. So this is a 2% compression ratio. The approximated text is still recognizable, but it's pretty messed up. And here's 50% compression. The text is now pretty much unrecognizable. They could still track some of the similarities, if you look closely enough. This is an excellent poem to read out loud, by the way, if you'll indulge me. What was the first chapter of the Bible is now the first chapter of the Necronomicon. By the time we're down to just the four lowest frequency coefficients, all we have left is a long high-pitched laugh. Here's a little animation that shows all of the intermediary steps, which took me a long time to do in NumPy, so I hope you enjoy it. So that was a lot of fun, but I'm not done. There are more ways to convert a text into a vector. The previous example was about how words are written. Another experiment would be to compress not the bytes of the letters themselves, but the meaning of the underlying words. So to do that, we need to have some way to represent the meaning of words with a vector. In natural language processing, the word for this, for a word represented as a vector, is a word embedding. I'm experimenting with word-to-vec word embeddings, which were created by some Google researchers. This is what a vector for a word looks like in word-to-vec. It has 300 dimensions. This is the word light. One of the interesting consequences of how word-to-vec vectors work is that words that have similar meanings are close to each other in terms of Euclidean distance in the vector space. So this code shows that the vectors for two related words like dog and puppy are closer than the vectors for two unrelated words like apocas and mastodon. If you build a quad tree or a KD tree, you can fairly efficiently find words that are similar to a given word. Like you see here, the words for the vectors closest to the vector for light are light, sunlight, lights, glow, glare, lightening, or lighting. You can also use this technique to find the word closest to an arbitrary point in vector space, which is important for what I'm about to show you. So here's how I'm going to lossily compress the meaning of the text. I'm going to create a list of the word-to-vec vectors for each word in the source text, still the first chapter of Genesis, which gives me a 39 by 300 vector. This function converts that vector back into words by finding the words in the word-to-vec vector space closest to the given vectors. With the unmodified vector, as you can see here, it just returns the original source text. So now what I'm going to do, this is the fun part, I'm going to find the DCT of that list of word-to-vec vectors. DCT works just as well on data with 300 dimensions, as it does on data with one or two dimensions. Hopefully you can see where I'm going with this. So then I can approximate the meaning of the original source text using only the first few coefficients of the word-to-vec DCT. It turns out that with this particular source text, you can leave out about 80% of the coefficients before the approximated text is different from the source text. So here's an example at about 91% compression, only 27 of the 300 coefficients are present here. As you can see, the text is weird. It says, against the start, got out, numbers, the Godot, plat, the earth, plat, the earth was without form, plat void, plat darkness was perma-ban, the face of the improvement, plat, the spirit of God ported perma-ban, the face of the deceased. So it's still basically the same text, but the words have been strangely replaced with words that are related in meaning or distribution. If you only use one of the coefficients to reconstruct the text, compressing its meaning by 99%, the first verse of Genesis is just diary and measuring, diary, measuring, diary, measuring, which is actually a really poetic summary of the first chapter of the Bible. Here's another animation that shows the whole process. This graph is showing the Euclidean distance between the reconstructed vector and the original, which also took me, this took less time because I learned how to do it the first time. So in the description of this talk, I promised to talk about the practical implications of all this. And the truth is that I don't really know enough to suggest how it might be useful, but the fact that the meaning of the approximated vectors is pretty close to the meaning of the originals, that means to me that maybe you could use this in low memory environments where you don't really care about the original tokens of the text, but you still want to use word embeddings. You can reduce the dimensionality there. It also strikes me that if you had different word embeddings and maybe you found a different way to perform this process, you could use this for actual text summarization, you could feed in a text, compress it, and the stuff that's left is basically what the text means. As a poet, my eyes have been open to the possibilities of text as a waveform, so I'm interested in doing more experiments with that. What happens when you put a low pass filter on this or add reverb or delay or put autotune on a text? What would that even mean in this context? That's all I have. Thank you.