 Hey, my name is Khan McDonald. I'm an artist working mostly with code. And I wanted to say thanks again to Irene and Lynn and everyone else who's organizing for having me here. First day that left me speechless was a chatbot named Megahal. I read this typical training session with its creator, Jason Hutchins, when I was in high school. Tu parles vous, tu parles les Français. Agasso, parlez-vous Français. In 1793, the French king was executed. Ha ha ha, correct. Although executed, it has multiple meanings. Very clever, again, like the other bots. The revolution started on July 14. It's 14 degrees Celsius here. Another revolution was carried out by Lenin in Russia in the year 1917, while Lenin read a book. I was in awe. It turns out Megahal was basically a sleight of hand. It would pick a word from the previous line and run a Markov chain forward and backward from that word. But reading these transcripts had a really big effect on how I saw computers. I decided I'd study AI in college. And I started a degree in philosophy and science. Halfway through undergrad, I started doing research at an AI lab. I'd skipped most of my CS classes to sit in on art and music classes. And in lab meetings, where we were supposed to be talking about natural language search techniques, I spent most of the time proposing ideas about computational creativity or automating online activity and identity. And after a few too many of these interjections, the director sat me down after a meeting and said, Kyle, I think you might be an artist. It wasn't a compliment, but I took it to heart. And I went on to do an MFA. And since then, I found myself in this regular pattern, learned about a new tool or field of research, explored it conceptually and technically through small studies, and eventually craft a new artwork that integrates recurring things from themes from my practice. And previously, I focused on tools like 3D scanning or face tracking. And most recently, I'm stuck on machine learning. But it's not my first time with this obsession. Outside of the AI lab, I was building my own side projects, programs that tried to understand and improvise around rhythm you were clapping or tried to finish drawings that you started. I explored everything from clustering to genetic programming. But my favorite tool was neural nets. At the time, they were small and out of fashion. But in the last five to 10 years, they really returned stronger than ever fueled by new research, new kinds of data, huge amounts of it, faster computers, new toolkits, and new communities. And this last year, the last two years, I guess, I've been getting back into machine learning and AI and rediscovering a lot of the things that drew me to it in the first place back in college and even in high school. So I'm still in that getting familiar, making small studies phase of my art practice that comes before making any new artwork. But I wanted to share some of that process today. So I'll cover three general topics, convolutional nets, recurrent nets, and dimensionality reduction. And I'm really grateful for your presentation because I was worried that some of this would be a little too quick. So on June 16, 2015, Reddit erupted in a furious debate over this image of a puppy slug posted anonymously with the title Image Generated by a Convolutional Neural Network, which I'm pretty sure was leaked from someone at Google. Machine learning researchers and hobbyists were quick to argue over whether it could possibly be the work of a neural net as claimed, or if it was in fact some other algorithm, or even handcrafted somehow. And the deleted comments littering the thread only added to the mystery of the image's orange origins. About a week later, a Google research blog post appeared titled Inceptionism, going deeper into neural networks with other similar images, basically confirming the origin of this puppy slug. Soon after Google released some code called DeepDream that anyone could use to recreate these images. And journalists were quick to ascribe the psychedelic artwork to Google's AI as if there was some one massive digital brain in an underground layer that Sergei is secretly feeding photos and a few dozen posts from Google+. While DeepDream may have been the first moment a neural net captured the public's imagination so visibly, they've seen plenty of success in everyday life. Convolutional neural nets or congnets in particular have been used for reading checks since the 90s and powering recent search by image systems, auto-tagging images and demonstrating exactly how homogenous many research groups are. About a year before DeepDream, I'd seen some other applications of deep learning and was surprised at how much had changed since I had left them in college. So my first step was to look for a toolkit written in some kind of language I was familiar with. I found this library, CCV, Computer Vision Library that ships with a pre-trained congnet and a flexible license. And I made a wrapper for a toolkit open framework to try and recognize things in real time. So I haven't done this switch before. Let's see what happens. This is gonna be, let's see. So here's an example of, how to do. Lighter, flute. It tends to think I'm a Band-Aid a lot of the time. Maybe iPod, good job neural net. That'll do, that'll do, that'll do. I've done a lot of computer vision work before, but this was totally beyond everything I was familiar with. This wasn't color matching or feature tracking. Even if I got some things totally wrong, the uncanny labels were still purely just inspiring. And it made me think back to the story of AI pioneer Marvin Minsky, assigning MIT first year undergraduate summer project in 1966. Solve the problem of computer vision. I was wondering what would he think about these techniques? After watching Jeff Hinton's Coursera on Deep Learning, I started dreaming of new applications in artwork, but building and training my own nets with CCV seemed a little daunting due to the lack of community and examples. Looking into other toolkits, I learned the big contenders were Cafe, Theano and Torch, written in C++, Python and Lua, respectively. And now there's TensorFlow, of course, which is one of the biggest. And there were plenty of less recognized toolkits and wrappers written in Python. Even Cafe had a Python wrapper, so I decided to start practicing my Python. It's turned out to be great investment due to the huge ecosystem, and especially with tools like scikit-learn. So the first thing I tried tackling was a problem I've tried before, smile detection. Smile detected great for when you want to automatically inject emoticons into your chat or take screenshots of things that make you laugh or make some kind of other interactive installation. And you can get a decent estimate with really simple metrics like the relative distance between the corners of the mouth. But I was curious how Connet would do in comparison. So I found a few thousand examples of smiling and neutral faces and modified an example meant for a handwritten digit recognition and turned it into a smile detector. Rather than asking if this is a small image is a hand drawn zero, one, two, three, et cetera, I asked if it was a smile, one or not, zero. Besides changing the input images, output size, I didn't make any other modifications. Let's see what this actually looks like in practice. And I'm gonna try something incredibly irresponsible and train a convolutional neural net on stage. So right now I'm starting up this tool, Keras, which is a wrapper for, it's a wrapper with multiple back ends. You can run Theano, which has been around for a while, or TensorFlow. The first thing I would do normally is download the data, but I did that in advance. I'd load all the images in, and I'd convert them to a slightly different format. So I have them in one really big matrix or tensor. And that can take a few seconds. But once they're in that format, and I made sure that they're right for the neural net, a lot of these techniques need 32 bit floating point, and if you give them something else, they get really mad at you. We can take a look. Cool, images look good. I'm gonna save these to disk, and we're gonna move on to the next step, training. So we'll load the data in using NumPy, and we'll start up Keras and say, we wanna have these two classes, like I said just a moment ago, instead of 10, we've got two. We shuffle all the data around so that we don't use, that we don't train on one part of the data and then try to validate on data we've never seen before. And then I start up with a GPU on my computer, which could go terribly wrong. But normally this takes about 10 seconds, and we're lucky this time. And then we actually define the neural net. So this is like the sort of textual description or the textural corollary of the graphs that we were seeing in the last presentation. And working with TensorFlow looks pretty similar. This is just via Keras, and now here we go. So we can see it's got about, I'll just start. Crossing my fingers. All right, okay, we're off to a good start. So it's taking about eight seconds per epic, which means it takes about eight seconds to go through all the data once. And in our first round, ooh, it wasn't that good. We're at 73% accuracy, but it's going up. Now we're at 83% accuracy. Our validation's about the same, it's a little better. And I'm only gonna run it for five epics, and then we're gonna save it and do a quick test. You can see this net's got like 800,000 parameters that define it, and most of those are in almost the last layer where these dense connections. Cool, it worked. We're gonna save that and take a look. That's the way you want things to look. That's great. Cool, accuracy's going up, loss is going down. Whoops, oh. All right, we're coming back, here we go. Last step, evaluation. All right, I'm gonna load the model back in again, start up the GPU again. This could also go terribly wrong if something wasn't like, you know, something was not unlocked correctly. And we're gonna evaluate this on one image. Okay, it says that this image is like a little more towards neutral than smiling, which I'd say it's about accurate, but let's try it on something that's a little more intense. I just made this little app that feeds my face in in real time to, if anyone's used this tool, like if anyone uses Jupiter and Python, you realize that this is ridiculous, but. All right, here we go. Hey. Yes. Oh man, I don't think my heart's been racing that fast before for a presentation. Cool, thanks for sticking with that. So let's talk about what's going on here a little more detail. We got a lot of that from the last presentation, but confidence or simultaneously learning, basically two things. There's these image patches that help discern categories and what combinations of these patches make up a given category. So first the next net detects things like edges and spots, then a combination of like an edge and a dark spot that might make up an eyebrow and an eye. And finally the net will recognize that an eyebrow and eye and a nose make up a face and that sort of thing. Sometimes it's hard to disentangle where a net's detecting features and where it's detecting combinations, but this is still one way that researchers conceptualize the otherwise murky inner workings of this technique. The deep dream images are based on reversing this process. So for example, let's say you start with a picture of a squirrel, like we saw before, and a network that's been trained to detect a thousand different categories of objects. Deep dream first runs the squirrel image through the net and identifies what's happening on the output. Is it, are there edges, are there spots, are there eyes? Sorry, what's happening kind of inside the network? Are there edges, spots, eyes, et cetera? Once the activity is identified, deep dream is the process of modifying the original image so that it amplifies that activity. Just like we were looking at a minute ago with taking something and making it look more like a frog or more like a bird or more like a plane. So if you have some vaguely eye-like shapes in the image, they start really looking like eyes. Or if you have a vaguely dog-like shape, it starts to turn into a dog. And this happens a lot because many of the thousand categories that researchers are training on happen to be different dog breeds. Through this process, you can also ask deep dream to visualize a specific category. So instead of looking at the inside of the net, you just look at that output. And this is how researchers have discovered that according to neural nets, a dumbbell is only a dumbbell when it's attached to an arm. When Google released code implementing deep dream, I started to explore this by applying it to a large number of images from classics like Man rare Michelangelo, to my personal collection of FBO glitches, testing with different settings or networks trained on different categories, or making animations out of a series of images. And then about two months after the inception is imposed, researchers from university Tübingen in Germany posted a neural algorithm of artistic style. In this paper, they show how to imitate an artistic style when rendering a photo using a neural net. And it looks impossible. Like the sort of thing that should require carefully trained humans who have undergone years of study and practice, but they're taking a single photo and recreating it in all these four styles automatically. So my first experiments with that technique were to try something harder, reversing the technique. Maybe if we could remove the painterly filter and turn something into a photo, we'd discover what Vincent van Gogh was actually looking at. Results were mixed and uninspiring. So I posted a hoax to Twitter instead, based on another artist's work and moved on with other experiments. But as with Deep Dream, I learned the most by processing a huge combination of different images from mostly from Western art history that I was familiar with, plus a few carefully tuned portraits. And I slowly learned that style transfer really meant something more like texture transfer. But there's still a lot to explore here, from animation to improvisation, inventing new styles, interpolating between styles. A few months ago, there's been some even more impressive results that were published. And last month, some researchers just posted a new article saying that they can do this in real time, whereas normally it takes about five minutes per frame. One of my favorite things about techniques like Deep Dream and style transfer is that they visually communicate an intuition for some very complicated topics. Not many people understand how a neural net works from front to back, including all the justifications of the code, architecture, and mathematical choices that went into the system, but everyone can look at a puppy slug or fake van Gogh and have some intuition about what's happening behind the scenes. Techniques like Deep Dream provide basically a human intuition for machine intelligence. One of the other interesting things about neural nets is that it's easy to develop a feeling for manipulating them once you have an understanding of the basic concepts. So a neural net's just addition and multiplication or weighted sums like we saw a minute ago. The training process is about figuring out which weights will give you the answers that you're expecting. And one modification of this setup is to take the current state of the net internally and feed it as input to the next state of the net. This is called a recurrent neural net. While a comp net usually predicts a fixed size output from a fixed size input, like a thousand categories from a small image, a recurrent net is meant for modeling sequences of data, a variable length, by predicting one time step for every prediction. Things like pen movements from handwriting, a sequence of characters and text, notes in music, or temperatures from the weather. And I think the Deep Dream moment for recurrent nets is Char RNN, an example published by Andre Karpathy in May last year. If you feed Char RNN, a few megabytes of example text will start generating novel text in that style. And Andre gives a number of compelling examples after feeding the collected works of Shakespeare. So drop upon your lordship's head and your opinion shall be against your honor or feeding Wikipedia naturalism and decision for the majority of Arab countries. Capital Laid was grounded by the Irish language by John Clare or Linux kernel code, even with the arcane comments. And if you're familiar with other text generation techniques, like Markov change, you'll notice the recurrent net has a surprising ability to balance syntactic correctness with novel text. With Markov change, you normally have a trade-off between copying your source material exactly and producing incorrect output. But recurrent nets find this middle ground by capturing a deeper structure to the data. So what happens if you combine a comp net and a recurrent net? Let's try this. If you combine a comp net with a recurrent net, oh, is it cropped at the top? I think it's cropped. If you combine a comp net with a recurrent net, even weirder, the font is not loaded on that screen. That's the weirdest thing I've seen in a while. And you give it a bunch of examples of images with captions, then now you've got like an automatic caption-generating tool, right? A man in a suit and a tie is holding a wine glass. Wow, we're back to the wine again. And now we know what, I would really be happy if it said video feedback. But that's not it. Maybe a font like, you know, cell phone, talking on a cell phone, man with glasses talking on a cell phone. I'm curious, what does it know about everyone here? Oh, I think I just crashed it. Yeah, that's a crash. Let's see, are we still? So, I'm so many dangerous demos. Okay. My first experiment with Charron N was not the image captioning, but feeding it a dictionary and generating new words. Charron N captures some basic structure like part of speech falling a word and numbers coming sequentially. Sometimes it provides a correct inflection or vaguely topical definition. And then inspired by Megahal's requirement to repeat a word from the previous line, I added this constraint to the output of another net trained on phrases from a 1917 book of thousands of useful phrases. And some of these were really lovely. I love how it finishes. Flipping love, flipping with every memory of the acknowledgement, the moon, and memory of her bright as the grave, the moon of the stars of the soul. Then I tried a more personal direction feeding it some chat history. Lauren says, so, did you see it anymore? Yeah, we only hear it out since that you were kind of because it could be last year. So, he was quickly saying that might be real time. Mmm, I guess it would ever want to trash supportive general, smiley face. His once word mids, yeah. Get hub link, Twitter link. The lines are the right length. The smiley feels like it's in the right place. I write, yeah, Lauren writes YA. The URLs aren't real, but they look exactly like the kind of links we would send each other. The output seems to occupy this liminal space between modes of awareness, kind of like between dreaming and waking attention. And it has the look and feel of the thing without having the content or the meaning of the thing. I even constructed, handcrafted, artisanal, small data archive of pickup lines. And I got this. How is the, you're beautiful, getting to hand a Bible at last year. Loomed you a love every time I was in your dad's. When your name, you manged to our stars with a heart because I'm milling in on the say, can'ts trace. You're so falling, have a half pile of friends and when havers sees with my most and rom. There wasn't quite enough text in the database to learn English, so. It sounds kind of drunk, which I guess is appropriate. But I wonder how long will it be until a machine is more seductive than a human? Next I tried around 220,000 lines from a huge movie dialogue database, but it was almost completely nonsensical and I assumed I needed a bigger network. And sure enough, a few months later, Google published some amazing results based on movie subtitles and it was awesome. The last really big data set I've tried is 20,000 drug experiences from Arrowid. And with such a huge collection of texts, the output's much more grammatically correct with proper spelling, punctuation. Conceptually, I'm kind of curious, like what happens when an algorithm passes that uncanny valiant becomes a perfect mimic of this kind of text. And if humans are unable to distinguish a generated drug experience from a real one, is the machine now some kind of philosophical zombie, something that could be describing an experience that it could never have, but so perfectly that we can't tell the difference? And it's not just about text in the sense of natural language, but anything that can be encoded as a sequence is perfect. So another kind of text that's not natural language is a SVG file, looks like this, right? This is an example from Twimoji, the emoji library provided by Twitter. And I collected all of them, which is about 875 files and two megs. And I asked the net to predict names for some new emoji and give me some new SVG files. This one is titled Clock Face 9. And here's a bigger collection to show you the variety and the output. And it's really amazing to see how the big circles that go along with faces consistently find their way into the mix and other barely recognizable shapes are scattered throughout. And the colors are really the most consistent since Twimoji uses this pretty restricted palette and Char R&N memorizes it exactly. If you like the way this looks, check out Smiling Face with Face by Allison Parrish or Glitch Logos by Darius Kazemi. All right, finishing up here, I wanted to mention something that's a little separate from deep learning, but has been a huge inspiration to me because it's something I've always, it's a tool I've always wanted, but has only made sense to me recently. So one way to think about abstractions is called dimensionality reduction. So for example, if you have a 10 by 10 pixel grayscale image that's 100 numbers or 100 dimensions, but if each image contains a handwritten digit, maybe a more useful representation is a 10 dimensional representation, like how much it looks like each digit, strong for the first value, if it's a zero strong for the second value, it's a one, et cetera. 10 dimensions might be useful for describing the categories of the images, but it's hard to visualize a 10 dimensional space as we saw with the six or 32 dimensional space earlier. So we can take this farther and try to embed the 10 dimensional space or even the higher dimensional space into just two or three dimensions. And we can draw something that's a little easier to interpret like a point cloud. Different dimensionality reductions that solve this problem have different kinds of abstractions or arrangements that they create. One of my favorite algorithms for this is called T-SNE. It tries to embed similar data points close to each other, but doesn't worry too much about keeping points that are really different too far from each other. This allows it to have structure at multiple scales. At the largest scale, you can see here it's placed different digits. These are all the handwritten digits from a database called MNIST. It's placed all the different digits in different clusters, like one cluster per digit. And then at a smaller scale, you can see patterns in things like how slanted the handwriting is, and gradients in stroke and weight. Let me keep going. Unfortunately, some of the most interesting data doesn't have numeric representation, but we can learn a numeric representation from context a lot of the time. Word to VEC is an algorithm for converting words to a set of representative numbers. And it looks at what context a word usually occurs in and uses the context to determine which word should have similar or did similar representations. It's usually trained on hundreds of thousands or millions of unique words scattered throughout millions or hundreds of millions of articles, like news articles, and returns 300 numbers for each word. Each of these numbers does not really have a clear representation, like a pixel value or a category. But you can do basic comparisons between them. Like you can look at distances. So words that are similar have smaller distance. In this case, Monday through Friday are similar to each other, but different from Saturday and Sunday. Or months of the year are roughly divided into March through July and August through February. I don't totally get this one, but you can tell that the consecutive months are more similar than distant months. And you can do analogies too. So like the closest vector to Moscow minus Russia plus Japan is Tokyo. All these relationships are going the same direction. So even though each dimension is not clearly interpretable, the general direction and location of each vector encodes its meaning. If we look at a bunch of country, capital pairs and 2D like this, we can see this relationship pretty clearly. So now that you know we can get a vector for every word, we can take any list of words and run it through T-SNE. Yes. So here's 750 words related to moods. And I used color to indicate corresponding 3D embedding instead of just the 2D embedding that was used for the positions. And this helps kind of visually group the clusters a little more. Some obvious variants in here are things like happy and glad and green right next to each other or eager and anxious. But other connections are kind of surprising. Like even though they occupy the same general category, they have opposite valence, like doubtful and hopeful are two examples above. They're right next to each other because they're kind of interchangeable in the middle of a sentence. They just mean the exact opposite thing. This got me interested in antonyms and comparing antonym relationships rather than the words themselves. Is the forward slash backward pair more similar to happy slash sad or future slash past? And maybe this changes between languages too. And sometimes these relationships end up being easy to interpret like unintelligent slash intelligent is right next to uninteresting slash interesting and other un-antonyms. But most of the time it's very difficult to interpret what the antonyms in a cluster have in common. One lesson might be that antonyms capture a lot of different types of relationships and there's no single relationship encoded in language as opposite in this. Well, actually no. Instead of creating vectors for words, it's common to make vectors for documents to using topic modeling. And we can do t-sne on every page in a book for example to watch the thematic content of the book develop over time. Or we can run t-sne on the internal state of a conf net after it processes an image. So instead of feeding an image through a conf net and then taking that and feeding it through a recurrent net to generate a caption, we can use that as like a description in itself, the internal state. So abstract concepts like green background, blue sky, circular objects, eye shapes, feathery textures, they all get grouped together. Let me open up this. So I worked with some researchers from Rhizomatix, interactive agency in Japan, exploring the Itzuo Sakane archive which is about 800 hours of footage from media art history that's been recorded over the last 30 years or so. And we use this technique to kind of put similar images from the data set next to each other. You can see here like all of the psychedelic frames we took like one frame per minute of video. All of the psychedelic minutes end up next to each other. Some of this like kind of sci-fi looking blue against black stuff ends up next to each other. I like this area over here where there's just a bunch of spherical things. Or like at the bottom left, this is where all the faces end up. Relax that you can see it a little better. And when there's one video that has like a bunch of frames that are exactly the same, they end up right next to each other. And you can see like duplicates in the archive like these two videos are the same video turns out. Cool. But I'll finish up with, so there's one more demo. For me, I got into interactive art through experimental music. And a lot of this exploration with machine learning has been about finding my way back into making sounds. But I guess I just realized I don't know if we have, I guess this is a visualization conference, huh? But maybe if we're really quiet, here's what happens when you apply these techniques to spectrograms instead of images. It works. Yeah, maybe like half that volume. Is it where? Yeah, it says no. It says no volume control. Okay, so there was a nice guitar somewhere. Whoa, I was not expecting that. We can see all this like the snares end up in the same area. I'll try and move slowly. T-Sne was like, you know, all these cowbells are basically the same. So.