 Today what I want to cover with you is not just how human beings are amazing for our ability to speak to each other but I want to go over our process of conversation and then a bit about how we speak to machines and I want to help you understand how to make the machines that have voice interfaces stronger and better for people to use so the first thing that we have to talk about if we talk about conversation is about the rules that we follow every day and The first of these rules is one that lets us live in cities and it's one that We have to break in order to have a conversation and this rule is one where When you're walking down the street? Anywhere in the world, so whether you're here in Lisbon if you're in San Francisco if you're in Beijing if if You are walking down the street the one thing that you are sure to do is that you do not make eye contact with another human being So as we walk we notice that there's another person there and then quickly look away because If we make eye contact with another person That's a form of recognition and it invites conversation and we don't actually want to have conversations with every single person we see on the street and so The first thing we need to do if we have a conversation is we need to look at another human being so that's the first step and That is called recognition. So recognizing that we would like to converse with each other now after you have this recognition you have to have a greeting and this greeting process is different from country to country and even within countries it can be different but typical American greeting pattern Which is potentially like longer than German greeting pattern, but shorter than a Persian greeting pattern Consists of anywhere you go somebody saying hi and the next person says how are you doing? They don't care how you're doing We just say it because we feel like we have to say This is part of a ritual that we have for greeting so first person says hi and this it doesn't matter if you're go to Get a coffee at a Starbucks or you are Seeing relatives that you haven't seen in a long time. They also do not care how you're doing in the initial greeting You have to say I'm good. How are you? If you say anything else other than that you're doing well or things are good That's a problem because it breaks the ritual that we have in terms of a greeting Even if things are going horribly in your life You need to tell them that you're doing fine and then later on when the real conversation starts That's when you can tell them how you actually feel And you need to ask how they're doing and they need to respond that they're also doing well so this is the greeting and it's a Completely meaningless, but it establishes that you're having conversation with another person and that you've both decided to participate in this conversation And then after that you actually get to have the initial inquiry, which is the question For example in a coffee shop of what can I get for you? So it takes a lot Making eye contact with the person having an initial in name Not especially meaningful set of conversation turns and then finally you get to have your initial inquiry and These are essential parts of all conversation And so what I want to do is have you listen in on a phone recording this is from the Santa Barbara corpus of Dialogue and language and it's a series of recordings that are made available for researchers Where you can listen to conversations between people that have been recorded so these are Over dinner time as a couple is falling asleep. This one is a phone conversation between a young couple One of them is in California on the west coast of the United States the other one is in Pennsylvania, so in the middle of the United States and I have to warn you that the conversation we're going to listen to and we'll listen to several parts of this conversation Is a very personal conversation might have parts of it That make you uncomfortable and it is in general uncomfortable to listen to a couple's private conversation But they have volunteered to be recorded and their voices have been changed slightly so they're not recognizable So let's take a moment to listen to this conversation That's my favorite girl in the world Who's the girl that I love The girl that I'll do anything for I'll wash her feet with my Girl Like TMI did not need to know about the 10 mile run the feet licking It's not how every conversation goes granted but these this is a pretty typical conversation and What's interesting about it is that? So here's a transcript of it and the at signs That's not Twitter. That's actually a symbol for laughter. So anytime there's an at in there. Somebody's laughing and you can see in this conversation that It follows the rules that we just talked about so it kicks off and Hello, and then hey, how are you doing and? This guy instead of answering me. How are you doing part of the conversation? And which he gets asked a couple of times He's talking about licking her feet after a 10 mile run and she keeps me like no How are you doing? She wants him to stay on track with the ritual and go over the first parts of the conversation. So even though He's like how I want to wash your feet with my mouth after a 10 mile run She even here is still like no really. How are you? She's not ready to have the rest of the conversation until he answers that initial inquiry so We zoom all the way the bottom and Then finally he goes I'm doing good and then they can continue with the rest of the conversation You can see over and over again in the conversation that in conversational transcripts that what's happening Is that people will do not let a conversation advance until the initial greeting component of the conversation happens? So within a conversation these different turns people take back and forth are part of a turn taking model and the turn taking model can Go in a few different ways. So what's nice is that for anyone here who works in design? It's a flowchart. So that should feel really comfortable for most people here And what happens during conversation is that you have somebody who's the current speaker and right now that's me and during a conversation I can Choose the next person who speaks so I could literally say you know Alan What do you think of voice interfaces and I'm selected the next speaker or? The next person who wants to speak could choose to speak themselves. So Somebody could pull a Kanye West run up here grab the microphone and decide to be the next speaker So we have this person who they can choose to speak You can't so I can choose somebody can self-select. They could pull a Kanye on me and then that person becomes the current speaker or I can keep talking so it can just be speaker continues and if I keep talking then I Can become maintain my status as the current speaker or? I could just stop talking and that case the conversation would end and that would be really boring So this is the component of a turn-taking model and as you we listen to more of the conversation You'll be able to see and I want you to pay attention today when people self-select to start speaking It's a little more difficult in a conference setting people don't know each other But when somebody calls and another person to become the speaker or when a conversation ends because nobody volunteers more information So those are the ways that we take turns and that's the order in which we can take turns But it doesn't really tell us the rules of conversation and these rules of conversation are Termed Gryson Maxon's and these Maxon's tell us of the principles of conversation and they're descriptive. They're not prescriptive So they describe Patterns and conversation that have been observed over time. They don't tell us how to have one These Maxon's do apply to most cultures around the world except Madagascar so in that case One of these Maxon's does not apply. So the first is that we say what is true and What's remarkable about that is that until you really Till you really start thinking about it whenever we have a conversation with each other We believe what the other person is saying because we assume there's truthfulness to everything the other person says So when I tell you my workplace You assume that I am telling you the truth about where I work or what my last project was or how many children I have Imagine every conversation here And how much more difficult will be if you didn't believe Everything every other person said to you And how much and so we have a degree of trust in each other and we assume that whatever we say to each other is true Then next of these Oh Listen to this one like I'm no Anything but I thought if the slight chance I would have it and it really became a person someday Wouldn't they love to see the photograph of the EPT with the positive sign on it knowing that that was like it's first pregnancy test So She was like I would love to if I had a positive pregnancy test I would love to save it to show my kid in the future and he's like no I don't I don't think that's a good idea and For most I mean it's like and I can see if your parents pulled out a pregnancy test You know when you're 15 years old and tried to show it to you you'd be like that's kind of gross For people aren't familiar with pregnancy tests. It's a thing that you pee on to find out whether or not you're pregnant And so what happens in this conversation is that the first thing that he says is something that's true Or is he's like I don't think that's a good idea and generally we tell each other the truth when we're having conversation The next part of this is that we are as informative as required. So we don't say Any more than we really need to say in a conversation and we don't say any less than we need to say in a conversation This is broken all the time on television on TV People always say a lot more or a lot less than is needed to know in a conversation because it makes for an interesting story But in real life between human beings We just say what we need to say to each other and so let's listen to an example of that where The issue of the pregnancy test didn't come up because it wasn't relevant to what was happening I didn't mean to not tell you at all it was just we would hang up the phone Then I thought oh yeah, I forgot to mention that I'm kind of late because I didn't think it was a huge deal, but then So People are like what is the huge deal? So so she has had multiple conversations with her boyfriend and she's totally forgot to tell him about this pregnancy test And that she had was late Because it wasn't really Important to the conversation. They were having it wasn't necessary for her to tell him that at the time It wasn't relevant to what they were thinking about So, how are you about that of that good good? We had such a nice day. It's really beautiful here like 70 degrees and sunny and we walked around we went to a park and we just We went out for dinner with joy So After she's like, yeah, I didn't really tell you about that pregnancy test It just didn't seem important at the time then finally goes well aside from this pregnancy test You never told me about how's life and at that point they completely switched years They both agreed that the pregnancy test is no longer relevant to the conversation and now they're talking about the weather And it's something that human beings are really good at is that we talk about what is relevant at the time in a conversation And so it could be that we start talking about the weather and as soon as somebody switches over to a new subject So let's say we then start talking about The triathlon that's going on outside over the course of the weekend If somebody else starts talking about the weather again, and it's not in the context of the triathlon It's incredibly confusing. So we try to stay relevant to what's happening and the last is that to be Perspicuous and this means that we're following these cooperative principles of conversation as we speak with each other and that we are Doing everything we can to keep a conversation going and really thinking about whether or not what we say Makes sense is truthful is just as much as we need to say in a conversation And as people are talking to each other what's happening is that our thoughts For example me the thoughts are very focused on coffee right now and If I ask someone if they would like to get a cup of coffee what happens is I send signals from my brain down to my vocal cords My vocal cords create a sound wave and that sound wave is heard by your ear and then about 30,000 30,000 nerve endings send that information up to your brain And if I do a good enough job of describing getting a cup of coffee, so Hey, let's walk down to the coffee shop and get a cup of coffee then What I can do is create a picture of that same thing inside of your brain And that's what we're doing over and over again when we speak to each other Especially when we use action-oriented words is that we're creating pictures in each other's brains But this doesn't tell us anything about what a machine sees so If I say would you like to get a cup of coffee? It looks like this This is my own Waveform for Would you like to get a cup of coffee and you can tell I'm from? United States because wood and you are combined into a single word, which is would you and Then the rest of it is the would you like to get a cup of coffee here? This is not actually what a computer is making sense of so What it does is that actually takes this waveform and turns it into a spectrogram Which has three dimensions instead of just two for understanding information So this is the exact same phrase but in spectrogram the color the parts that are bright yellow That's intensity of speech and if you look at enough spectrograms over time or you take a course on them You can actually learn how to read spectrograms to understand what people have said just from looking at a spectrogram So a natural language system takes this spectrogram We start with a human being and They say something and the end of this is called the endpoint And I have to tell you that when it comes to endpoints come on computer So when it comes to endpoint what's happening all of the time is that any machine? That's a voice UI is really just waiting for you to stop talking It has to wait until you're done talking so can figure out what you said and make sense of it and so that means that If you go and look for Scholarly papers and voice interfaces. There are a lot of great ones from researchers at Amazon at Microsoft IBM at Google and also at Soundhound about how they're trying to Reduce the amount of time it takes to understand the end of a set of speech and then to start processing that language I Have to say that as a person who lives on the west coast of the US my pauses between parts of speech are really long as Deborah Tannen who's a esteemed linguist she she found that when she had a bunch of friends over when she was in grad school and She had friends from the east coast of the US and friends from the west coast that at the end of the dinner time People who are from the west coast felt like they didn't really have a great time because they could never get a word in edge wise and her friends who are from New York City felt like the people from the west coast Never really took part in the conversation and that's because If you live on the west coast of the United States you have pretty long pauses between sentences when you're speaking Whereas if you are on the east coast of the United States, you have a much shorter pause between sentences which means that you're you're not waiting as long for somebody to jump into conversation and so It's definitely something to be aware of if you're from a place where people speak quickly and you're talking to anyone from California is that We're much slower It takes us a long time to get started talking and The pauses between our sentences are much longer than a lot of other places in the United States so that's really nice for A machine system that takes one sentence at a time It's really frustrating for any system that wants to hear multiple sentences because it means that it thinks I'm done over and over again before I'm really done a Machine takes that spectrogram and extracts features from it. So it's looking for phonemes. It's looking for the tiniest parts of speech And then what it does with these tiny parts of speech is that it figures out It takes a lot part all the sounds and figures out What are the different parts of speech and then uses those different parts of speech to get to natural language understanding? This happens in a couple of different ways historically. It's used something called a hidden Markov model It's okay. If you don't know what that means and now a lot more systems are using neural networks to determine natural language Well, all of this is happening a thing called a dialogue manager is keeping track of everything that the natural language system needs to know so if you ask a natural language system, how long will it take me to get to Lizbo and Marina then what it needs to keep track of is what is the time right now? What is the time when you'll get there? Where is the place you're going to? Where is the place you're coming from? Which type of route are you going to take and Then needs to convert all of this into actually generating natural language right now. Most of the systems we use I Have a preset number of words. So if you think about something like the Amazon echo It sounds really remarkable because it actually has a very limited number of responses Something like Google now is starting to create more natural language and doesn't just have pre-recorded responses So it actually doesn't sound as good But it's trying to generate these natural language through speech synthesis So in some systems, you'll have entire sentences recorded in others You'll have single words recorded and then in another set of systems You'll have phonemes recorded and those phonemes can be put back together to create new words and One speech is synthesized. It comes back to a human being as a waveform that we hear So I want to show you a little bit of how a hidden Markov model works Does anyone know what the word is up here? Tomato So what it does is it says statistical analysis essentially says All right, we just heard a a T sound and then next we heard ah and then Mmm, and it goes through for each turn and says what's the probability of each of these sounds happening? and in the end when each of these sounds happen what word was that and What's crazy is that you we started with hidden Markov models, but now we use a neural network that has an enormous hidden layer the thing I really want to emphasize with a neural network is that When I say hidden layer, I mean literally People have no idea what's happening in the middle part here. It's completely hidden from human beings be able to understand or see it but a brief overview of how it's understood is that If you take a neural network and you input a bunch of data into it for example about Paris and Then you subtract all of the data about France and you add in Italy Data then the neural network can determine that what you're talking about is Rome And it's it really feels like magic and because you can't see inside of it and it's hidden You don't know how it happens The same thing actually happens with something like summer so you can take the concept of summer represented in data Take out sun add snow and you get winter and then also with a very American example of having baseball if you take away the concept of back from baseball But you add in racket the neural network says the closest thing this data has is tennis and so when we're thinking about how we interact with Speech systems and not just hidden Markov models, but neural networks that Perceive information in a really different way than human beings What's really important is to know that like our brains aren't like machines and the things that are in machines that we call brains Are not like human beings at all either They're two completely different worlds and what's really important for us is not to make systems That make it easier for people to talk to machines, but to create machines that That really follow our own cultural practices that we already have in place So what I want to cover is how we can create stronger voice you eyes and These conversational interfaces can really be improved in what I call the four C's So making sure that we have coherence Cohesion cadence context and that they're comprehensive So first of all cohesion Cohesion is the glue that holds language together. So when I said something like so That's a discourse marker and those discourse markers are what makes a conversation makes sense. I Want to show you a sentence? My kid only eats rice He can't survive on that Who is the he here? It's my kid What is that? Rice, okay That's a mind-boggling thing for a computer to understand But most people in here can get that pretty easily those Substitute words are parts of creating a cohesive conversation If I said over and over again things like my kid only eats rice my kid can't survive on rice I would sound a lot more like a machine than an actual human being and so it's important to make sure that the voice interfaces we design actually have cohesive language a form of cohesion that An example of where cohesion didn't really come into play On Google you can say to it how tall is the burj Khalifa? And it gives you an answer and then I said to it Where is it and what's neat is that Google now can understand? Pronouns so that's nice. It knows when I say it. Where is it? I'm talking about the burj Khalifa But then what it does, which is really really frustrating Is it changes it while you're looking at it? It changes it to where is the burj Khalifa? Which is a really unhuman thing to do The it's for clarification But it also says burj Khalifa a couple other places on the screen And so it's not really necessary to do this and it actually breaks the cohesion of the conversation I'm having with that interface The next is cadence and this is where it's really important to have a voice interface that has an incredible amount of intelligence and a lot of recordings behind it and cadence is rhythm and stress and intonation So with something like do you prefer New York or LA? What's happening when I say that is when I say New York My voice goes up and when I say LA My voice goes down What's interesting is that once you get to three parts? It's New York LA or Seattle and so if you're recording for a voice interface that means that for every single word You have to have it in different intonations in different pitches on the way up and on the way down to make sense when technology is getting better in terms of being able to synthesize different pitches and different emotions, but it's something that's really essential and Anytime you want to have a find out whether or not a voice who I is really great Ask it for something that could be in a list or a long sentence and you'll see that The stresses and intonations just aren't quite right the second to last one and second to last is a form of cohesion if you're keeping track is Context so if I ask you the weather tomorrow Where do I want to know the weather for? Right here where I am. Where are we right now in? Lisbon I didn't have to say hey Do you know the weather for Lisbon tomorrow and you'd probably look at me a little strangely if I asked you what the weather was for Lisbon because we're here right now when both implicitly understand context on a device we have to set context so here's a Setup screen for Amazon Echo and you can see that I have a lot of Amazon Echoes and That mine is in Santa Cruz, California in the United States If if you have an opportunity to get information in advance from your customer, then that's really helpful so during the setup process the echo asks you to tell it where you are and It's really nice because at that point It can give me weather without me having to say what is the weather for Santa Cruz, California? And it's important as you add new features to a system to prompt your user to add additional information Like what is their work location so you can give them commute times as well? The last thing that voice interfaces aren't really doing And it's the most difficult of all of these is to create something that's comprehensive This is a small portion of the questions you can ask in Amazon Echo and Google has its own all of the different actions that you can take on a Google product and as long as We need lists like this Then people aren't going to use voice interfaces Because if we think about conversation with another human being It doesn't have a failure mode where it asks us to install a new application or Tells us that we can't do something Anytime we converse with the human being and we have a mismatch in understanding or a person doesn't get what we're talking about We can ask for clarification and we can learn from each other but voice systems aren't universal and They're not comprehensive and so until they reach a Level of comprehension they'll always be thought of as expert systems that can just do one thing So if you have a voice system on your TV It's going you're going to make assumptions about it that are relevant just to your living room And if you have a voice system in your kitchen, you'll make assumptions about it in your kitchen and so the real challenge is creating a voice system that can be as Flexible and dynamic and can learn the way a human can Or at least maybe better than a human being and so if you want a conversational interface that really works It will need to have cohesion so it doesn't sound stilted and like a robot We need to have excellent cadence the right rhythm and stress and intonation It will need to understand context and we'll have to be comprehensive And if we can do all of those things we can go from having a voice interface that somebody has to talk to To one that people love Thank you very much