 And I'm here to talk a little bit about situational design and the idea of how to shift from screen-based devices to voice first. Let me introduce myself. My name is Sohan. I work in Amazon as an Alexa evangelist. My office is right across the road, basically. So I just walked over. So Dan spoke a little bit about the history of design. How about how we went from punch cards to character mode, and then we went from character and mode to graphical user interfaces, where essentially, for the first time, we could see like a file on our system, or we could use this piece of hardware that would translate onto the screen, a computer mouse. After that, in the 90s, we moved to the era of the web, where for the first time, you could actually get information at your fingertips. You have to use this thing called a browser. And I definitely remember using early versions of internet explorer and Netscape navigator, waiting like five minutes for a static website to load. In the mid-2000s, how we use tech changed again. And what you can see here is an evolution of how we are really interacting with technology. We came to this age of smartphones, and this is something we are all familiar with. And if you really look at the 15, 20 years or so that the mobile phone has been around, think of the changes we have seen just in that period. We've gone from basic clunky phones, small screens, tiny keypads, to the advanced smartphones that we have in our pockets today. And think of the interactions you have on your smartphone today. You swipe right, or you pull to refresh, or you pinch to zoom. All these didn't exist before. But as tech changes and as tech evolves, so does how we interact with tech. So if you can see every 10 years or so, there's a sort of change in how we have interacted with technology. And as if on time, sometime in the mid-2010s, we were finally at this place where we could interact with machines via our voice. You might think, hey, as humans, speaking, and conversing, and conversation comes naturally to us. It's spontaneous like a baby's first words. It's such a big deal. Why did it take so long for this to happen? Well, the simple answer was it wasn't that easy. I think now we're at the stage where we have the algorithms, we have the data, we have the machine learning, we have the data processing capabilities. All of this is possible for us to be able to speak to machines and for machines to be able to understand what we're trying to say. So we really believe that voice represents the next major disruption in computing. And it was with that we launched the Alexa-based devices a few years ago. In India, it was launched last year. And we launched the Echo. Does anyone here own any Alexa-based device by chance? Few of y'all? Awesome. So as you can see, it's basically a device that lets you voice control your world. So listening to music or switching smart home lights on or getting cricket scores, all of these things you can do just via your voice. And we have this vision basically for Alexa to be everywhere. So it's not just something you would interact with at your home and just maybe just play music at. But you would interact with Alexa on the go, give voice commands to get something done on the go at home and even at your workplace. So this is really our vision for Alexa. Now I'm here to talk to you a little bit about how we can design for voice. That's the main topic of conversation. Before that, though, I really need to explain one thing, which is what a skill is. Essentially, a skill on Alexa is like an app on your phone. Typically, the first thing someone does when they buy a new phone is to download apps. You really download apps to enhance the capabilities of your phone. So similarly, we have skills on Alexa. Right now, we have 80,000 plus skills worldwide. And that is a skill store. So if you want to use, like say, you want to book an Ola cab or you want to get the score on Cricket for on Alexa, you need to enable these skills on Alexa. All right, so let's talk a little bit about how we have really designed for different platforms. And so far, we have pretty much just been designing for screen-based devices. One thing that's state common where if you have designed for a desktop application or for the web or for a smartphone or even for voice is that if you understand your consumer, your user, and if you understand the platform and the product that they're using, you'll probably design a good experience. The problem is so far, we have sort of probably not understood voice and its capabilities. And we are still thinking in a very screen-based approach when designing for voice. But it's fundamentally, it's very different. So I'm here to talk to you a little bit about what's similar and what's different when designing for voice. All right, so let's take a look at a couple of contrasts between designing for screen and voice. The main contrast is when it comes to graphical user interfaces, seeing essentially expects uniformity, whereas listening expects variety. This is a very subtle difference. But essentially, if you were to open any app, say you use a particular app like 50 times in the next 10 days, the app is going to look the same all the time. And that's a good thing because how we learn to operate visual interfaces is we really learn about where things are and what they do. That's essentially how we learn to operate any user interface. If you don't believe what I say, here's an experiment. Take out your phone. So take out your phone and think of the app that you use the most often and change the location of that app. The next time you want to use the app, you'll see yourself going to where it was because you've sort of taught yourself that that's where the app is going to be. Because of that uniformity, that's how we learn user interfaces. When it comes to voice design, though, it's not quite the same. The human ear for some reason isn't designed to listen to the same thing over and over again. Listening to the same thing over and over again is painful. It's annoying. So with listening and with voice interfaces, you really sort of want variety in what you say. And we'll talk a little bit about that later. This is also a big contrast, which is when it comes to screen-based devices, there is a defined, happy path of doing something. If you want to buy a product on amazon.in, you would go to amazon.in. And you would see the exact same interface. You see a search bar on top. You see a menu below that. You see some deals. And there is a defined, happy path to buying a product on amazon.in. Similarly, if you want to, say, tweet something out, there is a defined, happy path of how you could go to twitter.com, type out your tweet, and hit the tweet button for that tweet to actually go out. With voice, though, there is no such guidance. A user can say whatever they want. And it's really up to you to sort of figure out what the user is trying to say and get that done. So that's a big difference, again, when you're fundamentally designing for something screen-based versus something that's voice-based. Couple of other small, subtle differences is graphical user interfaces are typically designed for writing and reading, basically, for someone to read it out, whereas voice user interfaces are designed for speaking. A lot of things that when you type out might actually seem fine, but when you hear it out, it doesn't quite seem the same way. A good example, I think, is has anyone both read and seen the Harry Potter movies? I have. You'll find that the dialogues in the movies are slightly different from what you saw in the book or what you read in the book. And the reason is it doesn't translate the same way. And this happens, again, with most movie adaptations. Because the dialogue doesn't translate the same way when you're reading it versus when you're listening to it. So that's another big difference, as well. And a small difference, but I really wanted to put this out. It's not in all cases, of course, but largely graphical user interfaces are individual. Even if you're at a table with a bunch of friends, you're having lunch, and there's a bet. Someone says, hey, I think Virat Kohli scored, like, 35 one-day international centuries. And someone says, no, no. I think that number is a lot more. It's still very individual, the interaction where someone takes out their phone, they quickly look that up online, and they're like, OK, this is the number. That particular interaction is still largely individual. Whereas with voice user interfaces, we've seen that most of the interactions are fairly communal, in the sense that it's typically in your living room or in your bedroom or your kitchen. And there are multiple people sort of listening to what the device has to say. Even the one person might say something to the device. Of course, there are exceptions and whatnot. But largely, we found this to be the case. So keeping that in mind, we came up with four key design principles for voice, keeping these contrasts in mind. The key thing here is to really think of how to go voice first. And I think keeping these principles in mind sort of emphasizes going voice first while designing experiences for devices such as Alexa. So let's go. The first key design principle is being adaptable, which essentially means letting users speak in their own words. Like I said earlier, there is no defined happy path of doing something when it comes to a voice-based interface. On a screen-based interface, yes, there is. So you have to sort of be adaptable and let users speak in their own words. And it's up to you to really figure out what's happening. So how do you do that? Let's go back a bit and do a little bit of a design talk. There's something called a user experience affordance, or basically an affordance. Does anyone know what an affordance is? So let me explain. So affordance is something that tells you how to use a particular thing. For example, what's it called? Like that particular contraption that you see in front of a door tells you how to use or how to open or close that door. It could even be a sign that says push or pull in front of a door. That is telling, that is the affordance that tells you how to use that door or how to open the door or close the door. There are different types of affordances. There's something explicit, which is a big sign that says push outside a door. And then you know, OK, I need to push this door to go through. There are patterned affordances, which basically things that you see commonly. For instance, the Wi-Fi sign. Anywhere in the world, you go, even if you don't know the language, if you see that small, that dot, and those two, three arcs, you know that there's Wi-Fi available. Similarly, you have the save icon, that old floppy disk icon, which is the fundamental universal symbol for a save icon. So that's a patterned affordance, where you're sort of saying by clicking this or this symbol means this particular thing. And of course, you have negative affordances, which is essentially like a grayed out submit button in a form. So you have a form with like five text fields, and your submit button is grayed out, which you automatically realize that something you're filled in the form is incorrect, or you're missed out a particular field. So that is an affordance. When it came to design, and especially when we first started in the early days of the web, and even in the early days of mobile, we essentially tried something called skeuomorphic design. And by we, I mean designers worldwide, which was essentially skeuomorphism is where you try and model real world objects. So for instance, if you remember the early days of both iPhone and Android, your volume button would be very similar to this, where it would resemble a volume knob of an amplifier. As you can see, there's always drop shadow and bezel to sort of give it that 3D effect. But it looks like it could be like a real object. We, from there we moved to flat design. Again, I'm guessing people remember that whole change where iOS moved to flat design, Android did material design, and they moved to flat design, where there was enough familiarity with volume knobs on touch screens so that you could move to things like a slider based volume design. And this was, it doesn't really resemble anything in the real world, but people automatically they were familiar with what a touch screen does, and they knew that sliding their finger across that particular slider would increase or decrease the volume. Now, in voice though, what is your affordance? Because if I were to increase the volume, I'd say something like turn up the volume. All good, right? All good, you have an ideal user. A user said turn up the volume, everything's good. But here's the thing, there is no defined happy path. Because a user can say any of these things. A user can say volume up or make it louder, turn it down, set the volume to six. A user can say something without the word volume allowed as well to mean the same thing, right? Like turn it up there, doesn't have the word loud or doesn't have the word off or volume, but it still means the same thing. So what is your affordance? So essentially in voice, you have something called utterances, which are things that a user says to your skill or says to Alexa. For instance, let's take the example of a skill that helps you find the right college. Let's just call it college coach maybe. So these are different ways people can say the same thing. So for instance, they can say, Alexa, help me find the best colleges, the best colleges in Bangalore for colleges with the computer science program, for colleges with the computer science program in Bangalore. These are just different ways of asking for the same data. We have given different data points in each of these, but these are different utterances. What essentially your affordance in this case is called an intent, yeah? And you'll find this word, it's a slightly technical word, but you'll find this word a lot in any conversational interface, you know? Alexa, Google Home, Chatbot, all of it. You'll have something called an intent where it's essentially the intention of the meaning behind what the user said. So typically what happens is you have these machine learning that figures out the intent of what the user said by matching it with utterances. So what these programs are doing, and they're called natural language understanding, they're essentially converting human conversation, which is very unstructured, to something structured that a computer can understand. And how it does so is by breaking it down into an intent and something called a slot value, yeah? So in this case, anything to do with finding a college in any place or with any degree is matched to a particular intent. Let's take another example, right? Let's say if I was building a simple skill to give the weather. Y'all tell me, what are different ways someone can ask for the weather? Y'all tell me, let's keep this interactive. So how would you ask for the weather? Rain today. Sorry, yes, will it rain today, okay? Anything else? What's the temperature outside? Anything else? How's the weather? How's the weather, yeah, that's probably the most obvious one. Anything else? Can someone think of a way to say this without the word temperature, weather, climate, hot, cold? How's the outside? That's good, yeah? Should I carry an umbrella today or do I need a coat today? These are all different ways of sort of asking for the same thing. And it's up to us as voice designers to be able to understand that. And how you do that is through this affordance called an intent, where you, as a designer or developer, say, hey, I have this intent called a get weather intent. And any utterance to do with finding the weather is matched to that intent. After intents, you also have this thing called a slot value, which is essentially any variable within an utterance. So in this case, there are two variables. There's a type of degree that someone might want in their college coach, and there is a city or a location that they want that degree in. This could, because this statement could easily be, Alexa asked for colleges with an economics program in Delhi. So you have a type of degree and a location. So that is what is called a slot value. Essentially, what the NLU does is it breaks down this human conversation into what the intent is and what the slots are and provides it to you programmatically. So then as, again, a designer or a developer, you're able to do something with that data and you can be adaptable to different ways of saying the same thing. Don't forget, there are also different ways of saying words like this, where suppose you want to specify the size of the college. Maybe someone wants a large college, someone wants a small college. A user might not necessarily say these exact same words. They won't say, hey, I want a tiny college or I want a small college. Users can say things like little or modest or small scale. Again, just different ways of saying the same thing. These are synonyms. So again, it's really up to us as the designer or the developer of the skill to be adaptable to what the user is trying to say. So again, when you're designing for voice, think of the different synonyms that people can use when they're talking to your skill. So that was the first principle. The second one is about being relatable, which is talking with your users and not at them. Now, we're gonna take the example of a skill that helps you plan a holiday. And no one, you can't expect anyone to actually talk like this. So I'm gonna have a go at this, bear with me. So no one's gonna say something like Alexa, open holiday planner and tell me the cheapest return flight from Pune to go on the 10th of January, returning on the 16th of January and what offers available and I wanna, no one's actually gonna speak like that because that's not natural to us. As a developer, as a designer, I might think this is the best way to do it. It just eases the burden on me as a developer or designer. Let my users figure it out. Let them remember this exact syntax of them talking to me. But that's not the way we really want to do it. You'd rather model conversation very similar to how it would be if I were to call my travel agent and book a holiday. That's how we really wanna think and model that conversation upon. So it might look something like this and I'll try and read it out. So in the beginning, the first two is where I start the skill. I say, Alexa, start, let's call it outdoor guru. And the skill says, okay, welcome to outdoor guru. Where would you like to go? And I say, Goa, it confirms Goa. I say, yes, when do you wanna leave next Friday until when, the following Tuesday? What would you like to do? I'll be fishing. There's a check at the end saying, did I get all this right? I say, yes. And it says, okay, I have two ideas for you. So answer. Now, this is typically how we sort of want to model conversation between a user and an Alexa skill. And this is assuming everything goes well. You have the perfect user who's answered every question just fine. Everyone's happy. Very often though, users will either over answer or under answer. So very often, when the skill says, where would you like to go? The user will say something like, I'd like to go to Goa tomorrow. It would be really poor design if we were to ask the next question, saying, when do you want to leave? Because the user's already just answered that question. But you'd find surprising how many skills actually do this because they've followed a very linear path of asking questions which isn't human. It isn't conversational, but it's more like a computer. I have this fellow evangelist in the US, Paul, who often says, so far with interfaces, right? We've sort of forced humans to think like computers because we shouldn't be thinking in terms of drop down menus and radio buttons. But now we're finally forcing computers to think like humans. So let's take a look at what happens if the user under answers. Someone says something like, where would you like to go? And the user says, I want to leave from Bangalore. And again, this happens often. And then you follow all these steps. You don't have one piece of information. Your tech probably fails and there's a very bad sort of response to your user. And your user is like, I hate this. So it's really up to you to be relatable to talk with your users and figure out what they're trying to say. There are a few things that are happening here that I want to show. As you can see, there's a confirmation of Goa, which we call a slot confirmation. A lot of times when it comes to especially proper nouns as your slot value, you might want to confirm with your user. Especially things like place names because we are in Bangalore. There's a neighboring city called Mangalore with an M. It's a nice coastal town for those of you all who haven't been there. And it might sound the same. Again, it's not a point and click where you know for sure that the user has chosen that. You might want to confirm with the user. You might want to confirm all that information at the end as well. You're spent like a good two minutes having this conversation with the user. It's a good chance you want to confirm all of that. So you can do something called an intent confirmation. You're saying, hey, did I get all of this right? Should I proceed now? You're making your design a little more conversational, a little more voice first. All right, the third principle is being contextual, which is interactions in context. Now, earlier I told you how if you open the same app like 50 times over the next two days, you're going to see the same interface and that's a good thing. That's what we do want interfaces to change randomly every day. That would be a nightmare. But with voice though, and I think with most conversational agents, you really sort of want to tailor your interactions over time. So here's an example. The first time I use a skill. Let's try the travel planner skill. The first time I'll probably hear something like, hey, welcome to travel planner. I'm so-and-so, travel planner skill. Here's how I can help you with what would you like to do today? There's a good introduction. There are a few tips on how to use the skill. The fifth time I use the skill, I don't want to hear that same thing again and again. Because I know how to use the skill already. I don't want to hear that. And the human here hates repetition. So you don't want to hear that. You also want to tailor how you're responding. Maybe the user has given some preferences. You don't want to ask for that same question over and over again. Some skills do this really well. For example, the Domino skill in India, the first couple of times, it asks you a whole bunch of questions about the toppings and the type of pizza you want. But then it learns all of that automatically. And then it says, hey, last time you ordered this, is this what you want? And a lot of users just go ahead and say yes, because that's what they're using it for. Their kids want the same pizza every Friday. That's what they're going to use it for. So really learn to tailor your responses over time. There's this great foodie skill recently that asks for the user's allergies in the beginning, the first time they use the skill, and then sort of tailored all responses by not showing dishes that might have that particular item. So the user says, hey, I'm allergic to peanuts. Once, you never show any dish that contains peanuts. And that's something you've implicitly learned and you're sort of giving, because you can't voice out 20 to 30 different options like you can do with a screen-based device. You maybe have the user's attention for like three to five seconds, so you can say, hey, these are the top three things that you probably will try one of them. So really learn to tailor your responses over time. Also think of the sort of skill role, or the role that your skill plays. Typically we see it fall within one of these three buckets. It could be a very informational transactional skill, something like a weather skill, where you say something like the user says, hey, get me the weather in Bangalore. There's no point to say, hey, I'm the weather skill. Here's how I can help you with. Join me on facebook.com slash weather. You really don't need that. The user needs a piece of information. You give the user that piece of information. A lot of times there's a very concierge type of skill where you're sort of hand-holding the user through a certain process. The travel planner skill that I just showed you, that example, I think falls under this, which is you're sort of hand-holding the user through that ticket booking process. And there you might want to be a little more conversational. It's okay to have a welcome note. It's okay to maybe have confirmations, intent confirmations and things like that throughout that process. And thirdly, we see a lot of entertainment skills. It could be audio games. It could be voice games. Choose your own adventure type stories. You really want to see how your voice design can augment that experience. A lot of people use different voiceover artists. They use automatically generated character voices. They use sound effects to really create that atmosphere around their skill. So think of the sort of role that your skill plays and how your voice design is affected by that. And the last principle is about being available. And this one's quite interesting. It's basically that you need to rethink some of your menus and your top-level UI. I think very often we are stuck in this phase where we start with flow charting, right? Saying, okay, this is how I'm going to structure my data because this is how it will look on the website and on my app, which is fine. How it works on a screen-based device is you have a limited set of pixels on the screen. So a designer, a head of UX or whoever says, you know, these are the most important things and this is the hierarchy of data we have. So that's what we're going to do. So typically you will follow what we call a graph UI, which is a very flow chart way of describing data. A good example, I think, is how you get your IFSC code or iBan number for the international people here of your bank account. So typically you would go to your bank website, you would log in, you would click on, you would go to your account, then you'd probably click on my account info, then my account statement, and then IFSC code. This is a clear hierarchy of how you would do like X number of steps to get the IFSC code of your bank. No one's going to do that on voice. No one's going to say, hey, Alexa, open my banking skill, open my account, open so-and-so, what is my IFSC code? They would straight away say, Alexa, open my banking skill and tell me what my IFSC code is. So really instead of having like a very deep nested UI, you want to have a more wider top-level UI where this might be IFSC code so a user can enter at any point of time. Let's take a couple of examples and you'll see this example. So hopefully the audio plays. Find a movie near me. You can browse movies that are coming soon or I can give a recommendation. What would you like to do? Find a wrinkle in time. You are in the browse movie section. Say stop to return to main menu. Did you guys notice what was happening there? So this is a simple skill to enable people to find movies near them. So the user came to the main menu where there were two options to search by movie title or find movies near me. He said find movies near me and someone, and the skill said, do you want to see coming soon or recommended? But the user knew what movie title he wanted to watch. He wanted to watch a wrinkle in time. So he said the name of a title, but it still didn't accept. And that's because people have thought in terms of flowcharts and a very rigid sort of way when it comes to voice. This would have worked on an app or a website, no doubt. Yeah, there's a clear hierarchy of data, but on voice, a user knows what they want to watch. So let's take a look at another example and I think it'll be illustrated a little more. You can place this order or change something. Which would you like? Change the location. I'm sorry, I didn't quite understand. You can place this order or change something. Which would you like? Change something. Okay, you can change the pickup time, change the location, change the pickup method, or change the order item. What would you like to do? Yeah. Change the location. Yeah, so I think it's illustrated a lot better in this. So the user came to the main menu where the two options were, do you want to place order or change something? The user knows in his mind that he wants to change the location. So he says change location and the skill says, sorry, that's an error. You can either place an order or change something. So he says change something and then has to say change location. Again, because of that clear, very nested way of representing how your interaction is, he can't directly from main menu say change location. Which is a big drawback, of course. So really rethink how your menus are set. So if I could just go back. If you had a very wide top-level UI, this would be change location, change something, whatever. So a user could just say, Alexa opens, say, coffee ordering skill and say change location. So really rethink some of these things. Also, we talk a little bit about frame UI, which is what we see here, which is essentially how you would handle over-answering and under-answering. You want pieces of data within that particular frame, that dotted line that you see there. So in that travel planner skill, for example, you need four pieces of information. The data you're flying out, the data you're returning, your source, and your destination. Until you have elicited these four pieces of information, you wouldn't exit out of that frame. This is basically for, if people will tend to over-answer or under-answer. So if you are going to follow a very flowchart way of representing that data, it would go straight through all of these things and you would miss out something or you would maybe ask the same question repeatedly. But having that sort of frame UI, as we call it, make sure that it's a lot more conversational and you're actually completing what conversation you're having with your user. All right, so those were the four principles. I'm just going to skip past these, yeah. Just to sort of recap, we spoke about being adaptable, which was essentially letting users speak in their own words. We spoke about being relatable, which is talking with your users and not at them. We spoke about being contextual, which is interactions within context. And we spoke about being available, which is how you could rethink your UI and how your data is structured. So we have all of this right now, and I have 10 minutes, yeah. We have all of this right now. How do you really sort of put this into practice? Like, you know, okay, these are the principles, that's great, I'm working on like a skill or some voice-based interaction. How do I take this back to my team or what do I do right now? So let's talk a little bit about how to approach this for voice. Again, I think flowcharts to make voice interactions are a poor idea. Not only does a lot of this not take into account, you know, like a flowchart, I think if you're building a reasonably sized skill, it becomes really large and really ungainly, and it doesn't take into account different situations that might arise within your skill. So we recommend using a very turn-based approach, where this is your core interaction. Each core interaction has four components, and it always starts with something the user says. So that would be the utterance right here, what you see at the top, yeah? So for instance, if the user says something like Alexa, open travel planner, the second part or the second component is what the situation is at that time. The situation could be maybe this is the time, the first time the user is using the skill, or maybe the user has used the skill 300 times, or maybe the user has linked an account with the skill. It could be a whole bunch of different situations. Based on what the user said, the utterance and the situation, you would typically give a response, which is, hey, I've got five holidays planned for you, which one would you like? And that is the prompt, where you're prompting the user for the next action. So let's take a look at how this would work. You would typically have multiple turns as one storyboard or one interaction. So let's take a look at College Coach again. We're going back to that skill, and we're taking a look at the first-time storyboard. What happens the first-time a user logs into College Coach? So the user says, launch College Coach, and I'm looking at this here. User says, launch College Coach. The situation is it's the first-time. So the skill says, hey, welcome to the skill. I can help you find colleges. You can rate these colleges. What would you like to do? Or rather, where would you like to go? Now the user said something like, I don't know. Already we've got a slightly different response from what we expected, but that's fine. That's what will happen on voice-based devices. The situation right now is you need more information. There's not much to go by with, I don't know, and this is the first time the user is using the skill. So you're like, okay, no problem. That's what the response is. And you can maybe prompt the user for a state in which you want to find the college. So you say, hey, what state would you like this college in? And the user says, Karnataka. At this point of time, you have enough information to maybe give something to the user. So the situation right now, and I'm looking at the third one there, is you have enough info. So maybe say something like, hey, Karnataka, nice. That's a great place to study. And recommend a college. Let's say NITK. So this is typically how you design for a first-time interaction. Now what happens if the user has liked the skill and they've come back a few times, and let's see what happens in the fifth interaction. An example of what can happen for the fifth-time storyboard. Again, the user says, launch college coach. Your situation right now is that the user has used this skill for five times. So you probably need to keep track of that, but your response could be something like, hey, five-day streak. That's awesome. How was your visit to so-and-so college? That's the prompt. The user says something like, add rated four out of five stars. At this point, the situation here is that the skill knows, there is a college assigned to that particular user. They've gone, they've rated a particular college. And you're like, okay, four out of five. Glad you liked it. What degree are you interested in? The user says something like engineer. Again, at this point of time, you have enough information to say, okay, that's great. And you recommend a college. So again, all of this is when a user has done everything according to script. They have largely stayed to the script. They have not deviated too much. Let's take a look at when the user says something that we don't expect. We call it the disrupted storyboard. So user launches, it's a five-time streak. We're like, hey, a streak. How was your visit? So the user says something like, hey, two out of five stars. Didn't like it so much. At this point, the situation is your school's been assigned, but the user said two out of five stars. So we say something like, hey, that's okay. That's not a problem. Let's keep looking. And the skill says, okay, what size college would you want? At this point, the user chooses not to answer that question correctly. They have not responded with large, medium, small. They have said something like, hey, what's my favorite college again? I've already used it five times. I remember rating each of these. What's my favorite college again? Again, if you were designing on a flowchart or a traditional way, that would have failed right there. You would have completely failed. But with this approach, you know that the situation is your school, you have enough schools with ratings for this particular user. And you know what the user's favorite is. They have rated four schools already. So you're like, hey, you have rated NITK four out of five stars. Do you want to learn more about it? And the user doesn't answer that again. They say something like, hey, tell me a college which has the best sports facilities. Again, you have enough, your situation is you have enough information to make that call. You already know, okay, this is my list of colleges. This is what each college is good at. So like, okay, that's fine. MSRIT probably has the best sports facilities. So this is how even like adding those disruptions or you know, those sort of disruptors in the speech is still handled and taken care of from a design point of view. And then this is the sort of, this is how typically your design document would look when you present this to like someone developing the skill. So everything is covered where you have different utterances, the situation and what the responses and questions can be. And even if there are different disruptors, right, or something comes from somewhere else, you are still able to handle all of them because your design is strong. All right, so I just wanted to share this whole thing with you and I think something Dan said is something we also often say, which is when it comes to conversational interfaces, please start with people and not computers. I run like a lot of dev workshops also teaching people to build skills and everyone's so excited, but the first thing they do is, you know, like hit up the database or they start building with their backend, but especially with conversational interfaces, you really want to start with this. What are the sort of dialogues you'll have or what are the sort of conversations you'll have? So really start with people and not computers. Yeah, so all of this is there on this link, alexa.design slash situational design. It's a nice guide about how to transition from screen-based devices to voice-based devices. I'll just leave it on there for a second if you want to take a pic or something. Yeah, all right. I just want to end with this quote by this company called Gartner who does a lot of industry research. They say conversational platforms will drive the next big paradigm shift in how humans interact with the digital world. And I think this is very powerful. I really think this conversational interfaces will be the next big thing. We're sort of moving away from screen-based devices and I think the big advantage of conversational interfaces is the sort of how you're lowering the barrier to access technology. I think we've all been there where we have taught like an elderly person, a parent or grandparent how to use a mobile app by saying click here, click here, click here. And this is what happens. But when it comes to conversational interfaces, you just need to know the language and you should be able to interact. Of course, I think we're a couple of years maybe away from that vision, but that's definitely where we're headed. All right, so this is an image you probably have seen a lot. It's called the March of Progress about how we evolved from Chimp to being able to master tools to being working on a computer. And we often see this on T-shirts or Instagram posts. We truly believe that this is the next major phase where you can talk to the either and have like responses back and your hands are free to do other stuff. Anyway, I hope you enjoyed this talk. Thanks a lot, thank you. We have time, so I can take questions for two minutes, three minutes. Yeah, okay, so if anyone has any questions, yes. Thanks, Owen, it was a wonderful talk. Thank you. One question I had was you, in one of the slides you had mentioned, it's a communal thing. There are many people who might be using it. I said largely, I mean, there are definitely, yeah. Yeah, for my example, let's say it's in a house and two, three people are using it. And we also spoke about someone having a food allergy for something, but the other might not have. When you keep searching for it, it will still kind of negate all the other things. How do you tackle those kind of edge case? Yeah, so there's a tech-based problem for that, and Alexa itself, and maybe other conversation agents or so, they have something called a voice ID which can recognize different voices in the same household. And you can totally use that for personalization. So for instance, if I say something like, book me a cab to work versus my wife saying that, it would book us cabs to different destinations. So that's one way you can tackle it. Of course, maybe I'm odd, I have an allergy, but someone in my family doesn't. So you can always have different design decisions about that to change your preferences and stuff. But from a tech point of view, that's sort of solved with this voice ID. Yeah, I'll take yours, and then I'll come here. Yeah, yeah, yes. Sorry. Beautiful talk, but I have a question because when I use an app, I can see colors, touch icons, maybe hear the press of the screen, so I use my sense. And from my point of view, this kind of things create a connection with users in many ways. First of all, do I agree with you or with that or not? And eventually, how can you create this kind of connection using voice? Yeah, that's a great question actually, because so communication is just not about talking, because I can say something like, hey, where do you get that? It's nice, and you automatically know I'm talking about your t-shirt. So it's gesture-based, and like you said, about body language and stuff. We're definitely looking at a future with ambient computing, where all of this is the input to your conversation. Right now, I don't think, from a technology point of view, we are at that state, but that is definitely the future. Just from an Amazon point of view, there's something called the echo look, which is a sort of, it's a mirror with the camera that sort of tells you what to wear, which is maybe the smallest first step towards achieving that. And I think that's the future, but I definitely agree with what you're saying, and I think that's where we'll end up eventually. Thank you. Hi, this is Clement here. Overall, good session. I kind of got an understanding of how to design for voice. Two things. One is feedback on the Alexa app itself. So your skill, if you have to go to this skill, it takes quite a number of clicks to go to that. So my recommendation is have it on the home screen, so that like a tile you can go to the skill. So that is one feedback. You can share it within your organization. The second thing is about feedback, how you incorporate in your skill. So what I see is sometimes the same kind of mistakes when I'm using Alexa. It is really not taking that skill even after three, four months. So is there any plan to incorporate user mistakes, for example, into the skill so that it becomes much more useful for an individual? Okay, so you're saying as a design of a skill, incorporate things that people could potentially say wrong and handle for that? Oh yeah, absolutely, absolutely. So in fact, it's interesting. So I chose College Coach as an example and a few people in my team built this and it's on Twitch, if you go to twitch.tv slash Amazon Alexa, there are like five episodes of them building College Coach. It's essentially a skill to enable teenagers to find a college, right? And they spend a lot of time on the voice design and they actually said something like, hey, teenagers are very angsty and they'll say like rude words or something and get frustrated easily. So they actually built a teenage angst intent into the skill. So if the skill would throw like an error or something and a teenager would say some bad word or say screw this or whatever, it would actually say something back and redirect them to the right path. Which is I thought really thoughtful voice design because typically if you start with the code, you wouldn't be doing something like this. So you could absolutely do something like that and that is good, thoughtful voice design. So my question was, do you actually collect feedback of user mistakes where Alexa is not able to respond correctly? Yeah, exactly we do. Unfortunately, that's limited to the app right now. So for instance, if I say something like Alexa play Gully Boy songs and it plays me some other song and it doesn't do what I said it should do, I can go to the app, the companion app and there's a button that says, did Alexa do what you wanted? Yes or no? And if you click on no, you can actually type out a description and send that which helps with the system. So there is. Okay, thank you. All right, I'll take one last question maybe because we're out of time. I'm sure a lot of hours have been spent on building this Alexa and libraries and engines right now. So is there available like a service or can other customers or other companies buy the service kind of plug and play kind of a thing? Oh absolutely, you don't have to buy it, it's all available for free. So you can use Alexa, so you can do things. Not Alexa, but I just use the engine of the libraries behind it too. Yeah, yeah, so two things. So either you can build a skill for Alexa, you can also use Alexa in your hardware. So maybe you have a toy or something. Or if you want to use the individual components like just the text to speech or speech to text or just the natural language understanding, you can do that as well. And it's all free or very, very cheap basically. Yeah. All right, so I think we're out of time, but I'll be around if you have any questions. Hope you enjoyed that talk, thank you.