 Hello everyone, my name is Juan Pinarieta and I'm going to talk about using AI to build the next generation of human computer interactions. So, first of all, just a little bit about me. I have several years of experience using AI tools to build new interaction models such as voice, hand tracking and computer vision. I work on the Echo Show, the first generation multi-model Alexa device. And then also I work at Facebook, an ARBR organization, building an assistant for several of their devices like portal, Oculus, and later the Raven Stories, as well as I did some AI work explorations there with them. Today I'm working at Merlin Mind, which is a company, an educational company, building the next generation assistant for K to 12 teachers. And we are trying to incorporate a lot of these learnings to build this new experience. Finally, as you can tell by my picture, I'm a sci-fi comics fan, love to figure out how to interact with computers with big hands. So this comes in handy when developing new human computer interactions. So the first thing we think about when building a new computer-human interaction model, we think about building the future like a minority report where you use your hands, your voice to interact with a system. Like this is the ideal vision that we all aim for and that we all dream for. Another example is the examples of the movie here where the character interacted with an AI assistant purely via voice and the conversation was perfect, emotional and very natural. However, today the technology looks more like this where you have some remote controls, something in your head, and it's not as beautiful as we have seen in these other movies. Another example is like when you interact with voice, there's always these elements that sometimes don't get it right. So you have to repeat yourself a lot and you have to make sure that you talk in a precise manner so that the system can understand you. So today I'm going to talk about some lessons that I have based on my experience for using AI tools to build new human computer interaction models. I'm not going to dive deep into the technology itself. I'm going to focus more on how to build this experience and from a product management perspective how, what are the insights and the lessons that you need to take care when building these experiences. So the first lesson is a very obvious is like make sure you focus on solving a problem and not a technology. So that's the first thing. Second, make sure you measure cognitive load to evaluate impact. Cognitive load is the amount of mental processing that a human needs to do to take an action and this is one of the most important metrics when deciding what type of interaction are you going to pursue. The third one is consistency is king, making sure like users have a consistent experience and no known patterns, use known patterns is very critical. And the third one is learn how to make trade-offs of coverage versus deep experiences. So we're going to go to one by one in this presentation. Okay, so the first one, solving a problem. Understand what problem you are solving for. So this is kind of like an example when people start like developing the smart appliances, they were developing these apps that you can use to control your lights, for example. The existing experience was very simple. It's like what to switch and turn on the light. However, the solution that was developed wasn't necessarily easier than the previous one, right? Like there wasn't really kind of like a problem to turn on your light when you're in your room. It actually made it a little bit worse. You have to take off your phone, unlock your phone, find your app, authenticate and then you can be able to turn on the light. You also had to hope that the Wi-Fi is not down so that the light actually turned on. So there wasn't like a lot of problems that you were solving with these initial smart appliances. But to be fair, there was a problem. There were some people that really liked these smart devices to turn on their lights when they were away from home, right? Like where you didn't have physical access to the switch and you used technology as Bluetooth or network connected smart devices to support that. Now, that is a problem. It may be a small problem, but it's still a problem. However, when you start thinking about brother, like if you add voice, this may be solving a larger problem, which is like, oh, maybe you can turn on your lights when your hands are not available. Maybe you are cooking, you have the baby on your hands. So you start focusing on a little bit of a larger, more common problem that may be able to help you. But still, up to today, this is still not the ideal kind of like solution to the common usage of turning on your lights, which is you are entering a room and you have the switch right next to you. So potentially, like what a lot of companies are moving on is to try to figure out how do you turn on the lights just when the person enters the room and you have that person to do anything, right? And here, I mean, you are basically, you could enable potentially use presence detection using machine learning algorithms, not only sensors of movement sensors, because movement sensors can be triggered by anything, but like just detect when a human is in the room, either via voice, via cameras. So there's potentially a little bit of a larger problem that you can solve. So my point here is that you need to kind of like focus on what is the problem that you are solving for the user, how big is that problem, and how does technology help you solve that problem. So really, when you're building or using AI to build different human-computer interactions, you really need to focus on what is the problem that I'm solving on right now, and then think about what technology I'm going to use to solve that problem in the best way possible. Here's an example of one of the best cases of AI technology on everyday usage, right? Which is a basic usage search. Google, for me, has done one of the best jobs for implementing or adding voice into their experiences into the search bar. Either you are in a car or you are in your phone and you are in a hurry and you cannot type or typing this like too hard. The use of voice here has really helped you try to search very fast in a very reliable way. So I think this is an example where you are focusing on solving a problem and using the technology to do it. Alright, so focusing on solving a problem, that's point number one. Point number two, measure cognitive load. So you would say, okay, how do you measure the impact of the different problems that are happening, right? So impact is not always time, saving time, or less physical steps. It's a lot about cognitive load. So let's take an example so you can see what I mean. Let's take an example. A person wants to listen to music. Okay, so today they can either go to Alexa and ask, hey, play a specific song by a specific artist, or they can go to their laptop or mobile phone and try to find a song that they want to play and click on it, right? It's obviously that going to your laptop or your mobile phone will take a lot more steps because you have to get your phone, open the app, authenticate, actually find the song and play. But there's a lot of hidden cognitive load on both of these processes. For the first one, the user needs to understand, okay, what song do I want? What mood I am in? What was the name of that artist? What was that song name? Otherwise the system may not recognize what do you want to play, right? There's a lot of cognitive load on making a decision about a specific song, a specific artist. On the other hand, if you're in your laptop, there may be too many options and you don't know what's good, what's not, and you end up scrolling through thousands and thousands of songs before you actually pick one. This is the problem that a lot of companies like Netflix and other media companies have because they have so many options you never know what to watch, okay? So how do you improve this experience using AI, right? So lower cognitive load is about building trust on the system by executing simple actions. So let's take an example. So in the example of what I want to listen to music, the simplest command that you can play or you can educate your users is to just say play music. So if you educate your users, just say play music and trust the system that the system will play the music that you like in 90% of the cases. How to achieve this, right? You can do it first. You can build machine learning models to predict what's the next back song that the user would like. You can use context like location, time, analyze the mood based on the vocal chords. You can understand previous history of what that person has listened in the past. And you can also enable something we call Phosis Search where the user can just say or can like just sing a little pieces of the song and you can match it to existing songs. So all of these are using kind of like these different matching machine learning algorithms to reduce the cognitive load of the users to really have to memorize what specific song they want. Okay. The second thing you can do is to reduce failure recovery. Trust in the system is a lot about how easy it is for me to recover, how costly it is for me to take a second shot. In the music example, the user can easily say next or maybe they can have a button in the UI that says next song or I don't like this song. And Alexa or the AI virtual assistant can easily pick a different song and the cost of actually failing is very low. One example, for example, when this doesn't happen is on videos. If you pick one video, the cost of failing of picking a specific video is very high because you are not going to be able to kind of like evaluate how good the video is within 30 seconds. So picking a video takes a lot more cognitive load and it requires a lot more prediction and algorithms here. So but developing UI and systems to help you recover easily. It's another way to build trust in the system and reduce cognitive load. The third one, it's you can help the user create this and be way different options. If the system has low confidence about which option to pick, the system could potentially show, okay, which one of these three options will you want to choose? And that will help the user gain more trust in the system and reduce cognitive load. But as I said, it was all about helping the user understand what is the simplest command that you can say. So you have to memorize or ask for a specific song and give him trust on the system that the system will provide the right answer. And in this case, we have applied to music, but then you can apply that to many different other use cases, such as like maybe opening apps or offering different things. All right. The third point here, consistency is skin. Okay, so what do I mean by this? So the first thing here is don't make people learn more than once. Here, let's take one example. How will you access your apps in an augmented reality system? If you have to design an augmented reality smart glasses, what in what way would you allow the users to open your application or app stores, right? You can use voice. You can just say open my app stores, or you can also use hands recognition using computer vision to try to open the apps. So now you have to select which gesture you have to use. If you see the existing behavior of users on their mobile phone, most of the apps are open when you swipe up, right? That's how you access apps. So if this is something that people have already learned that they're already using on their phones, it makes a lot of sense to try to replicate it on on the smart glasses in the same way. Okay, so don't make people learn more than once. Just use what they already know. There's one example. There's one example. For example, Keiichi Masuda have built a lot of hyper reality videos. These are concept videos. In this case, for example, he's not using the hands recognition to swipe up. He's using, if you notice, like the hands opening to actually access the apps. And there could be many reasons why he made that choice. It may be that the hands recognition when you swipe up is not reliable enough. So users are kind of like getting frustrated because it doesn't work all the time. So in this case, for example, he chose that way. But ideally, if the reliability and the performance of both gestures are the same, you should pick whatever users already know. Here's another example. Like how will you invoke a voice assistant with your hands in an AR world? So in this case, there's no pattern or no common usage of invoking an assistant user with your hands. Nobody has done that. You only use your voice. So if you have to use hands, if you want to allow the user to invoke the assistant in an environment where they don't want to say open, like, hey Siri, or they don't want to say, hey Google, how will you do it? So you use common mental models that can help the user try to associate those gestures with the action that you want to take. So in this example, for example, Keiichi Masuda used opening your hand. When you open your hand, you access Google. And this can be associated with when you have your magic lamp and you try to open, like you can try to hold it with your hand and then rub it. It's kind of like a similar behavior. It makes a lot of sense. I mean, it's not something that you do today, but it's easy to understand because it's a common mental model in typical history or science fiction or comics books. So that's where like a lot of the influence of comic books come in handy when you have to teach new interaction models to users. So built on top of common mental models is very important. Lastly, use or try to create the standardized patterns. So in here when you want to build something that needs to be consistent across different experiences, try to use common standardized patterns. For example, Google does this very well where they have recommendations for what to do next, what's your next best action. And both when you are on your phone, you have these little bubbles that appear there or what are the recommendations that the machine learning algorithm is giving to you. And they also show when you are interacting with Google Assistant, this help the user really quickly understand what is that for. Because if they use it on the phone, they can quickly use it on the Google Assistant, on the Home Assistant. And they don't need to kind of like processing, oh, what will that bubble do and what do I do when I click on it. So you reduce the cognitive load of making that decision and this is super helpful for the user. So consistency is key to develop common mental models to create common patterns and always try to standardize them across experiences so that you reduce the cognitive load of the user. Okay, so that's point number two. And the last point that I'm going to the last lesson I have, it's how to make trade off between coverage versus developing deep experiences. Okay, and what do I mean by that? So a lot of times when you're building an experience, you need to understand what is the job that you're fulfilling and who are you replacing. So let me give you an example to tell you what do I mean here. So for example, in my current company, our job is to help teachers and tether from their laptop during class. A lot of teachers developing their, develop their class behind a laptop. And we want to build an assistant that helps them walk around the room, engage one more with teachers and use different interaction models to do this, like including voice, remote and different other interaction models. But what are we replacing today? So if we want to help the teacher move away from the laptop, we need to understand, okay, what are all the actions that are doing today? They're giving presentations, setting up timers, asking questions to Google, writing notes, et cetera, right? And how they are doing it today. Today, they're using their laptop with a mouse. Sometimes they have a clicker, maybe to manage presentation. Or sometimes some of them also have existing voice assistants that allows them to play music or set up timers via their voice. None of these are perfect solutions today. Also, they are like different type of devices that they have to interact with. And this costs a lot of cognitive load for teachers, right? When they have to be focused on engaging with students. So, but what I mean here is that if I want to build a solution for these teachers, I need to make sure that I have coverage of all these experiences. If I don't cover all these experiences and the teacher has to rely on their laptop or existing devices for some of them or the majority of them, then I'm not going to be able to replace the current or fulfill the job and replace their current solutions. So I need to make sure that my solution helps them across all these different activities, even if they are a little bit light. So, for example, with MerlinMind, we are using both voice and remote control to help them develop all these activities. And we're mixing some of these, we're allowing them to use multiple interaction models depending on what the teacher wants to do. But we need to make sure that we cover all the different aspects of the classroom activities. Otherwise, the teachers are going to have to pull back into their existing solutions. So the key question is, how do I do that? If I have to build experiences for all the activities of a teacher here, I need to spend a lot of resources which I may not have. So how do I prioritize this? And what are the trade-offs that I need to make here? So one of the trade-off is, how do you decide what can be a light integration versus a deep integration? So let me take you an example. In our music example, Alexa, when they build music, they integrate directly with Spotify, Pandora, and each of these services uses ML models to personalize your results. As we saw before, teachers or students or users want to trust on the system to get the best song. But this involves a large group of people and a lot of resources to develop. So if this is not one of my main activities within my use case, in my case, a school, I can choose to do a light approach. For example, leveraging a browser and opening YouTube. I can develop a very limited experience to play music via voice, but trying to do it in a light way. And those are the trade-off that I need to make about which of these actions or these activities I need to have a wide coverage versus developing in a light way so that I can start testing my solution. So in our case, for example, this is some of the ideas that we have had about how to solve a teacher's problem. And in some cases, we have chosen to do more deeper integrations. In some cases, we're doing some more lighter integration, relying on existing tools. But hopefully this allow teachers to kind of like use one solution that covers most of their use cases. Okay. So here are some ideas about how can you kind of like think about where do you make these trade-offs? You can start thinking what activities can be covered by 80% with a light approach, right? Maybe like music, for example, teachers may have like 10 or 20 different playlists that they may play in the classroom. They don't want to play specific songs, so we can offer playlists like curated playlists by YouTube and we start with that. And that's something that we can start offering so they have something in place that fulfill the job that they are doing. The other one is also you can think about is it easy to educate my users on the limitations? Is it easy to explain to them, hey, we are only offering this limited amount of use cases that will help you cover most of the experiences that you have in class and get feedback on that and try to understand whether that's good or not. And the third point as well is like how can you learn? Like if you choose to do a light activity to get coverage, how can you learn when do you need to invest in a more deeper integration? What type of metrics are helpful to understand where do you need to go deeper? In this case, it can be understanding what users are asking for. If a lot of users are asking for specific RDAs or maybe they are just asking for things that they want on their personal Spotify, try to gather that data to prioritize a deeper integration when you have resources. So that's point number four. So as I said, these are the four learnings that I have had in my career when using different AI tools to build human-configuring interactions. Always focus on solving a problem, don't focus on developing a technology. It's very critical. Measure impact on cognitive load, which means building trust on the system by developing simple actions. Consistency is key. Don't make users have to learn something new. Don't like use existing patterns and try to use common mental models when you're building different interactions. And the final one, know when you have to make trade-offs on coverage versus deep integration so that you can have a whole experience for the user. I want to give one more lesson. My dad always told me, don't do good things that look bad. And lately we have seen a lot of privacy news, a lot of privacy issues that a lot of the big companies are having. Because sometimes we haven't been that transparent with the users. We haven't been that open or we haven't kind of like, we haven't stopped thinking about is this something that will look bad for the user and try to explain it. So make sure that you always think about what data are you collecting? Like does the user know you're collecting that data? How can you be more transparent to the user and always put the users in control when you're using AI to build new interaction models? I think just building trust on the system is the most important. And part of that trust is making sure that users feel in control of their experience and what data is being tracked. So I just want to close with this guy. I don't know how many of you know him. His name is Alan Kay. He is the leader of Xerox PARC. Xerox PARC was an R&D kind of like lab that existed early in the 80s, I believe, where they did a lot of experiments. And they actually invented the mouse, invented the visual OS systems that we use today, that Windows and Apple Mac OS is based on. And a lot of the things that we use today came out of that laboratory in those days. He said something early on that I always take with me is like the best way to predict the future is to invent it. So go and start experiment yourself. That's it. Thank you very much. As I said, I work on Merlin Mind and we're hiring people. So if you're interested or you just want to connect, please reach out to me on my LinkedIn or here's my email where you can contact me and find out more about me. Thank you very much. It's been a pleasure.