 So, I am very excited to introduce our CSE Distinguished Lecture Speaker, who is Professor Monica Lam. Professor Lam is the Faculty Director of the Stanford Open Virtual Assistant Laboratory, which is creating open virtual assistants. So this is, I think, really important work, you know, virtual assistants, you know, I think like Alexa and Siri and so on, they're increasingly ubiquitous in our homes and even in our workplaces. And in many ways they're still in their infancy, right? And I mean that in two senses of the word, there are a number of technical problems related to AI machine learning, natural language processing, HCI, programming languages, systems and so on. And equally importantly, these systems, of course, have the ability to listen to us. And so now is really the time to make the kind of critical progress on privacy and openness and democratic control over this sort of infrastructure. So I'm very excited to have Professor Lam speaking here today. I don't think there's really many people I can think of who might be more qualified to work on this kind of big multi-disciplinary design problem. Professor Lam has published over 150 papers on a wide variety of topics including NLP, machine learning, architecture, HCI, programming languages, compilers. You probably know some of her work on compilers. She's a co-author of The Dragon Book, which is the definitive text on compilers. And so I will, I think that, you know, all of that speaks for itself. So together with my co-host Joyce, I would like to welcome Professor Lam to Michigan CSE for her distinguished lecture on the Open Virtual Assistant Initiative. Thank you very much for your kind words. So maybe I should share the screen. So, well, first of all, thank you very much for inviting me. It is great to visit even though this is virtual and one day we get to see each other in person soon again. But what I'm going to talk about today is our work on the Open Virtual Assistant. And to start, I just want to make some adjustment here. And to start, I want to talk about how exciting this is for computer science researchers today. And that is because the computers are going to get a new interface. This is the first time since many years ago when we have the workstations and the personal computers where we acquire the mouse and the bitmap display. That was that many years ago. And now it is a new interface and it is the human interface. This is the interface that we use and now we get to a chance to use that with the computers. And this is all made possible because of natural language processing and machine learning. And this is really exciting. It changes many things. It changes how we interact with the computers. There are three major ways. The first here is that when we ask for information on the computers, we no longer just think about keywords. We can just ask it like you would ask a teacher. You just say, here is what I want to know with all the qualifications or the things that you care about. You just say that in English. How wonderful would that be? The second thing is that if you look at all the programs that we use, it's going to acquire yet another modality. Okay, it's not necessarily replacing it, but now you can interact by voice. And before long, the younger people would just think that that is the normal way in which you interact. Interacting by speech is so much better than just software, than the menu-driven approach. Because now you're not talking about a limited set of choices. You can just express yourself and there is a dialogue and ongoing back and forth between you and the computer. It is so much more fluid. And finally, what is really exciting is that there is the third interface, which is the programming interface. In the past, we all have to be educated, formally writing, learning programming languages to get the computers to do what you want. Just imagine now everybody can automate their tasks without having to hire a developer. I'm not talking about very fancy codes, but things that you do on a regular basis, you may be able to automate it by yourself. And this opens up the use of computers for a lot of less profitable applications in a sense. I have talked to teachers and they said, I wish I have software that does X, Y and Z and so forth like that. But imagine that we can now just use natural language and have the computers perform tasks on your behalf. This will completely change what the computers can do for humankind. So this is very, very exciting in general when you see the voice interface. The voice interface is going to be embedded in one of these most interesting applications that we have shown up in the last few years, and this is the virtual assistant. So instead of you going and visiting all the different websites, searching for things, looking up apps, you just talk to one entity, the virtual assistant. So you can, you know, instead of remembering all the passwords, the virtual assistants can keep all your credentials. You have now a uniform way to talk to all the different, all these different resources. And what happened now is that the virtual assistant has a holistic understanding of what you do. All right, imagine the assistant can look at your calendar. You, they see that you are, you know, the workday is coming to an end, and you haven't scheduled anything. And it might just say, can I arrange a zoom meeting. And it says a zoom meeting and can I order you salmon from Tamron and this from for your mom and we can have a virtual dinner together. And they say, okay, because this is because the assistant knows this is the kind of things that I like to do on a Friday evening, for example. Okay, so things will take on a very different feel when you don't have to think about search engines and purchasing all in different places because you can do everything for you. So what we, we know that there are a lot of really big companies are very influential today, and we can imagine the virtual assistant when it matures is really a combination of Google, Amazon, Facebook, all combined. Okay, this is that powerful. And what is concerning is that we're seeing an emerging duopoly. Alexa has 70% of the US markets today. And Google Assistant takes up pretty much the rest of it between the two of them they are way over 90%. I'm talking about the speaker, the smart speaker assistant. These assistant don't just let you shop at Amazon or search in Google. They are creating a voice web. Alexa has 100,000 third party skills, 60,000 compatible iot devices. Okay, so you are now accessing the iot and the website through assistant, like you used to do, you know, like what we normally do is we access is we access web pages through the browser. So the assistant have these are open and open platforms, but it's open but it is also proprietary. Okay, you cannot confuse the two it makes a huge difference. Today they can say I'm open to everybody. And tomorrow they can change their mind. And because they have the full control you are using them it's just like the Apple device, you know, Apple charges 30% for all the purchases made on the mobile devices. Can you imagine Microsoft saying oh if you use the operating system I'm going to charge you charge the company's 30% for everything that is spent. It is just in, in, you know, it is, it is a lot of power that these companies have. It becomes kind of like the voice. This is the voice web, and this is your interface to the digital services. Then this is the, the assistant becomes an intermediary between you and all the businesses and they have control over where you are buying your groceries, where you're talking about everything because why do I really care, right, the assistant can serve it to you. So for example, imagine Kroger's or Walmart, you know, it's like today, they are open, you're allowed to sell through them. Tomorrow they can say, Oh, I'm going to take a cut, or I'm just not going to let you in. Alright, this is the part that they have because it's their system. It also has a ramification on the society because if you, if you don't have to open competition, innovation is going to suffer. And if they amass a lot of information about many, many people, they have a lot of power over how people think how people can communicate with each other and all the information that is being shared as we see lately in, you know, what happens with Facebook, for example. And on the other hand, you know, voice is, and in the meantime, you know, voice is very, very pervasive. It is very important. What does it mean that all the voice technology is being delivered through these couple of companies. What about the low resource languages, languages that are, you know, they're not as much purchasing power that is associated with certain languages are these companies going to support them. What about the nonprofit causes who is going to build those interfaces for these people. So there are a lot of issues that can come up when you have a single a couple of companies that are controlling the voice interface. So our goal here with the open virtual assistant initiative is to advance the voice technology make it available in the open domain. This is still early stage and we think that this is absolutely the right time to get going on this. So just talk about technology. I really, we really want to let user to have an alternative, which is to have an assistant that honors privacy that they can use. It's a little bit like Firefox, right. Today you have Firefox, you can at least there is a, there is an alternative, for example, to Chrome. The prep, you know, some of us are using Firefox, but the fact that there is alternative also keeps Chrome honest. Okay, there are things that they wouldn't have done if there is no alternative. So this is very, very important. So that is our overall overall goal. All right. We just have a little problem. And that is, we don't have any real users. We know that natural language processing means that we have to have training samples. How are we going to get training samples if we don't have real users, this is the advantage of Google and Amazon. And on top of that, we don't have billions of dollars. Alexa has 10,000 employees, and many of them are annotating the data, they take the natural language sentences, they annotate it this is what it means and then you use this to train the neural network. We don't have any of these things and a lot of people are just writing us off. It's like, you're not going to make it. I've been struggling with this problem. I mean, I've been working on this for five years in the first few years, we really are struggling, and then we realized some very important aspects about this annotation based machine burning. Annotations are just not good enough. Right. Alexa Google they have been added for a few years and the NLP is still really limited. Okay. It is much better than before, but we still have a long way to go. And the reason here is it's the natural language sentences. These are, these are not natural phenomenons. These are sentences that we engineer every sentence that I put up that I make right now is something that I engineer, I use to describe what I'm thinking right now and it is not just a matter of understanding what everybody like to talk about. You have to understand a new sentence. Okay, you have to understand what I am saying because this is a new sentence that I've never put together before, and nobody else said that before. All right, so it is not just a playing the frequency games like what are the most common things that people ask about, because that will not help to serve everybody's needs. And language is compositional. That means that there are exponential number of combos in the sentence. And when you put them together in a dialogue there are exponential paths. So how many dialogues can you possibly annotate. There is a very famous open benchmark called multi was it just a multi multi domain transactional data set you go there and you pretend you're buying restaurant buying plane tickets reserving restaurants and so forth all those things that we cannot do today. As a matter of fact. Okay, so that's a multi was it's an open data data set. And it has been annotated, but it has been re annotated twice. Amazon spent money re annotating it because the first annotations were deemed to be too incorrect. Google came around said that that's still not good enough, let's re annotate it again. And we analyze when we analyze the errors, there are about 30% of the dialogues were incorrectly annotate it is lacking in consistency. And that you feed the feed the neural network with a 30% error in annotation, it is not going to do a very good job. Okay, you're confusing it. Limitations have a lot of limitations. And maybe we can handle some of the most common things, but seriously, can you put the entire web up by having Amazon and Google annotating all the sentences there are lots of things out there on the web. And there are also many, many languages. Okay, so this kind of reminds us a little bit like Wikipedia. Okay, you really need a lot of people to put that information up in order to create the largest knowledge base and creating creating this largest voice interface is, I don't actually think it is doable by just a couple of companies. So, so we have a have we decided that we really need to take a different approach. I cannot make it with just applying the standard paradigm, we changing it. We do not send annotate, but we mostly synthesize. So what we realize is that first of all, I'm not trying to do natural language understanding between two humans. It is between a human and a computer. And the computer's capabilities are limited. We know what the computer can do. What it can do is to get, you know, is to issue queries to knowledge basis, run API's and then in the best case scenario, they can run programs. Okay, they can, everything they can do can be represented by a set of programs with these primitive operations. So what we can do now is to look at what the computers can do and say we are going to cover what the computers, computers know, not what people are asking today, what are the most popular questions, but we say here is the database. These are the questions that they can answer, and we synthesize accordingly and we synthesize primitive operations, we synthesize compound sentences, putting things together. And when we synthesize them, they are cut correct by construction. All right, I don't have an annotation problem at all. Right. But you can say that synthesis is always just sent, you know, it is still artificial. So what we propose is that you just throw in a few shot of real data to prevent overfitting. That's our overall strategy, because we can't, we just cannot afford this annotation style. It's also which I discussed is limited by itself. The second idea here is that we encapsulate our know-how in tools. We are not going to build it ourselves. What we're going to do is to take everything we know and make it available in tools so that other people can use this to build the interfaces. So the tools that we have today allows the individual to say, here is what the computer knows. This is the database. This is a schema of the database. And I will also show you some representative data. And they say, generate me a dialogue agent. Okay. Given the schema, we automatically go through these tools, we just synthesize tons and tons of perfectly annotated training data. And we use this to train a neural network. This is a semantic neural network. What it does, it takes natural language and then translate it into a formal representation. We have created this language called thing talk. It's for representing what the virtual system do. It is, you can think of it as a DSL domain specific language for virtual assistants. It's an executable program. You can have a, we have a compiler. It absolutely just take that and executes it. Okay, there is no intermediate representations, because what we found is like let the neural semantic parser figure out what the representation is. The advantage of this is that now you can go into other natural languages. They all have the same representation. We're not trying to capture the, the, the structure of the natural language at all. We kind of go backwards, starting with what the computer can do. And we just do it into a translation from natural language to code. So you know that, you know, I've been doing work on compilers. I kind of look at it as nothing but a compiler. It is a compiler for the highest level programming language, because we're translating natural language into, into a lower level code. Okay. So that sounds fine. But the question is, how do you get good enough training data. And that's time it has for quite a number of years is like, as I said, we started working on it and by the third year is when we kind of figured out how to go about it. And what we do here is a combination. We have templates that we created to generate a large collection of basic sentences. The important thing here is that these templates are domain independent. I don't know what I am saying, but this is what the constructs open. And you supply, so the users domain information provides us with the words for the domains. We put them, we throw them in, we generate sentences, we synthesize sentences. Many of them are very clunky. And we take that, and then we generate paraphrases, so that it will smooth out as normal language based on that particular domain you want to introduce domain specific terminology, which is a sentence by sentence. Now there is previous work on this and we were, and previous work uses human paraphrasers, and they will rewrite the clunky sentence into natural language sounding sentence. And what we have done is to take that to a next level, we automated it. And we observe that there are a lot of very, very good pre trained language models up there. All right. I mean, GPT three, for example, you know, I mean it can write you stories. And you cannot even distinguish it from real life from human written language sentences or human written stories. But the problem with these pre trained language models is that they don't really know what it is talking about they know the correlations of work, work of words. They say this is equivalent to that or this is similar to that. It knows a lot about all the relative, you know the relationships between words, but they are not grounded. They don't know what they are saying. So what we are doing here is that we generate the sentences these awkward sentences but grounded in the semantics because the semantics given by think talk. And then we use the pre trained networks to find all the equivalent sentences that we can get and add that in as the training data. So that's how we can generate a good data set. So we encapsulate these things in tools, and then the third step in our strategy is to scale, and is that we have to enable reuse. Right. It is not me building up one domain at a time. And the reuse is, it's all just what we have been doing computer science, we want to standardize representation so we put you know we have been working on think talk, we want to propose this as a standard. If everybody annotates their semantics using think talk, then we can share. Right. And on top of that we open source all the tools or the training data or the models. If we created a skill for music and for Spotify, you can take the same set of languages and then you apply it to another, another music skill. You know, you don't even have to do anything you just apply your own database and API and off you go. So reuse is how computer science has been taking care of the issue of software engineering productivity, and we do the same thing. It's like, imagine that we didn't have Java, where you have, you know, and all the tools for web servers that are built using the Java library, we have so many useful libraries and so forth. We're going to do the same here. We need to standardize on the languages and tools. And our goal here is to empower the 20 million plus voice interface developers out there. I picked that number because there are 20 million plus web, web developers up there today. Okay, they are doing website designs and serving that. And what we see here is that there is a need to interface to just about everything online. There will be imagine 20 million plus developers, and these are not people who are going to learn AI, and we don't want these people to annotate, they don't, you know, that's not even enough people if you stick with the annotation regime. All right, but so this is a very tool based approach. So that's the high levels. Let's get into the more details, the technical details. Oh, before I do that, I want to show you a little bit of preview of the results to give you a sense of where we are now. So what we did is that we take a very, very popular domain, which is restaurants. And we have, we saw crowdsource a bunch of com commands in, in particular, they, in particular, we are not looking just for the popular commands, we ask people to come up with more of the compound, more complex questions. So for example, if I want to ask for a Spanish restaurant that is open at 10pm in Palo Alto. Okay, that is a compound sentence that includes three different fields. The cuisine, the time, and the location. But as you can see here, you link them up in one sentence and it just is very, very common, very, very smooth. And we take those questions, and we test it out on all the top popular assistance, and we see that we actually beat them all in this, in this experiment. And what I show on your on the right here are some of these examples and you can see that these are reasonable sentences like show me restaurants rated at least four stars with at least 100 reviews because we all know that reviews with very few ratings with very few reviews are totally, you know, reliable. Okay, and as you can see here, these are all the examples that Jimmy can do, and only a small number of these questions are actually answered by these commercial assistance. I see a couple of hands up maybe I will. This is a good time to answer some questions. Hi, you might be covering this later. But I just want to make sure I understand what the neural models doing so the training data is natural language and then, like, and then natural language to think top code. And then the model is learning how to basically transform natural language into think top code. Is that correct. Yes. Okay, thank you. And then to answer the questions you just execute the thing top code or answer, you know, if you open the garage door it will just open the curve you execute it it will open your garage door. I see. I see. Okay, thanks. Okay. There is another hand. I have a similar but not quite the same question. So you talked about other annotate annotation based learning algorithms not having considered the structure of natural language. But I didn't quite understand how you incorporated the structure of natural language into your algorithm. Well, this is what I'm going to talk about next. If you still don't understand it at the end of this part. Let me know. Okay, thank you very much. So this is just the preliminary. And let's get into the details to start I want to talk about think top what what is think talk. We said that the interfaces, the voice interfaces are good for queries for dialogues to queries to knowledge dialogues to pro software, as well as automation I'm going to show you an example of each one of them. So, for example, I say I like a Spanish restaurant in Palo Alto, and you see that that's the query that we that that's in think talk. The thing talk is basic as far as queries are concerned, they are roughly equivalent to SQL. But we have a different semantics. It is because we know we're going to synthesize a lot of questions we want a language that it is easy to compose. And secondly, we know that we're going to translate from natural language we also want the, the target to be an easy target for neural networks. So it has been tuned for those two purposes. Okay, and what about into dialogues as interfaces. So instead of just mapping the question into a query. We also capture the result of the query so that we can communicate it back to the users, and the state of one sentence carries to the future sentences. So when you say please book a table for two. I know that you are booking that particular restaurant that I am proposing. Okay, so now you can see that this is a representation that captures the semantics of an entire dialogue. For automation. Here's an example, when I leave the house turn off the lights. And this is the construct we have we say we can monitor the GPS, when the location is not home, then you set the power of the light bulb to off. All right. So these are API's, and we have implementation that executes this code for you as you can see, this is an event driven programs. You know, it's, it's a pretty, you know, it's not an easiest program in the world from a computer science point of view, but now we are saying that in English. All right. And what I want to point out here is that all our tools are based on grammars. It is not fixed to think any particular thing talk language. These is the grammar that we found and we have tested it and tuned it so that it is easy to synthesize to you. But now everybody else can add to the thing talk to capture more information. So for example, we have access control in natural language, and we take the automation sentence and we add the person who are allowed to do these operations. We throw them in and we have a very good active, we have access control a remote request of operations. And we just change the grammar and the tools will automatically run. Okay. And this is what I mean by I want to encourage we a lot of people working together to build up the, the semantics of what the, what the assistant can understand. Now let's take a look at the training data, it's all about synthesizing the right training data. As we mentioned, as I mentioned before, you start by having the developers provide the database schemas and representative data. And you. So for example, this is a schema from schema.org schema.org is a standard schema that has been defined and a lot of websites have meta information that puts, you know that they put all that information using schema.org in their metadata. And the reason why they are doing this is that they want to, they want to support search engines, the search engines using this information can help people locate a restaurant, find out what its phone number is, and so forth. So what we're saying that is that we can scale by working on the standard representations so that we can now be a voice agent for all the websites out there that use this metadata. So if we connect these two, I think more and more people will put more information into the schema.org. And by creating one assist one voice interface, you can serve all the different websites in the same category so this is the standard power of, you know, the power of standardization. So as you can see here here is a restaurant and for example there is a cuisine and so forth. Right. So the next step here is that we notice that even though we are talking about cuisine there are, you know, it's just, for example, Spanish, right. There are many ways of just talking about cuisine. You can, in many different parts of speech, you can say, what is a restaurant that serve a Japanese food, or we can use it as an adjective what is a Japanese restaurant I don't even talk about the word cuisine I don't want to talk about I mentioned the word food. And there are many different ways of expressing the same field in different parts of speech. So what we do here is that we go to the pre trade networks. We ask them for paraphrases. And when we see that they find, you know, they represent cuisine in different parts of speech, we record that verbs, adjectives and so forth. So if we have the basics for each field, then the next thing we do is that we put them into sentences using grammar templates. So for example, we have a gram template that says which blah serves, which blah is the table name verb, which is going to be pulled up from the, the, you know, I'm looking for a verb you can look at the field and look at the part of speech and say the service is the word. The value is Chinese, and then food is just how you what these are the words that we learned from using the pre chain networks to figure out what the primitives are. So we have templates that turn them into four sentences. And we have templates that give you compound sentences. We understand, for example, comparisons of types. So it's a greater than or slower than or heavier than it is all based on types. So the templates are now based on parts of speech and types and not on the domain information. So now you can substitute the domain and we can generate a lot of sentences for each language. So for English, we have 800 domain 800 templates. You may say that's a big number, but you may do it once and for all. Okay, instead of asking for thousands and thousands of annotations, we came up with these templates we're now generating many, many basic sentences. But they're not good enough because these are very clunky sentences because they're not domain specific. The next thing is that we use the pre train networks to get paraphrases to turn them into more, more fluent sentences. So, and I'm going to expand on this because this is not so obvious how you get good domain specific paraphrases, but this is how we generate a lot of lot more sentences. The next thing here is that we throw it and we want to generate dialogues. So we have a dope, again domain independent models, and we just generate thousands and thousands of dialogues you give me a table of restaurants, I talk restaurants. Okay, you tell me you give me a table of basketball, I talk basketball. Okay, and this is how we are going to get the scale. Finally, you want to have other languages. We base we rely on neural machine translators to generate languages in, you know, sentences and other language as languages and we train a semantic parser in that native language. We don't translate back and forth at the edge. We have native semantic parsers. So that's the full, the full tool chain that we have created. So, as I mentioned the domain specific paraphrases are not so easy to get the reason. So the basic idea is that there are the, you can take a sentence, and then you can use a neural model like a seek to seek to generate an equivalent sentence or the paraphrase of the original sentence. And there are existing general purpose paraphrasing data sets that you can use. So you can just do that, and then you add a level of fine tuning and then you can generate paraphrases for what we need. So here, they are just paraphrases based on the neural model, and the sentences may not be correct. And they may not be so good. So for example, here are some, some things we pulled out from our experiment, search some cafeteria that I that have greater star, than three and do not have smoking. As you can see these sentence are very, very clunky, because this is synthesized based on words that we pull out. And when we send it through the paraphraser, it says search for a restaurant that has more than three stars that sounds good, and doesn't spoke. And that is not working. Okay, so this is not a good sentence. And here is another one find restaurants close to my home. As you can imagine the paraphrase can say five restaurants near me, but that's not identical because I may not be home. And then, for example, search for people who are employed by Stanford. These are various questions that you can get by tuning, turning up the temperature. The very greedy one looks good point three looks good. And then it starts doing something different find people at Stanford that's not good. You know, currently employed you can argue maybe it's good maybe it's not good. But the point here is that when you generate these extra training data, you can actually get it wrong. And if you train a neural network was wrong data you don't have a good parts semantic parser. So what do we do so we use a self training. So what we do is that we start with a synthetic data we train a semantic parser the first version. And we take the synthetic data we send it through the paraphraser. And we get noisy paraphrasers for paraphrases. So we take the paraphrases send it to semantic parser again to get the logical form. And then we compare the logical form and see if they are correct. If the paraphrase gives you the same logical form, according to the semantic parser, we take the sentence when we throw it back in and use it for paraphrasing. And then you generate the next version of the semantic parser, and you can repeat, and this is called this is an auto paraphraser. And what you note here is that the sentences that I throw in our sentences that are already recognized. But by throwing them in, it's going to stretch the semantic parser to include more sentences that are close to the ones that are being accepted through this process, and it will grow the number of paraphrases that can be accepted. So this is how we do that. And this is a, and what we, in this experiment, we use a BERT encoder LSTM decoder. And let's see the result. We tried it on six domains. We're using schema.org databases. And these are some of the standard, you know, there are lots of websites of these different domains. Here are some statistics on that database, the 18 properties per domain. We come up with these automatic annotations. We synthesize sentences, about 300, we paraphrase another 300. Okay, so this gives us a lot more variety. And on top of that, the dev set and the test set are not, this is very important. These are not generated and paraphrased. If we take generated sentences and paraphrase them, we can get very, very good results while having a very poor semantic parser. It's absolutely important that you crowd source, that you get natural sounding sentences directly from humans. And this is what we use for the dev set as well as the test set. Let's take a look at the result. So we tried it on all the different domains. And we ran some ablation studies. And what it shows is that if I just generate sentences using templates, it's not so good. And once we throw in the auto field, the few annotators, you know, finding new field, different parts of speech, different ways of saying the same thing in different parts of speech. It increases it because when we do the template generation, we have a lot more variety. Now we are close to 60 on average. We talked about a paraphrase or automatic paraphrase. If we don't do any filtering, the result is actually worse. Right. But when we do the self training and paraphrasing, it brings it all the way up. Everything is above 60%. And what is actually interesting here is to compare this with a auto with a manual annotator and paraphrase data set. And we see that there is only a 6% gap. So this automation takes us a long way to getting the performance and here, all without any human annotations. So, and I'm just going to briefly talk about the other approaches, the other parts of the tool chains. And when we come when we try to do multi-lingual, we really do not want to do everything that we have done for English for every single language. You know, we spend time created an 800s, you know, a template set of eight, you know, 800, there are 800 templates. We don't know the other languages. And it's a lot of work. And so we generate the training data using machine translators. But one thing that is very, very important to know is that when you transport, you know, when you port it to a different language, you also have to port it to the terminology, the entities off that different language. So in America, you may be talking about burgers, but in a burger king and so forth actually is pretty international. But if you are in other countries, you want to talk about their local dishes, their local restaurants and so forth. So what we did is we put, we put, we built a novel neural aligner that allows us to substitute the entities from English to local entities, and that makes a huge difference. This is absolutely necessary in order to do this internet internationalization. So here's the results. We talked about restaurants before the same data set. And we show that across all the 10 different languages, the results look pretty good. And these are all tested with human translated test data. Okay, this is not translated using the computer. So for Genie, our experiment is a few shot training, which is that we tick synthesized synthesized translated aligned data, and we throw in a few manually translated sentences. And we compare with the state of the art, and you see that there is a pretty good difference. So the result with this, with this approach, we are now talking about just doing this multilingual question answering dial answering agent in one day. And what we needed to do primarily is to hire professional translators to translate the depth set the test set as well as you know as well as the few shot training data. That means that it is relatively cost effective and time effective. I see a hand up. Maybe I can take a question now. Yes, I had one question regarding your previous slide. You said that this you actually the translators are going to translate the entities in English to entities in the other languages. Right. Ah, Yeah, and for instance, I'm from Iran, and I know that burger is very different from DJ kebab. So this doesn't really serves and preserves the correctness of the translation, because it seems that some errors will be generated in this. Yeah, I can get into the details more. It is, I kind of, let me show you this and then it will probably be very clear. So what we do is that we take the English sentence. And whoops, I need to play. Okay, you can see it. No, I can't. You cannot see my window. Oh, I think let me make sure that it is shared. Yeah, so the whole idea is that we really need to substitute with local entities. As you mentioned, whereas we make the translation here is an example of English sentence and a thing talk, we put it in. And what we actually wanted in the end is really not talk about burger in Italy. And instead of this place called woodland pond, we want to refer to it, refer to a local place like Venetia. Okay, so we are totally want you really want to substitute the terms with local terms. And the way we do this is that first we went through the translator. And as you can, as you can. As you pointed out, the way the burgers and woodland and places are translated through a translator is, is, is literal. And, and they do not go, they do not look like the original English words. All right. And so what we do is that we have a neural liner that says the word burger corresponds to hamburger woodland pond is now this thing that says place by the lake or something by the lake by the woods. Okay, and we straighten that out because we you know this doesn't even match the logical form that we have because that's not a place name that is understood in in Italian. All right. But once we identify those locations, then we just go and substitute back the original terms so we can see that that now the Italian sentence referred to burger referred to woodland pond because these are, you know, the the parameters the entities. And then we do the substitution, and then we put in the right, the right words, the words that we want the place that we want. And this is how we change the sentence. I see that makes sense. Thank you. Thank you very much for the question. So, so that's how we got these results. And, and it is very important that we lower the cost of translating it to other languages, especially low resource languages. So, so that's what we do with multilingual and for dialogues. Again, you know, we have a domain independent template. And we have these English sentences that correspond to what the user says when what we say, and these are all blanked out for the domain. So we substitute into the domain information. And now we have many, many more sentences and we go through the paraphrase serves. And that means that we have even more natural sounding sentences. And this is what we do the training for. And we did an experiment on zero shot chance for learning. What it means is that we take the standard benchmarks. And once you get the result for attraction, we withdraw all the data on on attraction, and we just substitute it with synthetic training data. Okay, and then we train with the rest of the training data for the other domains. And what we, and here is the experimental results. The first result is done by who at all and they use a trade model and you can see that except for taxi everything is below or around 20%. So our first our approach, using trade, just jumps it up by a substantial amount. And when we train change the training model, I mean the the neural model using some tea which has better pre train network information, it brings it up quite a bit. And on top of that you throw in a few shot. Again, it goes up. And it is actually interesting to compare this with a with the full data set. And you see that we are within really reasonable bound, even though we have not a single bit of training data for each of these domains. So if it is zero shot we are like at 70% of real data training for 1% few shots, we are at 80%. So we drastically reduce the cost of annotations. So in summary here we have built this. This is the tool chain that we have created. So we talked about privacy. And let me just explain how we managed to provide the privacy. And as we have an open source virtual assistant, the assistant actually can be its open source you can run it at home. And so, all your IOT information can just be be made available to your assistant at home and it never leaves the house. The neural, the language translator right now has to be in the cloud. And the way we. So, so for example, if I ask what is my checking account balance, the translation is done in the cloud, but now you have a think talk representation that is executed locally. Nobody knows what you're checking account balances, except for your own personal device. But, you know, not everybody wants to run their own local device. What we want to support is to give people choice. We want to do in a way that is exactly like email. You know, like Hillary Clinton you want to run it at home you can run it at home. But if for other people they may want to use the cloud but what is important is that it is open source. And then you have a choice of vendors. Right. And having a choice of men vendors create competition. And the assistants are interoperable. And in, and we all know that we're, you know, because of sharing we all give up our data to Facebook. And so what we have done is that we also created sharing, we allow the assistance to communicate with each other. We use think talk and remote think talk as the protocol. You can share all kinds of things that you can do with a virtual assistant with people that you want without letting them have your passwords. Okay, so I think it's just about time that we improve our sharing capabilities. In the last few minutes, I want to talk about how we make open assistance real and I want to make this real. But this is really hard. Right. I mean, we, we write papers, they are not product ready. And we write these, you know, we put it out there. And typically what happens is that the large companies will take all the good ideas put them into their proprietary products and a lot of ideas on how to protect privacy they just never get implemented. So we need to change that and we're very, very fortunate we have convinced one foundation so far, which is offered piece loan to put money to, you know, to invest in the technology for consumers. So they put up the first, but first bit of money for supporting the productization so we are able to hire a few engineers, and we hope to raise more money in order to make this dream a reality. So we are putting a lot of effort in building Thinkpedia. Thinkpedia is the opening encyclopedia of ways interfaces to IOTs and web services. So it, you know, Alexa has a third party repository. Google has its own. And what we're saying is that we need an open one. We actually have built this for three years now but this is, but we have now better technology and we also have engineers to make this real. And what we are working on is that by the second half of this year, we're going to have skills for the top 10 domains are reasonably good. One of the hardest things are the IOT devices. The good news is that we're partnering with Home Assistant. It is a open source community. They have already 100,000 users that connected to over 1000 IOT devices. And they have meta interfaces for the different specific IOT devices. So we provide the voice to the meta commands and now we are connected to thousands of these devices. And they have already been, and they have already adopted us and we have been shipped with Home Assistant as their voice interface, but it wasn't very good, frankly. But now we are making it, making, we are improving it with all the resources and the new technology we have. And so by the middle of 2021, we hope that this is something that we can use and we have a starter community of developers. And we have been working on the music. We partner with Smart News for News, Yelp for Restaurants, and so financial data with Alpha Vantage. We are using the technology to do Q&A for Wikidata, and then there are the table stakes like weather, jokes, timers and alarms. We plan to put out in the middle of this year, and there will be open source virtual assistant that can work with these Wikipedia devices. And this is the first step. All these things are built by us. Then the next step here is to hope that we can attract skill contributors. And the value proposition here is that if you put your third-party skill in Alexa, all they are going to do is to do intent classification for you. There's not a whole lot of AI behind any third-party skill that they provide, but we will give you a better AI technology. And once you build a skill, you can put it on Alexa, put it on Google, and you can even put it on your own companies, phone service, and web services. And we believe that there is a reason for these companies to use this technology because they do need voice interfaces. They may not know, they may not believe that we can make it yet as an independent platform, but I believe that this is useful to them nonetheless. So in summary here, we talked about three sets of open source software, the tools, Thinkpedia and the assistant itself. And what we see is that we are putting the first complete system together with some early adopters with the help of home assistant. But we really want to create a community, academia, companies, open source communities and users. And we want to invite you to join us. There is a lot of very, very interesting research opportunities. Once we get a few real users, the kinds of research you can do will take on a different level. So here is the information, go visit our website, join us and collaborate.