 Thank you all for coming. I am Giovanni. I am a student from Stanford, and I'm going to present to you today, Almond, our effort to build an open, programmable virtual assistant. And we are interested in virtual assistants because the linguistic web or the web accessible to natural language is a great opportunity. It provides a uniform interface that will subsume websites, search engines, and mobile applications. It can see all our personal information. It can intermediate all our digital services and help us choose with personalized recommendation. But when we look at the reality, it's interesting to compare the linguistic web against the graphical web. While in the graphical world, we have a GUI, a graphical user interface. In the linguistic world, we have a natural language interface, or LUI. Browsers are replaced by virtual assistants. And websites and pages are mapped to skills and intents. But while the graphical web is decentralized and hosted by the owners of the content, the linguistic web is centralized and is hosted by the platform, like by Alexa or Google. And while the graphical web is an open platform, the linguistic web is proprietary and it's a wall garden. Our vision is to build, instead, a better virtual assistant. A virtual assistant that is first and foremost free and open source. Open source is the default, especially in this community. But this is not enough. We want it to be better. We want it to be smart with cross-domain interoperability. We want it to be knowledgeable with information about all sort of open world and APIs. And we want it to be social and support interaction between humans while expecting privacy. And I'm going to go through all these three components in detail. So first, let me explain what I mean by autonomy with cross-domain interoperability. Today's virtual assistants are like a browser. They let you navigate and interact with one side at a time. By humans, while they do all of this, they need to do more. They need to share information, connect across sites, aggregate, and decide. And Almond supports the human in all these tasks. So here's an example of what does it mean to share information, aggregate, and decide. In the context of Bob, Bob is an asthma patient, a condition that covers a lot of Americans. Bob can use his assistant to share his position. For example, by letting his dad know if he's in the hospital, he can use a virtual assistant to automatically connect his inhaler with his GPS location, with his file storage service. He can use his virtual assistant to aggregate information, such as the air quality index and the fact that Bob is running, and even to decide when the doctor should be noticed, should be informed by a critical condition in Bob's readings. And our key technical concepts start from the observation that virtual assistants have skill repositories. They connect all different services. But we are not just interested in collecting these repositories. We want these APIs and services to be interoperable. So we build a repository we call Thimpedia that collects all the signatures of all the APIs, not just inputs, but also outputs. And this is a richer representation than existing repositories, which supports write ones run anywhere. Given the API in this repository, we introduce ThinkTalk. ThinkTalk is a personal web programming language that can, in very short construct, in very short statement of code, connect these APIs to execute the commands that the user actually want. But of course, the user is a user. It's not a programmer. So he cannot write code. He wants to write natural language. And so our goal is to build a neural network that can translate natural language into executable ThinkTalk that can then make use of all the web APIs we're collecting in Thimpedia. Going deeper into this concept, Thimpedia is our encyclopedia of things. And for each thing, or device, or service, it'll collect both the API and how the API can be used in small snippet of ThinkTalk with the corresponding natural language. So you can mark that you can use the search function to search tweets by hashtag. And then you would say that tweet with hashtag voice UI. Or you can say that you can use the word tweet to make a tweet on Twitter. And currently, Thimpedia has 104 devices and 249 functions. And these functions can be now combined using the ThinkTalk programming language. This is our high level programming language that we design. And it has one simple construct called when get do with filters. But here are some of the examples that we can support in this simple construct. And it should be clear then that ThinkTalk is designed to map exactly what users want to do. So for example, you can say, when my advisor is online, or Slack, automatically send me to a way. So I don't get notified. And so given this definition of the programming language, the next step is how we actually understand the inputs of the user. We map this as a translation problem so that we can take a sentence, say, monitor as KCD. And if the alt text contains Linux posted to Facebook, combine it with the fact that we have XKCD mapped to this specific snippet of code in Thimpedia and post something on Facebook with the corresponding snippet of code to translate directly into the executable code that will then monitor as KCD and post it on Facebook. So here is the idea. But the natural language input is not as simple as writing down code in English. In fact, given a sentence that could be constructed out of the information in Thimpedia, like the one at the top, the user might issue all sort of variation of commands. And so how do we build our system to be knowledgeable enough that can actually understand all these variations and map them ultimately to the same code? In particular, how do we translate all the tasks from natural language into code? And the key intuition here is neural semantic parsing. This is a machine learning algorithm that takes natural language and translate it into executable form, a formal representation. And like everything machine learning, the key hard part is just acquiring data. In particular, training data in this context is very expensive to acquire because it's pairs of sentence and program. And therefore, it's not enough to get an annotator. You need an expert in the domain that can write down the program. But luckily, we have a solution for this. It's the Gini Parcel Generator is a tool that allows domain experts to program the training data. Domain experts don't need machine learning expertise, which is embedded in the tool instead, and can be expert in how their domain should be accessed in natural language, and therefore can contribute that knowledge to Thimpedia, which extends the virtual assistant immediately to new capability. So how do we train without real data? How do we get Gini to produce a virtual assistant without real data? Well, for training, we have acquired 1.3 million sentences that we synthesize out of templates provided by the domain experts. And I will go in detail on how these templates look like. Out of these, we crowdsource 24,000 sentences using mechanical torque. So this is about $8,000, which is about half a quarter of a graduate student. So it's relatively cheap. And after data augmentation, our training set is total 3.6 million sentences. Now the next question is evaluation. And state-of-the-art research says you should evaluate on paraphrases. But as it happens, paraphrases are misleading. You can get 95% accuracy on paraphrases. Then you deploy your system in the real world, and you get less than 20%. The reason is, while paraphrases are written by humans, they're more real, they're still influenced by the synthesized data by the templates. And they're too close, and not actually how people express commands, necessarily. So instead, we evaluate on a small set of realistic user data. Realistic user data is the most expensive data to acquire, because it has to be annotated by hand. But we only need a small amount, because we only need it for evaluation. We use two methods. The first one is a cheat sheet method. We go on mtorque, show a worker a cheat sheet that contains a list of all the APIs, all the services that Almond supports, and we ask them to come up with a command, then we then annotate. And the other method is just take if this and that. This is a popular web service where people can connect different services and devices and write a corresponding recipe. And we scrape that data. This is data that we have no control over. They have no relation to Almond. But for the commands that we support, we use it as validation and test. We collect about 1,500 sentences for validation, which we use to refine the templates and get a better training set. And 360 sentences are a test set that give us the final accuracy number. So I'm talking about templates. What are these templates? As I mentioned, in Thimpedia, we have information about both how the API is used in code and how it is used in natural language. And not in particular that the same API can be used in many different ways. So the list folder API for Dropbox can be used to, well, list folder, list my files, but also list the files that changed most recently. You can use it to be notified when the list of file changes, and therefore you modify the file. Or when there is a new file name, and therefore you created a file. Also know that the same functions, such as Open, could be referred both as a noun and as a verb. The noun could be the download URL of a file, or just open that file would be equivalent. And therefore, existing state-of-the-art technique, they're based on associating a unique natural language expression to every API are not sufficient, because we want to understand both a noun and a verb. So given these primitive templates provided by Thimpedia contributors, how do we go and combine them into construct? We have designed a language of construct templates, which define all the way primitives can be combined. And this is both a grammar of natural language and a semantic function that combines the corresponding pieces of program to compute a corresponding program. And so in your construct templates, you could say that when something, when phrase, then action, verb phrase do something, and that corresponds to a w, then a program, or similarly do something, when something, and that corresponds to the same program. And we have collected, as designer of the language, 146 construct templates. These are written once and for all for a certain subset of functionality you care, but whereas extensions or capability of the system comes from primitives. And so to combine our pipeline, we will start with an API library and primitive templates, which is essentially Thimpedia. We combine these with construct templates, written by the authors of the programming language. And we use these to synthesize our initial training set. A small amount of data out of the synthesis step will be then paraphrased by humans on crowdsourcing platform, M-Turk. And then both synthesized and paraphrased data will be used, passed to an augmentation step, which injects parameter data sets, like song names, people names, words in the dictionary, and so on, to produce our final training data. Not in particular that we train with both paraphrases and synthesized data. The paraphrases are more natural and give robustness to the model, because they teach all the different ways that the same command can be expressed. Whereas synthesized data is much larger and gives compositionality to the model, because it shows how the different ways APIs and natural language concepts can be combined together, according to the rules of the construct templates and the programming language. Furthermore, we can take a small amount of paraphrase and then expand it manually into new construct templates. This allows us to give more value to the crowdsource worker who put effort into writing interesting paraphrases. And now the machine learning model. We use an off-the-shelf DeccanLP model. This is a contextual question answering model from Salesforce. It's a bi-LSTM and self-attention model. I will not go into details, but you can look at our paper. But without wasting more time, the key inside is Gini improves the state of the art significantly. Even though we didn't have any realistic training data coming from real users, we still get 62% accuracy on real data. And furthermore, we performed a case study on just a single-skill Spotify that we put extra effort into. And we can see that we can get 80% accuracy. We also found that you can extend the programming language by adding new construct templates and new features to the language, and you still get good accuracy with small amount of effort. And this shows that domain experts can contribute to Thimpedia and to the ThinkTalk design and get a more powerful virtual assistant that grows incrementally. And so we have applied this, and we have started the process of crowdsourcing on Thimpedia. Preliminary research exploration started with 60 college students that wrote 57 skills as part of the homework. They had two weeks. They had no prior knowledge of Almond, but they still succeeded and brought 187 functions. And out of that, we started a developer program. It's very early. I want to stress. But we have 117 registered developers belonging to 90 teams, and 17 teams have completed our tutorials. And in total, they have contributed 55 skills in Thimpedia. So why do I believe that Thimpedia will eventually be successful when I compare it to, say, 80,000 skills in Alexa? Well, the interesting thing to consider is that Alexa skill repository, Alexa skill kit, only stores actions, straight up dispatches. Google is a little smarter. It stores actions and corresponding context. Where should that action be invocable? Bixby is a lot richer. It stores actions. It stores queries or concepts in the terminology and the context. But Thimpedia is designed to eventually subsume all the representation of all the major virtual assistants. It has actions. It has queries. It has filters and computation, aggregates, sorting, anything you would find in a query language, in a SQL-like language. It's in scope of Think Talk. And it will have context context in the square bracket because it's still work in progress, but we have a design for it. And so the idea is that, given all this platform, with all these different representation, we can have a unique representation in Thimpedia such that if somebody contributes to Thimpedia, the service is immediately available on all the other platforms. And this is a strong incentive to write to Thimpedia instead of writing to all the platforms separately. And furthermore, because we have our own natural language model, not only will it be available on the proprietary platform, it will also be available on Allmonde, a free and open source assistant. And because we are open source, it will be available on websites, on apps, on phone services, on any kind of injection system that can take messages and wants to have a chatbot replying. And that's why we believe that open-world knowledge will eventually trump the world garden because it will allow more people and more experts to contribute their knowledge. And now I want to take a break and talk about the third pillar of Allmonde, it is social. And by social, it means it helps people share. So how do people share nowadays? Well, if we're talking about sharing data, like photos or files, they essentially have two options. You either go to Facebook or Twitter or Instagram, or you use emails and messaging. Perhaps you use a social, like a secure messaging, like Signal or Matrix, but still a messaging system. What about IoT devices and services? Are you going to share your username and password directly? What if that service is behind logging with Google, like Nest? Are you going to share your Google password? That's not really acceptable. Our goal is to share with fine-grained control. Define not only who, but when, where, and what can be shared, both in terms of data and in terms of access to IoT devices and web services. And of course, we propose to use virtual assistant to share because virtual assistant can serve a quest on our behalf as owner. They can also serve a quest on behalf of other people that would like to access our data. And this leads to a federated architecture where every user has their own virtual assistant, and then virtual assistants communicate over a secure, shared messaging layer to collaborate and execute their quest of the user. This architecture has no third party. There is no central owner of the data, like Facebook. It provides generality because any service that the assistant knows about can be controlled, and ultimately any service on the internet. It is usable because everything is accessible through natural language, which people understand. And it gives privacy because the direct access to the web service is never granted. And in support of this design of sharing through virtual assistant, we have designed a policy language called Think Talk Policies. You should notice that it is a straightforward extension of Think Talk that adds a source defining who is asking to execute their command. And it adds filters on every primitive. But you can immediately see it is expressive. It supports all sort of commands and restrictions. You might want to tell your virtual assistant in terms of what is or is not allowed. And let's go through an example in detail of how Think Talk and Think Talk policies support sharing. Let's say a dad wants to help his daughter monitor a house, and they have, let's say, an IoT-enabled security camera. The dad might say, ask Alice to notify me when a security camera detects motion. He says that to his virtual assistant, which then translates that to an executable representation. It gets sent over a protocol to Alice's virtual assistant, which then goes and check, performs a formal secure check against the policy database, and decides whether the program should be executed. In the common case, though, this is not a top-down ABAC system with system administrators. It is unlikely there is a policy ahead of time. So instead, what Alice's assistant will do, it will ask Alice for permission. In particular, the program can be now translated back into natural language to form permission requests that can be shown to Alice. And note that this translation happens on Alice's assistant based on exactly the executable representation, which makes it precise and secure, and it's not affected by whatever dad wants to say. And now, should Alice agree to share the security camera with her dad? Well, maybe not unconditionally, but she can instead say, only share it when I'm not home. And so now, the original request and the restriction Alice applied is stored as a new policy in the database. And with this new policy, the assistant is now allowed to execute the command, send the result back to dad's assistant over the protocol, and then dad's assistant can now display a notification showing that there was motion detected on Alice's security camera. And so we have built this whole infrastructure, but do we really need it? Well, we have conducted an experiment using 200 people from the general public. This is a survey we conducted on Amazon M-Turk, and we show them 20 scenarios where we start with 20 scenarios about sharing data or access to accounts or credit cards and whatnot, and we start with role-based access control. Would you share this asset with, let's say, your daughter or your kid, just based on who they are? And you can see that there is quite a range of spectrum from some very restricted credit card to some very liberal, like Netflix. Then we say, well, would you share if we gave you these additional constraints, like we find great control, such as a $20 budget limit or only PG-rated movies on Netflix? And the interesting thing is that across the board, regardless of how restricted our liberality was originally, there is an increase in people who are happy to share given fine-grained constraints. In fact, if we looked at all the 20 scenarios in a survey, we find that the number of cases, people times scenarios where there is actual sharing is doubled when we have fine-grained access control. We then wondered, is our obstruction correct? Are policies useful enough? And we collected 220 use cases from MTurk, and we collected these from workers who are not familiar with the system. We presented them with three examples of someone issuing a request and somebody coming back and saying, no, only under these constraints. But the interesting thing is that 85% of the use cases can potentially be supported. 70% would be with existing APIs that are out there, and we just need to add it to Thimpedia. And 15% with new APIs or new capabilities in the web services. 6% of the use cases we collected are unenforceable. One example of this is someone will be willing to share the library card, but only if the person receiving the library card agrees to return the books on time. And this is unenforceable because it's an entirely out of band non-digital transaction. 9% of use cases we collected from MTurk are those that are actually not supported by our policy language. And these primarily involve actions that require multiple constraints or aggregations over time. For example, checking how many times the certain command gets invoked, because our policy language is applied instantaneously at the time the command is executed. And so with this, I have introduced our technology. So where do we stand? Where are we going next? And how you, hopefully, can help. Well, our current status is that Almond is developed by the Stanford Open Virtual Assistant Lab, or Stanford Oval. We have National Science Foundation funding for research, for a neural network, for architecture. But we're also starting, we want to start a partnership in the industry to bring Almond not just in the academia, in academic conferences and papers, but to the real world, to real users, and make a real impact. And of course, Almond is 100% free software. Most of it is under GPL, because I personally believe GPL is the strongest protection, in particular for consumer software. You don't want to lock it down. But for Thimpedia, we use a permissive license. So you can adopt Thimpedia. You can contribute to Thimpedia. You don't even have to contribute necessarily an open source skill, although we strongly suggest you do. And of course, the data and the model, we have a crowd source, all the data we have collected from mTurk, from paraphrasing, and our model we train is in the public domain. So you can go and make the best use of it. We have built prototypes of Almond. We have a web version, a scalable multi-user cloud version that is available at the website, Almond. That's time for you. We have an app on Android that is designed to keep as much as possible of the data and the computation and all the credentials of the user stored locally in the phone. And we also have a desktop version, where desktop really means Linux desktop or GNOME, which is available on FlatHub. Our developer program on Thimpedia is open. So go and get a developer account right now. You can create a new skills in minutes if you follow our tutorials. And the APIs and the repositories are free to use. There is no limit. You can contribute and play with it as much as you want. You can also create custom neural network models with your language and your skills. These are limited because our infrastructure is reduced, but you can play with it. And we have identified four areas in which you can believe that the open source community is most suitable to help. First of all, growing Thimpedia. As I said, we want to attract all the domain experts. If you have a service that you would like to see on Thimpedia, if you have some web service that you really like to use, just put it on Thimpedia. It's easy. Of course, voice. We need a voice interface in a voice assistant. But it's not an area of research. So it's not an area where we have spent a lot of effort beyond just using pre-packaged software that is available out there. But there is an opportunity to do something better than existing software, and in particular, reduce the reliance on proprietary and bad privacy cloud APIs. Internationalization. I mean, translation and ability to have a worldwide reach are a strong suit of free software. So I believe we can make a very strong impact on that. And finally, cross-platform. We want to have Almond not just on Android, but also on iOS, but also on all sort of messaging platform, whether it's Matrix, Slack, WhatsApp. We want Almond to subsume and be available on Google Assistant, on Bixby. We have a prototype of having Almond and Thinpedia skills on Alexa, but we want to be on every platform. And we hope that the community can help. And to that goal, we are hiring. We are in particular looking for iOS developer, but we are hiring staff researchers and staff engineers to help us. And you can also help directly. Try Almond on our website. You can go to our GitHub, play with the code, download it, build it. Or you can sign up to a community forum, get involved, make yourself known, and meet the rest of the community. And if you're interested, please feel free to reach out to me even during the conference. That would be amazing. Thank you.