 Okay, so today we'll have fun with natural language queries, but before that, two announcements or three, first one is that we are sort of half in the middle of the semester, so it's time to stop or we digest everything and take an exam. So which will happen in this room the usual time in a week. Also, as you study with your friends, you want to write a proposal on what your final project should be. So that will be assigned today. Essentially, we'll post directions on what you want to write up. But we talked about it last time, so hopefully you've been thinking about it already. And one other thing I want to discuss are those darn hash values in homework five, which I recommended as a solution and they didn't work and probably should not have been recommended to begin with. But I want to make sure that everybody understands what was going on. And so there are actually two parsers that we have. Well, it's one parser that it performs syntax-directed translations in two different ways. One of them performs syntax-directed translation right on the machine inside the Python program. And this is the sort of syntax-directed translation that you will write in PA5. And this is what we do also, the staff does when we debug our grammars and so on. This is what I use when I develop the solution for this homework. This is also the parser that usually used to be sitting on the app engine. So it would perform syntax-directed translation right there on the Google machine. Does anybody see a problem with that? I don't know how many of you peeked into the rpars.py file. Did you guys look into the file that does the remote parsing, ships the grammar to the other side to Google and gets it back? Yes? So how does the current parser look like? Did you notice how it does the syntax-directed translation? When this project, essentially each time you send your grammar to the app engine, whenever you invoke the remote parser, it does something special with the syntax-directed translation. Does anybody know what it does? So what's dangerous about running the syntax-directed translation on the Google app engine? Okay, so who could execute anything? Exactly, so for example, what would you, so the answer is that you could execute anything, presumably you mean that in the actions of the grammar, you could put arbitrary Python commands, right? So what would be some interesting commands to put into the Python grammar? Okay, and do what? Okay, you could do that, but we have a backup coffee, so it's not such a big harm. Yeah, you could just go through the directory, find the solution grammar, ship it back to your client machine and be done with it. And in fact, we were running this parser in this mode for a few assignments last year. And we knew about this danger, of course, but we were hoping that not many students would realize that. One of them did was kind enough to tell us and we realized, well, maybe we should not just think that they will not realize that. So how did we change, so this is by the way, security question, but this is the part of security that deals with programming languages. So talking about this is completely appropriate for CS164, and you should have a solution on how we actually solved it. So how do we perform syntax-directed translation now? Clearly not right in the middle of the directories with the solution grammar. So how do we prevent you from reading our solution? By the way, besides reading the solution to the grammar, you could just get the solution to the parser itself, of course, so there is a lot of valuable material there. Well, in principle, we could perhaps run it under a different user, but this is not what the Google App Engine allows us to do. It doesn't give you the sort of multiple users and some of them can read the file, some of them cannot. Perhaps we could hide the parser executable, but not the grammar, since that's really an input. Yeah, I don't know if there is a way to sort of separate the file system in two different ways. Exactly, so this is what we do. The answer is that rather than executing the syntax-directed translation on the server in the parser directly, as you will do when you write the raw parser, we will emit a chunk of code which will ship back to you and you execute it on your machines and this is actually what's happening inside rparse.py. You don't know about it, but this is what's happening. Now, it's not completely safe either, but this is what we do, but before I tell you what is the problem with these two parsers, the parser that does everything locally and the one that ships to you the code, from the security standpoint, what is not great about shipping to you the code? Well, we could run anything on your machine, yes, and, but we don't really care, so that is. We already have access to all your files in the repository, so what more is there to see, so, but yes, we could do that, but that's not what I had in mind. There is something else, some other danger, related to the first one. In fact, very similar to the first one. You could print the code that we sent to you, in fact, it's trivial, you go to rparse.py and it's there right in some string which is evaluated. What would you get from that? Would you learn anything? So what do you think that code looks like? So what happens during syntax-directed translation? You have a parse tree, you have some actions that are performed at each node of the parse tree, okay? So what do you think the code looks like that we sent to you? So this could be one version that will send you actually a recursive code, a traversal routine, a visitor of the tree, and a bunch of functions which are invoked as you walk the tree. That would reveal quite a bit. So actually what I do, I just walk through the tree on the server and generate completely straight line code. But the bodies of those functions which I concatenate on the output, you still can see. It's hard to see what the grammar is doing because the traversal completely flattens the code but those bodies of the functions are there. So if you looked into the code that we send to your machine, you would find there the difficult to write actions for the homework five solutions. Jake, where his escape characters, all this stuff was right there. You would not see how we walk the tree but you would see the hard part of those actions. So that was sitting right on the machine but luckily nobody noticed. So this sort of sandboxing where we send the code and by no means executed on our machine saves us from you reading the parser and the grammar but we still send you part of the code which is valuable. So how could we make it even better and perhaps since the thing is being recorded we have to make it better next time. Homework five is offered. How would you prevent that danger? The fact that you get code in which those semantic actions are all visible. Okay, yeah, so we could do it. So essentially suggesting that we would write the grammar in some other language and essentially write the solution in some alternative language. That's too much work. But that would work but I'd say is there something we could do without essentially writing a different solution? Oh, I see. So those procedures would actually be sitting on the server and you execute them. Yeah, but then it also involves a call back to the server where you execute this code. So that wouldn't quite protect us completely. So what else could we do? Okay, so I think if I understand you are suggesting a three tier infrastructure where here is your client, here is some middle server and there is the parser server and the parser server sends the code not to your client but to this demilitarized zone in the middle where the code is executed. You only see the result of the execution. You don't see the code but we executing the code not on our server where all the valuable files are but in the middle where there is nothing but a Python interpreter to run it. So that would be the right solution and I think this is what we will probably do for next time. And so this sort of sandboxing people due to protect their information and so it's a beautiful example to discuss in this story. Now the way I was thinking is that here is your client so here is a student's machine. You send your request to the server. There is this intermediate one that does nothing but forwards it and here is the real server on the app engine which contains the parser and our grammar and sends back the result to this one where the code, it's this code, it's executed and here you send just the result. So this way you're sending just the result of the SDT, say seven, here you send the code with the actions the code is executed here and this way you only see the result. Now of course there is the danger that if I heard you correctly that you completely take over the server with your actions and send it somewhere else if possible. Now I'm hoping that Google prevents you from making arbitrary connections or that is one thing that we could stop. And even with that I don't know whether there are other exploits possible but at that point hopefully it's easier to write the homework. Your means your or mine? So what is the scenario that you have in mind? Oh yes you could. So do a reflection essentially in Python and read the code and send it. Yeah we could perhaps run it in some version of Python without some libraries that doesn't allow you to do it. So turn off some of the reflection. Yes, so that people understand because that is interesting. So the actions that you send in this code which are executed in the middle code could essentially contain a statement which has give me the code that I'm running in some perhaps intermediate phone some Python bytecode but that you can reverse engineer and send that back. And so that's the ability, the ability to make that call we would have to disable. So another approach is what you are saying sanitize the input so that the code cannot contain certain functions. It's hard to sanitize it perfectly but perhaps that's the right approach. So I would like you to hold the thought we'll start talking about these issues later in the semester. I wanna get to the natural language thing and I have one other thing to discuss before we get there. But although this is in the context of Python and your homework, what you see happening exactly exploits that people do with JavaScripts in the browser where whenever you go to a website you click on a link that downloads JavaScript into your browser, it runs it. Next to the other tabs with information about your banking and password if things are not set up right that code can get there and send your credit card information somewhere else. So what do you see here is just another instance of something else. Okay, so now here is how the syntax directed translation looks in this so-called local SDT in the parser that you will write. I simplified it drastically so that we get to the core of the problem. Simval is a new object. It is the thing that you refer to as N1, N2, N3. It contains exactly one field. It's the field val, okay. And of course this SDT happens during the traversal of the tree but this is essentially how it looks like. There are no two variables X and Y. There are some arrays or hash tables or whatever but essentially you have two variables X and Y that keeps those objects. When I took the parser, I essentially went through this code and say instead of executing this line locally I put a print statement around it and printed this code into the file which was then shipped to your client code and I printed it essentially the way you see on the right. Again, there are no X and Y variables. It's more complicated which is why I didn't see that the behavior is different. So why these two ways of running SDT different? Exactly, so this is the whole idea. Essentially you can think of it this way. When this execution here comes to this point this object is not going to be available because we are just going to lose a reference to it. Because only variable X held a reference to this object so after this assignment is going to be gone and therefore Python is allowed to give this new object allocated here the same hash value. How that happens whether they use the physical address of the object doesn't matter. The key property is that the first object has disappeared. Nobody can reference it. Therefore you can reuse the same hash value and indeed this is what happened. This and this would produce the same hash value say seven and seven whereas here you were guaranteed to get a different value say 14 and 16 because both X and Y are still live and they point to those objects. Now if you look carefully on the second example you will see that actually in this example they are forced to have different hash values. Why is that? I oversimplified a bit too much. I would need to make the code a little bit longer to get a different hash value. Let's see if I can, I don't know how to make it bigger I guess you'll have to live with it. I can just create an object. They do have a different value but now the value is reused. So who understands what's going on? So essentially we had to have another X symbol and another hash X to see a different value. To see the same value. Why is that? Uh-huh, exactly. So when you create a second object this object here is actually still reachable because the call to this happens before you override X. So at the time of this call X still points to the object created here and therefore you are forced to give it a different hash value. It's only the third call which gets that same value. Excellent. So these are all PL questions. Now what are the lessons from this story? Number one lesson is test, test, test. The highlighter grammar for regular expression was written before we had a parser that generates code and runs it on your machines. And it works well because it worked on the SDT style on the left. Then we changed the semantics of the parser. I assume, oh, the semantics didn't change. Let's just pause the solution. The solution didn't work and it was discovered by the paying customers. We did it the Microsoft way. Well, the Microsoft no longer does it that way but people claim they use it. So we should have tested it. That's solution number one. Solution number two, which Sean would tell you is that only a professor would ever use a hash value to create a unique value. Why is that a bad idea? For a professor or otherwise? Send the hash value, can you elaborate? I'm not sure I followed. The whole point of path is to say that when you say values, you've got to pause to try to use it to create unique values. No, I see, I see. So you're saying that there were sub-trees in that syntax in the parse tree which were identical and therefore they should have the same hash value whereas I use them to somehow obtain a unique one. That's true. So essentially you're saying if the tree is truly viewed as a functional object, as a mathematical entity, you look at one thing and the other thing, they are the same and indeed they would be represented with the same values and therefore they should have the same hash value. So that's true. That explains why in that case on the right it's a offense because the hash values do give you the same value. What Sean would tell you that you are relying on some part of the interface between semantic actions which you write and what's going on in the parser, that could change anytime and in fact it changed as the parser went from this version to that version. And I was happy with this solution because oh, it's only a toy program what could ever go wrong and of course it did go wrong. Sean, are there more lessons besides this? They're probably, okay. Anyway, so let's get to what we want to discuss today. So applications of natural language queries. Let me give you a result of an old research project that sort of started the whole movement. Instead of writing a sequel query you could ask something like who works on three projects and the answer comes back and you have a bunch of researchers who work on three projects. Which of them are project leaders? You actually ask the questions the way you see and you have the leaders, documents describing your projects come back and potentially an estimate which of them won't finish before 94. Are they led by this or that? You know, you can acronimize the names and they give you sort of an informal answer in the form of the former. And I don't know how old you are but if you follow the trend of computer science in the last five, 10 years, maybe we see more and more trends towards these natural interactions. You see Wolfram Alfavic, you can do today. I think the questions you can ask are more interesting than who works on what project. You can ask about the gains and so on. It probably does much more than last time I took the screenshot. And what do you envision in the future is something along these lines. And Siri answers, it's a pretty deep dialogue but this is what we expect if the trend continues. And so something that we would really like to do if you remember the lecture one, we wrote a real prologue program that took the hats problem and encoded it by using our intelligence and then solve the problem. And what prologue did for us was nothing more just enumerate all the possible solutions and check which of them is correct. It would be really nice if we could actually give it an English description of the problem and the translation to prologue or data log would happen automatically. And this is a lot of what we are going to talk about today although I don't think there are systems yet that do it, perhaps there is something. So as part of PA6, you will get a problem like that similar to the hats or stamps problem except a bit more involved. And by the time of PA6, you will have your own prologue interpreter with your own prologue parser. And what we'll ask you to do is how far you could actually push the translation of this text to prologue rather than encoding it by hand. You will be free to take this text and change it a little bit so that it is somehow stylized so that the parser doesn't need to understand arbitrary English, just stylized English. But it would be nice if a huge chunk of the example for you was translated to prologue automatically. Okay, so let's see. This is what a formal query language looks like. It's SQL, you have a relational database with one table and other table. This one stores employees, the department they are. See this one maps into this table, the phone. This one has the manager, you've seen this before. And then you can write a query that does what? It's a SQL query that selects certain fields. So employee field, the manager field from this table. It selects them from these tables. And here is a predicate, a condition that needs to be met for you to know which of the rows of the database to print. This should be pretty familiar. If it's not, that's okay because we are not going to use SQL. We are going to write things in prologue, data log. It's a data log too because all the queries will write I in the right subset of prologue. So here is a query in SQL. It prints those manufacturers that make the beer that Joe's pub sells. So now I want you to take a paper and pencil and write the query in prologue that does the same thing. It will be much more concise. And we'll make you appreciate writing queries with free variable, I think, make you appreciate. You can talk to your neighbors again, but I want to make sure that you understand how queries of that sort are written in prologue because later in the lecture we'll get to automatically translating natural language into those queries. You can assume, of course, that the tables or the predicates look like this. So who's got a solution in two lines, three lines? Okay, so let's hear it. So what would be the first query, one line? Do you guys have a solution? We'll do something like that. Capital price, right? Any variable that we are not using, underscore would also work. And that's it, right? And I created a little predicate Joe sells B so that the thing is encapsulated. And essentially what I did, I somehow took this and turned it into a special predicate so that the structure of the data log query corresponds to the SQL query, but that's it. So it seems much easier to write these queries. You understand what they do and therefore we'll stick to them rather than SQL. So why would natural language interfaces be natural? Well, you don't have to learn a computer language, you just speak the way you are used to. What do you think other advantages are of natural language interfaces? We'll get to the disadvantages too, don't worry. So can you think why it would be beneficial? So certain tolerance to errors would be nice. The ability to say things differently, it would be nice to get that too, although that's harder. As you will see very often in order to make these things work, we need to restrict the language in which we speak. And so there is a limited freedom to how you express yourself. Aha, I see. So the errors are less cryptic or the diagnostics is better. All right, some other advantages of using natural language for database queries. Okay, so turns out that people looked at studies and realized that, well, for non-programmers, this is clearly better than writing SQL because they are non-programmers, all right? And sometimes queries with negation are especially difficult to write when you are not a skilled programmer. Even for programmers, it was difficult. Another advantage is that you can very easily combine several queries and shorten them by referring from one query to the previous one without spelling out all the details. So when you ask whether there is a ship whose destination is unknown, yes. And you just ask, what is it? Meaning, what is the ship whose destination is unknown? And so carrying out the dialogue is easier that way. What would be disadvantages, you think? There are plenty, of course, all right? Okay, so ambiguities are a problem, other problems, or the really broad context might be missing, right? Yes, so we cannot just start a computer, start talking to the natural language and assume they know we are in lecture 13 of CS164. That's right. So that broad context is missing. But even by missing narrower context, the whole system may be misled. I see, so you're saying just punctuation could change the meaning of things, right? Yeah, okay. Are there disadvantages? Oh, okay, so perhaps the most interesting one, what if you have to deal with dialects and people expressing themselves differently and just words having slightly different meanings, okay? Yeah? Excellent, so essentially the better the system the more you expect and the more you are surprised when all of a sudden it doesn't work. So let's look at a few of these. So the linguistic limitation may not be obvious, okay? So clearly we are not going to support the entire language only a subset, but what subset it is, the user is not going to read the manual and study the grammar and understand the limitations, of course, right? So imagine you have a system which can answer this query. So read it carefully. And so if a system can answer that, can you guess which countries those would be? I'm guessing Finland is probably one of them, right? You would expect that this query can be answered as well. Now is this the same query, different query? So we'll see more fun examples like this. So what does the first query mean? What does the second query mean? And why the system may understand the first one, but not the second one, okay? Uh-huh, all right. So excellent, so in the first query there was a little elision, right? The omitted piece of text and it was automatically by the parser recognized and replicated. What sort of word or phrase do we need to replicate to make it a clear query? How would you rephrase that question? Right, so let me, what I want to, what I'm looking for is how do we paraphrase this question so that its meaning is really clear? And I'm saying that in here we did, well the speaker omitted some information, right? What is that information that the speaker omitted or the parser was able to fill it in? Exactly, what are the capitals of, no I think that's the right, no? No? I would say that's the meaning of the second question. All right, well if this is ambiguous then perhaps one should never let people ask this sort of question, but let's see if we can get an agreement. So the first question, my understanding was that we are asking for a country which either borders the Baltic or Sweden, no? Or, all right, all right, oh I see, okay. So perfect, so this is here, I'm bordering. Okay, that makes sense, maybe I didn't read it carefully enough, yes. So these countries here must border Baltic and Sweden and now we are asking, okay, and this here is essentially translated into a set of countries and you perform a union with Sweden and then you compute the capital of those, right? Am I right? Okay, good, this is what you get with the foreign speaker or just not reading carefully, okay. So, but the key point of the slide is that you may expect that since we could answer the first question, the second question is also within the scope. And now all of a sudden you are surprised and it's hard to explain why that question may be beyond the grammar understanding of the system. So the suggestion was if we say, say, if we say bordering, bordering say Finland and Sweden, you wanna change the meaning depending whether we have a comma here or not, right? And without the comma, the meaning would be as in the second case, right? I'm guessing, well, if we put the comma there, what do you think the meaning should be? A union, right? And without the comma, right? So that would be one solution and the system could decide that it will support these variations of English, but it could be that it supports one and not the other. And so the point here is that you could be surprised that how come it can answer one question and not the other. So at least that's what we wish to be the answer on the slide. So how is that with formal query languages? Well, you could say that in a programming language, it's pretty easy to guess what programs you can write. If it's syntactically valid and passes the time checker, a type checker rule, it's a legal program. So it seems like that problem disappears, but even in programming with computer languages, it's not easy to master the knowledge of what a particular system can do. It doesn't often come up with the language itself. Languages themselves are usually pretty small, maybe except for C++. But where does it arise? When is it difficult to find out whether something is implementable or not? So mastering huge libraries that come with frameworks like, did you guys use Eclipse ever? Probably as users, but not plug-in writers. The systems come with libraries that have tens of thousands of classes, hundreds of thousands of methods. And figuring out whether a particular library or the composition of these libraries can do what you wanna do. For example, open a Java file, parse it into an AST, process it and give you a handle to that AST. It's quite difficult to find out whether the library can actually perform the task. And searching for the three line code that does it often takes four hours. So the productivity of finding the three lines of code is not really good. And so in that case, in understanding what functionality is hidden in libraries, it would be nice to be able to query it in a natural way and look up a database of programs or somehow compose from the API of the library, a program that might be able to do it, would be beneficial. Because understanding the scope of the functionality of libraries is pretty difficult problem. All right. So we talked about the linguistic failure, understanding some grammatic phrase, but not others. But sometimes it's difficult to distinguish. Was it really the grammar that I did not say correctly? Or was it the fact that the concept is not known? So your system may not understand the notion of a multi-city trip when you are trying to buy a ticket. That may not be obvious to you. And so you keep talking to Siri and trying to rephrase it so that the thing becomes expressed in less ambiguous terms grammatically. Whereas the problem is that it just doesn't recognize multi-city trip. So what would be a good solution to that? Clearly eventually the progress will come to the point that these systems in natural language are going to be used and this sort of issue will have to be solved somehow and we'll have to give users some way of out of this. What do you think might work? I don't think that there is an answer to this question yet. It's one of those that we'll take maybe another decade or two to settle down, but okay. So the diagnostic to the user could be, I don't understand multi-city, right? Okay, that would be pretty good if the parser could do it or the system behind the parser, yeah, another way. Aha, I see the system could give you an answer. I tell you, I'm not really sure, right? This is what the IBM, what is it? The big blue, right, does, Watson, Watson, not big. Big blue plate chess. Right, it does gauge probability of getting it right. Okay, so just the probability of the right answer, although in this case the system would probably just get stuck or clearly give you an answer between two cities ignoring the third one. What else could one do? Aha, okay, so that's actually pretty good. You could say a bunch of similar inputs that are perhaps token by token word by word similar and were presumably working correctly accepted by users and this way you would see that none of them talks about three cities, just two and maybe you would get the clue of the scope of the functionality, okay, that's excellent. So perhaps you could also get, this is interesting, right? If the syntax is too complex, it could tell you I cannot parse it because you have three subject sentences rather than one. Okay, so when do they actually work? So we talked about it briefly. They work really well in a study compared to SQL when multiple tables had to be combined, so a complicated join when there was negation and when the query did not correspond to anything in training but it was a new query. So these were the cases where natural language did work better. Okay, so let's talk more about the technical challenges and how one can resolve them. But before we go there, this happened to be quite a popular topic last time the class was offered and quite a few people jumped at NLP and wanted to implement projects such as with natural languages implement adventure games. So define your rooms, objects in the rooms and how the rules would evolve. And they turned out to be a little bit more ambitious than they should have been, but it is a great topic. I think that to make them successful, you need to really narrow down the form of the English that you handle and we'll try to touch on it today. But it would be a great topic how you effectively program a system with natural languages. So let's see what are the issues that you may need to solve and how you could. Modifier attachment. So you see some ambiguities here. Do you see any ambiguities in this sentence? Right, what would be the ambiguities? Oh, okay, I didn't even think that. So you could list all employees and print their driver's license, essentially so Vid could say, what do you print? Or you could just list all employees who have a driver's license, but there is another ambiguity. Does anybody else see the ambiguity? Yes, please. Yeah, exactly, the driver's license could refer to the company. Now it's a sort of a silly thing and of course as a human, you can see which one you mean, right? That we mean that the driver's license refers to the employees rather than, sorry, to the company, but it's enough to replace driver's license with expert, export or expert license and all of a sudden, such a license could be associated with both employees and companies and the ambiguity cannot really be resolved without either asking the user or printing both answers and letting the user choose. So what would be, could we accept sort of a standard rule that we'll use? So when we parse English and there is an ambiguity of the kind, can we sort of put down a rule and we'll always use that to disambiguate these sort of things, okay, what would we do? Okay, so Vid would always go to the closest one, right? So in this case, we would associate it with the company which is not what we want. So how would you rephrase, how would you have to rephrase the question to make the right association? Yeah, so you could say at least all employees with a driver's license in the company, right? That, yes, that would not, that would not work. Okay, so the proposal was, least all employees with driver license in the company, okay, so how could we modify it? Exactly, so you need to now have explicit references. The least employees in the company, and now you could say who, have a driver's license. Yeah, you're getting closer to that. So the interface that I would like is say it informally and then see top five choices from the system and then picking one, right? It's still better than having to type it formally, okay? So this is a solution. So do the right-most association, exactly as you proposed, right? So in this case, the salary is referring to the recruiter because it is the right-most association. How would we rephrase this question if we wanted to associate salary with the employee? So if we swap these lines, list an employee whose salary is greater than 3,000 and who was hired by, but we could also just say put an end here and it should do the same trick, right? Now we are saying there are two conditions referring to the same person rather than, okay? But in some cases, you just don't know. The division-making shoes or the employees-making shoes and again, we'll probably need to ultimately go to the user and disambiguate. This one is interesting. What two interpretations you see in that sentence? Right, okay, so and you could formally write it. Is it true that for every student there exists some course, specific potentially to that student such that the student has taken the course or is there one common course for all the students? Exactly. So which of these readings do we prefer, you think? Is it number one or number two? You would go with number two. So who goes for number one, number two? Well, I think it's pretty clear but some people would prefer this one. The rule that you could follow is, do a left to right reading of those quantifiers. And so here is left to right and therefore, it's number one. It's interesting to see whether you could parse this sentence and use something like a deep rec declaration to choose which of these two meanings we'll select, okay? Now, how about this sentence? Presumably, we don't mean people living in two states at once, right? It would parse as that, right? So you would need to rephrase it who live in California or exactly, but this is how people speak, right? No? All right, yeah, right. So you could say, or maybe you could say, you take this and make a copy in here, right? So this sort of a lesion and insertion is pretty common in spoken, in natural languages. How to do it unambiguously, it's difficult. So we would need to have a rule. What could be the rule in this case? In this case, we could interpret it the way you do and then say, well, there is nobody and then the user would need to rephrase the question. I'm not sure I know whether there is a reliable disambiguation rule for this. It's not clear what Arizona essentially refers to. Does it refer to the live, right? Or does it refer to the entire phrase? I don't even know how to describe it, I guess. I would say the solution to this is just to interpret it the way it's written, even though or often means, and often means or. Yeah, that would be probably reasonable. This is one where it may be difficult even with domain knowledge to answer because it could mean both things. What was the first half of what you said? Oh, I see, so the end really just means satisfy both keywords and okay, if there are zero, that's right. And they actually give you a nice warning saying that this is what they've done, right? Yeah, I think people could misinterpret the comma, so it seems like a potentially slippery slope. It's rather subtle. You have a comma? Aha, I see, that's a great point. So if we are expecting these queries to be transcribed from spoken language, then good luck catching the comma in your accent, right? Yeah, my gut feeling is avoid the comma. Okay, this one, okay, a city department, could be department in the city or a department taking care of the city. A research department probably does research. A research system probably does not do research, the system used in the research, okay? So clearly for the deep discourse questions, you need to understand what do we mean? System that does research, another word for academia or a system used in the research, how would you deal with that? This one is pretty hard. So you could perhaps look at the context and then somehow choose it, right? Now we are really talking about AI, hoping to just parse the thing and answer it. But that would potentially work, really depending on just probabilistic context there, choose one or the other. Other solutions? Yeah, I think what you're suggesting is what people apparently do is just create a configuration of these concepts and say, well, city department always means this. It means department taking care of city rather than something in the city. And so sort of a domain database, if you will. It's what you probably wanted to suggest as well. So this is the solution, just do a configuration. Okay, this one is interesting, but not so hard to solve, I'd say. But we do need some extra information to make sense out of the second question, right? What sort of knowledge would you need to solve that? Because I could ask how about mail and clearly then I'm trying to replace not this word, but that word. So now this might be difficult to do in parsing, right? Because now we are relying on the fact that these two are two objects from the same concept and that we are relying on for the mapping. Yeah, okay. We do have that ambiguity, right? Is it the same manager or? Well, we are talking, okay. So I think here is how you would disambiguate that. I think the fact that there is D, sort of the use of A and D is used to say, are we referring to the same object or establishing a new scoping? So I'll post a link to a really nice system that doesn't come from the 80s, but from the last decade, it's beautifully written set of rules for disambiguation from a university of Zurich, easy to read. So I'll post you a link to that so you can see what they do. And again, smallest, largest is another way of easily communicating. This one is a really cool example. It would seem useful in browsing rather than clicking. Okay, but let's get to the technical issue. So the grammars that people use for natural language processing are typically not context-free grammars, but semantic grammars, probabilistic grammars. We are not going to deal with those. Instead, what we'll say, we'll restrict our language to a subset that we could actually parse with context-free grammars and we establish some disambiguation rules, okay? That's fine. So as an example, look at this grammar and create parse trees for these two sentences. And so you see sentence as a N-P-V-P, right? Noun phrase and verb phrase, okay? Verb has verb, prepositional phrase, okay? And here is our simple vocabulary. So take a paper and pencil and create parse trees, grow parse trees over this. So let's do this one. So what is a fruit? It's an N. All these flies, it's a V. Is it a V? It is a V. So you see we have an ambiguity here, right? Now this is the whole point that each of these sentences have two parse trees. It's ambiguous, but let's do the right parse here. And so flies here would be, so what would be this? N-P. What would be like? What do we do here? What is this? N-P. This one here, P-P. Okay, so where did I make a mistake? Okay, so this would be a verb. And we have a sentence, right? Okay, how about the other parse, which works for this one? This is a verb, so what comes here? And now of course, these are the desired parse trees, but we could switch those sentences. And the parse trees would still be correct according to the grammar. But this is the nature of parsing natural languages. So what could we do to disambiguate here? So who can propose a rule for disambiguating these parses? So the question I'm really asking, if we take these parse tree which is correct, but also desired, and I say now, time flies like an arrow. Now we get a parse tree that we don't want. How would we determine that? So where is the mismatch? You can say fruit flies, time flies. Yeah, I think we're not gonna find a mismatch here, but okay, what concepts could we use to pick the right from the parse trees? Right, so you see now why these probabilistic grammars based on observing what you have seen in your corpus of data would work better because they have seen fruit and flies next to each other, fruit flies, but you haven't seen time flies. You have seen time flies next to each other, but not in the context of noun phrase, but in the context of, in this context, right? So if you understood frequencies together with where in the parse tree that the pair appears, you could now reject one or the other, okay? Another choice that we could do without the corpus, perhaps? So what would you like to see in the corpus? Right, okay, so fruit flies is an accepted one as a noun. So essentially the same heuristic. Okay, so you've seen the two trees. So how would you translate this to prologue? It fits well for CS164 because we have a parser that takes, it's not a wimpy parser, it's slow, but it can accept arbitrary context-free grammar and do powerful disambiguation. We have prologues, so it would be a shame not to show how we can translate natural language queries to prologue. So how would you phrase this query as a prologue query, okay? Essentially that's what we have. We probably want to have another one. This is what you said, borders, country, Greece, capital of that country, and we also have this country here with checks that what we are getting is a country, but essentially you got it right, okay? So let's now try to take this grammar, look at it and see what kind of questions you can ask with it. Can somebody give me one question from the grammar? Which specimen emits radiation? I think it only has 16 different questions, but here is one, which rock contains magnesium? So take your papers and pencil, take this grammar and write a fragment of an SDT, which actually will walk through these sentences. Here is the parse tree if you want to see it and generate a prologue query similar to the one about the capital and the Greece, and Greece, not the Greece. So this should be fairly easy. Don't need to flesh out everything, but you should be able to understand key ideas. So essentially we just want to add actions that when evaluated will produce a prologue query that goes into a database. The database presumably tells us things like this rock emits radiation. This contains magnesium and so on. Probably want to start by just writing down what predicates we have in the database, what facts, okay? So do we have a fragment of the translation? So let's, who's got the facts that we want to store in the database, right? Assume that those are given, so what would those be? What facts do we need to keep in the database? Think of them as the equivalent of a relation. Can people suggest facts? Okay, so how about we say rock contains, if we want to say that a rock, a particular rock contains magnesium, right? Actually this is not quite right, so let's assume we'll have a specific rock. We'll call it a foo for now that contains magnesium. So this is one fact, what would be another fact? Maybe emits, if some bar emits light, would that be another good fact to have? Okay, okay. So we'll assume we'll have some specimen called foo and we'll put it here. This will tell us that it is foo, okay? Yes, all right, so maybe that's enough. So can we get a rule for at least one part of the grammar? So which one we would start with? Would it make sense to look at the translation of this? Okay, so how would we translate it? Presumably we'll do something with N1 value and N2 value. We wanna somehow wrap them in a fragment of a query. What would that be? Can you speak, yes. I think our time is up because it happened again. So this looks like a race condition between the pen and PowerPoint, which is not great. But essentially what we say, take the verb, which comes from the V non-terminal and we're going to use it for the name of the predicate. It is either contains or not. Parentheses and then the other one will become the argument and we need to do one other thing. We need to invent a variable, right? And we'll return both up the evaluation tree. We need to return that string, which is, so we had V and N and we'll take N1 dot value and this becomes the name of the predicate, parentheses. N2 value, okay? So this could be contains, this could be a magnesium for example. And here we invent a variable, a fresh variable the way you did in the highlighter and this entire string, which is our query, will be passed up the evaluation. But we need to pass another thing up. What would that be? We'll use the string to compose at the higher level the subquery, the clause with another clauses, but some information is needed for this to be composed. Right, we need to return a pair of X and this, right? And then when the upper level, we'll take this part with a comma, take another part of the query and this X, which is here, we'll use in here. Or maybe this one will come with X, this one will come with Y and we'll do something like X equals Y here to bind the variables from these two quads. So that's essentially it. So this is the key idea of translation of these languages. Translate pieces to queries, bind them with these three variables and that's it. The question now is just how to restrict the language such that ambiguities can be resolved and we'll give you some nice hints for that. And maybe, yes, very nice, but probably all the annotations are gone. Thank you. Thank you too.