 So let's start by looking at the calendar to see where we are. And you just submitted the homework, where you build the foundations for PA4 to PA6. And we are now here starting to look at DSL, so domain-specific languages or small languages, in other words. Why we are looking at them now? Because here, as you study for the midterm, you'll have to come up with an idea for the final project. Small problem that can be solved with a small language in a small amount of time, in a small team of two people. And so it overlaps with midterm preparation, because presumably you will be studying for the midterm, and this is a good opportunity to brainstorm about how what you've done can actually be used to build a small language. That you will do then at the end of the semester. So after the last lecture, we have the second midterm, and this is the time that you will have here to work on it. So we are here. There is PA4 assigned today in which you will take a front-end parser given by us, which parses the grammar description and generates an AST. In the homework, the grammar did not have an AST. It had a really simple string description. So now you'll make PA4 essentially a real parser that reads the grammar. And what I want to teach you today is get you ready for thinking about the final projects, because you will not only have to come up with a problem to implement, you will have to pick an implementation strategy of how to build a small language. And there are many. And so I want to go through over some of them today so that you have all the tools ready to start thinking about the final project. So where are we? Small language implementation is what we'll talk about today. We are in the middle of these, well, should be probably 10 to 12 lectures that talk about small languages. We talked about regular expressions last time because it is the small language that is best studied so far. There are many different translations and they illustrate on a small scope many cool principles. So we'll finish that today. And then we look at the implementation strategies and mostly talking about how you can take your small language and embed it into a host language rather than building a language all on the side. Because how to do that we'll cover that a little bit more later in the semester. So on Thursday we'll talk about the problems that you can solve. I will give you some ideas. I don't know if you come up with ideas on your own but you'll brainstorm here what problems you could solve in your final project. So let's look back at regular expression because I have an important segment that I still want to finish and we'll start by answering the second challenge question that we didn't get to last time. And you can read it but let me paraphrase what it is. Usually in the front of a compiler or interpreter you have a lexer which takes the sequence of characters and partitions them into lexines or tokens. So it takes F-O-R and turns it into a keyword 4 and it takes the digits that comprise the integer literal and turns them into a number. And it creates the sequence of tokens and passes them to the parser which then does the parsing. And usually the lexer can be completely independent from the parser in that you could run the lexer first and create a file with those tokens and then the lexer finishes called the parser but that's not quite the case with JavaScript. And other languages as well but JavaScript gives us an opportunity to look at it in a small scope. So does anybody have an answer to that question? Uh-huh. Okay, so you're very close. So it is about regular expression and forward slashes. It could differentiate, I believe, between a comment and a regular expression, right? So to refresh your thinking, right? Something like ABC in JavaScript is a regular expression, okay? And what we are saying that well, when the lexer comes to this it may not be able to differentiate between this and a comment, right? And so imagine the lexer comes here and now it sees these two forward slashes. So I thought that the regular expression had to have a non-empty body which would differentiate between them. But if that's not the case and I don't know for sure then I don't know actually how the lexer would be able to distinguish it at all. So I suspect that a regular expression needs to have something between the two forward slashes unless somebody has ideas how to do that. So the slash star comments are usually not ambiguous because when you encounter these two characters they cannot mean in JavaScript anything else but a beginning of a comment. They cannot mean a name of a variable, they cannot mean an operator. So slash star is good because looking at these two characters you can distinguish it from anything else. Now these two are interesting and I don't actually know what the lexer does. I suspect that empty regular expression literals like this one are not allowed because if they are allowed then I have no idea how the lexer would ever distinguish that. Right. Okay, so now we are asking how will the closing character be distinguished from something like this which is a legal, well maybe this one is not quite legal but let me make it, let me make it, now this is a legal regular expression, right, forward slash is the dot star in between and you're asking how will the lexer differentiate between this and this. So how would it differentiate? So remember the role of the lexer is to take the entire input file, the stream of characters and chop it into logical units which are these lexemes. So that should be a hint how the lexer will differentiate. So it is true that star forward slash could appear in multiple contexts. It can appear as a closing piece of a comment, it could be at the end of a regular expression. So what would the lexer do in this case? Right. So what the lexer will do, it will use the context here, the way it works that it makes its way through characters and then it comes here and spots slash star here and says oh, this is the beginning of a comment and now it will scan everything up to the closing star slash. And so it is this initial context which tells us grab everything that is here and this is a logical piece of a comment. Similarly here, when it comes to this point, it says oh, I'm starting a regular expression, let me read everything up to the end of the regular expression. So the beginning tells us what sort of end it should look for. But still assuming that these regular expressions must be non-empty, I don't know what to do with those, I suspect they must be non-empty, what would be the lexical ambiguity? Right, so we can finish that offline, but it's a good comment that sometimes the parser needs to know, the parser always knows where it is. Is it sort of inside an expression, inside a statement? The question is whether the lexer needs to know that as well to correctly chop the pieces. So here is the answer. It is this fragment of the input. Can it be lexed in multiple ways? So what are the two possible ways of tokenizing it or lexing it? I'll insert some spaces here so that we could write both of these ways of lexing it. It could be either a division or a regex. So this is one explanation, ID, division, another ID and another division and another ID. What would be another one? One that is not legal if you take a big picture view, but another one here is ID, and the entire thing from here to here including G is a regex literal. The G at the end means that you want to do substitutions globally, so sort of a qualifier. So this is the correct one, the correct way to lex it. This is wrong. Now, the lexer doesn't know which one to use without getting some context because if it just reads the characters left to write it, both of them look plausible. If it uses some heuristics such that find the longest way of tokenizing, it would perhaps opt for this one, which would not be correct heuristic in this case. So how will the parser tell it what to do? So presumably the lexer is here and it has identified a variable ID. Now it needs to decide what to do at this point. Can the parser give it a context to distinguish between the two cases? Exactly. So the parser knows the context. In early, it comes in the form of knowing both edges are predicted, right? What you can expect on the input. And in this case, because you have seen ID and nowhere in the grammar an ID can be followed by regular expression literally, you know that this is the right choice. The ID can only be followed by the division operator. Because nowhere in the grammar there is that choice of ID followed by regular expression. And this is the information that the lexer could give, sorry, the parser could give to the lexer and the lexer would then tokenize it accordingly. Yes, in a sort of pipeline fashion that the parser needs the next token and it calls the lexer and says, well, give me the next token, but I am in this context. And therefore, if you have an ambiguity of the sort, say the division operator do this versus that. So you would not tokenize everything upfront and you would give some extra information. Well, I believe not in JavaScript, but it could be that I can definitely imagine a language where the lexical grammar, even after parser giving it the best context it has at this point, something could be lexed tokenized this way or another way. And then you do your parsing using an ambiguous grammar and then you decide later once both of those trees are built. Then you decide which of the parse tree is correct, perhaps. Oh, this one because something is left associated. And then you actually commit to a particular lexing decision. So you can definitely define such a grammar and it would be very easy. You could say, we'll do all tokenization in the grammar, in the parser. Essentially, we'll treat each character as its token and the building of characters into identifiers, everything in fact, the bundling of characters would be part of the grammar and part of this ambiguity. It would be a bit inefficient because you are using not regular expressions to process characters but the more extensive parser, but it would be a simple way to do them. So in general, it would be better to define the language in such a way that the lexical grammar, which is how characters are built into tokens, is independent from the one that is there. But sometimes it's hard to predict and once you commit to a choice and your infrastructure handles it all right because your lexer is coupled to the parser, it's hard to predict what troubles you, people who will want to implement it differently, implement the language differently, what trouble they will get into. Okay, so this is the answer to the second challenge question and this is sort of a real story that came to be because the interpreter for JavaScript was written in such a way that there was a parser that always tokenized in its context and it was hard to make that change later. So, but there is more fun with something as simple as regular expression. So remember, we looked at two flavors of regular expressions, essentially two different ways to implement it and we gave them different names depending on the implementation. So some of them were called regexes and some of them were called regular expressions. And regexes differ from regular expressions in that regexes come with these various extensions, right? You may remember that we saw this test for whether a number is a prime by referring to what was matched previously in the regular expression using the backslash one construct. So the thing that distinguishes regexes and regular expressions is how they are implemented. So regexes were implemented with backtracking, right? One simple way to see it is that we can take a regex, parse it to obtain an AST of the regex and then translate it into a little prolog program which does the backtracking for us. Regular expressions were implemented by translating them into an NFA, non-deterministic automata, which we know how to translate into a DFA and then you are just scanning the character, reading each character exactly once and making a transition using a table that describes the state machine and its transitions really efficiently. And we looked at this input, which does not match this, and regex took forever to discover that it does not match this input because it does not end with x. And it took forever because there are these two nested loops, essentially. The classes are two nested loops and it had to try all possible combinations in backtracking before it could conclude that yes, this string doesn't match because it doesn't end with x. Whereas the regular expression based on automata just read every character once and at the end it could conclude there is no way I can reach the final state of the atom. So that's what we covered before. Now, there is something interesting to be said about these two approaches. These two implementations have the same semantics when you look at the problem of, well, here is a string, here is a regular expression. Tell me whether that regular expression matches that string, the entire string from the beginning to the end. In this case, both of them behave the same because even if you have something like star plus and plus, which is two loops, you don't know how many times you need to iterate here and you don't know how many times to iterate here but any combination of iterations is enough to prove that regular expression or sorry, a regex matches that string. There are many different ways how to match it but any of them is enough to show that they match. So regular expression and regex behave the same way on the problem of checking whether the string, the entire string matches the regular expression. But there is an important problem where they actually differ. Can anybody think where their semantics might be different? It is a hard question but it comes to hurt us quite often and very often we spend hours trying to debug regexes because of the small difference. It happens to be the difference between what the programmers often expect and what regexes give them. So I want to go over it. So that's true. So with regexes you could find all possible matches, although people generally define regexes in such a way that you cannot enumerate those matches. If you compile them yourself to prologue then of course you could do it. But in general you are just interested in a match and you're interested in both substring it matches and that's it. But the discrepancy arises roughly around many possible matches, around the questions related to that. So this is exactly where they differ. When you are not really asking for a Boolean answer, does it match or not, but you're asking for a substring, then the definitions of the two change. And so here is an example. Where this happens. So I'm going to give you an example. So I'm going to give you an example. So I'm going to give you an example. So imagine you write yourself a simple configuration language which describes how to compile your project. So you define a variable, files to compile, you assign a value to it, which is a list of files you want to compile. And you want to use regular expressions to parse this configuration file. So regular expressions for this will be simple. This part here matches the id. This is the equal sign. And this matches the rest of the line. Remember that the dot does not match new line, does not match backslash n. So this regular expression is going to end at the end of the line. Exactly as you want. So now you say, great, I'm very happy with my configuration language. Which it does the job I want. But now I have so many files that I need to compile. You want to support something like that. Now you want to have your files on two lines here. And then you want it to go to the next line. What you want to do is you want to use a backslash to escape the new line that is here. So there is one new line character here. You want to put a backslash in front of it as this comma. To signify that this new line does not move to the next definition of the next variable, it just is escaped. So in other words, the list continues on the next line. So how would you extend this regular expression to now support the new semantics? What would seem to be the simplest way to take what we have? This has been debugged and has been working well for a year. Now we want to allow this optional second line. So we want to now write the regex for this. So exactly, I take the existing regular expression which is from here to here. It's copied from over here. And now I say I want to match a backslash, a new line, and then anything on the next line, which is exactly the same as here. And the whole thing is optional, which is fine. And does anybody see the trouble with that? Oh, okay, so that's a problem. It only supports a single new line. And we could extend it to support an arbitrary number of new lines, but it would be just a little bit uglier and not illustrate our point. So that's not a problem. Imagine that they just want two lines, right? So does anybody see the problem? Because if you do, it's great. During the course of your life, chances are this will cost you a few painful hours. So I want to make sure it doesn't happen to you. So what's the problem? Needless to say, this extension doesn't work. So after we match this, right, so this new line here is going to match this one character here. So that's good, right, so far so good. And now the cursor sort of moves to the next line in the file, and you are going to match everything greedily. But remember, the dot does not match a new line, so it will only go up to here, up to the end of the second line. So that's good. The dot does not match a new line, unless you pass a special flag into the regular expression compiler. Okay. It matches just the first line. Okay, it matches just the first line. That's correct. What exactly does it match on the first line? It matches this. Where does this match? Okay, so that's a good question. Is this backslash matched by this regular expression or not? Which part of the regular expression does it match? This one here? Okay, so I let you think about it, and let's go over the questions that people had to consider. So the original question whether you have a string and the regex, whether the regex matches the string, is a simple question with Boolean answer. Nothing really to think about, okay? In real life, however, the problem is different. You often are trying to match a substring. In fact, tokenization is just a series of these calls, right? You have consumed part of the input, and now you're asking, well, give me the next substring here that has some property, and you consume it, and then you do it again and again. So that's really the string problem that we often face in practice. Now, it may seem silly. These are just regular expressions. After all, how could somebody screw it up? There are language design issues already here. So the problem is that there are many potential matching substrings. And so if you ask, give me a substring that matches this regular expression, and there are many of them, you somehow need to decide which one you want. That is a language design question. And there are really two questions, right? Where does the substring start that you want, and where does it end? There is no other question. There is no question of color, weight, bolder. Where it starts, where it ends. And so it's easy to agree that the substring that you want from this huge string is the one with the leftmost match, not the rightmost match, but the first instance that you find, but there is still the question of, how far does it extend? If there are multiple matches and you match a string that goes... So here is your string, and now this is one match, but maybe this is another match here. Which one do you want? Do you prefer the shorter one, the longer one? If you define your regular expression library, you better tell your users what a substring match will do. Which thing it will return. So which one would you choose? Okay, so now you are saying it really depends. Maybe you will need to pass a flag and say, I want greedy or non-greedy, but what does that mean? So what you said is true that regular expression libraries come with two flavors of the star operator. One that is greedy, and we'll see soon what it means. It matches as much as it can. And one that is non-greedy and presumably that matches the smallest amount that it can. The second half which you said is not quite true. It doesn't mean that if you use a greedy operator with matches as much as it can, that it leads to the overall longest match. And that's where the problem is here. And so let's see how people answer the question of which of the many matches to take. The declarative approach, which is usually used in regular expression, is to say enumerate all possible matches conceptually rather than really finding all matches. And return the longest one. And the operational approach, which is what Regexis has taken, say for each of these operations, which is star and ore, well define what it does. When the interpreter, the backtracking interpreter comes to these operators, you need to tell it what it should do. And typically, well, you have the greedy semantics for the star. And that means that when you come to an expression and a star, you will try to match this E against the string as many times as possible, such that the rest of the regular expression can still match. For example, if you write A star A and you have a string AAA, how many times is this going to match? It's going to match twice because you are still leaving this A for this one to match. So this is an example of greedy. And similarly, you need to make a choice for E or E. And again, if the leftmost one can match, including the rest of the Regex, it is the one chosen. So let's now see why the Regex that we wrote is broken. So are you ready to identify the bug? Well, will the backslash be matched by the dot? Definitely will be, because now we are here, so clearly this regular expression here is going to match everything greedily until it can no longer match. What does it not match? It doesn't match the new line, but it does match everything, including the backslash. And then, of course, this one here, does it match this character here? No, it doesn't. So your parsing will stop here and you only consume the first line, including the backslash. So that's the problem. Now, to double check that the declarative approach of looking for the longest match is the right thing, let's try to interpret it under the regular expression semantics. Would it match both lines? Yes? No? Right? Exactly. Well, conceptually, it does enumerate all possible matches for the entire regular expression, meaning from here to here, compares them and returns the longest one. And that would somehow magically inside mean oh, the dot star only needs to match up to here, because after that I can't get the longer match with the backslash. And this is, again, the choice made by an oracle who without really trying all possible choices says, oh, it's better for me not to take that slash because then I can match more. So how do we implement the oracle without backtracking? Do you remember what gives us this power? So remember how we wrote this non-deterministic finite automaton, which for the relevant part of the string would look like this? You have a dot here which matches anything, and then you have a transition on slash, and then you have a transition optionally on a new line, and then you have more anything. So this is how our NFA would look like. Now, why is it then NFA? Why is this automaton non-deterministic? Well, one reason is that from this state to that state we have this epsilon transition, which could be taken without consuming anything on the input, and so you could get into the final state without reading backslash, new line, or any characters here. But there is another somewhat hidden reason why it's a non-deterministic automaton. What is it? Can anybody find another reason for an oracle to have to make a choice? The dot and a backslash really are non-deterministic choice, because the dot really means anything, including the slash. And so the oracle here needs to say, oh, so do I read the backslash here, or do I read the backslash and continue? Now, how do we implement oracle in prologue? We did it by essentially backtracking and trying all the choices. But in this automaton we don't need to do it that way. How do we do it? Exactly. We don't have an oracle who will say, well, after this backslash here, this is one character really in the input, we could be here or we could be here. And some of these transitions, some of these states could die off, but if one of them survives and makes it to the final state, we have consumed the string. And this is how we could do multiple choices at once without enumerating exponentially many choices. And this is why we can not only parse the string fast, but we can also find very easily the longest match. Because you can always remember what was the longest string that some of these oracles was able to match by just keeping it in some variable. So this example is realistic. It sort of puts together everything to learn about regular expressions, how they are compiled, how they potentially differ, the role of non-determinism, other questions before I try to wrap it up. So let me wrap it up and maybe some questions will arise. So the example that where I spend a lot of time was not this particular configuration file, but this one is beautifully simple to explain. What is said about this that I read in the JavaScript book where the semantics of regexes was explained that, oh, the greedy thing is really strange. Be careful about it. It could lead to unpredictable situation. I wrote on the margin of the book, teach it to the students. They could also be surprised. And needless to say, two days later, I did spend three hours trying to understand a bug like this. It was in a more complicated context, so it was harder to find, but it is nasty. And it is sort of the greedy semantics of the star that you do not declaratively say, give me the longest string, but you describe what each of these stars does, and then it's kind of difficult to understand their global collaboration. If you have multiple stars, the first one does everything, and then the second one, which could be sort of an outer loop against this, does what? Which of the two needs to match maximum? The inner one, the outer one, which is the first one that we encounter. I think the inner one is the first. So you see how you easily get into trouble. So it forces you to think about backtracking when you want to understand the regexes. It would be much nicer to just always have the declarative meaning, but, well, it's not supposed to be like that. I think why did it go wrong? It probably is tempting to blame Pearl, but I'm not sure that's really the reason. It might be that the creators of the regular expression libraries that came with Pearl did not know about NFAs and how they can match efficiently. But chances are that they knew about it, but they couldn't do much because Ken Thompson's algorithm was patented by AT&T and they couldn't implement it. I don't know. It's hard to say which do happen. Either way, backtracking became standard and greedy semantics came with it, and with that came the sort of non-compositional semantics of regular expression, that you think you have a regular expression that matches the part of the string that you want. Now you want to extend it, this is what we did, but the extension didn't quite work. I would say that this language is not compositional and leads to surprises. It could be that the trade-off came from wanting to more power because with the regular expression built with the tomata you cannot get that back-referencing. But I suspect it was actually the other way around. You did have backtracking and then you realized, look what we can do on top of backtracking. And so the real question to ask is should you ever consider adding the back-referencing to regular expression because if you want to do some sort of matching of that sort, you should probably use grammars or something that leads to fewer surprises. And then you sort of have a backtracking engine which is really powerful. In fact, apparently regex is with backtracking a Turing complete. You can compute any functions with it. And that's now a true mismatch, right? You just want to do simple string matching but you have this really powerful computer underneath. And then you add features on top of it and then many people waste their time on it. Okay, so what we covered here is that you can take regex, view it as an AST, parse it, translate it into an NFA that recognizers just tell things apart. A recognizer based on grammars can tell you whether a string has balanced parentheses but based on NFAs cannot because you have finite number of states and you can have unbounded number of parentheses. NFAs, the FAs regular expressions are all equally powerful but this back reference may regex is more so. We sort of talk that the plus is just the sugar on concatenation and a star and we just talk about compositionality. Okay. So any questions about regex is regular expressions? In this case, let's move on to how to implement small languages in your final project. So small languages also known as domain-specific languages really are so varied that there are many different implementation strategies for them and one that is more interesting and becoming more and more popular is just building them as internal small languages. So internal means that they are embedded in some host language and so the host language is sort of like an interpreter for your DSL, right? And I differentiate between three levels of embedding but where to draw the lines is extremely fuzzy and if you are unwilling to argue about what is right, they are more sort of a guidance so that you can orient yourself in the space. The first one is library framework, which is a parameterized library and something that is called a language. We briefly went over this in lecture two, I believe. So let's talk about some examples so that you can now start thinking which of them you may want to use for your final project. So DSL as a library, often we don't think of it as a language. Only language theoreticians because it might be useful to do so and we think of it as a library even though it defines its own abstractions and operations on them. Can you think of some libraries that define some concepts which you then use in your program written on top of the library? But these concepts are so useful that you really can start thinking as using a DSL to do something. If you think of a language as coming with strings and operations on string with integers and operations on integers, objects and operations on objects, a library often comes with blah and operations on blah. So what would be some blahs? So this is my example here because we talked about this in N-socket. It is a concept and it's a beautiful abstraction. It really is an illusion of what doesn't exist because if you look at the hardware, if you look at the wires, if you look at the internet, there is nowhere a notion of any socket. If you look at the internet and look at the packets, you don't see any wires connecting two sockets. It's sort of completely software abstraction, extremely useful, but still just an illusion and therefore so great. Therefore it survived such a long time and it comes with operations like, well, create a socket, connect it to an A address, write some buffer to it and close. In fact, there are some rules on how you could use these operations. You cannot write before you connect it to an address and you cannot write after you close it. So you could think of it as having a state machine which describes which operations are legal. You can go through a sequence of operations in the order prescribed by the state machine. You can think of the state machine accepts a legal set of operations. What would be other examples of libraries with their own objects? So components in a GUI, presumably you cannot combine any of them. Maybe there are bigger components which accept smaller ones. There is perhaps a canvas on which you put some other components. Exactly a language. In fact, the entire manual for it will be about how to use the language, which compositions are legal and which are not. All right, some other? What is that name? RSpec. RSpec, okay. We'll talk about one Ruby library today, but it is sort of so nice that I would phrase it as a language. But of course it's implemented as a library. So these are libraries. Now, we saw Arfig in the first lecture. It's a DSL again implemented completely as a library in Ruby, but it is so tailored to its task that it would say it goes really far into designing a language. Now, what it does is, imagine you want to implement a slide with animation like this. I'm going down, page down, and these three pieces of text change. This one comes here and disappears, replace with this one. And so how do you do it? Well, you do it by actually creating a sequence of figures in PowerPoint. But you don't do it by creating them separately, but instead you create a language in which you make it very easy to turn a piece of text off and replace it with something else. And here is a fragment of the language, and it's a library because all you do you call a library function. And the agreement is that the first argument is a title and then the other arguments come. And if you want to put one on top of each other, such as this text came with a color red, there is a special function that composes them and so on and so on. So the library can go quite far. Now, what about the DSL as a framework? Well, again, the library boundary is very fuzzy, but let's think of a framework that is parametrized with a client code. Because you don't want to just call it. You want to really change the behavior of the framework. Okay, so a random testing framework, which generates random inputs, but clearly when you are testing something like a compiler, not every random input is a meaningful test. So you want to parametrize that library with some information about essentially what is the distribution of legal inputs. So this would be a perfect example of a framework. The browser itself is a framework. It has a library for DOM and for layout, but do you parametrize it with functions that the browser calls when you say click on a DOM? So a browser, I would say, is also a framework. So jQuery would be such an example, and you've seen it before. What is interesting about it is that it does give you an illusion of being more than a library, and now this jQuery A essentially reads like, for all nodes such that N is an A node, do something. So even though this is just a procedure call, it does have the declarative feeling of, say, prolog, because you abuse the syntax a little bit more. But it is a framework because after all, you're just passing a function into your client code into the framework into jQuery, and that will invoke it when appropriate. In this case, these functions will be invoked when you hover with your mouse over a node of the DOM. So how about DSL as a language? Well, it's really hard to say where the boundary starts, but I would say let's call it a language when two things happen, when these abstractions contain runtime or compile time checks that spot errors. You write an incorrect language, DSL language on top of the library if there are mistakes, it would be really nice if that mistake was called rather than program crashing, even better when the mistake is called at compile time. And now you see that even sockets have these runtime checks because when you write into a closed socket, you get an error. So clearly I'm contradicting my rules because even the library like a socket is already a language because it does those checks. Another one, and that one is more interesting perhaps because you may not have seen it so much, is that you use the syntax of the language. Whatever the syntax of the host language allows to give you the impression that you are really working on a different language. So in jQuery perhaps it's not so pronounced, but you already can see how you can make a call dot hover, and on the result of the call you could perhaps make a dot CSS call. And it's call chaining that you are making a call on the result of the previous call. And now it doesn't look like a sequence of calls even though it really is, but you can start thinking of it as, well, this is an object, and on that object I set the hover property and I set the CSS property. It's easy to think about it that way. It's the illusion it gives you, but neither of this is true because this really is not an object. It corresponds to a set of objects and definitely it does not quite return the object itself. It iterates over a tree and it sets the properties, the hover properties of all nodes to something and then it sets their CSS property to something else. But when you write the code, you get the illusion that jQuery A is just one object that you are manipulating. So let's look at Rake. Rake is again implemented as a Ruby library, you can think of it as a language because it abuses the syntax really nicely and does checks of correct usage of the language. So Jim is the author of Rake and the picture shows you how happy you are when you design a language like Rake. So keep that in mind. You should be like that when you are done with your final project. And it's essentially a language like make or end. Or make, presumably in some classes. It's a tool for compiling your project that keeps track of dependencies. If I change this file, I need to recompile it and as a result I change that file and as a result I need to link it and maybe you run some tests and so on and so on. And it has a prettier syntax than make, perhaps, because of how nicely it abuses what Ruby allows you to do. And this article written by Martin Fowler, very well written, it's your required reading, but let's over the key features here. So here is the example Rake file and what it does, well, in the interpretation of the Rake language it says, I have a task that I call code generation. And here I put the commands that actually do the code generation. You can call shell scripts from here, sort of copy files, print them, whatever you want. The content is not important. Here is another task. This one is called compile, but I'm saying a little bit more here. What do you think I'm saying in this declaration of that compile task? Well, it either depends on the code gen task or code gen task depends on the compile task. So which one will probably go? Which way do you think the dependence goes? It's hard to say, right, because why is it hard to say? You sort of need to look at the domain knowledge of what does compilation do, what does code generation do. Your guess? Okay, so I see. So your point is actually extremely good. So you're saying if I had to tell for each rule who depends on me, that would be really hard to do. It's much easier to say what I depend on. So you're speculating that this really means that compile depends on code gen, because this is easier to state, and indeed is that true. So we are saying, well, whenever we generate files through code generation, we better compile them as well. So the arrow is a little bit misleading, because if there is dependence from code gen to compilation, I would expect the arrows to go in a different direction. But the arrows go in this direction for a reason. And here is another task, data load, which depends on code generation. And here is a test which depends on compilation and data load. So if this has been done and or this has been done, now this is a list of things, rerun the tests, make sure everything is still okay. So that's essentially the semantics of that rake file. You can now match it to make, but it just happens this is a legal Ruby file as well. This is a Ruby program which the interpreter can read and execute. And now that might be quite surprising. So how is this a legal Ruby file? So for that, we need to understand a few things. One of them, the crucial one, is what does a Ruby procedure call look like? Did anybody here write any programs in Ruby? Okay, so how do we call full with arguments say 1 and 2? Okay, so let's assume we omit them, so this would be our call to full with arguments 1 and 2. And there is this extra bonus that we can do when we call Ruby functions, right? What is it? You can pass it a block, right? You can pass it a block, which for the purpose of this discussion you can think of it as a lambda. There is a difference between them, but let's assume that there isn't, okay? Now the lambda becomes sort of an implicit argument of the function and the function from inside can get the lambda and do something with it, okay? So this is a function call, all right? So now let's look at this and do some autopsy of this DSL. So what do you think is going on? Now that you know what the procedure call looks like, can you start spotting the constructs of the language and how it is implemented in Ruby? So essentially what I'm asking you to do to be a Ruby parser, right? You are going through it character by character and of course you need to make sense out of this. Sequence of characters without really knowing nothing more than the grammar since that's what the parser knows. And you know little bit about the grammar. You know what the function calls look like. The rest will infer. So the first observation we'll make is, okay? So task is clearly in the DSL a keyword, right? We want to think of task as the keyword in our DSL but it's going to be implemented as what? As a procedure and the whole thing will actually be a call, right? So this task is a procedure or a function and the entire thing here is a call, right? Okay, so what about the rest? Can somebody else spot some constructs here? Okay, so let's go with the block. So here is our block or lambda. It's not quite the same thing but if there are no arguments then it's the same thing. So this is one of the arguments we pass into task. So what exactly is going on with the data load and code gen? What is the colon thing? It's essentially a symbol. It's sort of like a string. If we replace it with a string or immutable string, it doesn't matter really, okay? So both of these are symbols. It's a data type that Ruby has and Python doesn't. So what about that arrow? What is going on with the arrow? We wanted it to go in the opposite direction because we know that data load depends on code gen. So I would like to see the flow from code gen to data load but it's not how it is. Why it's not like that? It would really make me mad if I had to write it like that but why was the DSL's designer hands tied? Okay, so let's go through these hypotheses. So this is hypothesis number one that this is just some variable that we define, maybe a procedure name or something. Something new introduced into the language that serves as a nice connective between these two symbols. And you explain why they could not use the arrow in the opposite direction because maybe that was taken. But if that was the case, I would just define less than equal, equal, equal, make them long enough to disambiguate them from anything else. So it's not a variable, okay? So anybody speaks Ruby here and knows what these arrows are. It is like a map. So in fact, the whole thing here, let's go here. So this is a dictionary literal. And so if I say A, B, can somebody rewrite it into Python? What would that be in Python? So maybe we'll do A colon B, right? So this is a dictionary and so this is a key and as you know, this is a value. So this is pretty clever, right? Now the whole thing here creates a dictionary with a dictionary literal meaning it takes that content and turns it into a small dictionary which maps a single key to one value or as is the case here to a list of values. And so when you call the task, it essentially, the call to a task is now viewed as a definition of a new task. What is the name of the task? Well, it is the single key in the dictionary which you pass as the first argument. What is the work that the task will do? It's the lambda that you pass into task, right? And so now you would see how you could implement this task procedure. It's very easily. You get a dictionary which describes the name and dependencies and you get the function which describes what to do when that dependence is triggered and you need to rerun the data load. And now we see why this arrow cannot go in the opposite direction. It is because it is used for dictionary. Now we could perhaps reverse values and keys but sometimes we depend on multiple things and I assume a list like that cannot be a key in a dictionary and so as a result they stuck with the reverse of that arrow, okay? So this is pretty neat. Now there are two kinds of tasks really. There is a file task which is like inmake which says if this file changes then generate this file and now the blog, the lambda can have an argument which you can then use in writing your rules to select prerequisites which presumably is this and you call your actions that create this file. Then there are task files which are not on dependencies but they are really between tasks as we've seen before and so these are changed whenever you call clean all these build refactorings are done. Again nothing special here except that you don't need to do it on files as inmake, right? Inmake the problem was that whenever you wanted to have tasks that don't really work on files they are just sort of maintenance cleanup files you had to invent often fake files to trigger those dependencies, so not so here. And now this is something even nicer because now you can see how you can define the task separately so here you say what needs to be done for the second task what needs to be done for the first task and orthogonally you define these dependencies. So you could define the task separately and then decide on the dependencies later in a separate file and you can also do what you can do inmake essentially define a rule for going anytime from a .c file into a .o file and the way you do it is that you have a special object in the array called file list which matches a bunch of files which all of those that end with .xml and you iterate over them and you perform this task and what does it do? It does nothing more just registers a new task called the file task which creates a dependence between the source and this new blicky target. So each of these files now going to trigger this command and you do it all in a simple language rather than having to introduce as was done in make some special syntax that was not quite flexible and was hard to understand but here you really understand what it is doing in this file you just invoke this creating a new dependence. So to conclude I want to talk about languages that are not internal that are not completely built out of the host language but they sort of mix your own language and the host language so a little bit of a parsing of your language will be involved so the first way to do it is that your DSL is inside a general purpose language what would be an example of DSL being inside a general purpose language and being parsed before you execute? So for example in JavaScript regular expressions would be if some string and now you do match and now you have your regular expression literal and you could say that this is your DSL which needs to be parsed and converted into some bytecode or a tomata or backtracking before you can execute the language so this is your DSL and it's hidden in a JavaScript even in Python where regular expressions are normal strings some sort of parsing happens during the execution before you can invoke the regular expression can you think of other examples where your DSL is mixed with your code and you need to parse it essentially compile it before you can execute? Oh, okay, that's actually a great example so C pre-processor so indeed you could think of the C pre-processor as a special language that you insert into the file it's processed through relatively simple compilation but you have seen the tricks that people use to abuse the hell out of that power, okay? That's great. So Prolog was also compiled translated into CS164 essentially you got some sort of AST but I would not call it as embedding something into a language perhaps you could have sort of 164 language with snippets of Prolog inside which would be parsed just like regular expression and then you would call a Prolog interpreter on it on that subset then it would be truly nested in the way we are discussing now but the way we used it was sort of you could only write a Prolog program and not mix it with anything else but it would be easy to turn it into something like we do with regular expressions and actually quite convenient very often that you could just call a piece of Prolog Excellent examples. Is there more? You have syntax-ruling scheme essentially those macros, right? Yeah, they're sort of part of scheme so it's hard to say where there is But that's right so that's a good example still more examples of embedded DSLs that you actually parse Oh great, I see so jQuery has a selector language essentially the CSS selector language right and that needs to be parsed and handled by some CSS interpreter Excellent, okay I want to write it down because this is great jQuery has a selector language These are great examples, okay Okay, so this is I was hoping somebody would bring so there is a link, L-I-N-Q in C-sharp C-sharp is Microsoft but it's actually really nice because people who work on it in recent years are people like you who graduate with good knowledge of PL and then they add really tastefully features in it and link is essentially a syntax for writing SQL queries and other queries and the Rx language that we will implement in the last project was designed and implemented by Eric Meyer who is the author of link and what he does is essentially write your SQL queries more naturally rather than in SQL, okay and these are translated into low-level library calls that would look nasty if you had to write them by hand So that's a great example, too More? So okay, so this is great So essentially, SQL aside, templating is an example, right where you can combine essentially strings that will be mixed with the results of your computation and you can use it to generate web pages, alright So templating, okay Fantastic, so let's turn it the other way around and now when is it the case that you have a general-purpose language embedded inside your DSL? We have seen that one, too Oh, so yes, it is true that you can simulate the Turing machine on top of regular expressions with backtracking That's true, but what I have in mind is somehow mixing, say, the 164 language, Lua or Python putting it inside some other language So really think of the DSL as the host language and the general-purpose language as just something that describes a little bit of work that you do So you write a Lua program and you have snippets of tech, latex code You write a tech program with snippets of Lua A tech program with snippets of Lua So, oh, I see, I see So essentially you are calling Lua 164 from a document language So on the outer level, at the main level you have a document which invokes Lua when it needs to do something figure out the size of a table Okay, alright, more Okay, so in shader languages So the shader describes a computation presumably in some really narrow DSL but you can invoke C++ functions at some point which presumably define how to compute a triangle or something I'm guessing that this is what happens in shader languages I don't know, actually Yes, I would say that's fine If you think of HTML as the DSL then JavaScript is embedded on it So that's good too HTML, well, people will debate whether that's a DSL but let's put it here It illustrates the concept Alright, there must be more So ELISP is the Emacs version of LISP But you might but I think it's too subtle now I might confuse So let's leave it out so that people don't get confused But you may be right Okay, so you do pearl in regex So perhaps you can control how matching happens, I'm guessing Okay So there is one thing that we have seen one of the projects in CS164 that you will build where we have such even handlers embedded in a DSL Is that what you're going to say? The syntax directed translation or the grammar description that you will start playing with in PA4 is essentially what? We'll write the grammar like this non-terminal I don't remember the syntax I think we are using an arrow So N goes to AB and then here we are going to place an arbitrary Python code which says something like N1 gets the value of N2 plus N3 I think we need to save value here So this here is arbitrary Python code which you can write into these actions and when you build the parse tree and you then evaluate the parse tree you are walking over a part of the parse tree which contains whatever the subtree is here whatever the subtree is here and when you propagate the values up this rule is executed and it's arbitrary Python code with the simple agreement that the way you refer to values propagated up from here up from here up from here is N2.value and 3.value and 4.value So when you execute this piece of code you just need to set up the Python environment in such a way that in this environment that piece of code can find variables N2 and 3 and 4 and by doing that you can now execute arbitrary Python code and control how the evaluation of the parse tree happens So that's the example you will see and it's quite convenient this way the grammar is very compact the rules are simple if you need to put a lot of Python code over there you just call a Python function and that's essentially it So external DSLs are those that are not embedded in a language but that are completely separate like the Google calculator it has its own syntax you translate it with the parser it has its own interpreter or you generate code essentially compiling it with action with the external world so there isn't that much to talk about If you want to build a real language this way there is a lot of work because you need to build your own integers floating points strings, operations on them everything from scratch that's why internal DSLs are more popular because we can reuse what that host language gives for you but in some cases like the Google calculator it is the right choice So with that the reading is the chapter, the article on rake and for the Thursday lecture bring ideas on what you would like to work on for the final project we can discuss them here to see whether they are appropriate whether they are not too big we can try to find a small subset of the idea so that it is doable in two weeks and we'll talk about problems in general, solvable by small languages Thank you