 Well, welcome. It doesn't seem like the end of the semester is here unless you look at the weather outside. In that case, it actually does. So we'll talk today about your final project, or I hope you will talk about your final project because I want to run it mostly as a brainstorming session to get your creative juices flowing. So find the final project, you can think of it as your own P8N designed by you. So it should take about as much time, but we'll start now so that by the time for the final project derives you already. The goal is to convince yourself that what you learn here, you can actually apply. Clearly, you can do a project that we hand to you on a silver platter with the handout mostly readable, but the real life of course doesn't look like that. So you want to convince you that when you encounter a problem that requires a linguistic solution, a library or small language, you can actually think of it the right way and solve it the right way in the right amount of time. So that's instead of the final exam of course. So during the usually scheduled time for the final exam, which comes at the brutal eight o'clock or so this year, I think 7.7.30, whatever, it's too late on Friday of all things. So we'll get together somewhere in Soda and have a three-hour demo session where you'll talk to each other about your projects, comment, write reviews on paper, and have pizza. After that, maybe beer if you have an ID. So where are we? This is not quite readable, but you do see the weak granularities. So we are off the screen now still, but coming up, you will need to submit your proposal for the project, and I want to get you ready for that. You'll do that in parallel with the midterm preparation. So as you study for the midterm, you can interrupt and think about your project idea. So you'll do that as soon as your PA4 is submitted. Then you'll get probably in 10 days feedback. There are quite a few groups, so it will take some time, but we'll give you higher level enough feedback that should correct you in the right direction. You won't worry too much about design, implementation at that point, just sort of the scope and the suitability of the idea. Then with that feedback, you are going to work out a proposal, essentially an implementation plan, so that when you start working on the project here, you actually have a plan. You sort of have a handout, your own handout that you will follow, and you have here, I don't know, essentially, you could say three weeks, something more than three weeks, minus some preparation for the final exam. One thing that you need to keep in mind also is that there will be two lectures with class presentations. These will be really short presentations. Don't think that, wow, great, these are really short. Short presentations are really hard to give, just because you have just three minutes and quite a bit of information to convey. So the last two lectures before the exam are your presentations, and I think that's it about the timeline unless you guys have questions. So the question is what is the difference between class presentations and demo? So the demo is the time when you have everything implemented and running and you are able to show the product on your laptops on some example programs. You will also have a poster on which you explain everything. That won't quite be the case of the classroom presentation. At that point, you will essentially just present the design of your language. This is the problem it solves. This is how I approach the design of the language, the implementation. Here is how my programs will look like, that sort of thing. So you will stand here and we will quickly rotate with perhaps three minutes per group. Good question, no more questions? Okay, so this is a demo session. So this is how the demo session looks like. The presentation looks like you standing here. And also at that point we'll announce the winners of the two contests, the parser contest and sort of the best demo of PA9. You will build a browser in the course project and on top of it you'll build some interesting little program. And so that will be announced as well at that point. And the winners get yellow jerseys like in Tour de France. So at this point I can announce the winners of the first phase of the parser contest. There is still a lot of work on the parser to be done but this is the result of homework four. So here is our solution. It runs in 10 seconds for some unspecified input. There are quite a few people faster than as I'll show you just the first three. So Aaron is there with twice as fast as our parser. Rohin is second and Bayhee. Are they here? Can the three of us stand up? Okay, perfect, thank you. So we'll see how you do in the final project. There is much more to be done. The winner last year was five times as fast as our parser and the efficiency actually comes from running on large grammars I think because large grammars have many non-terminals and so many options. So an expression can be a plus and minus and perhaps 20 other alternatives. And in the current parser for each of these alternatives you have an edge. And so the speed up came partly from representing all these 20 edges for the 20 alternatives in a single edge. So that of course didn't show up in this parser since we are not doing any reasonable grammars yet, these are all toy grammars so far. But congratulations and thank you for the great job. So the final project proposal which is something you submit just before the midterm is finding a problem, solve a little bit what we are learning here. So your customers could be programmers like yourself, other programmers who sort of just write a lot of code without thinking too much about it. That's not you I hope. But also end users, web designers, you name it. I'll show you an example of a project from the last instance of 164 that is neither of these. You want to document how the problem is solved today. So you want to ideally find an example which shows that the way we do it today is impossible. It cannot go on like this. I need to create a library, I need to create a language, I need to change how things are done. You'll find an example and illustrate where the problem lies. Perhaps there is too much plumbing that should be hidden in the abstraction. Perhaps there are bugs that people do that you would like to eliminate by not letting them specify some parts of the program but again hiding them. Once you have that, you'll show how your small language will fix the problem. So you design the language. You show us a small example of how the language would be used. So you list your constructs in a small first hello world and then a bigger example illustrating the usage of your little language. And then you outline the implementation. Is it sort of an internal DSL embedded in your host language in Python or in 164 or in something else? This is the hybrid, the way regular expressions are hybridly embedded in a host language like JavaScript. Is it compiled? Are you taking your program and generating other code which is then compiled or interpreted by other tool? Or is it interpreted? So all this will be spelled out then it's due Sunday, March 3rd. You could have a group of three on the final project, especially if you are considering something bigger. So ideally you can justify why you need three people. Okay, so the hardest part of course is finding the problem that you want to solve. So these are some fragments from an interesting article on how to do research. We are not doing research here, but we are pretty close. We are finding a problem that ideally you could get other people excited about. So whether you are entrepreneurs, researchers or just engineers who want to do something new, the hardest part is finding a problem worth solving. And so interestingly the education that we often offer is sort of everything is handed to you and you learn how to solve the problems but not so much figuring out whether a problem is worth solving, is it worth solving right or a hack is enough. So I want to work with you today and brainstorm what might be a good set of problems to attack in the final project. So a few things we'll go over first. Most of the projects actually in the class are some extensions of the nine weekly assignments, so extensions of the browser. You could do some more with the prologue, some more with the parser, some more with the browser. So I want to clarify for you what you will know by the end of the semester, since we will not be done with it by the time you're writing the proposal. Then a few examples of you could say influential DSLs and a few examples from projects that we have seen in the past in 164 and then time permitting will brainstorm what other ideas we could work on. So we are currently here looking at how to just build a parser that recognizes the inputs given a nicely described grammar. Then here you will do syntax directed translation. So now you'll have a parser that actually can read in a program translated into some other code or read in a program, build an AST and interpret it. This parser here that you will build in this step is what we used here for prologue for the two instances of 164. This was the thing that is sitting on the server. And here in this example, we are going to sort of catch your last bugs in the parser and write various translators, including the Google calculator. So you'll have enough practice to actually write your parsers and little compilers for whatever you need. Now in this blog here, we are going to build a browser by first building an HTML parser. So parsing HTML into a DOM, which is sort of an AST of the document and building a layout engine that will position the boxes on the screen and render them. So now you can have your little web pages. In here, we are going to build a tiny jQuery, sort of a simple prototype of what jQuery does so that you understand how scripting works and how you can lift the low level operations in scripting to something declarative like jQuery. And finally in the last step, I'll show you on the next slide, we are going to build a sort of a dataflow reactive language to make programming with callbacks and events somewhat more sensible. And an interesting project could be done on extending the dataflow language RX further. So what is that language? So here is an example of doing something in JavaScript. We saw it in lecture one, but it makes sense to go over it again. Imagine that in the browser, I have a box, a yellow box, this is not yellow, but pretend it is yellow. And now I have a mouse which I move and the box is always going to follow the mouse with some delay. So it will be in the position of the screen where the mouse was half a second before. So that's what we want to implement. This part of the JavaScript program, this is essentially HTML, does nothing more just draws the box on the screen and it gives you the name. This is the name that we are going to refer to later in the program. And the program is here and what it does, it says, well, whenever the mouse moves, right, you are adding an event listener. So you're just registering a callback or interrupt. And whenever the mouse moves, you do what? This function here is invoked and what does this function do? It doesn't really do anything, but says invoke this function, actually this one here, 500 milliseconds later. And what will you do in that function? In that function, we will set the top and the left coordinates of the box to be what? To be the values that the mouse had when the callback was registered. So 500 milliseconds before. So here is the first interrupt that is invoked whenever the mouse moves. Here is the second one invokes 500 milliseconds later. It's really a closure. You can see how these two values actually come from here, which are the positions where the mouse was when it moved. So this is not really readable and in fact the callback hell is one of the reasons why JavaScript programs are hard to write. Because JavaScript sort of in its basic form, it's quite low level. So here is how we'll essentially rewrite it. We look something like this. We are not going to use visual notations. We'll use text notation, but the programming model will be exactly that. That there will be a source of events like a mouse that you will connect through a pipe to this div element. So you will use some addressing to find it on the page, but these events are going to be delayed, both X and Y coordinates by 500 milliseconds each. And now there is no callback, well deep inside in the implementation there is, but not that the level of the program at which you are writing. It all looks like values nicely flow from here to here and the programs will be much more compact and much more readable. And therefore hopefully you can write better, more complex functionality because you don't get tangled up in the callback hell. So this will also allow us to implement a little bit of a push functionality. So whenever something changes on the server, it can send us the response and will display it automatically on the screen. But this will be very little, so adding a somewhat more of a push interaction server sending stuff to you would be a good project going on. Okay, so I want to show you two DSLs, both again connected to Berkeley in some way. The first one is MemoEyes. Was done by Bill McCloskey, who is now one of the guys working on Mozilla Firefox engine on the JavaScript component. And essentially what he said was that, well make in principle is good, but the syntax of it is just too painful. And we've seen in the previous lecture, Rake, a nice Ruby DSL purely embedded in Rake for doing sort of the dependence based computation, which means you change a file, Rake will determine what needs to be recompiled, okay? So Bill has found an even more elegant way of running your compilation scripts in such a way that only what is needed is recompiled. So here is how you can use it. So what you see here is a fragment of a shell script, right, it's a shell script which essentially describes how you compile your project. And so this is just a sequence of commands here which compile one file into object file, another into object file, and here you link them into the resulting product. Except each of these lines is invoked not directly in the shell, but it is passed into this MemoEyes Python command, which invokes it in a special way. And the same happens with all the other files. And so you see we're actually running three instances of MemoEyes. And the whole idea is that first time you run this compilation, all three GCC commands are invoked, of course. The second time you run it, if only this file was changed, only this compilation will happen. And of course, the linking needs to change as well because this would be a new file which results of the compilation of file 2.c, okay? So start thinking how you would implement it. What goes into this MemoEyes.py file? Here is another way how you can run it. You run it here directly from a Python script. So you import some MemoEyes function from the MemoEyes module, and now your run command does what? It runs MemoEyes and just checks for status. And here you have your essentially compilation commands that they compile something and eventually they do some sort of linking. This is in OCaml and other language. Which doesn't matter. And again, the same thing happened that this is executed each time, but whether this command is run or not, it depends whether the source file has been changed. So let's see, how would you implement this? So the goal again is to execute your script again and again but suppress those compilations, those steps that don't need to happen. So we could look at this shell script and see what should probably happen inside MemoEyes.py. So it should try to figure out what files we want to read, right? And maybe find the information about it whether it has changed since last time or not. Or use some hash map to look at the content of the file and check whether it has changed by computing the hash. Well, okay, so that's essentially what happens. How do we need to do now? Number one, how do we find out which files these are? Are we assuming that MemoEyes.py is actually parsing that command line or not? Would be nice if it did not have to parse it, okay? So that's indeed the key trick here that MemoEyes for the first time is going to run that GCC command. But if you'll write it, it will do it in sort of tracing mode that the operating system will tell you what files were open and closed and open for reading versus writing during the execution of this particular command. Maybe I can even show you a fragment of it. So here is I'm running LS with the strace command. And when you run it, it will give you a bunch of information which you can filter down to opening a file. Maybe I can find something. There is an open, opening some PTY here. We probably could find a real file here inside. Right, it's opening ETC passwords. I have no idea why. Oh, I see, I'm doing LS so it makes sense, okay? So it's loading this. And this is the information which MemoEyes uses internally to figure out which files were read and which files were written. So why is that useful? What sort of information are we building from it? Okay, so exactly. So we look at, we'll take this command. We'll run it in the tracing mode and we'll discover for this that whenever we do it, we are reading file one, reading file two, reading file three. Remember that the compilation doesn't only read file one dot C, it reads a bunch of header files and whatever else. Which when modified, you should also rerun this compilation. And you discover that there are these dependencies here on the file called, say, file one dot O in this case. So perhaps four files are open for reading during the process and one is written. And we build this dependence graph, right? Okay. So that's what we do the first time. And maybe remember when these files were read and written or maybe we compute their MD hash files marks and we know what their content is at the time. So imagine none of these files here that we read has been changed when we run that script again. So at some point, we rerun this again. None of these files here in the input have been changed. How do we discover that we don't need to rerun that command? Right, so we look at the timestamp or the MD hash to see whether they have changed. But before we do it, we need to find out which files actually that command wants to open. Aha, okay. So you really need to compute a hash of the command. You hash this thing here because you do not want to build the parser of the command line. The command lines come in million different formats and it would be too difficult to parse it and figure, oh, you're reading this file. Plus, header files are read implicitly rather than being specified in the command line so you cannot rely on it. And you don't want to run the command again just to discover what it would read and somehow kill it in the middle. So you really need to look at this command, give it a unique ID and assume, of course, there are some assumptions that if you run this command again in the future, it will do the same thing which means essentially follow this dependence graph, read this, read this, read this, read this and output that under that assumption, you don't have to understand this command at all. You just look at it hash. The hash will lead you to the dependence graph which will tell you, check if any of these have changed. If not, you can skip. Isn't this beautiful? It's so dumb in the implementation. Okay, it's so clever because the implementation can be dumb. It can just essentially hash the command line, run the trace, build the dependence graph. Okay. And then next time just check whether these input files have changed. Well, what it needs to preserve between invocation is some sort of mapping between this string which it doesn't understand, but it knows what it is. And so it maps this to some files. Right, so memoize would have some database where that is stored, right, right. And it perhaps doesn't even need to know the dependence graph. Maybe only the files that it read. I'm not sure whether you would actually gain some benefit from knowing what you have written. Maybe you, yeah, I don't know, perhaps not. So you only remember maybe what you have read last time. So this is quite great, actually. If you now, if you search on the web, I try to find the original link of the memoize. You cannot find it, but people who have heard about it re-implemented it because it is so small. So there are companies using this for small internal builds because you can configure it to your needs. It's easier to tweak than make because it's perhaps 300 lines of code. So here is essentially what we discussed. Here is the crucial call that you are running the S-trace with the command coming here and here is some other information that comes in which I don't remember what it is. Okay, Protov is a different DSL. It was done by a Berkeley student who went on to become a professor at Stanford and he works on data visualization. So the sort of things they want to do, charts like these, a pretty wide spectrum. So it's not that you have a tiny range of graphs you wanna do, it's all that. It looks beautiful if you look at their gallery. So here is the first one. In many cases they did not actually invent the visualization themselves. They take them from various publications of people who contribute novelty in how to visualize stuff. And so this is some nice visualization of Merzord. So I'll let you stare at it to see whether you can understand what actually is the meaning of these lines. So who understands the meaning of the angle of the line? Yes, please. Uh-huh, right. So the smaller the number, the more to the left it's leaning, right? And the number that is a median happens to be going straight up, okay? And so the array is sorted when you see what you see in the bottom line, that none of the lines cross. If they cross then you have two values that are not sorted yet. And initially everything is unsorted then as you start emerging, so you merge in this step, presumably this and that. And so these two arrays now become sorted. And I assume you merge this and you maybe merge these three, I don't know. So this is beautiful. Now how would you write a program that visualizes this in a few lines of code? And now perhaps it's not the best DSL but it is compact. So let's see if we can understand what they do. So let's first just focus on this part. This one is easy. It's all in JavaScript, of course. There is a small JavaScript library. Well, maybe not so small. So what they do in here is they go to the protoviz library and create a panel. This is sort of the space on which you visualize, the sort of the top element. And you set its width and height and margin and bottom. Bottom is presumably how low it sits. Okay, and you also supply the data, the data that needs to be visualized. And how do you do that? You actually call the special merge sort routine that gets the original data. And that merge sort doesn't return only the result, it returns all the intermediate steps so that you can then visualize them. So that's what you want to visualize. So essentially all the data you see here are fed into this panel, right? So here is just the setup. And now we are going to specify how the whole thing will look like. So let's look at this and try to understand the DSL. It will use the same call chaining as you see here that you do a dot-width call and on the result of that call you do another height and margin and bottom. You should by now be familiar with this. So now to this panel created here we are going to add some stuff, okay? So let's see if we can understand what happens there. So anybody ventures to guess what they are doing in these calls. That's fine. I cannot, I don't claim I understand everything. You would need to go into the manual to understand it. But the point is that a well-designed language should be self-explanatory. So let's see how close they got. This is an identity function. We essentially says the data that you have stored here don't do anything with it. So clearly this is a way for you to somehow take the data and massage it, right? Presumably this array here would be some sort of multi-dimensional thing, right? One per row. And you could, depending on which row you are you could change it a little bit, okay? But not so here. So the data stays unchanged. So this is essentially saying keep the data. Now what do we think happens here? This is the tricky part maybe. So I think the scaling, let's see. Let's understand this part first. I think it will be easier to understand the, so let's try to understand the wedge. So I can tell you that the wedge is a special wedge. It has angle zero, which means rather than being such a wedge, the angle here is zero, which means it is a line. So the wedges that are created here are actually the lines that you are drawing, the grass, okay? So here we are adding one line. It has some length, okay? So this is 30. Bottom is right here. Angle is zero, so it's not a wedge, but a line. And start angle presumably is this part here. And that start angle is somehow computed from what? From the data point, right? So if the data point is small, you scale it more to the left in a linear fashion. So there is a library for scaling. And here is some sort of a range. You're scaling it between this number and that number. And then you set its color. So this somehow happens inside the library on each data point in this two-dimensional matrix. You don't actually refer to the data point, but you can specify how it will be scaled. So you could think of this as this is a prescription of how this data point will be visualized, and it will be in our case a line moving depending on the value of the thing. So what I think they are doing here, they are in this point saying there will be several panels, one, another one, another one, another one, and so on. And then you can see that this PV, which is part of this library, has something called index, some internal value, and these panels are going to be stacked like this on top of each other. So you can now judge how useful the DSL is or how easy it is to learn because you need to know not only these functions, but how do you stack them? You need to know that there is by, you need to know there is this variable index. So there is some learning curve that you need to go through to use it. But the effects are great compared to having to write everything by hand. So what changes would you make to this library in order to make it easier to use? Have any suggestions of how to simplify that? Okay, so we could perhaps say that if we are not providing this, then its identity function by default, right? You could do that, okay? Perhaps. Right, I don't think this is the hardest part to master about this. I would say that perhaps stacking of those panels could be done more easily, right? Essentially they are operationally saying here are those panels, stack them up like that. Uh-huh, right, so this is an excellent point. Essentially what they are saying, for each row create a panel, and within each panel do something at every point, right? And this is definitely not obvious here, right? It somehow happens by saying, well this scale refers to the data point somehow implicitly, and here you are saying, somehow scale the data row by row in some way, right? Essentially somehow they are laying out those boxes, those panels into which then they put the grass, and it's not obvious. Perhaps if you know the pattern it's fine, but definitely not that easy to learn. There is another example which is somewhat even harder to understand, but the effects are great. So what would be interesting is, as one project, pick a subset of proto-viz, or the functionality that they support, and design a better language. Clearly their hands were tied in the sense that they were not trying to build an external language with its own parser, they said, everything needs to be expressible in JavaScript. And so this is how they embedded it into JavaScript. It needs to be, in other words, a library. And perhaps even with sticking within JavaScript one can do better than this. So that would be an interesting example. A related idea I would like to suggest to people is, do a language for visualization of algorithms, right? If you notice that in the early parser and a few other algorithms we looked at, it would be nice if it was extremely trivial for the instructor to add commands for visualizing the algorithm in some animated fashion. The reason we don't do it is that drawing it by hand takes so much work. Writing code that does it is even more tedious. But if somebody could invent a library, say for Python or whatever, in which I write the pseudo code in one way, and it both executes the program and visualizes in such a way that I can easily configure how the things look like, that would be fantastic. Many generation of students would be thankful to you. Okay, so here is an example of how a language would help in CS164, just visualizing algorithms. And we already used DSLs in several ways, in grammars, graph visualization, you could say regexes. Another example where they were used. What would be handy is, you had to do the desugaring in a painful way. So a DSL for tree rewriting would be nice. Debuggers for grammars, environment visualizers for coroutines would be really handy. If I could again easily extend the interpreter to in the process of interpreting actually create environments in a visual way, that would be great. I'm not sure there is a solution for that. Now, this may be a good point to look back at homework one, where we ask you to identify the pain points in building the grease monkey visualization. The reason for that exercise was to sort of reflect on how your programmer, where things go wrong, but also to suggest ideas for how things could be fixed. So can we go through some of these now and see whether the solutions would be something from 164? Maybe the solution is not within the scope of the course. Do people wanna suggest what the pain points were for them? Okay, so I don't know if you heard it, this is great. So understanding the regexes as they are used in JavaScript. The regexes always come with some calls, like match and next and give me all matches, that sort of thing. So clearly one solution is to just design that API for regexes in JavaScript better in a more understandable way. And so you could just essentially teach people how to do API design better. But there will always be APIs that are not as easy to understand even when they are designed well. Something like Eclipse, if you know, is a fantastically extensible IDE, but it comes with perhaps 10,000 classes and hundreds of thousands of methods. And if you wanna do something, it literally takes four hours of reading manual and reading code to write four lines of code. The four lines of code does a lot. It reads a Java file, parses it, builds an AST for Java, annotates it. So there is a lot of good stuff hidden behind those calls, but it takes sort of one line per hour of programmer productivity. And what is really bad that you cannot spend those four hours searching in a row because it's so mentally draining that you sort of spread it over several days. My own experience. So this API, Eclipse is pretty good. It was designed by people who built such IDs before and so it has good design patterns in them. And that's why it sort of caught on rather than died. But even then, these APIs will be really hard to learn. So if you could solve that problem, you could be a really rich man. So what could you do for just these regexes? Have I vote? Just the better regex literals that are more consistent. But maybe let's stick to the problem of, you have a library and you know that it does what you want, you just don't know how. Is there a way of how we could faster learn about the library? Maybe you wrote a code that is almost right, but not quite. We need to tweak it to make it work. What could you do? And now open your mind, right? There is a lot of stuff at our disposal. We have enormous computing power even on your laptop. And our compilers and IDs don't use it. You don't have your laptop just number crunching to help you program. It's sitting idle most of the time. Then there is crowdsourcing. There are people on the web who can look at your problem and help you. It's important to get the answer in a few seconds, of course. And then there is search, Google search. So what are the choices? Let's do something here, okay? So the output is essentially, if I understand a specification, it's the desired output, right? So presumably, you knew what the program should output. You could write a simple test case where you would say, on this I want that, on this input I want that, right? That would be easy for you to write. And you're saying that somehow just fill in the program and massage it until it just works, okay? I would like to talk to you about what existing work exists in this space because we do research in this domain and perhaps I'm missing something. And you could do it by just saying, well, here are six calls, maybe 60 calls or 600, just combine them in all possible ways, creating all programs of up to certain size. If the size is small, you could probably do it in a minute while you're brewing coffee. And so to me, that sounds like a great solution. So essentially we are saying, you know what input you have and perhaps you know what output you want. You know their types. And now you find all functions that accept a type foo, right? And maybe those functions output bar and now you look for functions that accept bar and create Baz, right? So this is a tool we build and it sort of works in many cases. Where it breaks down is that, well, in JavaScript or Python, which are dynamically typed languages, you don't have static annotations. So you know nothing about the function. As far as you can tell at compile time, you know, well, any object would go in. Even for statically typed languages like Java, which we looked at, there are these extremely abused types like strings, which you can represent, use integers, dates, social security numbers, and almost any function takes a string. So string is not discriminative enough to distinguish between how you can connect those functions. So now you need to do analysis to understand what kind of strings they accept, what can be inside, and then where things become so complicated that you cannot make it work. But the idea is good. You just need probably some other ingredient to make it work. Uh-huh, uh-huh. So that would be great. I think the notion of related is hard. It would be much easier if every function that you ever write cannot stand alone without the inputs and outputs on which you run it. Imagine that you cannot upload anything to Google code without a huge set of input-output cases. And once you have those, you know quite a bit about the function without running it because you know how it maps inputs to outputs. Now you can search for the desired function much better. So perhaps the methodology should change. I wanna switch to other pain points in Homework One. Understanding APIs, APIs are essentially crazy, poorly defined languages that do a lot in some inconsistent way. Helping them understand is a big problem and will remain so for many years. So if you have a, I'd like to move on to some other thing. Oh, so web scraping. Okay, so you're referring to the problem of, we see a web page and you know that somewhere there is the data that you want and you would like to specify what, or maybe what problem are you referring to? Writing a regular expression that matches just the right thing. Okay, and so you would like to have a way of maybe learning regular expressions that work correctly and can handle all the corner cases that you mentioned such as multiple authors. Artists did the same name or zero albums or any of that kind, okay? So somehow perhaps you would like to write a regular expression for the common case and then have a tool explore the other web pages and somehow refine the regular expression for the other corner cases. Perhaps you would highlight both it should match but you don't need to write the regular expression yourself. Okay, so you sort of annotate the learning inputs and the output comes out, right? So these sort of desires of doing programming by demonstration which is essentially machine learning of some kind, machine learning whose output are programs would be great and we'll see more and more of it. Problem is that we'll sort of need to identify domains where this is doable. A regular expression seemed to be those and then build those tools. Meaning parse the HTML into a DOM and then use sort of X pass that bog the tree, right? Mm-hmm, mm-hmm, uh-huh. I see, so the problem was that the webpage use tables which went parsed into a tree didn't look like a table or maybe not regular, right? Maybe the columns were not aligned or something, right? Perhaps somebody generated the tables incorrectly and so somehow you would like to have a way of perhaps normalizing the data information into a form that is sort of more canonical and then easier to extract. Some sort of tree matching. You would have a tree that you expect and somehow it could find the oddities and transform those trees. Is that what I think? Yeah, I think so. Uh-huh, uh-huh. I see, so essentially a refactoring tool that takes your HTML pages and refactors them, right? Well, the challenge, this would be great and maybe not so hard except the challenge is that most of these pages are generated from templates. They may have PHP code that dynamically prints those pages, so you would need to go into that PHP code and change that so that when that is executed, it generates a sensible, modern HTML page. So you sort of need to go to the meta level and fix the code generator. And that's harder, but it is a cool problem to understand a program and see whether it's printing HTML correctly. Other pain points from homework one. What else took a lot of time? Okay, so I'll let you think about it and let's look at some. So another thing that you could consider is something, a media Viki language. It has a template language which allows you to combine text with various sources of information. And these template languages are just really hard to write. And so you could look at that, but I don't wanna spend too much time. It did brainstorm here. I wanna show you, okay, I wanna show you this. So here is a project from the last instance called Grainline, and it's quite unusual, but perhaps it will spread how far you think about the project. So a pattern maker, who is a pattern maker? So a pattern maker is essentially a tailor. So believe it or not, they built a beautiful language for making it easier to draw the patterns that you cut out from fabric, 2D patterns, and then fold into 3D into something like a shirt. Okay, why is that a problem? Because when you want to do a shirt for somebody, you get this recipe, and the recipe doesn't prescribe particular sizes of these patterns. The recipe will tell you take this distance, take that distance, and then put these distances into the pattern and scale the pattern up according to your measurements. So if you want, here is an example, right? If you take a measurement with something like this, you need to add a particular distance, and here is roughly the language that people follow, right? So you take your measurements and you follow these instructions here, and you draw out the pattern, then you can cut it out, then you can cut out the fabric, then you can stitch it. So it's quite painful. And indeed, their main competitor was pencil and paper because there wasn't a solution how you could do it in a more mechanical way. So what I looked at was essentially a declarative programming language that looks very similar to those instructions in paper, like put down point A, go to B, and then you put in the measurements and that constraints over will actually recalculate the size of the pattern and draw the stuff that you can then just print or maybe the recalculated pattern you easily transfer on the fabric. And they want the grammar that is approachable so that you can actually easily use it even if you are a non-computer scientist. So here are some example of the language. So you specify the origin first, then you say B is 10 steps to the right using the X coordinate because X and Y people remember so there is no confusion, C would be 20 things to the top. And so this is the program you write and this is what the system draws. Then you can say do a line through A and B. L2 will be somehow going from A to a point that is five pixels or five units to the right of C. Then you do a curve from B to C that is parallel to L1 at this point, at B. And then you say D is here on the intersection, it's on the C line which we just drawn here and on L2. Now you can even define a function if you want but these the intersection here and now I think you can specify the pattern after you've drawn everything and this is the final product which the two outputs. So this was perhaps the most unusual project but clearly useful because there was a customer who needed this, a friend of the guy who built it actually does shirts and struggles with these drawings on a regular basis. So the best for your final project would be if you find somebody who is a customer of this kind and define a language for them. So what other things we want to look at? Low risk projects would be taking something that you have already built in 164, some of those PAs and extend them further, add more features into prologue or more features into the layout engine, essentially making it more of a full-fledged browser. Another class of things that you may wanna do is extend some existing DSL. So jQuery for example has an architecture for plugins. You can extend it by writing your own functionality. So in the process you actually learn about how jQuery is implemented, what other features people add to it and in the process effectively figure out from inside how a modern DSL is designed. And you could just rethink an entire DSL. Do jQuery or ProtoVis better? There are already languages on the web that take jQuery and its popularity and say here is how you need to do it, right? Okay. You could grow the 164 language. There are various features that other languages have that 164 doesn't, for example, it's not so easy to embed a language in it because the syntax is not as flexible as in Ruby. So adding that would be fun. And then a lot of good stuff coming from 164 technology is just making bug finding easier. Doing things like memory leaks, generating interface programs, for example, right? So this was one really cool project in previous years where often when you want to connect C libraries to Python, you need to write appropriate interfaces that transfer the Python data structures to C. And that's a lot of work. And so what those guys did, they wrote a tool that reads include files for the C libraries and generates those interfaces automatically. In most cases, correct. In some cases, you need to tweak it a little bit, but essentially it is a generator of those connectors between high level languages like Python and C. And then another possibility is to take 164 and compile it into a faster language so that you don't need to stick with a slow interpreter. You probably cannot do it for the entire 164 really well, but for some common patterns in it, you could do a really good job. You could translate it to a language that is essentially still dynamic like Python, but has much faster interpreter. So you go from 164, say, to Python, but not through interpreter, but through compilation. And so you could probably get a factor of 10 or maybe 50 over speed up that way. And then a lot, as you write programs in 164, you run into bugs so you could write sort of debuggers, both for people who program in 164, but perhaps also for people who write interpreters for 164, all the bugs you had to fix, what would be a good way to make it easier for future students to find those bugs and fix them by visualizing the environments, but maybe through other means as well. So for example, coroutines is one case where you had quite a few corner cases to figure out. Can anybody think of a tool that would make it easier to debug those interpreters? So we could do visualization of the environment. And so in that case, they would just need to give you perhaps functions that you call at the suitable point. But what about if we give you some assertions that you could check on the environment? So you run your program, you create your environment, and it seems okay. But perhaps there are broken links in it and other bad things that you only discover when you actually run the program on the suitable input. But the broken thing in the environment may already be there. For example, your environment is always broken, but you never notice until you find the right test. So if we give you tools that walk through the environment and check for consistencies or inconsistencies, right? That might be useful. So could we do it, for example, by everybody in the class publishes their environment on the web, on our server, and then we somehow cross check them against each other? So perhaps we give you a program and a set of points at which you print out your environment, and at that point all the environments from all the students would be somehow automatically compared. And if you differ, they realize, oh, I'm doing something wrong. So this is sort of an example where the server and the abundance of cycles and the fact that there are many of you could be used for better debugging. And it models the sort of programming that we will see in the future more and more, the fact that the internet puts all the programmers together, the environments. So the environment may not be different. They could be completely different logically in the program, and then they could be printed differently, right? So we could perhaps normalize printing, but still not the representation, but with enough students you will probably find an environment that looks like yours, or looks mostly like yours except for one point which might be your bug. And so this way you could perhaps do debugging faster. Any other ideas on how we could help debugging the first three projects? Perhaps we could do what we did one point manually, essentially give you a sequence of steps that the solution needs to go through for prolog. Perhaps we could let you query our reference solution on the server and you compare yourself whether you generate the same trace as our solution does. This would not be useful only in the context of the class because you may wanna debug your programs in the future in the same way. Imagine you have a large complex program that is working correctly. At that point you may want to record all the traces that it generates at suitable program points. And now when a bug shows up, or even before it shows up, you nightly run the program in which you have in the meantime made changes and compare it against those stored traces. If things differ all of a sudden you have a suspicion that there is a bug. So again an example of testing that will likely happen in the future because we will have the cycles to do it. And we don't do it yet. So this is all open as your project in the class. So I don't have more to say, this is all, but if you guys wanna come down, you can officially stop now. You wanna come down and suggest your ideas for project. We can talk more and see whether they are suitable or not. Thank you.