 good Wednesday afternoon to everyone see I think we're actually doing really good in terms of like content and things that were where we're at I'm super happy so I'll take questions on any class stuff in the next three minutes and then we'll move on forward the what yes so I don't have a question yes so you're checking for double-hatch right so a double-hatch let's see project three so in project three a double-hatch is only to come at the end right after a line okay so double-hatches are only to come at the end of the input right so there's only you know after you've read a whole line of the rule if the next thing is a hash right you read a hash that doesn't correspond to the token so that's going to be an error anyways and you specifically don't have to code in any error cases you can assume that all the input you're going to give it to battle according to our grammar I would say testing first so there's a whole line of testing methodology and actually software development called test driven development where you actually write the test cases first and then you write the code that satisfies that test case when you're done you write a new test case and that way if later you want to decide to refactor or change your code you'll know if anything breaks because you already have a nice test suite so I definitely think that's a good way to go so like write a super simple test case that you want to pass write that add that to the test week so that the scrunch will run and test everything and then just do the development that you need to do to get that to working to get that to work because here's the big thing that people do is if you start just going with monster coding spree and you never actually stop and say a does this compile b does this work what I'm doing right you have to fix every single bug in every single line of code that you wrote but if you write 10 lines of code and then check if those 10 lines are doing exactly what you think they should be doing then move on to the next 10 lines of code right you split it up and now you're kind of building on a good base so over time that's what I code it's like I could basically function wise so I write a function and then write some tests to test to make sure that function doesn't it's supposed to do and I'll write another function and I start piecing the functions together well any other questions project three midterm next Friday correct yes yeah so there will be a so the way the midterm review will work we will post a sample midterm that will be basically last year's midterm and then in the recitation sections they'll go over those midterms so that will be a prep for the midterm cool so I highly recommend attending a recitation section it's gonna be good yeah yes definitely first follow sets predictive revision dissent parsers will also be on there usually I cover right up until the very last day of class before the midterm in this case we're not doing a review and maybe say Monday yeah I don't want to be jammed too much in there but yeah definitely everything we've covered syntax analysis we're gonna finish syntax analysis today we're gonna start moving on the semantics so maybe stuff you can pull from semantics any questions okay so we've talked about we went through an example of doing actually creating a predictive recursive dissent parser and those of you that have checked out homework for you'll see that there's more practice on this here's a grammar do first sets you follow sets and write the whole predictive recursive dissent parser so that's the goal there so that you get to practice with that okay so for a real-world example of this email addresses so I actually think I'm like maybe the first class or maybe when we talk about regular expressions I'm pretty sure somebody I'm also from this side of the room because that's the only way I know you guys just based on what you said I think some of our here brought up the suggestion of using regular expressions to validate emails so why would you ever want to validate an email yeah so why but why validate it why not just try sending you know so you know you know they're sending to the right address that's more like a user assistance kind of feature right it's like if the user may be mistyped or just typing garbage right if we know we can't possibly send an email to that email address then why bother sending it right you should tell the user in the GUI hey this is an invalid email address right it's not maybe in one good reason any other reasons verifying who what's wrong yeah so you could either way right I actually have some emails that I want definitely not to go to spam and so I have special filters or maybe you're trying to figure out the field like what emails are being sent to certain address that's going there's a campaign password validation password validation so I was that useful in it like you know I have any symbols in it if you're not allowed to do it again you know just limit them yeah could actually make a good idea you may want to restrict I think like an email address would be a pretty bad password right because email addresses are publicly available so maybe you want to disallow that in there cool so what makes up an email address a string of texts an ampersand a domain name right so the key question right and so basically this is kind of how email address right so we want to understand either how to parse it to say where it's valid or to validate it in some way maybe you want to get the username or maybe we want to understand what domain does this email address live on right we're trying to analytics over all of our users and trying to say how many of our users are using Gmail accounts versus hotmail accounts right and here we want to parse that email address to figure that out so it seems incredibly easy right tenu with the knowledge you have right the regular expression to parse this email address it was what would it I mean what would that kind of look like let's talk about that so let's write email regex alphanumeric stream so like we'll do like actual email syntax so this is one character group that represents the characters a through z lowercase a to z capital a through z and let's say zero through nine so alpha numeric star which is one or more followed by what yeah what's the domain name this is it that's the only thing yeah dots and domain names are three oh this doesn't mean three this means this means this represents the characters you think a or b or c or d all the way through z capital a or capital b all the way through z zero or nine or dot that whole or expression one or more times whatever like domain like dot com and like that work it used to be yes but now you can buy your own top-level domain you have aya all kind of stuff yeah yeah because it could just be the dot right so this doesn't really capture it but it all depends on how we want to do let's think about it make it even more probably well there's also so when you get an email right so it tells you the person's name usually right how's it actually know that the header information not quite well yes kind of there is a two header and the two header says the email address right but that email address can be in one or two forms you have a name followed by brackets and then email followed by closing bracket so let's think about this regular expression that we've just written and let's think about well if actually we go back to it and the important thing another thing to remember is that email predates the web so before www html all that stuff is in about 91 time frame the internet actually existed and email existed before the web so what this means is I want to send an electronic mail message to the user cse space 340 at the host example dot com and that's how email addresses are basically that's what they mean so if your email is cse 340 student and email dot com that's essentially you're telling the world I am user cse 340 student and my email is located on gmail dot com it used to actually be if you will be a history lesson it used to be to send an email you couldn't just send a specific domain you would have to say food and then you have to say bar pound bass pound email you would actually have to say the the number of hops of servers to get your message from your computer to the other person's computer is that crazy to think about kids these days it's so easy all right so it turns out that according to the email spec double quotes cse space 340 double quotes is a valid email address and then maybe some poor user out there who happens to have to use this because they have an email address with a space in it I think that'd be a terrible idea but as a business and if you're trying to decide to send back emails you're gonna lose these customers because you can't properly recognize their email address you can also do crazy stuff you can include slashes you can click equal signs in here you can even inside double quotes you can include an amperage and you must know you include the at symbol here inside double quotes and you can even use a slash character to try to state you can use you can put a slash character to include a double quote inside the username portion so this would be the user capital ABC double quote at symbol example dot com at example dot and then when we start to add when we talked about adding the user's name this would mean a person with the first name of test a second name of example space at hello with a user with an email address of test example dot com and these are all valid email addresses according to the spec the RFC so there's a document out there that describes exactly and part of the problem is right we didn't just invent this 10 year a year ago 10 years ago 20 years ago this is mentioned like 30 or 40 years ago and so it has to have been back it's a paddle up till then and so so does a regular expression match any of these email addresses no do we want it to yes so actually there's two ways to do that yes there's one way to say no you just check if there's an ampersand and that's it and then you just try sending the email and then if it fails it fails right that's definitely one way to do it a more robust solution right if you don't want to waste a bunch of emails and it's better for your users is to be able to validate these properly so there's a company called mail gun and they provide email services as an API so you can use their services for sending out emails and one of the things they have is really cool tool to validate email addresses and based on their experience sending out and parsing tons and tons of emails so how do you think they implemented their parser was it a regular expression engine was it more complex than that yes what do you think they use almost yes I think it's without the predictive part but they use a recursive descent parser so they use a parse they define the context free grammar for the email language and they developed parser to parse that email so yeah it doesn't have a predictive component and so you guys are going to get out you check out and see what exactly is an email address so what I did is I went through I ripped out their context free grammar that they created for these email addresses from that Python code I think I think it was quite on good in their tool and then I tweaked it a little bit to make it be predictive and so we'll go over this right now I want to go through another example of first following recursive parsers so we're going to go over the grammar and I'm going to leave the rest of this as an exercise for you to practice on but they have things like quoted strings so that would be a token is a string that's inside double quotes so you can define that using a regular expression and the atom is something that does not have quotes around it and dot atom is an atom of the dot character in it so any number of characters with a dot inside of it white space and so they have the rules okay an address is either a name address from the RFC a name address of what they call lax which doesn't actually conform to the RFC but a lot of email servers still produce and accept or it's just an address and a name address is a display name followed by an angle address right with an angle address as we saw as that bracket or an angle address RFC does display name is a word display name RFC list or white space word display name RFC list this is the entire context for your grammar it's kind of cool right we have this is emails so you implement this parser and now you can parse emails oh and then I made it simpler so it didn't have all this stuff so and then if you go through here we can go we're not gonna go through this but you can calculate first sets for this you can calculate follow sets and so then you can finally get all of the parts display names and everything so using all those I think I'll be really good Jack for you to make sure you're on the same page with first follow sets and predict the recursive descent parsers and to also show you that this is an arbitrary stuff that I'm teaching you people I should do this I said I'm technically make money off an open-source library but it's you know from a company that hopefully is making money okay okay so to sum up basically so we talked we've never talked about like a formal algorithm for how to actually calculate the rules for predictive recursive descent parsers we kind of just in some ways intuition our way through it and talked about general rules so at a high level for every non-terminal in our grammar we want to create a function called parse that on term right super simple that's literally the task of writing a parser being able to write this function for every non-term and so for every production rule where we have a goes to alpha or alpha is a sequence of terminals non-terminal just like before if you have in your code basically if get token exists in the first of alpha then choose the production rule a goes to alpha right and this assumes that you already proved that yes I can write a predictive recursive descent parser for this grammar okay then for every terminal and non-terminal a and alpha if a it's a non-terminal you call parse a right for each sequence in order in that function you want to actually call parse that non-terminal if it's a terminal check that get token was equal to a right you want to make sure that get token returns that crack token and then we have to we have to make sure that if epsilon exists in the first of alpha right this takes care of that case then we check and get token exists in the follow-up a if it does then we know that this is the production rule if get token is not in the first of a and then it's a syntax step error otherwise if epsilon exists yeah exactly if epsilon exists in the first of a and get token is not in the follow-up a then there's also a syntax error so it's just like a little more succinct way of describing exactly how you go create these parsing methods any questions on these semantics our streets yeah you make the pastries we'll see in project five it's not it's really like once you have the parser correctly right every time exactly you want to do it but every parse function could return a node like an a node and then when you call let's say parse a in your s you have your s node and you make s is left out to be that a and then you see a little b and so you create that so it's actually really straightforward to do that because the structure of your your function calls is the structure of the tree and so you just have to create a tree from that so you'll get a lot of practice in project five on that never seen the plan to be covered now so watch out cool okay semantics what does semantics mean yes generally semantics means what does it mean right so up until now we've really just talked about kind of basic in some sense mechanics right we talked about lexers and tokens right which says what does this thing look like what do numbers look like what do words look like what do identifiers look like right and then we build that up and we say okay what are valid sequences of tokens right what's a valid sequence of tokens but but is every sequence of tokens actually valid if I have a C program and I have some int foo and then I say well there's more types but foo is equal to bar is this valid syntactically yeah assuming this is inside a function or whatever right valid syntactically is it valid semantically well what does it mean to assign a string to an integer or to get even weirder let's say we have a int foo let's say we have some functions here that do whatever boolean but these are functions and then later on I have variable is equal to foo plus bar it's a valid syntactically yeah does it make any sense oh what does it mean to add to functions together I mean could you think of a program language or maybe this doesn't make sense maybe you're chaining those function calls so you're saying like first call foo and then call bar or like call foo and then pass the output of foo to bar and then return the output of bar you could come up with crazy things that these things do but in a normal C like language this is nonsensical it doesn't mean anything it'd be like trying to call the function 10 right trying to call 10 as a function right this doesn't mean anything so semantics is all about defining this problem what do things actually mean right so we talked about syntax analysis for turning a sequence of tokens into a parse tree semantics really is about what does that parse tree mean so how can we define language semantics so I want you to think back and think about when you were first learning your first program language be a Java or C or C++ or whatever your first language was so how did you learn by learning syntax for how hello world some writing maybe a super simple program maybe from but where's you learn that from like from a blog post or judging just curious this is my example here it is don't worry about this weird art like set up public tag wait me stop just know that this right so you started with an example was that a class yeah what other kind of ways what other people so class would be taught basically you have to kind of talk the syntax a little bit but also the semantics about what things mean it's actually kind of a lot if you think about your first when they show you that first program right there's a lot not only is there a lot of garbage but there's also things like what are these these there's double quotes all over the place there's semi-colons here there's curly braces these aren't even things normal people use on a keyboard right but they mean special things to the compiler so they mean special things to us the programmers right so what are some other ways tutorials yes they're going through some tutorials having your IDE yelling you so just like typing random stuff and hoping that it works yeah books right yes that's seen seems like it's already made and then hopefully having something that tells you kind of what things happen right or what they do and so think back on that and what do you want from language semantics why do you even care about the language semantics to who makes sense to me when I write it to the end user the application yeah it's going for you right actually you have to get past the compiler first before you ever get it into the hands of a user or compiler interpreter right so yeah in some sense you're trying to get past the computer what what what do you want what are some good things when you read it have you read good programming books and bad programming books have you had well maybe I should ask this out loud used resources that are good resources and resources that are bad resources to learn a program language and what's the big difference there like what are some big differences like if I told you here's how to here's how learn some new programming language let's say the language has ever known like a list what would you want from that document or whatever examples okay cool what else definitions of what yeah so you want quite clear I mean do you want the book to just ramble or tell you contrary things or there's no structure yeah because you as the programmer need to be able to write the program and you should know what is good as far as the terms of lexing syntax all that stuff right but we've actually seen there's a formal way to define these things but then we have in addition to that we have what is the program actually do what are good things that the program can actually do and so to do those we really want a clear description of what things do even simple things like what does it mean when you have the plus operator in between two variables what does that actually do and we need something that is clearly defined so that we the programmer can write a program and that the compiler will actually create a program that does what we think it should do right so we need things to be precise right we don't want any ambiguity like oh yeah just add numbers like if you saw that in a c-book I was like it just adds two numbers that doesn't tell you anything about overflow what happens after you get outside of 32-bit integers right very large number plus a very large number and see is usually a negative number because they're gonna wrap around right if you're not aware of that you're gonna think that the program is broken or it's not doing what you think it should right we also want to be predictable right we don't want to have to be well let's kind of get some usability a little bit one think we know what it should do but we don't want to change right if we run and it's important to think about the difference between a language and a compiler right so the scene language is a standard you see 99 and then you have many different compilers that implement that standard right we have visual studios see compilers again with the name of your GCC there's clang now right different compilers and so it should be the case that I write a semantically valid program in this compile and run on this compiler and then I take that program and compile it with a different compiler should still do the same thing should be in the keyboard complete right why we want completeness yeah or you can be predictable and precise about the things that you described but anything outside of that you have no idea what's going to happen right there's actually a big problem in C because some things are undefined by the language spec and it is complete in the sense that it says hey if you ever do this it's who knows what happens the problem is that you as a programmer rely on the specific behavior of one compiler and then we take another compiler that's what breaks things so how do you specify language semantics what are some ways we can do that your language designer you want you've come up with the best language I was gonna say it's gonna make you really rich don't know that necessarily the language designers are super rich but as a boon to humanity you are releasing this awesome program language and you want to describe to me how to write a program in there so how do you actually specify the scent the semantics of the language given these criteria you want to be precise predictable and complete a list of like variable types so what so they hired not specifically what you have to do but what can you do like how can you specify it right some rules right basically what you said right a bunch of rules right and what going on probably English I would say it's the first start right you write in English and you write this huge specification document that's specified the language and specified the exact same things of everything and you give that to me right and then hopefully by reading that document I can know exactly what to do so one of the pros and cons of that it may not be complete yeah hopefully let's say you created it you're gonna make sure that it's at least kind of make a lot of rules yeah it's gonna be long really long and you read any of the language specifications there's like the ECMAS script JavaScript specification there's I think the C99 specification everybody really knows they're super yeah how long in it pretty long are they easy to read to it's kind of like reading a legal document right if it the final returns and be very very precise about what you need is English as we know it's very ambiguous cool so you do that what's the other ways yeah English back right so you're right back any Ruby people how does what's a defined as a Ruby program or originally I guess I think things have changed since now but when Ruby first came out like how is Ruby defined I think Python 2 and it first came out so you create a language you have an idea of how it works but how do you actually implement like when you come with the language just like design it all in your head part of it yes but at that point it's just in your head how do you actually get it to be useful to other people documentation okay so if I just give you a big document that here's how to compile my view awesome Adam language you'll just be able to go right out of one of the programs was that make something with it give examples how but I give you an example what do you do with that example you need a compiler right you need a compiler or interpreter that you can give somebody and say here's my compiler interpreter and so that way you can play with it right I mean you couldn't give somebody just all the dots and say here's here's my awesome language right it's the best language actually this is how early Python and Ruby they basically said the whole English documentation about what a Ruby program is and what a Ruby program does they said here's a Ruby interpreter whatever that does is a Ruby program and that's a reference implementation saying like well here's the implementation of this language here's a compiler whatever that compiler does is good and it's going to that is a Ruby program so whatever you have questions on semantics right here an example and see what it does whatever it does that's what Ruby program should do okay the third one is really crazy so you can actually define it formally with math you can specify exactly what the semantics are for every operation of your program you like to find it formally exactly I think I'm just gonna review what I just said but so I'll show you an example that basically specifies how you define an abstract machine and then you define how the state of that machine changes through each execution step in the program and so then you can do all sorts of you can model everything basically within that that basic idea cool so English specification so the C99 language spec is 538 pages long it's the C language that was spec that was defined in 1999 I think there's an updated one C is it C11 14 or C plus plus 14 is there a new C standard C11 okay yeah so basically like this senior dates back to like the 70s and so every once in a while they like get together and I like okay we should what new features do we want to see so I think maybe C99 may add like rules if that was a big I don't know if that was one of them but I think they added that at that point like this is handy otherwise your language just stays stuck in whatever features that you have any originally start your language and it contains super interesting things like an identifier showed the note an object a function a tag or a member of a structured union or enumeration a type def name a label name a macro name or macro parameter the same identifier to denote different entries at different points in the program a member of an enumeration is called an enumeration constant macro names and macro parameters are not concerned further here because prior to the semantic phase of program translation any occurrences of macro names in the source file are replaced by the pre-processing token sequences that constitute their macro definitions this is just about what an identifier is specified and this gets to preciseness right you have any questions about what identifier is it sure as heck better be in this paragraph or else you're never gonna know right but it's saying a lot of different things so and we can assume also that there's long and complicated descriptions of what an object is what function is a tag a member structure union enumeration a type of name all these kind of things we have all about here about macros and all that kind of stuff right and so so what are some of the pros and cons of English specifications you think maybe it didn't say anything here about what to do for macro names and macro parameters and so if the specification doesn't say but the compiler needs to do something in that case right they have to just make up what they think is right and what they think so there's any beauty comes in and it's really bad because you the programmer may not know exactly the spec is telling you the compiler may not know exactly the spec they say and then maybe multiple compiler writers may all disagree about what the spec should say and do right it's again the horrible situation what else yeah it's like 500 pages right I mean start with me I was like I've never read this entire specification right the K and R book that I recommend that's like what I use for my seat right that's like 100 days that's much more reasonable but this you know it's gonna be so long and so complete that nobody's actually gonna go through except for the compiler writers because they have to right this is their specification so yeah it can be ambiguous not correct or even ignore right this happens a lot on the web where they say HTML is this and the yeah it's cool but I'm gonna do a blink tag and then they go if no we just on the spec and they're like oh we didn't anyway some people are using it and then so eventually after like five or ten years they incorporate these new things into the specification that's honestly why HTML is so crazy because you have people companies doing random stuff and then exactly there's cases that the spec doesn't mention right if the spec doesn't mention anything then how do you know what should happen but it's very good in the case that you can have multiple implementations of the same language right the specification assuming it's complete and it's precise you've actually have multiple implementations okay so reference implementation so reference implementation so up until 2011 the Ruby max is the guy who created Ruby his Ruby interpreter was the reference implementation whatever this thing is a Ruby program whatever it doesn't do is not a Ruby program that's it so what are some of the benefits here yeah it's easy to just run it right oh that's pretty easy way better than 500 pages just run it what else what does another bet it's yeah you want to explain explain that a little bit more yes so that's actually that's a pro I guess if you're the compiler writer right so the idea is let's say it turns out on Ruby for whatever reason there was a bug in the compiler and it turns out that four plus four four plus four is nine whatever but that's what the official reference implementation does and so people say okay well that's the behavior of Ruby they start writing programs that rely on the fact that four plus four equals nine so then you can't ever change it and fix that bug because it's part of the language and so it stays there forever so yes that is one of the huge problems of this as well the goods one of the bad sides that bugs become accepted parts of the language because hey that's what happens what else what are some other either pros or cons either side yeah can't be ignored there's nothing that it doesn't specify right because it's specified everything it doesn't it does it if it works it doesn't that way and the way it did it is how that works if it doesn't accept it then it doesn't accept it right there's no there's little ambiguity in the sense that you if you ever have a question and you can write a program to solve that question you can test it what are some other cons yeah you don't actually I know like how do you study and try to like understand the language right without any other documentation you're just at the mercy of basically generating test cases and saying what happens yeah try on air you're just trying to just stop right what about like doc let's see what some other things here let's hear we talk about yep so yeah it's the benefit it's precisely specified on a given input so if you give me input I know exactly the semantics of that Ruby program is because I have a compiler bugs we have bugs there what about portability what if I wrote my awesome Ruby interpreter in a crazy list language that only runs on list machines that don't even exist really anymore but it doesn't work on yours or let's say a more realistic example like I'm writing a C interpreter in C the compiler in C but I'm using a bunch of Linux features and now we all windows can run your interpreter so they can never understand what this program is okay so formal specification it looks so basically they have multiple different ways of how you do this but you specify formally the semantics of the program and it's that's a lot of benefits because all parts of the language have an exact definition there's no English ambiguity about what things do or not do it's very well specified and it has a side benefit you can actually prove properties about the language and about programs written in the language you can say things like it's not possible in this language to create a program that does x because I have a formally defined language I'm not saying it's easy we can do it it can be kind difficult to understand so I use this example so this is one of the professors at UC Santa Barbara Ben Hardikoff he and his group I'm trying to zoom in so anyway like triple tap or do some weird Apple thing no all my gestures are failing serious zoom see if they don't work okay okay so here we have this is all of their semantics for JavaScript the JavaScript language and so we basically so here we have the state so here we're transitioning this abstract machine into different states from one state to another and depending on what type of rule it is this is how we actually I think this one is how we move this is like the Boolean and this is how you move the states so you could probably spend like a month deconstructing and all the information that's in here but it should map to what the specification is but what's one of the problems with this yes step of it is super weird I mean this is taking a semester class to like kind of understand what's going on in a grad class I'm not putting up here so you understand it I'm putting up here so you understand kind of what it looks like so I'll be some pros and cons of this yes well I mean different is documentation but you need documentation understand it yes or degree in interpretation and program semantics yes yeah so hard to understand for an average developer right at least English that first one that English sentence was very long but at least you understood the words in that sentence right this you have to learn basically a whole new language and even the order to understand what they're doing here also so this is a language for a real programming language that exists so how do we know that they actually did this correct and that there's not a bug in their formalism we don't know we have to hope right yeah if that's you know if there's any problem here and there's nothing execute why do you execute this thing you can write the program and execute this thing right but it's not if you're just giving me this it's like hmm I will use a different language right and actually so what's funny is that the approaches now are kind of merging right because specifications and semantics go beyond like I said program languages right they also cover things like HTML and important file formats and those kinds of things and so what's happening now more often is you have English formals not formal but English specifications like a specification in addition to that you also have a reference implementation so they usually say hey here's this thing and here's C code that's exactly how this thing should work and so they give you both so that way people can use the reference implementation at the starting point and use it in their own projects but they also have the English one to fall back on so that it defines all the cases cool all right so we'll stop on the crazy map and we'll get more into semantics on right