 Alright folks, very sad day. This is our second to last class together. I know, you're all super sad. I think you'll be seeing some of you again soon. Okay, so very quickly, homework, you probably should have gotten the emails and seen on the website that homework for was released. If you haven't, go look at it and get started now. These are challenges that should challenge parts of things that we talk about in class, and sometimes they need time. You need time to think about these things and you need time to get stuck. So you can log in with your username and password that are visible on the course submission website. Apparently all my emails to everyone went into the ether. I have no idea what happened to those. I sent out 120 emails and they went somewhere. So, once you're on this lovely server, let's see, is there anything interesting? You can check out, oh no, it's too big. So the scoreboard is too big for my massive font size. You can run the score to see there are currently nine levels. These are just ordered in an alphabetical order. You can see that at least once, I think one student, maybe two, I can't tell what this is. We have to decrease the font size. Make text smaller, that's what I'm discussing. It's not super useful for that. Now I increased the font size by doing command minus. Oh, there we go. So you can see, yeah, so two people already solved all the challenges so everything is doable. Another good trick is to use the scoreboard to try to see which ones are easier than others and that can give you some directions of where to start what to do. So you want to rank a level. They are under bar challenge. So here's the list of challenges. You'll see in each of these that the directory is owned by root and is the group just execute me if we ls this directory. We will see an executable there called just ask execute me which is set UID, set group ID. So this means when you run this executable, it's going to run as the group just dash execute dash me. And the goal is to call the link command user local bin which will, this little wrapper application will help add you to that group level. So when you do that, you'll know that you've actually read the level of this test file. It's not supposed to be here, but whatever. You can't see it anyways. I just think that it's a mystery that I magically put in there. You can't cat that file either. Exactly. You can't. I can. I actually don't know what's in there, so I don't want to cat it. Maybe it's like a password or something. It's better not to do that. So this is the idea. So every one of these directories tidy up, stolen data, secure this house, searched, brought me, read secret, just executed me groups, basic overflows. So these are all things. I'm going to have source code. It's going to be fun to play around with it. Put this up. It's due next Friday, not this Friday, right? No, it's due next Wednesday. I decided to give you guys extra time since it was late coming out. So it'll be due actually after the final, but I'm going to get crazy. So there's not going to be any late, lateness on this assignment. So we'll take the snapshot of the scoreboard at that moment in time, and that's every screen on this assignment. Any other questions? You still have to turn to read me, right? Yes. So you want to turn to read me because like I said before, these are little puzzles, right? So I want to make sure that you can actually solve the puzzle. So the read me will describe how you broke each of the levels. It should be fun. This is a fun assignment. Oh yeah, we have these people on the server. That's good. All right. What are we doing? Oh, teaching. Okay. For the last week, we are going to learn and get a crash course in web security. Mainly because it's one of my favorite topics. It's my, I guess technically it's the first thing I ever got pain to do professionally is write websites. So that doesn't have a special place in my heart. So the web evolved. So this is the snapshot of the very first web browser. So the web, as we know it, was created by Tim Berners-Lee who was a research scientist. He was working at CERN. And you know what CERN is or does? What do they do? What's their big thing? Yes, the LHC, the large like making whatever particles hit each other on whatever physicists do nowadays. So they have an organization that manages and runs these huge projects. And these are projects that cost like billions and billions of dollars to make it run. And they have this interesting problem in this is around 92, 93 timeframe where they had researchers continually joining the projects and leaving and everything is in kind of a state of flux. And nobody really knew, well who do I go to to talk to the person that's in charge of X, Y and Z. Nobody knew. So Tim Berners-Lee had this vision of well wouldn't it be great if we kind of had this directory that would be updatable but you could update it and you could have a home page and we could have like a list of people and you could link to other pages of other people. So he started writing anybody recognize this operating system? Anybody use this operating system? Oh wow. What is it? It's not the Mac. It's not through what? No. It's not MS-DOS. It's not MS-DOS. Come on, MS-DOS doesn't have it. Do we? I'll give you a hint. The name is on the picture. That it is. It's not universe. There's a lot of words on here. Somebody was close with Mac. So what did Steve job to do when he was outstid from Apple? Next. Yeah. So he started a competing company called Next. And you can see it up right here. This is the logo of the Next operating system. So this was the Next OS that they were running. Tim Berners-Lee had a Next system and that's where he coded the very first web client and web server and web browser on. So this is a picture of the World Wide Web which is a hypermedia browser slash editor. It's actually kind of interesting as he had it. This idea already in his head that he wanted you to be able to edit sites as well as just browse them which we've kind of lost over time. And this has the distinction of being the very first website. So this was the first WWW website. And actually a fun fact is I've played around with this WWW browser. It actually still, I mean, if you run, you have to run Next in a virtual machine but you can still use it to browse the web. So it still kind of works which is ridiculous. And so this is Sir Tim Berners-Lee. He was knighted which is pretty cool. And he had this first idea so in, I guess I have a time for him a little bit wrong, in 89 and he has this great book. If you're interested in the history of the web, I highly, highly recommend this book, Leaving the Web. It talks about the genesis of the web. And it's really crazy to think about that we went from, you know, he created this, essentially this really small thing just for CERN in like 1990. And by the end of the 90s, literally like a decade later, you had the .com crash and the .com boom and all that stuff. So it's really insane to think about, like think about something like you designed today being used by like millions and millions of people all over the world in less than a decade which is pretty nuts. So one of the nice things, so the key idea is that Tim Berners-Lee didn't invent all of these things. So he borrowed this concept of hypertext which is essentially you can have a text and it's actually very, you know, right now it's a very natural concept. You have some text, it links you, you click on it and it takes you to somewhere else, right? This was a pretty revolutionary idea. I think it goes all the way back to like the 60s and 70s. And there have been some attempts at systems to do this but it's very difficult to actually do it. And he also had basically the internet, so he didn't invent the internet, he didn't invent TCPIP, and this is one of those things that's important for a member when people talk about the internet, they're not talking about the web, right? The web is just another protocol that runs on top of the TCPIP stack but it now has the lion's share of almost everything there. So the idea grew into how can you have universal access to a large universal document. So you can have one document and there can be links on those pages that point you to other documents and there's links on those pages just creating this web, right? That's where the web part comes from. The idea, the problems that he had to solve and that have to be solved is how do you name a resource? If I say, oh, you should go check out this web page. How do I actually tell you the name of that? And then when you're on a web page, how do you know how to visit those other web pages that that page is linking to? So there's actually, it seems like a problem that is not a problem, naming things would be a problem, but if you're going to want to have everybody be able to name things and name documents in this universe, like about the web as a universe, it becomes a difficult problem, then how do you request and serve a resource? So how do you actually do that? How do you, once you know who to go and who has this service, how can you actually make that request to get that service and how to create hypertext? So how to actually, something that says, if you want more information about this thing, go look at this other document. And it's really crazy to think about these three things form the core nexus of the web as we know it today. So how to name a resource was solved with URIs or originally URLs, Uniform Resource Identifier. So this is how you name things, how you find things. HTTP was created as how you request and serve things, and HTML, the hypertext markup language, is how you actually create hypertext to view and display things. Questions on these? So, and there are actually three, so this is actually one of the most challenging things about web security and web technology in general is that if you were, let's say, if you want to be a, oh, I don't know, an embedded systems person, you just write C code that works in an embedded operating system and that's kind of your life. Or if you want to be, develop, let's say, back end Java applications at some company, you're just going to develop Java and whatever, you write interfaces, you write code. If you want to be a web developer, you have to master a huge list of technologies, the three core of which are URIs, HTTP, and HTML. So if you even get started in doing the web, you need to learn three different technologies and then the list grows more and more and you get more and more advanced. But they're really interrelated. So the idea is a URI, which is just what we think of as a, I'll kind of use URI and URL interchangeably, which is what we think of as the HTTP colon slash slash Google dot com. So that is a URI. So once your browser has a URI, then it knows how to make an HTTP request to that server, Google dot com in this case, to ask for that document. And what does that server then reply back with? With an HTTP reply with HTML. And that HTML document inside of it, if it has links to other documents, will then contain further URIs, which our browser can turn into HTTP requests, which gives us HTML responses. This is a nice life cycle where you continually have these three technologies really interacting. So the idea is you can think about a URI as metadata of how to find a resource. So it answers this question of how to name something. So like, I don't know if I say, oh get document x, y, and z on Google dot com. I can just tell you that and you maybe have to translate it, but the URI is a nice standard that specifies how to do this. It answers which server has it. So who do I talk to? And when we think server, will then get translated to an IP address. So who has this document that I care about? How do I ask for it? So URIs, we usually think of them as HTTP. Has anybody seen any other types of URL links that don't have HTTP in the front? FTP? Now there's FTP links. What else? Anybody seen any of them? Anybody use a mail to link? If you click on a link, that automatically tops up your mail client. So that's a mail to URI. So this is the, how do I ask? So this answers the question, do I talk HTTP to the server? Do I talk FTP? How do I actually talk to the server? Then how can the server locate the resource? So how does the server know how to find what I'm talking about? And again, these are all as we talk about a network. When we talk about networks, everything here is defined in RFCs. So you can actually go look up the standards to know how do I parse a URI? And how do I get it to its constituent parts? The basic idea is we have the scheme, a colon, the authority, a slash, a path, a question mark, a query, a hash, a count symbol, and then the fragment. So the idea is the scheme is the protocol to use. So HTTP, HTTPS, that's another one we didn't talk about. FTP, mail to. The authority then would be the entity. So who do I talk to? So it's usually a server name, and it'll usually be in the form of, is that somebody's alarm? Or is there a cover of an oxide leak? So it'll be usually, it'll be, if you've ever seen, usually you'll see hosts, the host names of Google.com. So you might have to see Google.com for the port name that tells the server. So there's a default port for each scheme. So HTTP is 80, FTP is 21. I want to say, is that right? And then HTTPS is 443. So there's default ports, but if you wanted to change that, you can add a colon after the host name to put the port that you want to talk on. And then you can even specify usernames and passwords and all kinds of stuff. So PATH is usually what we think about in a file system. It's a hierarchical structure starting separated by slash. A query used to pass non-harticle data. The super interesting thing about this is that a fragment, so if you see a link with a fragment in the URI, that fragment actually does not get sent to the server. The server has nothing to do with that. It's actually for your browser when it displays the page to take you to what they call a subsection or sub-resource. But the key thing is the PATH, the query, everything basically after the authority, the PATH and the query, literally to you, mean nothing. Because you don't really know how to interpret that resource. The only thing that knows how to interpret that is the authority. So Google.com is the only one who can essentially know a resource that you're talking about when you say slash foo bar, slash 12, and then a query string of whatever you want. You can change it, obviously, so you control this URI, but fundamentally what it means is up to the server. So some examples. So you can have, this has a scheme of foo authority of example.com on port 8042, PATH of slash over there, and then the query is test equals bar and then has a fragment. You can have an FTP, you can have a mail to me. How would you parse this? So what's the scheme here? And what's the authority? What was that? Example.com. Example.com? Are we sure? Example.com slash test slash example on port one. So that makes sense, but let's say it's example.com and then what would the PATH be here? What's the difference between a PATH and a ROU? Typically you think of a ROU on the server side. So you think about how to route the request. A PATH is what is specified in the URI. So you have a PATH that doesn't actually correspond to a ROU because that depends again how the server interprets it. But fundamentally, well, being a well-designed web application will generate URIs whose PATHs correspond to routes in the application itself. So back to this question. So we have a problem here, right? How do we parse this PATH? Is it slash test slash example.com one dot HTML? What's this question mark? And then a slash, and that's the PATH. Or is the PATH test example.com one dot HTML. And then everything after this question mark is the query part. And the query is slash PATH. So this is why we need to escape things. So we can already see in here, and again this gets back to an issue that it's all about parsing. So a lot of security problems are all about parsing. And so here the question is how do I parse this string into its constituent URI parts? Well, I have some ambiguity here because this URI is including some reserved characters. So how I interpret it may not be whoever created this URI. So we need some kind of encoding. So the idea is all of these characters are essentially special characters according to the URI. So we need just like when you have a string in C or Java and it's contained in double quotes, how do you actually include the character double quote in your string? You put a backslash before it. And then you have another problem. Well, how do you do a backslash? You've got to put another backslash before it, right? Similarly with URIs, but instead of using backslashes, they use what's called percent encoding. So the idea is the spec says you have to include a code anything that's not alphabetic, a digit, a dash, a dot, underscore or a tilde. And you percent encode it by using the percent symbol and then followed by the hexadecimal representation of the byte, which means that ampersand would be percent 26. And of course, we have the same problem. How do you do the percent symbol? Well, it's actually pretty easy. It's percent 25. A space character is ASCII. So it's hexadecimal representation of 20 and so on and so forth. So we've actually fixed this example and depending on how we fix it, that will change how we actually parse it. So this would say, okay, the path is test example percent 3a, which is actually the colon symbol, 1.html and then the query is slash atom. So this is a completely valid URI now. Everyone will know how to parse this. The question is really if example.com knows what to do with this, who knows. But it's at least a valid URI that we can parse. Questions on URI? Get to HTTP. So the idea is, HTTP is essentially a protocol for how a web client can request some resource from a web server. It's based on TCP. So what guarantees does this mean that we have when we're using HTTP? Can we determine a week? Are we going to find all the files? What was that? All the data gets there. All the data gets there. So when the server sees a request, it knows that it has all the bytes that the client actually sent. What else? Secure connection. Define secure connection. Via three-way handshake, the client knows that it's talking to the server, the server knows it's talking to the client. Yes, so we've established, I wouldn't call it secure necessarily, but I'd say that there's an established connection. So we know that the client is not spoofing a request. We can be sure we order. Even then, we know you can maybe spoof a request, but we know that we established a three-way handshake with an IP address. So we know we have that socket pair of source IP, source port, destination IP, destination port. Yes, all the requests will be in order. But the client will know that the server will know that all the bytes, and the client will know that all the bytes he had sent will be sent in order, and there won't be anything out of order or messed up. So version one, so this is actually standardized in May of 1996. And it has had several revisions since then, although you can see there's actually a revision in 1999. And then version 2.0 is basically still under discussion and is not yet standardized. This actually is going to be a very different style of protocol where it's all binary based and there's multiple streams and different things can happen, but it's still under discussion. So the idea is as we've seen, we know with HGD, so we know this is a TCP server. So we know the server must first listen to incoming TCP connections on whatever port it's listening on, like default port 80. It has to do, it waits, waits, waits, and the client opens a TCP connection to the server, doing the three-way handshake and everything that we talked about already, then sends the request, the HGD request to the server. The server reads the request, which will be in the HGD request format, and sends an HGD response back possibly containing HTML content. It can also send back whatever it wants. It can send back a PDF, an image, whatever the server wants to do. So graphically, the way I think about it, is you have the client, you have the server, HGD request to the server, and HGD response back to the client. In reality, when you actually start developing web applications, things get much more complicated. There may be a proxy, there may be a firewall from the client to the server depending like at ASU, you go through a, actually I don't know if that's for certain that you're definitely the incoming traffic is checked. That would be a good experiment to run. Anyways, so the firewall, so you may be talking about firewall, the server, or any other node in between could actually be running a proxy which can cache. So oftentimes servers run with proxies in front of them so that the responses can be cached, so that way every request doesn't have to get to the server. It's going to be more complicated, your browser often will have a cache so it will sometimes not request resources if it doesn't think it needs to. And so your request is actually going through the firewall, through the proxy, then to the server and all the way back making the situation very, very complicated. And this doesn't even go to, well how does the server make the response? Maybe the server talks to a web application that talks to a mencache machine and SQL servers and all this stuff will create an html response for you. So, when you bring it down an html request is pretty simple. You have a, what they call a method, so some kind of what you want to happen to this object. You have the resource that you want which is actually derived from the URI. The version of the protocol, why is including a version of the protocol useful? So version predicted, how do you parse the request? Yes, because different requests require different things, plus so you're designing a web protocol that you want to be used by millions and billions of people. Do you include a protocol number in the requests? Really screwed up and broken because you didn't test it well enough. You make updates, only half those million people are going to actually update the other half are going to be too dumb. So you're going to have to be able to still serve those customers when they make the requests because you need to have versioning of the protocol. So it's a multiple thing, right? So it's for servers so that they can know how to parse the request, but also if there's an older server and you've upgraded your browser and it's making HTTP2 requests and you only understand HTTP1 request that's an issue and you would rather say hey I don't understand this version number that you're using rather than the other way it just kind of works but also weirdly breaks in other reasons you'd rather just reject any request that you don't understand so this is a way of doing that of putting it in the protocol. And we saw this even with IP so IP puts the protocol in there the protocol version number information about the client so the client actually sends information about itself and an optional body so the server will sometimes send data to the client and send data to the server in its request. The syntax is pretty easy so there's a start line followed by headers followed by the body and everything separated by CRLF so this is important because as we saw TCP is string oriented so all you know is you've got some bytes from all the server knows let's say the client is sending 50 bytes is this all of the request is there going to be more to this request is there going to be less to this request so you have to have a way you either can specify the size in advance of what you're going to send and say I'm going to send you 500 bytes and then you wait for 500 bytes or you can do it like this where basically the headers are separated so basically every header is separated by a CRLF which I believe is a slash R that may be backwards so we'll look that up so the method so anybody who's done web stuff what are the common HTTP methods on the HTTP request again what was that one patch put, delete what are the differences between them you just know that there are different ones yeah which is white so the difference is how the server so in reality the server can do whatever it wants to anything it can serve them all the same it can do completely random things depending on the method you can even write custom methods that aren't defined in the spec get is basically hey just give me whatever entity is referred to so thinking about this in terms of documents just give me that document the most basically is asking the server to do some processing with the data that you're sending put is usually storing something and head is identical to a get but don't return me a body which is kind of weird but the idea is the core difference is that get the get request should be item quoted so should we remember what that means item quoted probably learn about it in operating systems or something doing something once or more than once is the exact same thing right so you should be able to make one get request or 100 get requests and the state of the server should remain the same so fundamentally it means a get request should not change the state of the server whereas a post request can change the state of the server so this is when we think of things like when you sign into a web application you should be making a post request because you're changing the state you're now signed in when you log out that should also be a post request and it's actually got I believe it was Google in trouble in 2006 or 2007 they wanted to they were creating I think it was like a plug-in or something that would let you browse the web faster and what it would do is it would see what page you were on and try to pre-fetch links that you were likely to click on from that page so it would follow get requests on that page and of course not a lot of not every website actually is very strict about following these things so this would cause all kinds of havoc it would like log people out of the application they were using while they were using it they didn't understand why and I think it would probably like post like like comments on social media sites and stuff because this is not a very strict but if you're doing stuff you should be thinking about like this there's other ones, options, deletes trace, connect and you can define arbitrary things so if we look at an example it's a simple example so the start line starts out with the method so it says it's a get method and it's actually something that's interesting when we talk about the version number so the version number is actually the very last kind of a snap line so it's a get, a slash and then http slash 1.1 what does this mean when you upgrade to a new version of http what can you change and what can't you change you're going to come up with your task with coming up with the http 2.0 what can you change in this request keywords that you're using but you can't change the fields so the fields are separated by a space the method, the location so you have the method get a location but well yeah so you may be able to say it again you can change what those values are for your new protocol you could use getter instead of get but you can't change that the fields you can't change that there are three fields there because it needs to know where to look up you can add to the after probably to the after right so the idea is you could change essentially for these things the syntax here has to remain exactly the same because it needs to be backwards compatible anybody who's ever spoke in http needs to be able to parse an http 2.0 or 3.0 or 5.0 request to get to the version number to say oh this is a 5.0 request I don't know how to deal with this so you can but you can completely change it anything afterwards right after this you can change whatever you wanted but it's kind of interesting to think about because they put the version number here any future http request have to start with something that has the same syntax it doesn't have to correspond necessarily to get and then a resource but thanks interesting to look at but anyways so the first argument is the method the second the second field here is the resource so by default hash is the resource that's requested otherwise this will include the path and the query of the URI will be in here the method we set some stuff with us that this was user agent curl and this is a header that accepts tells us tells the server what types of encoding we accept because they can actually encode it they can bzip it or gzip the response and the other interesting thing so why do we need a host field why do we need a host header so how do we get to this stage so this request was just sent walk me back take me back in time to before this request was sent what happens we established the tcp connection with the host how do we know how to establish a tcp connection to that host the dns lookup from the yes dns lookup and then one more step back how do we know what host to do the tns lookup on is it typed it in to the browser the address line the URI so the user typed in the URI what would be the URI in this case can you get it from this request www.google.com www.google.com and then slash because that's what we said here is everything that's the path and the query all on that second line cool so the user says in this case using curl it doesn't matter but it could be a web browser so curl is just a command line browser so you type in your browser www.google.com slash and then it entered and then Google will do a dns lookup for google.com so it did a dns lookup it gets back what when it does that dns lookup and then it connects to what port on that IP address server what port so this is all from the client this is the browser making this right 80 so port 80 so it connects the IP address on port 80 so why does it send this header isn't the client already talking to www.google.com because they just made a dns request it got back an IP address well so at least so we looked at TCP so from the TCP connection I mean an IP it's talking to an IP address right on a port so from the client's perspective it's talking to that IP address on that port so nobody in the middle cares that it's talking to www.google.com all it knows is IP address I want to talk to that IP address on port 80 so why do we tell google.com that we're talking to google.com seems dumb actually the first version of HTTP 1.0 did not have this but this is something you should frequently do in your career you'll probably honestly you probably won't be starting a project from scratch you're going to be going into code bases that have already been developed for a year or 5 or 10 or 15 years and you'll get things that will be like why does this monstrosity do this crazy thing it's my own copy-paste functionality this is a personal experience and so something that looks silly on the surface usually there's actually good reasons for it when you dig it so what is the reason here so we just thought about it from whose perspective the client so from the server's perspective what happens it receives a TCP connection that is an IP address on the port right so it knows it's a certain IP address it receives a TCP connection on port 80 from a remote client and then what does it do it gets this request so it gets this from the client it parses this request so why would a server need to know that it's www.people.com can a server host multiple websites can a server host can a server or IP address host multiple websites would you want that do you want a one to one mapping between IP addresses and websites then you'd have to we're already actually running out of IP addresses right we only have 2 to the 32 IP addresses and actually how many websites are there from domains I wonder I don't know if there's more but anyways the question is would you want to get an IP address for every single website that you're running or would you like for instance actually Google is a great example for this because you can use any IP address of Google any of their front end IP addresses to talk to any of their back end web services so you can do a DNS request for mail.google.com and you can manually set that IP address as the IP address of google.com and it will work 100% correctly because Google is parsing this host field and knows which server you want to talk to or if you think about it if you're running many many small websites right I'm having a different IP address for all those websites just host them on one server and tell your web server hey if you see small website A serve this content you see small website B serve this other content so it's actually an incredibly handy and useful feature that was actually I think probably required for the growth of the web like without this starting a web site would be a massive problem so modern request in a browser looks something like this so they're a little bit more complicated so we have get we have the same thing here we have accept encoding accept user agent which the user agent field is a whole kind of mess that if you look at these you start going crazy a little bit because it's actually it's a Chrome but it's saying it's Mozilla 5.0 on a Macintosh actually your user agent is leaking to the server knows now that I was running a Mac not only am I running a Mac but that it's an Intel Mac OS X 10.10.1 so the specific version number of Mac OS I'm running plus it tells me the version of Apple Webkit that I'm running and the specific version of Chrome that I was running so you all of this information including your IP address are all being leaked to Google and not just Google but every single website you ever visit so that's fun and definitely when Tim Bernsley was creating this stuff they were not thinking about this kind of privacy angle so the server reads this response it knows exactly what site you're talking about with the host field it knows exactly how to do this so it responds with a protocol version number a status code a reason for that status code headers and a body and the nice thing is this protocol is pretty symmetric so headers followed by CRLF but most of the time the response will have a body so yes all is the same overall structure the idea is again with the body how do you know when you reach the end of the body so either the server will specify a content length content-length which will tell the client how much body to expect the flip side is it's just essentially everything until I close the connection is the body so in this way you can have you don't have to specify the length of things so the response code so this is where you get the classic 404 so basically the response code is a three-digit code the most significant digit tells you a little bit of something about it so if it's a one it's an informational which you will never see if it's a 200 that means hey this was a successful you made this was successful so I received your request I understood it and I accepted 300 levels are redirects which is the server's way of saying hey that I understood your request and that resource you want is somewhere else maybe somewhere else for me maybe somewhere else completely a 400 means the client messed up says like hey I don't know how to handle your request so either your request itself was bad or like a 404 would be I have no idea what resource you're talking about go away a 500 means I messed up like you sent a good request but the server blew up maybe the servers run out of memory and it can't allocate any more memory to handle your request so normal ones you'll see are 200 with the short message will be okay interesting 200 level ones 300 level ones a 301 is like moved permanently which means that the client can cash that response and so it knows to actually never talk to the server again for that resource to always go to the redirected resource 400 bad request 401 unauthorized forbidden not found server error all kinds of fun stuff so modern request so we saw the request and then the response actually is much more complicated that we get back from Google where Google says hey so this actually interesting the response tells you exactly at the very start while you're talking so it says HTTP slash 1 dot 1 200 is the status code which means it's okay that means everything here is headers so there's all kinds of headers here things to control caches that exist any proxy servers along the way and your browser's proxy server cash what how should the client interpret these bytes that you're sending back in the body so text HTML means hey this is HTML you should be parsing this is HTML and we have the actual content of the page here so since we talked about authentication I thought it would be interesting to look at HTTP itself actually has built in authentication and it's based on a challenge response scheme so going back for a second so we see this is our response so we're using TCP for this so what is so going back to the security here what does this mean about the security of this data we just got back from the server so given that we are talking from the client to the server on TCP port 80 what does this mean about the security of this response that the client has received from the server it's not encrypted it's not encrypted so what does that mean practically everything's transmitted in plain text everything's transmitted in plain text so every single node on all those hops from the server to the client can see all this content anybody who's on the same network is us that's doing some arc middle tech nonsense could see that traffic anybody possibly yeah so it's not it's not private so all this is in the open what else do we know or what else is the result of using TCP it can be spoofed or somebody could have injected fake content in there so we actually don't know for a fact that what google.com sent is what our browser actually received right so this is the problems with so this is why we looked at and studied UDP and TCP security is now when we look at a protocol like HTTP and we say okay this is built on TCP port 80 and we say oh yeah all those attacks we looked at for network security all of these apply to the web you can't trust any you know essentially anything you get back anyone could have seen and anyone could mean could have changed you have to consider your threat model to see how realistic those things are but fundamentally that is what that means so when we look at authentication so authentication is something that we would you know is kind of natural to put into a protocol is how can a client authenticate itself to the server and so that way a server can have access control over resources right and say okay some documents are only for some users and it's a simple challenge response scheme so basically the server sends a 401 message to say hey you're not authorized to view this and so the client then has to authenticate so the client the server replies with a 401 message which has a www-authenticate header and it tells it to use basic authentication and uses the realm of reserve docs then the client so then the browser usually asks the user for the password like hey a username password that says hey are you this website says you're not authorized to give me a username password and then the client will then try to reaccessing this resource by identifying an authorization header saying that it's using basic authorization and a base 64 encoded username and password is this secure so can the server check that the person is authorized to view this file okay so we can look and say okay these are the username passwords it can authenticate the user by that username password that's sent in this header and then it checks its access control list to say is this person authorized yes username password here can you recover the username and password well how does it start it says my name it's base 64 yeah base 64 encoded username and password that's what the client sends base 64 encoded so then what does this mean in the terms of what we just talked about about TCP and ID security now literally anyone can steal your username and password and it's especially bad on an open Wi-Fi network so if you're using a Wi-Fi network with no password literally anybody sniffing traffic on that network will see your request which includes your username and password for this website so this is clearly bad and actually so there's a it's really bad because of the base 64 encoding you can actually just decode this and now you know they're using a password but you don't even have to use that you can just steal this header you can use your own request and use this header right you don't even need to know that it's base 64 encoded yeah if somebody else is on the same network there's a possibility that they could play ARP spoofing games I think that would depend on the specific Wi-Fi device if it would allow you because a switch basically allows you to send ARP packets and replies to people I think when I tried to do this in wireless networks most of them have forward that message to anyone else and because you're using a password protected Wi-Fi network well I guess it is important to what is completely broken WVP so WVP network could be completely broken or cracked but other types are secure in the sense that you can't just listen to other people's traffic even if you know the password you can't listen to other people's traffic so you should never use this and it sucks because it's built into browsers so you think about why does every website reinvent their own stupid username, password, form and all that stuff is because the built-in things to HDP suck just about everything moving away from HDPA there's a lot of SSL certificates that are now being put by HDPS though still uses HDP underneath the hood so the only thing different is you create an SSL connection to the other server but you're still talking HDP so you could use this over HDPS and it would work and you would have slightly better guarantees but you still have this weirdness that you're sending and the other problem is the website can't control what the UI looks like for this username, password and prompt so that's also pretty bad so it fails like in UX it fails in a lot of ways and you can't reach this very much a VPN only encrypts and hides your traffic from you to the VPN endpoint but then anyone else who's on that network could be that request to the server is all unencrypted so you basically are shifting your trust from your local network to that remote network HDP 1.1 defined different, they were like oh this is completely broken maybe we should not do this so basically the server sends a nonce as a challenge so sends a random digit and then this client basically sends a request with a hash of the username the password, the nonce value the HDP method and the requested URL essentially so what this means is that once you've taken the hash of the username password, the nonce, the method and the URL somebody else can't use that header value because the server would send them a different nonce so their hash would be completely different but the problem here is that the web server actually has to have access to your clear text password so the web server on its side has to be able to do this computation and be able to compare that things match so it's actually why nobody uses these things because they're not really designed which means of course if somebody breaks into your server the problems we've talked about they can see those passwords so cool okay so you can actually play with HDP traffic as we talked about you can use TCP sniffers, you can use sniffers you can look at it from servers browsers are actually really good now at analyzing HDP requests and responses so if you enable development mode in your browser there would be a tab that has a network request so you can actually see what HDP requests the client is making your browser and what the server is responding with you can use proxies on either the client or the server side to see what requests are getting sent there are really cool things so Firefox has some stuff that actually allows you to alter and mess with the headers that are getting sent probably the best tool if you want to kind of delve into this further is BERT proxy so BERT proxy is like a professional web pen testing tool and it has a free version so you can just download it and use it, actually that's what I use as the free version amazingly you set up your web browser to use BERT proxy as your proxy and so that means BERT can see all the requests that you're sending and you can actually edit the request in there it fuzz things a little bit it has some crawling capabilities it's really cool, it's nice to be able to see exactly what's going on Crash Course in HTML so we're doing all the basic tech today so we can talk about more abilities on Thursday so the idea is HTML has a super long history started out as part of the web in 95 anyways there's been a lot of revisions something super interesting that I didn't realize until I studied this history was that XML actually came from HTML and it was an output of trying to standardize HTML because the original HTML is not very standard and nice so they tried to shoehorn HTML back into XML to create this weird ex-HTML hybrid in 2000 but that completely failed like nobody used that so they abandoned that effort and were just like okay we'll keep HTML its own knocked XML thing and we'll just live with that so we have HTML 5.0 it's reached this point where it's a living standard where it's constantly being added to and updated over time but you can go check out these standards to see exactly what HTML is the basic idea is to mark up our documents with tags which try to add meaning to raw text so you've actually all had experience in writing in raw text how did that feel I know this is a callback to 13 weeks ago but writing policies in an ASCII text file terrible life doesn't look nice what else editors can make life easier they can also make life more difficult I would argue if you're trying to do something fancy but yeah so that idea is you have raw text but maybe maybe you wanted to bold some part of your report or maybe you wanted to format it nicely or fully justify the text you don't have that control over just raw text so the idea is so the markup language the M and HTML is you're trying to mark up your document with tags which add meaning so the idea is you have start tags a tag of foo followed by some text and then an n tag of square oh these aren't square these are ankle brackets slash foo you can have a self closing tag which is equal to which is just in tactic sugar over bar a start bar and close bar with no nothing in between and this is all 100% similar to XML except this last time where you can have tags that have no n tag so you can have image tags other types of tags that have no corresponding close tag cool so tags are hierarchical so this is kind of a basic html page so you have html tags it says hey this is an html document you can have a head tags and a title in there whatever you want so by changing that you'll change the title that's on your tab or your browser window and then you can have a body and this body content that has paragraphs in here which can say something like imd example tags so it's up to your browser to then interpret parts of this and display this correctly to the user and it's hierarchical in the sense that head and body so you can think of it like a tree head and body are both children of html and head has a child of title title has a child which is just the text example so on and so forth but just tags are not really expressive enough because I may want a title I may want an image tag but how do I tell it what file to use as the image right or if I include a hyperlink how I tell it where it actually goes so you can add attributes to tags and attributes it's actually really annoying there's four different types of syntax this is the tag foo with attribute bar this is tag foo with attribute bar which has the value bazz and these are both equivalent attribute tag of foo attribute of bar with value bazz they're just ones included in single quotes the other ones included in double quotes both are syntactically valid multiple attributes separated by spaces so here's an example of using all three so if you I highly encourage you to do this on the A webpage if you like right click on it so if you source and look at the html source code you'll see tons of this kind of stuff the key of html putting the h in html hecker text is the anchor tag is used to create a hyperlink and the href attribute provides the uri to go to inside the anchor tag is the text that shows up that you're linking to so this would be an anchor tag with an href attribute of hdp colon slash slash google.com example and that would show up just like this so it's a nice clickable link that we all know and love so this is a standard html5 page what's it have interesting in here it has this doc type it just tells the browser that it's html5.0 anyways we won't go into all that so your browser is responsible for doing this this is on a chrome browser doing this the cool thing is you got browsers of all different stripes and shapes so this is I believe links, ly, and x which is a command line based web browser so you can actually use this to browse the web there's another one called w3m that I do I still use that I guess occasionally which I used to use for documentation so you can actually use this from inside emacs so I can run emacs and have my code here and then the documentation with the function I'm interested in and other things so I can never leave emacs just like with URIs we have this similar problem of so what are these let's go back to this example what are the special characters here here's our special two parsing of html single brackets slashes quotes single quotes all kinds of stuff so we need some way to to deal with that so the so we use character encoding and it's actually really annoying but this is different than URI percent encodings because they're completely different in coding but it's fairly similar so you start with an ampersand and you end with a semi-colon and you can name things so you can give it a predefined name you don't give it a predefined name there's a list of names or you give the decimal number so this is the decimal unicode code points you use ampersand hash or you can give the hexadecimal numeric character code and then hash x and then the hexadecimal this is the root of a significant number of vulnerabilities in the web so for amp just like in URIs and with slash encoding you need some way to encode the character itself so an ampersand would be ampersand amp or in the other crazy you know there's four different ways to do this the e on my name is how you do that there's all kinds so the less than either the less than symbol or you can pick up the left angle bracket it's ampersand LT and we know just what we talked about this has to be encoded if you want to use this because how does the parser know if you're trying to start a new HTML tag or are you using a less than character if you want a text of angle bracket do close angle bracket or left angle bracket, right angle bracket or are you trying to create a new tag in the HTML structure called food do one more thing and then I think we'll be good forms so we've all used forms so basically a form has an action so forms will look like the beautiful forms that we use they have different inputs those inside here that say student, class, grade whatever and depending on the parameters of the form it will make either a get or a post request to whatever is listed in action and here because there's no method when you use a get request and so it will put the values in this key value student equals every page and class equals cse591 and grade equals a plus so these are all in there and if we look at the actual absolute, this would be a post request so the difference between a post request is this does not get sent as part of the URI it gets sent as the data that we send in the request so it's just important to understand that the input to a web application are all the links on the page because links automatically create get requests or any forms present on the page which can have get or post requests cool and when we come back we'll learn about web applications and how they can be formable