 All right, arm and body are in good. Feeling good? Feeling good? Feeling good? Good? And the system, if you can manipulate your cranks, your bills, get them to do whatever you want to do? No? Are you ready to type it up and try enough stuff that it actually will work? It's also good too. Those are all both good skills. I missed you guys. I'm glad to be back. Oh, the mic is on. It feels a little bit like the monkeys and typewriters. Yeah, yeah, yeah. Millions of you, I would not call you monkeys. I would get students typing on typewriters. Some of you will solve levels. Some of you do a simulation of that. You all just randomly try things. How do you build the levels? No, I hope that was like, you know, it's painful. The stuff is not easy. And like, how many of you can appreciate when I said that, like, theory is easy, but practice is much, much harder, right? But they're actually doing these things. So this is why we do it, even though it's so painful, because you really, I don't mean, I really, I really truly understand it when I can actually do it on a real program, rather than just being like, ah, yes, warm-up strings are very easy, right? It's percents, and it's percent HNs. Yeah, and then you try it, and it doesn't work. And then you have to figure out why, right? Well, we need this, like, for the rest of the semester. For a while, we'll see. It'll be up for a bit. You'll have other challenges to work on, don't worry. Yeah, no worries. There's more stuff coming. That doesn't worry me at all. Good, good. That's awesome. Cool. Okay, so the next section, so now we're actually completely with help from Yan and from other apps. We're going to close out application insecurity, not better than the other. Another app, yes. So now we get to move on to web. So we've studied really low-level network security and how network attacks happen. We've looked at applications and how applications and binaries work. Now we're going to kind of, you could think of it as moving up one level into the web. And because the web is really a whole other protocol on top of all the network stack that we talked about, plus all of these programs, like the browser, the web servers, all of these are applications, binaries, and then C or C++. So everything we learned actually in all the two previous things are kind of all coming together. And so this is a picture of the very first website and the very first web browser. On the very first screen. On the very first screen. Yeah, it was chiseled in a rock tablet. So Tim Berners-Lee is the inventor and the father of the internet. He was a researcher working at CERN. Do you know what CERN does? Collides stuff together to see what happens. I don't know, I'm not a physicist, but cool physics stuff with colliding high-energy particles together. And because of this really big research lab, he realized that what happens is there's a lot of researchers coming and going into CERN and nobody really knows where anybody is and so it would be great to have a nice directory of where people were and you could have the concept of links that have been around for a while, maybe grab a document that had links to other documents that would show you what people were, what they were working on. And so he created, anybody recognize this operating system? Next, what is that? Yeah, from Steve Jobs and Apple? So he got kicked out of Apple and he decided to start a new company called Next, NEXT, and created this company. They created like big hardware plus operating systems. And then eventually, I think it was Apple who basically begged Steve Jobs to come back so they bought Next to get Steve Jobs. Actually a lot of the things that you see in Mac OS X like today comes from the next operating system. So it was a graphical GUI user interface when, let's see, what is this? This is version 1.0 in like 1990, 1991, so early 90s. So the internet, as we know, since we looked at the history, had been around for a while, right? And what are people looking to the internet for? No, old-time boards, no back-end. Late 80s, early 90s, old-time boards. What was that? Yellow pages. Yellow pages on the internet? I mean IRC, maybe? IRC, possibly, I actually don't know. Definitely FTP, Transfer and Files, Accessing Things. Email was the thing before the web. But the internet was really confined to kind of academic institutions and maybe some cutting-edge technology companies. But really it was still in this kind of realm of not very popular. Not many people knew what the internet was. And so this was the very first web page. This is the worldwide web and it's pretty interesting. There's links to what's out there. And so it's pretty crazy when you think about it, is this piece of software that you probably call that Tim Berners-Lee created at CERN ended up becoming this insane monstrosity that we all use everywhere. And actually you'll see really spurned the development of the internet and the web and made it become this huge phenomenon. So this is not so different than what we think of as a web page, right? We have text on the page. What's this blue thing with the underlying in here? What does it mean, though? Maybe you want to... That'd be perfect, yeah. That's kind of like a pointer, right? So the pointer points to some memory location that has the actual value you want. A link is some text that points to a document with hopefully describing more information about whatever this link is. And so he created a proposal at CERN to say, hey, I want some time to create this thing and there'd already been some of these ideas of hypertext, but they've kind of been locked up in academic prototypes. That's an interesting research area if you want to look at it. He wrote a book. I highly recommend this book. If you're interested in the web, a Weaving the Web, this describes kind of what his thought process was, what everything that was going on in the creation of the web. So the idea was it was kind of a nice answer. He had this idea of hypertext, right? These links to other documents, along with the connection of the internet, which now meant you had rather than just an intranet, right now, instead of just talking your computer, talking to your machines and your local network, your computer can now talk to nodes across the internet in other places geographically, the whole idea of the intranet. And we had the TCPID stack, so we had reliable ways of communicating. And the idea kind of grew from this, hey, I want something to manage CERN's documents to this universal, worldwide system of documents where anybody could create a node and have documents on it. And you could have a link to either documents on those people's pages or documents to somebody else's page in a completely different server in a different part. And this is where we then get the idea of the web. But there's three key problems and questions you need to answer about how to do this. So what are they? How do you burners leave and teleport it to, let's say, the 9-inch, what are the questions you need to answer? What's the format? Which format? Whatever this page is. Yeah, so there's a question. Okay, you have a document, right? Just text. How do you know what parts of that document are the text and what parts are the handker text, which are links to other parts of the document, right? Do you think that there's actually a lot of different ways you could do that? There's an index that says, okay, characters 0 through 5 is a link to this and characters 10 through 20 are a link to this. And then there's some of the text in a completely different shape. So you have your question of how do you communicate links, right? So a document, how do you do the handker text part? Is that it? Is all of that here done? How secure is it? How secure is that? Absolutely not. That's not in the center of the question. I don't know how to figure it out. But we'll see. Half of the web takes about the many technologies like IP and TCP where you first develop it and see if you can and see if it works and then you go, oh, security, right. What else? How do you transmit it? So how do you get that data from that remote system, right? The protocols we have are TCP and IP. Do they say I don't fetch hypertext from a remote server? No, so you need to develop some protocol to do that. Key third component thing. And it may be weird to think about because we're so used to this whole idea of the web, right? Where do you start? Now where do you start for, if you get a similar question, how do you know where to go? Right? Of the entire universe. Think about the entire space of all computers that are on the internet. How do I know which one has the information that I want? And then how do I ask for it? And then when I get that response back, how do I know which parts are text and which parts are hypertext? So these are the three key questions. And as we'll see, this is important because the three core web technologies all come down from answering these questions. So how do we name a resource? So how can you know that my web page is on some server with an IP address that I don't even know and that your specific class is in a certain location on that server so how do we even name that, right? And this has to be a universal thing so that when I say a name it's not from my perspective and I say, oh yeah, you go, you know, it has to be a universal thing. This is actually a thing in managing early email addresses have the exclamation point character in them that would rattle. So you would have user name at and then you'd have the next server bang, the next server bang, the next server bang, and finally where the person was and that was the route of how your email had to get from you to the other person, right? That's not a global name because that's going to be different for everyone so we need global, unique names of how to name a resource. How do we request and serve a resource, right? So once I know who has it how do I ask for it and the third one is well how do I interpret that document so I can get new links and as you'll see naming is all defined by URIs Uniform Resource Identifier which again is one of these things that's so kind of core we see links, we see HTTP colon slash slash all the time. HTTP, hypertext transport protocol that is the answer to the question of how do we request and serve and HTML, the hypertext markup language is how you describe a document that contains hypertext. And they're super cool so they actually form this nice kind of loop so you first I mean this is actually the question, right? How do you start? You start with HTML where does HTML come from? Oh in terms of creating I think not necessarily creating but using let's say, right? And you need either URI you need not only a domain you need to know where to go, right? You need some kind of URL for URI I'm going to use them interchangeably because they're essentially the same thing you can think of URI as more general term of the URL but for our purposes they're basically the same so you have a URI which then tells your client your browser which doesn't necessarily have to be a browser so I'm going to try to use the word client tells your client how do I make an HTTP request to that server to access that resource then after you make that HTTP request you get your response back and that is HTML and on that page what's on the HTML page more links, more URIs which then our client can know how to fetch and it's just this beautiful, beautiful cycle so these are all three super interrelated technologies and this is why we're actually going to spend I mean we're going to spend the majority of the time looking at these three protocols so we're going to start with the beginning of the circle URIs I guess you can think of this as a spiral and maybe if you've ever found yourself three or four levels deep in Wikipedia reading about some crazy French game that you've never learned about before you'll probably understand that it maybe is a spiral but technology-wise it's a big old circle and even this is still true nowadays this is not like from the beginning of the web even when you throw things like JavaScript and all these fancy things it all boils down to these three technologies so URIs are essentially metadata so how do I know who to talk to and how do I ask them for a resource so how do I know they can give me a page about cooking instead of a page about programming right, they're a server they have a bunch of resources so it needs to answer the which server has it who do I talk to and this could be domain name which we know is translated to DNS to an ID address right, so who has it hi John how do I ask so the question is how do I ask for that resource and this is actually a nice thing that they added here rather than make URIs specific to HTTP this is the universal part that you can have HTTP, HTTPS, FTP email links, all kinds of different types of links and how can the server locate the resource so how can the server know what resource I actually want main questions here and again, and this is why all the back it's hard to remember if you're remembering paying homework three all the back is like homework one when you're writing an HTTP client or server is all this stuff is all defined in publicly available documents that you can go read exactly what it means to be for a URI, what all the parts mean so the basic syntax you have some scheme so this would be HTTP, HTTPS a colon and then some authority so the authority is who has it some path a question mark a query part, a hash and then a fragment so it's pretty, and this is straightforward this is every URI you've ever seen ever, it doesn't call this you can even have telephone numbers on here so I believe it's TEL, call it and then the number and if you put that, a link with that on your phone it will actually, when you click on that link it'll open up the phone app and try to call it and it'll say if you want to call this number is that what telnet one is? no, telnet is I understand the the discrepancy there, telnet is SSH but unencrypted so that's why it's a close report I can't remember if it's 23 and telnet's 22 or something like that I can't remember if that's correct so scheme is the protocol the authority, it basically means who can interpret this and this is actually an important point because even though we're used to maybe a link that looks like hdp colon slash slash google.com slash search which to us as humans means to Google to us, what we think this means doesn't matter the slash search, whatever the path and query is, is up to the authority whoever that server is to understand what that means so links don't need to be human readable they are for marketing purposes and it's easy to send them to people and all that nice stuff and so that of course the authority has its own syntax where you can have a username and a host and then usually a colon and then a port and you can even I think in here in some things password in there so every scheme has a default port why does every scheme need a default port because the server we learn when we learn about IP and TCP it has to listen on a certain port and there's only one application that can listen to a port on server and so hdp is port 80 so by default a server can change that and host a web server on any port but then it has to the authority would have to meet this different port so you know which port to go to so the path is usually a hierarchical structure separated by slashes just like a directory listing the query is super interesting so it's used to pass as we'll see non-hierarchical data and the fragment so what is this we've seen urls which have stuff after the what is why so if you go to a browser usually when you do that it will direct you down to some part of the page so it's not necessarily the top of the page the idea is it goes back to the documents and resources so the link to the URI is to a document and the fragment is the sub-object of that document so this is why you can say but specifically this part and it's up to the so actually the super interesting thing about this is that the fragment does not get sent to the server because the server doesn't care about the fragment at all the server only cares about the path in the query and then it's up to the client or the user agent to figure out what part of this sub-resource that is questions on URI syntax so how do you include a path with a question mark in it? what do you want to know? can you have a question mark in your path? because then there is a thing that is possibly URI that you feel like that is starting to vary yeah but what if I don't make a web page that has a URI but really you pass my class question mark there has been encoders yeah so we need something to be encoded we've seen this before in a lot of different protocols so you can see even just in this you are out of colons question marks the hash even slashes all of these can change the meaning and so we need some type of encoding which we'll see in a second so some examples going over so what's the scheme of this? who? the authority? it's not example.com what is the authority? example.com colon 8042 right so the whole thing is the authority so you have scheme and then slash slash example.com colon 8042 but the slash slash is kind of not super important and then we have the path slash over there and the query is test equals bar and then the fragment is nose so this is I'm not going to go through all these because that would be crazy let's go through the last example so the scheme is the authority is slash slash example.com and then what's the path or what is the authority? is the authority example.com slash test slash example colon 1 which would be port 1 what's the problem? exactly exactly there's a problem parsing this you'd probably have to look very closely at the rules to see what the precedence was this would be back to those of you who take the 340 how exactly this parsing would work and how each of the things would be and part of the problem is slash atom part of the query or is the question mark part of the path and the one of this colon here so that's what we get to the idea of to make this very clear of the mechanism so of course instead of using a encoding mechanism that already exists like I don't know what are some that we've looked at? like let's say a Python string you'd be like slash x20 or the x value of course you have to come up with a new one and so there's a list of reserved characters so these are characters that mean something in the URI itself and then I don't know if it's Tim Berners-Lee I should say he's a UK citizen and he was knighted so he's now served Tim Berners-Lee which is pretty cool so if you invent something that becomes as important as the web maybe you'll get knighted maybe maybe the factor I don't know that could be in your future so in percent encoding for example you have to encode anything that's not alphabetic a digit a dash dot underscore and tilde so can I make all this up? where do I get this from? RFC yeah this is all from the RFC it's all in there and so you encode each byte that's not what in that you do percent and then the hexadecimal so it's pretty simple so the ampersand is one of the reserved characters we would need to put percent 26 because the ASCII the ASCII value of the ampersand character in hexadecimal is 26 percent so similarly to a lot of things if we have slash that's our character we encode slashes as slash slash and so similarly here the percent sign is now a special character because it's part of how we encode things so what would this be? percent percent? percent to 25 space is percent 20 which is probably familiar because you've seen that before and so on and so forth so now if I give you this example now how would this part so now what's the path here? test example all the way up to colon one period html and then finally after that question mark is part of where it is so it's incumbent it's incumbent on the web server the server that's reading this to then parse that out into the individual parts but it's the job of example.com to do that we don't do that, everything else is essentially opaque to us, everything after that cool questions on URI percent encoding sweet alright another thing that we need to talk about is absolute versus relative URIs so the idea is when we have links we can specify exactly here is a link, here is the scheme protocol the scheme, the authority the the path and the query here's how you get your resource sometimes though we want to specify a link maybe in the current directory that we're in so the idea is there's many different types of doing this relative so when you're looking at html source you can see these differences and it's always a very good idea especially when you're learning the web when you're browsing the web, right click and say view source or inspect element or sometimes it's in some drop down menu somewhere so this means, so the slash slash here is use the same scheme so whatever this link is on if the link is on a page that's in htgp colon slash slash this link will use htgp if it's on a htgp s page it will use htgp s but it still specifies the host and the path why is that useful? so this way if the page we're on is using htgp s then all the links will continue to use htgp s which we'll see is important how you're using images or bringing in other content into your page but you may watch about everyone can always use htgp s so with htgp your page will still not break and everything will still actually work and the reverse can get you errors if you're using if you're including things on your page from an unencrypted connection from htgp that will cause warnings because then the whole thing is not encrypted so we can do slash test help.html so what would this be relative to example.com so it still use the same scheme the same authority just a new path is test help.html and we can even use directory traversal type things so we can say go up to levels so this would be in this case the context is always important so in order to resolve a relative link you need to know what page you're on so this is actually all we need about URIs they're actually fairly I'm sure you could spend somebody could spend a long time talking about them but they're pretty straightforward which brings us to htgp so the hypertext transport protocol what does it what does it use for transport of hypertext so it's after we have a URI that has a scheme of htgp we know what authority to talk to we know exactly what host and on what port to make a tcp connection to that assuming that there's a server listening there we'll think that they speak the htgp protocol and we'll make a request over tcp and you actually already know a lot about how this works right so this should be review yeah does everybody want to read the rfc again I think you can do it no yeah but now we're doing it from the other way right perspective not just the server so so the history is pretty interesting so I think I should update this I think version 2 is probably close to being standardized but version 1 was defined in May of 1996 which is pretty interesting so you think that the timber league created the first version in like 1990-91 and by 1996 they standardized the first version of htgp very soon after pretty soon after sorry sorry sorry fairly soon afterwards they did version 1.1 in 1999 and the new version version 2 is based on a google proposal called sbdy which has a lot of interesting features I believe it uses encryption by default it's a binary protocol not a text based protocol as we'll see so it has a lot of channels and multiplexing and making multiple requests and one thing and it's supposed to be a lot faster too is the nice thing or in some aspects so the overview is right so we actually could target on this even without htgp in particular because we've written servers the server must be listening for incoming tcp connections on that portal otherwise you're not going to have a good time that was I haven't watched that on my website I still don't know why but apparently a patch is crashed and some people are arguing that the website is down and I'm like that is weird why would the website crash if someone was doing something they should tell me how they made that happen because that was pretty it shouldn't happen but anyways so you need a server listening for tcp connections and the client when it wants to connect opens a tcp connection to the server and now at this point what should your mind be thinking about what are the packets that are going to send to the client and the server a handshake right what's a handshake the send the send and a half this is part of but now we're operating at a higher level we're taking that stuff for granted but you do know exactly how that works so you can explain everything here then the client sends the hdp request right so once the tcp handshake is established then the client sends its hdp request in the protocol that we'll talk about the server then reads that request parses it if it can understand that request and it knows the path and the query that the server is talking about it will generate an hdp response potentially or probably containing the html content that the client was asking for cool so it's actually client hdp request server hdp response but really it's actually can be a lot more complicated and it's interesting if you read if you read the I think even the hdp1 one spec that you read has references to firewalls and proxies and there's a lot of options that can be set to try to control the behavior of these middle boxes so the proxy has some kind of cache and the client has a cache and so the client actually and this is actually probably more similar to kind of ASUs network which we've had problems with during some CTFs blocking our traffic but the client makes a connection through a firewall which goes through a proxy which ends up going to the server and that data comes back and so you can have things like the client when it asks for a resource it looks at its local cache and it can actually serve up the hdp response and the hdml content if it doesn't have that maybe this proxy has already cached the content so it sends it back so there's a lot of interesting ways for this to happen so so this is more realistic but we don't want you to think about it on this level the more basic scenario is just by an abstract level but this is part of and actually I think it's good to talk about now this is part of what makes the web to me so interesting and also very difficult because there's so many different technologies in play here so not only you have IP, tzp we just talked about URIs and hdp now you're adding caches, firewalls, proxies into the mix and any one of these can mess things up along the way so the web is a big ball of different technologies all working together which is cool and crazy so hdp requests these are things that you've parsed it has a method so we'll talk about what exactly a method is it has a resource which is derived from the URI again that's that link between URIs and hdp requests the protocol version why is it important to put the protocol version in a request so the server knows it may be parsing or not if it gets an hdp 2.0 request but it doesn't support that it'll have to just drop the connection or tell the client to go away or something or the reverse if the client sends an hdp 1.0 request the server if it supports that can drop down to that and speak that other protocol language so this is why we even saw this in IP every IP packet has a version number on it which is how we're able to upgrade from IPv4 to IPv6 because there's this information of each player information about the client which is optional so this is something that the user can send in order to tell the server a little bit about itself an optional body which is how we'll upload content and that's it so it's pretty simple so the syntax what about the headers so this is high level overview of the things that are in the request yes not the syntax which will go over exactly what they are so at an overall thing this is really when we start reading the spec and start breaking it down this is really what an hdp request is right we're going to start lines followed by headers followed by a body okay it's just syntax stuff that we can parse every line is separated by crlf which some of you hopefully learned while doing the thing that new lines aren't sufficient although most servers have anyways there's a lot of best effort parsing going on but according to the spec it has to be crlf and then so the there's a start line a crlf and then you have a you have a list of headers the question is so you separate the headers from the body with a blank essentially a blank line so you have the headers you have as many headers as you want and the question is how does the server know that the lines are ending and the body is beginning because there'll be a blank crlf and then you're good and then you have the body questions trying to go through this quick so we can get to some quick build up some mace knowledge here and then we'll start attacking stuff which will be fun so methods so why do we have these different concepts of gets and posts and heads and deletes and puts stuff on the server so you can do it again yeah so there's different it's exactly that so it's different semantic meanings of what the user agent wants to do with this resource right so the URI itself doesn't define any get posts great update delete but that'd be a good challenge or a good question there's a bunch of methods which one's not standard well that'd be hard anyways it is extendable right it is extendable yeah so you can put and it's up to the server whatever the server puts so the or whatever the server supports so you can write a thing that has custom methods that do random things that are non-standard so so get is basically give me what you got post means I'm actually going to send some data to you put means you should store whatever I'm going to give you under this this path the URI the path and the query parameters and head is pretty interesting it's identical to a get request except the server doesn't send the body why would you include this if you're writing the h3 spec make sure that the page actually exists maybe could be testing to make sure that the page exists before you affect the whole thing that's yeah so actually so testing on the server supports there's also a diagram of proxies and everything we can actually see if anybody's adding any additional headers or anything else to either our requests or our responses but it's also used in terms of debugging right and when you're debugging you maybe don't want the whole page back right because bandwidth was expensive then so you just want part of it older ones, options delete, trace so this was an interesting one basically give me back everything that I sent you as the body of the response so this is another debugging thing where you would send a method of trace blah blah blah blah blah all the headers and then when you get back as a response with your what the server saw as your request as the body right so this is another case where we're trying to debug from the server's perspective right because this is when you have clients or architectures you don't know if there's something else in between that's messing with your content so maybe the server is seeing something different than what you're sending so this was a way to do that I believe it's not used at all it actually enables vulnerability so it's kind of completely eliminated you're just kind of you know if something's messing with your stuff connect is used with proxies so you're telling a proxy you want to connect to something else and you can extend it with more so important ones that a get is not supposed to change the state of the server actually we'll get to this in a second but so a simple example so here's a get slash this was made during curl to google.com so we have the start line which is the method a space the path with the query so the parts of the URI a space and then http slash 1.1 that we're speaking we have user agent so this is where the client tells the server what it is host www.google.com accept is another header that does other things what kind of things it accepts in response why do we need this www.google.com what the share goes right so first it seems counterintuitive it's like the server does google.com not know it's google.com right because we have our URI google.com we made a DNS look up and we go to google.com and in fact back in the days when the web was first invented they didn't have this host header now the problem is now you can only run one website for one domain on one server and as we start I mean this is what they did the idea is you want one server to be able to host multiple websites or multiple domains this is called virtual hosts depending on your web server the idea is the IP address obviously the server knows it's IP address but in this way it can give different content depending on what host the URI is for so yep all the important things and modern requests though look very complicated so this is a I pulled this from safari of a request it made to google there's accepting coding and I believe tells the server that it will accept like deflate or gzip content back and it will transparently do that and handle that it also prefers text html or application html these are all mine types of different data types that it supports it tells it something super weird that it's mozilla 5.0 I actually can't remember on the top of my head why the user agent shrink and then later clarifies what exactly it is is it this is a macintosh and intel mac os x x10.10.1 which is kind of crazy I don't know if you've ever looked at this but you're sending this to every single website you visit right so every single website knows exactly that I'm from a Mac and I'm running safari I even stopped because I thought surely that user agent can't be that long it's actually showing me that it's apple webkit it's a webkit is the core browsing and the core html engine I don't know why it says chrome and safari here I don't know user agents are a whole crazy complicated mess so this is the request the client makes a request and the server needs to decide how to respond so what are some of the ways so put yourself in the server if you're an html server what are some responses how many do you want to respond so I go up to you and say I would like resource x please 200 okay yeah so you may be able to say okay yes here is x you may need to say I have no idea what you're talking about I've never heard of an x what else what? yeah you may say I've never heard of that x and I exploded so what else you're not allowed yeah sorry you can't access x what else yeah I may tell you that actually x is over there right somebody else actually has x and I may want to tell you that just for right now somebody else has x or so x is at some other resource I want to tell you don't ever ask me again for x it's somewhere completely different right and so there's actually this rich vocabulary that a server needs to have to communicate properly to a client about the status of the request there's even a way for a server to say I'm still working on it I'm still working on it and then actually reply but the response and this is also something again that you've already created very similar to a request so you're going to say the protocol number again remember this is important a status code and a short reason for that status code followed by headers, followed by body so again the syntax is very similar your status line headers, body everything separated by CRLF and empty line the overall structure is almost exactly the same the difference is the response codes are usually not usually the response codes are three digit codes where the first most digit they're separated into categories based on the first most digit so you have the 100s which are the okay yes I got your request I'm continuing to work on it 200s which say like it was accepted and understood 300s are all the redirect requests to say it goes somewhere else 400 means you messed up like I couldn't parse your response or what you told me was garbage or you're not authorized the faults on you 500 is I messed up the server blew up or I don't know something happened like sorry so you have some of the interesting ones when you look through these they're kind of fun you see some of these so 200 is pretty much the most standard accepted no content 301 moved permanently so this means like literally never ask for this resource again and you can catch this result like always go somewhere else 307 is a temporary redirect 400 means you made a bad request 401 means you're not authorized 403 means you're forbidden I don't know what the difference is there maybe you could be authorized but I don't know 404 is not found 500 is internal server error 501 means like it's not implemented I think that would be if you made a weird method that if the server didn't understand like you put foobar as the method I don't know what foobar this resource means bad gateway, service unavailable all kinds of stuff so yeah it's kind of funny that like 404 has entered like the common no many ways people see this when they browse so they actually know what that means alright so example of a record so we have our example request from before this is from the curl and then we have the corresponding response back so we can see it actually includes a lot more information than it was previously sent so we first have the status line which is http slash 1.1 200 okay so we know this is a good response then every line is separated by CRLF we have a date we have when does this expire we have different things of cash control of who can cash it and when they can cash it and this is a lot of what the RFC as you probably looked through it is dealing with is what exactly the semantics are for every single one of these headers we have set cookies which we'll talk about in a bit XSS protection X frame options this alternative protocol is interesting because this is what the server is using to tell the client that it actually supports SBBY or it should be 2.0 so on the next request it can try it if it wants to the transfer encoding chunked you guys had to deal with that right it wasn't tested I should have added more tests in there yeah that's a problem yeah how you transfer it and when you transfer it those are all interesting things and then the new line and then finally you have the body which is going to be the HTML as we finally get there so one of the interesting things about the web in general is how do you add new things to this protocol so like how did HBP go from HB1.0 to HB1.1 I mean part of it is so they said they had to add the host header right I mean and yes to officially update the spec you have to go through the whole process but how would you prove that it actually works or what if you're running a browser or a web server where you want to add some new features so there's actually a nice aspect about the web in that so all of those headers that serve those are experimental yeah it stands for experimental and it means that if you don't know what it is just ignore it and actually that's a common policy a common way of writing web servers and web browsers is if you don't really understand what it means just ignore it if you don't support it it's fine but you leave it before you don't mess with it and this is how the web evolves so you have permission to change your browser to send a different header and then once you prove that it's a good idea then it gets adopted into standardization so it's actually the web is this kind of a protocol like a living breathing ecosystem like somebody probably in the Netscape area thought it was awesome to create an HTML link tag and I was like blink and flashed and I didn't ever see that maybe not because it's been deprecated and actually it's been removed but it got added at some point and it got added to a spec because people used it and especially with the web you have this interesting mix between we'll see servers web developers and website operators servers and clients because if you wanted to add some crazy cool tags you don't have to move tags around you need a browser to support that but browsers aren't going to support that until four websites need it you have this vicious circle so if you can prove that it works and that it'll be useful you can convince people to adopt it as part of the standards anyways okay cool we talk about all this yeah so the content type is interesting actually this is a good point to look at this tells the user agent what am I looking at what'd you give me we know we can actually even though it's called a hypertext transport protocol it doesn't mean it doesn't mean it doesn't mean you're always getting html pages back or PDFs or whatever and so that header tells the browser what did you just send me which is important and then the actual docs okay cool so now we're going to dig into a little bit of how HTTP does authentication so why would we need authentication we have this awesome open document sharing system where we're linking the universe of the plan lengths of documents why would we need authentication yeah so we want to control our access it's kind of the whole idea of why we need security in the first place some documents we don't want to be available to everyone confidentiality we want to restrict access to certain documents so HTTP actually has this built into the protocol which is why we're looking at it here basically the server is at the 401 that says you are not authorized to view this content and include the challenge as well as some scheme of how you should respond to that challenge and then the client must if they want to access it and try again with the header that says authorization with whatever that response is so we'll look at HTTP basic authentication another thing that like they clearly knew that they needed some kind of security in terms of restricting access to documents we'll look at how they actually did it so the server will say hey you need to authenticate with basic under the realm of reserved documents which can be different realms wherever you're going the client retries the access including a base64 encoded username and password so how secure is base64 encoded? I mean it's not secure, it's not cryptographically secure it's an encoding between HTTP code so all you have to do is take that string and put it into any base64 decoder and it will show exactly what it is and if we're talking about using HTTP this data is going to be sent from the client to the server over what protocols HTTP over what? TCP TCP and IP HTTP or encryption no which means that this is why we're sending the network to the tax which means anyone on the path from the client to the server can get a username and password and not only that if anyone on your local network who can do our poisoning attack can force that traffic to come through then and make it the middle of the traffic so do you know did they just assume that no one would ever be trying to snoop on each other or did they assume that there was going to be security out of the round like there is now today? That is a good question I would believe that they weren't thinking really about security in the sense that they knew they needed some kind of access control mechanism and they said hey we should include that in HTTP itself there wasn't at this point a lot of security look at anything really so most things were just like can it work and then it works and then some of the next thing at that point it may not have even been common knowledge of all these other attacks that we know on a local network so it was probably like oh yeah you're trusted, the server is trusted you trust your ISP what could possibly go wrong so try to crack this username and password you can do that hgb 1.1 adds additional server hit a resource the browser pops up a username password dialog it's one of these mechanisms so this is not always basic but it's definitely so at least in hgb 1.1 we didn't go over any crypto so these are just high level concepts the server sends some nonce the client creates a hash of the username, the password and the nonce value the hgb method and they requested URL and then the server can rerun that calculation to check that those are the same so in this case you're not leaking out your username and password to everyone but the problem now is the web server has to have access to the clear text in your password in order to reduce this same computation so these these reasons plus the fact that the interface is really terrible when you go to a website that uses this hgb authorization the browser is frozen until you put in this username password box that all those very few sites actually use this so which means now every website will see has to implement its own user password authentication and authorization mechanisms alright cool so you can and you should be sniffing and playing with and looking at hgb traffic so now modern browsers are really awesome at this so you can open up any modern browser there will be some developer tab or developer tools which will have a sweet tool bar that will show you the text of the page and then if you make a request usually with that tool bar open it will show you the network traffic of every single hgb request your browser may need to render that page so that's a cool way to monitor even what you're doing we can use tcp dump or wire shark to listen to all the hgb traffic that we're making to look at what web requests we're making proxies so proxies are super useful so we can create a proxy and tell our browser to hey set all my traffic through this local proxy and that way I can look at every single request that's getting sent out and all the responses and not only that I can actually monitor modify and change it so this is actually one of the best tools and the tool I always use when I'm doing web vulnerability analysis is I use a tool called burp proxy which I think yes, okay good so there's extensions you can do to modify your requests I use burp proxy to and whenever I'm doing I get CTF or something I do all of my requests through burp proxy and that way I have history of everything every request I've made and then I can I sometimes modify that request so sometimes you want to fuzz something and see the result in a browser you can modify the request burp has even so it has a free version so you should download it and use it it has even the free version the thing I use the most is I think it's called repeater so you can get a raw request of one of the new proxy and then you can edit and change various fields what happens if I change this parameter to this hit go it makes a request show you the response in another frame you can keep changing, trying things going backwards and forward to the request that you made it has things like it can URL decode things it can URL encode things with the percent encoding it can percent encode as you type it's a nice tool it's a little bit scary so maybe I'll do a demo at some point but it is a professional great tool that's used by real pen investors and you don't even need the paid version I think it's probably $300, $400 so that's not a chat it does definitely some cool stuff but now there's nothing you desperately need are we going to do it? yes we are alright so finally we get to everything we get to the HIN HTTP we get to now the hypertext markup language so how many people are familiar with HTML? good this should be it's always good when you talk to somebody and say they can program they can do HTML it's a language it's a language so the idea is we want some format or Tim Berners-Lee Sir Tim Berners-Lee created this in his mind but you wanted this format to be able to create documents that are portable you don't want this reliance on something like a PDF which is reliant on Adobe Reader to read and render properly and so it was based on SGML so he didn't actually invent this syntax 100% so the history here is HTML2.0 it rose to 95 so 32 was in 97 401 was in 99 so you see a lot of progress a lot of things going on and then XML1.0 was announced so actually we had hands for HTML what about XML? have we ever used XML? yeah so XML actually came from HTML which was super interesting so just like URIs was this idea of how to generalize XML said hey HTML is really useful but there's some things that are not quite standard let's fix that generalize it you can describe documents I've described what a schema that defines what XML is valid for whatever your purposes are and the web wholeheartedly rejected that so it wasn't until literally there was no progress made in like HTML I mean no official progress from like 2000 to 2014 the whole thing of like rewriting your web page so that it would be xHTML compliance and it ended up being super dumb and it was too much work and there was too much momentum behind HTML itself so that's why we finally got HTML5.0 which said you know what it's 5 not everything needs to be XML so the cool thing is right now usually HTML spec is what's called a living document so it's continually being involved and you can go check out kind of the latest spec and the latest thing of what's going on so you can write HTML now the problem is so think about this from a browser perspective right so your user agent your browser you're rendering HTML content for a user here we go to some site that was written in 1996 so that's the most like highest language of HTML that we could use HTML 2.0 so what are you going to do are you going to tell your browser if your user is sorry that's an old website to try to visit I'm not going to render that what do you do do your best to try to render it right that's so you may you know when you think about this is why part of the reason why browsers are some of the most complex software you have on your devices right because and this is why there's so many more abilities we've seen that with the binary analysis you've been looking at is you have an application here you have a C++ application that not only has to render and display and have a GUI and parse this HTML language but it has to support every possible version because you don't know what kind of garbage you're going to get from that server and if you refuse to render content because as we'll see let's say you forget an ending tag so you say oh too bad like that page sucks I'm not going to render it somebody else like your users are going to use a different browser because your tool sucks if I can't get to your websites nobody blames the website they blame Chrome or Firefox or Internet Explorer so you have to not only support all of the standards but all of the literally like look at some old web pages it's literal garbage it makes almost no sense but it still kind of works and so you still have to be backwards compatible with websites written in 1995 or 1994 I mean it's crazy web browsers like why did they work so the basic idea so rather than the idea I proposed earlier link to and here's the HTML here's the text the idea is you mark up the page with tags which try to add semantic meaning to the raw text so a start tag is going to start with a last symbol the name of the tag in this case would be foo and then a closing a greater than symbol pretty easy right so followed by some text so now you've got whatever text you are and then ending with an end tag which is the same thing as a start tag with a slash so you have this less than symbol a slash foo and then a greater than symbol it's pretty simple and that's it this is all this is based off of and you can have a self closing tag which means so this is actually syntactic sugar that corresponds to a start tag and an end tag right after each other so you can have something like this which is a less than symbol bar, space, slash greater than symbol bar not the bar symbol but the ar just as the element name and you can avoid tags so you have tags that have no ending tag so for something like an image and an ing tag you have a less than symbol, ing and then a greater than symbol and this is the big difference so for people who believe in XML tags have to be matched by an end tag HTML is not and people hated that so things like an image doesn't really make sense to have a closing tag because semantically what does the text inside of an image tag mean that's kind of a separate point but so the idea is the tags are hierarchical so you can nest tags within tags and tags and tags and tags and tags and tags all the way down you yeah you can't break them though you have a start tag followed by a closing tag for each element before you so you can't have overlapping nesting that doesn't make sense and it makes more sense when you think of them they're hierarchical so they form a tree structure so for instance here we can have a major document with a start HTML and end HTML tag so a start head tag and end head tag a start title tag and an end title tag with a text in it of example a body tag and then inside that key tags are stands for paragraphs says I am the example tags so the cool thing is this forms a tree structure if you think if you kind of tilt it a little bit and think about it where the root is HTML HTML has two children head and body head has a child title and title has a child of text of example and body has a child of a p tag with text I am the example text so it looks roughly like this if you look at it in a tree structure but this is not enough how do we know what image we want to put right we have a high energy tag but we need so we need to add attributes to these tags so we don't just want to have the tags themselves how do we specify links with something like this right we can create a tag called an anchor which gives links but we need to provide what's the URI that I go fetch when you click this link so this is with attributes inside of tags so the idea is that attributes live inside the tags in between the less than symbol and the greater than symbol and the tricky thing is there's four different types of syntax here yeah so you can have an attribute bar and a tag foo so it's just separated by spaces on its own nothing else so this means that the tag foo has the attribute bar you can have foo and then inside there you can have the bar equals baz which means the tag foo or the element foo has an attribute of bar which has the value of baz and then you can single quote baz the attribute there or you can double quote it so these are four different things the first one is different because this only says that the element foo has an attribute bar so you can think of it as a binary it's either there or it's not but it has one value the other way is separated by the values and then you separate multiple attributes with spaces and you can get very complicated things here to add semantic meaning to the tags which now gets us to finally complete the loop and say how do we do hyperlinks why do we need hyperlinks so we can get URIs to other documents that we can make HGP requests to get new HTML pages which contain new hyperlinks so it's an anchor tag so it's a lowercase a a for anchor oh also there's HTML is not case sensitive so you can use uppercase or lowercase or combination of both don't be a monster use lowercase the href attribute is sounds like a really weird thing but hypertext reference that's the href comes from is used to provide the URI and the text inside that anchor tag is the text of the hyperlink so that's what gets underlined and put in blue so it's all starting to now come together so we can have an anchor tag with an href of hgp colon slash slash google.com text of example and that will look like a classic link that we're all super happy about a basic so this is the basic structure of an HTML5 page we first had this thing with a less than symbol a exclamation point and then a dock type HTML I can't remember if that dock type is a special element or if it I can't remember if the exclamation point means anything it's not an HTML comment that's something different but yes this is definitely necessary to tell the browser it's an HTML5 web page I'm trying to think of technically if that's an attribute or if that's an element called dock type or yeah exactly I think it's but it's definitely necessary to say it's an HTML5 page so you need a meta tag in the head that tells the browser what character set are you UTF-8 characters is it ASCII is it all these other types of characters so you think about this is a way for the browser the client to tell the browser this is how to parse what I'm sending you some title a body and an href so actually everything is needed here except for this anchor tag in the body so you need the dock type you need HTML, head, meta, title and a body to be a compliant HTML5 page and you can easily find this stuff online on how to do this alright then we get to the browser so the user agent is the main term is responsible for parsing interpreting the HTML and displaying it to the user why is this important to have this distinction between browsers and user agents user agent is acting on behalf of the user browsing is a very specific connotation the user agent could be curl like I had so I'm looking for just output in my terminal we could be using links or another type of non-gooey web browser we may be using sometimes use W3N and Emacs to look up documentation in a window in Emacs so it formats a different minute a REST API, yeah that could be a different one but we're very familiar with these so this is an example of that example page in Chrome and then also in the links web browser so you can actually use this and mess with that now the question is so what are the special characters in HTML less than symbols, greater than symbols right slashes spaces sometimes double quotes depending on if you're inside an attribute or not single quote, double quote as we'll see ampersands also equal signs because our attributes are true equals bar again now we have this problem how do we include this if I want to create let's say a text that says 5 is less than 10 some math equation how does the browser know that less than symbol is text that I want to display and not the start of a new tag or how do I put text that says here's how you create a basic HTML page right in the start tag of HTML we need some like tenure coding should we use percent encoding yeah we could that would make sense but no we don't I think probably I guess I have to look at it historically but I think it's because I think that HTML is based on SGML which probably already had this encoding as my guess now that I think about it more because yeah having these two different encodings and all these other things causes a ton of problems as we'll find out so it's called HTML you'll find actually a lot of names for this it's called entity reference or entity encoding in the HTML5 spec you'll see character encoding the idea is everything so the ampersand is now going to be our special character that says and there's three different types of syntax here which is crazy so you have a name character reference you have ampersand some predefined name and a semicolon so this is how all entities are encoded like this that are things that you want to be shown as text you can have a decimal character reference you can have ampersand a hash symbol and then the decimal so not hex but the base 10 unicode code point of what you want and you can have hexadecimal so you can have ampersand hash x and then a hexadecimal unicode code point it's a cause of a lot of different vulnerabilities that we'll find out so it's really key to understand this encoding some examples obviously ampersand needs to be encoded if we want to use this right so we have to do we can encode it three different ways ampersand symbol amp semi-colon so this would encode it as an ampersand we can do ampersand hash 3 8 which is the decimal point of ampersand or we can do ampersand pound x 26 and we can even put zeroes in front of the hex encoding all kinds of crazy stuff that create the e with the accent on my last name it's like e acute or any of these three ones so we talked about why we need to encode less than symbols right it's actually incredibly important otherwise the browser may think we're trying to start a new tag right we don't want to start a tag we want to show text to the user so it has to be you'll see ampersand LT semi-colon in a lot of pages when they're using this okay so when we get back we probably got so close we have a little more to do here we're going to talk about how to give input to the web application which is going to be how we get our input in the trigger bar