 Alright folks, hello. It is Wednesday. Okay, perfect. Alright, so we will discuss the future of the class on Monday. So, we'll talk about projects, kind of the remaining rest of the class. Today we're going to talk about web security, we're going to introduce everyone to the web. A question for a home or two? You will get your grades, we'll say tomorrow is the latest, so not to commit to anything I don't know that we can get. But we'll email you directly with your grades, so it will be very clear. We're going to email you individually, not everyone. Any questions? No, I haven't. So, it will be your cumulative, up to both points. In the midterms being graded, it will probably be sometime next week. We actually calculated that wasn't it. There's eight problems for 142 students, it's like 1100 and 1100. 123 students. Oh, 43 students. 143. That's like 1,000 questions to grade of all of your amazing handwriting skills. That's good. Any other questions? Okay. So, the web, anybody use the web today? Yeah. Anybody not use the web today? I'm actually curious. I don't think it's possible. First question, what's the difference between the web and the internet? First, first question, are they interchangeable? Are they the same thing? No, especially not for you, because you're a computer scientist. You're not a journalist or a media person writing an article about it. So, what's the difference? Let's start with the easy one. What's the internet? Don't we want them to find the internet? Yeah, in the back. Network of networks, right? What protocols? The internet? TCP. UDP. IP. IPv6. Yeah, that's about there. The internet is kind of all those interconnections. But HTTP lives at the top layer, right? So, HTTP actually, the internet doesn't care about HTTP, right? All it cares about is TCP, UDP, and IP. So, then what's the web? It is a higher level layer. What level? Application level. Application level. Application level? Okay. So, what makes up the web? I don't know if something is the web or not. What's that? Second protocol. What set of protocols? HTTP. HTTP? SMTP. SMTP? SMTP is SMTP, it has nothing to do with the web. STP. STP is its own protocol. It has nothing to do with the web. These actually both predated the web. Rest. Rest. So, higher level kind of protocols. HTTPS. HTTPS. Okay, I'll take it. You face the HTTP over TLS. What do you get back when you make an HTTP request? HTTPS. HTML, that's the third one. Or the second one. Let's get rid of HTTP. HTTP, HTML. The other one is URLs. These are the three core technologies of the web. That's what we're going to talk about. So, the web, when it was initially created, was created by Sir Tim Berners-Lee in 19... I don't know. I think it was 92. So, Tim Berners-Lee, he actually has a great story. There's a fantastic book called Weaving the Web, and it's him recounting his tale of how he created the web. So, Tim Berners-Lee was a research scientist at CERN. Has anybody heard of CERN? What do they do? Thank you, scientists. Correct answer. How was everything? The LHC, right? The large... My dream is the light, right? There's probably trying to end the Earth as we go by creating a black hole. What they say they're doing, they're studying science, and they're studying particles, and they're going to smash particles together, right? So, CERN is a huge organization, even back in the 90s, right? And you think about ASU being a huge organization, we're not trying to collide atoms into each other, right? That requires a lot of different scientists with a lot of different specialties. And so, well, I think I should call him Sir Tim, I guess, because he's a ninth now. But back in the day, he was just regular Tim. And I don't think he's ever watched that. So, he realized, wow, it's really difficult to even know who's working on what thing in the organization, right? There was no, like, really centralized phone book, and here you have this, basically, academics who were there and gone and visiting for short periods of time, right? This constant churn, he was like, man, it would be great to have a way, like, a centralized look, some kind of location so that we can know where everybody is, where their office is, what their phone number is. And he was also really interested in this hypermedia idea, which was the idea that you could just view a document and there would be essentially a way to access further information from that document. And in our case, it's in the form of links. So he submitted this proposal to CERN to build, basically, this hypermedia system to find people, right? And to see what everyone was working on. So this is the first, basically, graphical version of the World Wide Web. So super interesting things are that he developed this on Next Step Computer and you know what Next was. Yeah, this is Steve Jobs' company after he got moving out of Apple. So they developed these devices, these whole operating systems and computers, and this is where he developed the first hypermedia browser. And a super interesting thing here is that slash editor at the end there. So in his view, he actually started the web with more of a wiki-style view. You could not only browse HTML content, but you could also edit it right in your browser. So that's why you have this idea of a browser slash editor. And actually, I will tell you this, I'm still trying to get this to work. I've actually got the next operating system to run in... I can't remember which hypervisor I was using. Probably VMware. It actually does work, and you can browse the web using this old browser because the basic protocols still work. This is widely recognized as the very first web page. And it's actually insane to think about how close this is with the modern web page today, right? You go to a web page and it can look exactly like this. So information about the web. This is Sir Tim. Do you create something? As important as the internet, you... I guess you're a British citizen. You may be nine years old, too. So maybe the rest of the audience can look forward to that. So he first created this proposal in 1989. He then finished the first website in the end of 1990. So when you think about that, that's how many years? 27 years? And you think about how popular the web is now. If you even think about how popular the web was in 1998, right? By 1998, 1999, we already got the first dot com bubble. That's literally nine years after he first created the internet. Oh, sorry, I'm not... See, look at me. He first created the web. His very first version in 1990. That's insane kind of adoption measures. This is a book highly recommended. So really, he combined all these technologies. He combined hypertext, the internet, so TCPIP. So he built this on top of TCP. And the idea really grew into universal access to a large universe of documents. So he kind of envisions the web as all these documents out there. And you want to be able to access one, which should give you links to other documents, right? And so you can browse around information in this way. Three central things that you're trying to design and imagine a universe of documents. How do you name things? How do I know what document to get? That's a central question, right? How do I request a document? And then how does that person, let's call this server or whatever, respond to my request for a document? And the third one is how do we actually create hypertext? How do you create a document that actually links to other documents? And so this gives rise to the three central technologies that we talked about. URIs and URLs define how do I identify a specific resource on the web? HTTP defines how do I, once I know about a resource, how do I make a request? And finally, HTML defines, hey, when you've got a document, this is what a hypertext, hypermedia document is. Any questions on this? The super key three technologies, without these, you don't really understand how the web works. There's a lot more, but these are the first three. And then they operate in this beautiful loop, this is why I really like that. So you have the URI, and you'll find out that URI tells your browser or other user agent how to fetch content. So the URI defines how to make an HTTP content. Sorry, how to make an HTTP request. Then when you make a request, you get a response. That response will be in HTML. That HTML has links on that page which tell, which are URIs which show you other documents. So you have this beautiful circle of the web life. So why does it call the web? You should be trying to visualize it, right? In one document it has links to other things, each of those documents has links to other things, right? Why is it not a tree? Because the links can go back. The links can go back, right? You can have bi-directional links, right? They're pointing all over the places, right? Yeah? Why can't I write a graph? Can you not what? A graph. We have links. You have a web sound cooler. Then you can have a crawler. I think that may be the answer, because there's no good answer. Tim was like, web sounds so much cooler than a graph. If I were meeting him, I'll ask him. My web and my not. So starting with the first technology. URIs are the center for everything, right? If you don't know how to name something, how can you ever ask for a document? So you've seen URIs before. They are essentially metadata of how to reach or find a specific resource. So it answers the which server has this document? How do I ask that server for the document? How can the server locate that resource? So the server, now that I know which server to talk to, of the thousands of documents that this server has, how does that server know to give me the specific document? There is, as I'm sure aware, RSEs, right? There's an RSE that describes exactly what URI is. We'll talk about that, but that's more related to the third question. How does the server locate the resource? So syntax. It's pretty easy. This actually will map in with, so the U from URI stands for universal. So actually this scheme is for more than just HTTP requests. That's probably why some of you are thinking about FTP. You can give somebody a link or a URI with FTP described in it. So that's this first part of a URI. This scheme describes one of the standard of schemes. So that would be HTTP, HTTPS, FTP, whatever. Authority defines who is the server, who should I talk to for this information. And in between those two is a sign I call in. Then there's a slash. And then there's some pack part of URI. And then we can have a question mark and with some query part. And then after that there can be a hash with a fragment part. So breaking all these down. So the scheme is the protocol, like we said. The authority is the entity. So an interesting thing about that is the authority is the one that controls how to interpret everything else. So this actually goes back to your question. Everything else is essentially meaningless. The path, the query, the fragment is kind of important because it depends on your user agent. We'll talk about that in a second. But that path, those query parameters, as an agent, as somebody trying to request a resource, they're just long see you. You don't care. Oftentimes they are in human readable text so that developers can understand it, you can understand it, right? But there's nothing that implicitly says it has to be that way. The authority is usually a server name. It's usually in this form. Username at host. So you can actually specify what username at what host. This is why if you've ever gotten a link with an FTP credentials already in it, this is how it's passed. And then a colon and then a port. So this is actually how you specify if you've ever gone to a non-standard port like google.com, colon, I don't know, random port name, that's how you specify an important name there. And this is a TCP port we're talking about here, right? Okay, the path is usually exactly what we think of a path, but it does not have to be, right? That's the important thing. It is usually a hierarchical path with slashes, just like a file system path. The query is used to pass on what we call non-hierarchical data. It doesn't fit the folder kind of paradigm of the path. And again, these are all conventions. This is usually, and the fragment is actually super interesting. So the fragment is used to identify a subsection or sub-resource of the resource. So we'll talk about exactly where that comes into the bit. Any questions on URIs? There's not. There's two slashes. HTTP colon slash slash authority, Google.com. I can't remember. That's a good question. All right, we'll look at that in a second. Part of the authority. I can't remember if it's required, maybe I need to change the syntax there, or maybe there's additional sub-syntax from the RFC that I'm not putting in there. But I definitely took a lot of RFCs, so I'll have to read a little bit about that. So we can then parse these URIs. We can say that, okay, this URI is for scheme foo. The authority is example. Well, in this case, the authority would be slash slash example.com colon 8042. The path would be over slash there, and the query would be test equals bar, and the fragment would be nose. You can have FTPs as a link to an RFC on the FTP server. Same thing, FTP colon slash slash, the authority, and then the path. So this one has no query and no fragment. You could also have things for email. So this is how you define an email in a URI. So here the scheme is mail to, and here the authority is dupe.asu.edu. So here the scheme is actually the one defining what the authority means. Yeah, so here you have the slash slash on the authority, that's a good question. Other cool things is HGDS. So why does this look weird? Is it valid? What is the, what's the scheme? What's the authority? Example.com or slash slash example.com? The path? How come this isn't the scheme? Or how come this, how come the authority, how come this isn't the hostname? How come the hostname isn't example.com slash test slash example, and then the port is colon one dot html question mark, and then the path is slash have. Dasu. I mean we know that, but we have to parse it first before we even know what port number is valid or not. RFC. RFC, what about the RFC? It tells you what characters can be where. Yeah, so that's part of the problem, right? And so we can already see that there are special characters in a URI, right? colon slash question mark, and hash are all characters that will affect the parsing on this URI. So just like in bash, when you want to, or even in your favorite programming language, right? When you have a string in double quotes and you want to include a literal double quote character, what do you have to do? Escape it. There has to be some kind of escaping scene. So the same thing with URIs. They're a set of reserved characters that are supposed to always be encoded. And the way this is, it's done with what they call percent encoding. So I will definitely agree with any one of you who says, the fact that we have like 10 different encoding systems on the web is insane. And you are absolutely correct. It's 100% insane. But we're kind of stuck with all these parts also. So you need to learn them and say, wow, we're going to learn the future when we design these new things. So the idea behind percent encoding, we need to encode anything. So it's stated in the RFC that we need to use this percent encoding for anything that's not alphabetic, a digit, that's a dash, a dot, an underscore, or a tilde. These are the things that we can use freely. Everything else should be percent encoded. So what is percent encoding? Use the percent sign. Then use the hexadecimal representation of the digit. So if you wanted to URL encode lowercase a, what would you be? What would that be? Percent, 6, 1. We looked at those things so long that I memorized it. So it's the hexadecimal representation, so the percent follows like two digits each hexadecimal. So this means ampersand will get turned into percent 26. The percent sign, can you use the percent sign? If you wanted to send your name as a percent sign, or if you wanted to access a resource that was named percent. Yeah, so the percent character is a special character now because we're using it to define this encoding. So just like with double quotes, we have to escape double quotes, and when you escape double quotes, what then do you have to escape? Backslash, right? So if you escape double quotes with backslashes, you then need to escape backslashes with an extra backslash. And then if you're passing that all to bash, you have this insane, horrible, like quadruple backslashes. Gives me a nightmare. But this means that you have to at any time you want to use them, not as a percentage coding. Space will get encoded as percent 20, and so on and so forth. So in our previous example, we want to fix this to be what we think it should parse, which character is doing it and change. Does it call in here? What else? Yeah, depends on what we want to do. Maybe the slash, or maybe the percent. The question mark is part of the path. So it depends on what we mean with this URI. So if we fix it kind of normally, we URI... URI percent encode the colon, which is a percent 3A, and the hexadecimal digits A through S can be either uppercase or lowercase. The case doesn't matter there. And then the slash we're going to code as percent 2F. And we've looked at enough shell code to make a distinction. I don't know if that would do that. URIs can either... So for you to make a request based on this URI, what do you need to know? We need to know the authority, so we know what server to talk to. We need to know the scheme to be able to know who to talk to. And we need to know everything else. We need to know what things to send as we'll see. And so if I gave you this URI, this uniquely identifies one resource, one document. Right? It has a scheme. It's not ambiguous. It has an authority. There's a patch. There's a query. Everything you need is all there. And so if I give this to you or give this to anybody, I know we're going to go to the same place and try to ask for the same resource. Do you think we'll get the same response? No. No, definitely not. But we'll at least be able to talk to the same machine, the same authority here. So this is what we're going to think of as an absolute URI. This URI will take us both to the exact same location, but you've probably seen in the browser when you hover over a link, they're going to click. It's not always in this form. What are some other forms? Yeah, just ..slash, ..slash. Right? So URIs can either be absolute or it can specify a location relative to the current document. So if you think about just like when you're in on a Linux machine, you're in a certain directory. If you did cd.dot where that takes you depends on where you are. It depends on your context. Same thing here. If you see a URI from ..slash foo that will take you to a different URI depending on where you see that link, what the context is. So by itself, a relative URI makes no sense. Right? It's only in the context of an of a document. And you can actually relative URIs or a syntax of their own. So here, this means use the same scheme that you are using to access this page, but use that scheme to access example.com.example.demo.html. So if you're running this on from a page that you got from an hdp response this will make a different request than an hdps response because it will use the scheme of the enclosing page. So slash test slash help.html will be relative to the current path. So the scheme authority here will remain the same. And this one that we've seen ..slash, ..slash are relative to the current scheme authority and path. In all these cases, context is incredibly important. Depending on the context that you should be able to take if I said, hey, on this page you saw a link to ..slash, ..slash people.html where is that going to take you? Or what hdp request is that going to generate? The same thing for these other lines, right? Yeah? I believe so. I think technically by the standards it would be web server dependent. So they could probably choose how they wanted to handle that. But I believe we'll try it now. So we go to Google.com slash, slash, slash, slash, slash, slash, index.html. See if it works. No, you're all being mentioned. No, you're all being confused. Okay, the second protocol. Hypertext transport protocol. If you look at the name it was clear that these were all developed in tandem, right? This literally contains the name of hdml hypertext. So you have URIs which tell you how to make hdp requests and you get back hdml in the hdp response. So this should be very easy for you. Why? You already went to the web server and you went through the whole RFC to do so. Right? Okay. So when you break down hdp the second protocol for how a web client can request a resource from a web server and that's all it is. So what do we think of as web clients? Browsers, what else? Web crawlers, what else? Netcat, you could even use net. You have to do all the protocol yourself but you could definitely use netcat. Well, think of programs. Any program can make remove judgments and say whether it's a botnet or not. We don't actually care about automated scripts, right? So there's the whole requests library in Python that is freaking the best thing ever for regular requests and hdp requests so that's in essence a web client. You can also have there are GUI list browsers. You can play links ly and x that you can use to browse the web. I used to use Emacs. You can set up Emacs to use W3M so you can get web browsing in Emacs which actually is super useful for documentation because you can just keep the docs there in your Emacs session and you don't have to go anywhere else. It's super awesome. Okay, so hdp is based on the tct and it uses port 80 by default so that's the other super important thing. So when you see a URI if there's no port specified in that URI what port do you assume? It depends on the scheme. If the scheme is hdp you assume 80. Quick question. So version 1.2 so you think about how early, right? First one page you had in 90 we had already standardized the first version at hdp 1.0 in May 1996 version 1.199 and version 2.0 actually it's based on a new proposal called SPDY which actually very much changes the paradigm we're going to learn and takes everything to binary so instead of ASCII text being transmitted back and forth as we'll see it's actually binary protocol which may have been more efficient there's all these other components but since it's still not really used very widely we're going to leave it for now but there's an RSC that you can read or propose to RSC so you can learn how it works hdp overview we basically have only two things that we care about here the client and the server the server you wrote a server it listens for incoming connections incoming what type of connections tcd connections and you actually know exactly how that packet got all the way from the client to the server the client opens the tcd connection to the server on the port either given to URI or the default 80 then the client sends the request to the server the server reads that request and sends a response back actually it's super easy for them at a very high level so we have a client running some web client so the other thing about web client the kind of HTTP language of this is user agent it's like acting on behalf of some user to access a resource so you'll see user agent or UA to their client is also how I think about it because I think it's that kind of from an angle of looking at server side web code and thinking about code that runs both on client for the server so I think the client server is named in that way so that's kind of how I'll present it but keeping my user agent that's also that thing so client makes an HTTP request server has to parse that request and based on that request we'll try to return an HTTP response but oftentimes the real web is more complicated so we can think about it in this super abstract model but as you get more and more closer when you start wanting to do research on the web or understand the web you need to understand all these different modifications and even the early drafts of the HTTP 1.0 spec point out some of these things so there could be a firewall in between the client and the server and there could be a proxy between the client and the server that is maybe caching the response or doing something like that and the client itself actually has a cache so it will not always fetch a page when you think it's fetching a page it loads it up from its local cache and so actually your request will go through your organization's firewall as proxy then maybe it will hit the server and so on and so forth so you actually have this pretty complicated workflow here of multiple parties interacting just to serve this HTTP request so the request as we saw means what what does the request mean? what type of request is it? yes the method what type of request? the payload what? you want it? yes so we need all that information what information does it need? does it need the scheme? from the URI does it need a scheme? no, you're making an HTTP request assuming your HTTP request is valid that's the evidence that you need that you're making an HTTP request what about the authority? yeah but I'm the server resolve the IP address resolve the IP address it's already talked about as the server so it's already to even get to us it needs to have taken if that's a DNS name it needs to have resolved that DNS name to an IP address verify sometimes the server will be managing multiple hosts yes hold on to these thoughts so we have the method we have the resource so basically everything in the URI after the authority the protocol version is actually a super important one even IP level has this, the protocol version number in one of the headers why is this important? maybe different possible scheme called e-sportable? exactly, I need to know what protocol I need to know how to interpret a request is it a valid HTTP 1.0? is it a 1.1? so this allows you to actually increment the version number and if everything up to the protocol number is the same then you can do cool stuff client information usually the client will offer up some information in the headers about it itself who is making this request what does the server know always from the request? exactly the source IP address on the source port because it's receiving a TCP connection it always has that information some botany are only set as payload so the syntax we have a start line followed by headers followed by the botany each line is separated by a CRLF which is the control characters of character turn and line feed this is a super important I know a lot of you didn't follow this in our servers and at some point I'm not going to talk about this but this year was not that year so headers are separated from the botany by an empty line so this is how you get around not knowing how many headers you're going to send so the method the method part of the request line is the method that the client wants to be applied to the source so in get it says give me this thing post says hey I'm actually going to give you some data in the botany usually put is actually one that's not as well-known as the rest style of the application and actually rest did a lot for bringing a lot more attention to these other HTTP methods and head is very cool it's actually useful for debugging it says hey respond to me as if I had sent a get request but don't include a body so this is useful for kind of testing different things like if you're maybe trying to see if a resource has changed you can issue a head request first if that body is going to be huge you'll see a head request first which may have had to tell you if it's changed or not there are all kinds of methods though I'm going to briefly scale over these options delete trace connect all kinds of crazy stuff and the web server can actually define arbitrary methods so the web server can extend and support any arbitrary methods that it wants so for example this would be an HTTP request so we first have the the first line which is the which is the method exactly followed by white space followed by the path including the query yes and then the protocol version number and then we have our headers so headers are usually colon separating fields like key value fields so here we have a user agent colon so the user agent is curl so now we have this host header which is telling us who we asked for what the authority is except tells us what kind of things we we can actually accept back what kinds of file formats and files have the host will talk about in a second this is a super simple command line curl request modern requests are much more complicated so they all have the first line the same generally but they'll also say what kind of encodings they accept so the server can actually at least offer gzip encoding and actually gzip the content before sending it over the user agents are much more complicated and more detailed so a response yeah how do we check for authentication if a client is allowed to access that entity think about that lots to get to there so a response an htdp response when you're responding back you also respond with the version of the protocol version number and say here's a htdp 1.1 response again make sure you don't have these protocol mismatches a status code which says exactly what this response means a short reason for that status code some headers and then a body so the syntax same thing for the client header's body it's almost the same as a request except for the status line slightly different so these status codes are really important this is where the 404 not found they come from this is the status code and htdp so status codes are three digit codes the first digit defines broadly what category it is so 100s are all about informational so this means that like hey I got your request and I'm continuing to process it I've literally almost never seen these actually that would be a good hacking challenge or something I mean codes and information in these 100s but I honestly don't know how the browsers would respond to all of these things 200 is the one you want to see a 200 was like yeah that was awesome that was successful I got your request I understood it except your request 300 is the way for the server to say actually that thing you were looking for needs to go somewhere else so broadly 300 means hey like you need to actually do more work in order for me to handle this request for instance I talked about maybe the thing you're looking for moved somewhere else maybe some other authority is handling that resource 400 means you screwed up so the client your request could not be fulfilled or there's an error in the request like 404 means like you messed up man I don't know what that resource is I don't know what document you're talking about 401 what do you know authorization yeah 401 is the hey actually you're not authorized to view this and so we actually talked about very quickly a very easy form of authentication that they can view so based on what we saw right now so far what can the web server authenticate you on your IP address that's about it that's all I mean so could you authenticate some of the other user agent could you you could but why would you not want to yeah this comes from the user right this user agent is set it comes from the client and we know from the stuff we do on binary is that user input is the devil and should never be trusted all of this stuff is untrustworthy but we know how difficult is it for somebody to spoof their IP address in a TCP connection it's pretty hard they have to do the 3-way handshake it's not impossible but it's a lot more difficult ok so 500 means I mess up the server said something happened I screwed up your request was fine but I blew up so 500 error usually means that there's some error on the server side and it can't handle a request anybody see any other chance to ask us the question is so like browsers are set so that if they just receive one of these without data without a body they'll handle them but the server is still even if it just asks for one of these errors can still respond to whatever data it wants yes fundamentally yes so the body can still be whatever the server decides the user agent will decide based on how it's programmed depending on this response what do I do next and so there's this complex relationship between the servers and the clients where as a server what behavior can you rely on for what client for sending what code back so yeah you can actually put data back so this is why if you get a 404 error sometimes you just see your browser's message other times you see the website's custom 404 message and so that depends on if they're sending you content back or not which depends on how the web server is configured same with 500 errors so the status code will be like 200 ok like 2 there are other types of 200's 300's you have a 301 which means the thing you're looking for can always be found over at this other place and you can catch that forever right and caches can catch that in a region 307 is a temporary redirect that says hey this time the thing you're looking for is over there but maybe not always 400 like 400 means you made a bad request so if you make some requests and you get the really messed up or something to barf on that but if you type like gibberish I think it would probably give you a 400 error hopefully 401 is unauthorized 403 for bidding 404 not found 500 means hey there's an internal server server and there's another kind of the 500 errors so from this example request is using curl to access resource slash so there's always a slash right that's the root there's nothing else in the path of the host www.google.com ooh sorry we'll have to use the copy and the response you get is maybe something insane like this but it still follows that exact format so we still have the protocol that we get sent back so it's an HTTP 1.1 response we've got a 200 status code and we got okay so everything is all good then from there we have all kinds of crazy things we have the date we have expires when this thing expires all of these are cash controls so you can see actually how important caching is to the architecture of the web and all of these headers are standardized in the standard of all they need so you have all of these oh the other cool thing about here any header that starts with an X means what? I think it's experimental and they be extended I think it's experimental so this is a future that's not currently a standard but people want to test it out and I think you need the X and then the dash and then whatever comes after that so this is like old browsers if they see something they don't understand anyways so you can start sending headers like happy birthday or I think there is some company that put like job a link to apply to jobs for that company in their HTTP headers so your browser if it sees something it doesn't know it just ignores it right there's a hand in the stretch so the other critical thing here is that we set the content type header so there's like I said there's another standard called mine and you know that wait what is it multiple was the second one multiple was the second one multiple was the internet mail extension multi-purpose internet mail extension wow I have not gotten that ever which makes sense it comes from email like attaching files and how to tell someone you want files so here we're saying it's an HTML response that I'm sending back and even this is actually a tricky thing the character set that I used to send this back to you is it an ASCII is it Unicode is it another one because there's lots of other ones what type of encoding the important thing is what the browser gets back here is just bytes it's literally just getting a bunch of bytes back so it needs to know how to interpret those bytes so this is the payload here we know it's an HTML page because it told us that in the content tank let's see other things this doesn't have anything to do okay cool so let's go back to our previous example why do we need to tell the host I think you use that the right way in case that server is managing multiple hosts so in AGDB 1.0 there was no host header that was required by servers by clients so it seems counterintuitive because like I said I kind of made fun of well I wasn't doing it on purpose I was doing this because I didn't want to come back here so when you think about it the authority is already specified in the URI so I know I'm a web server for google.com when you come to me I should know that I'm a web server for google.com and therefore why would you need to tell me what you're coming for but then you would need a one-to-one mapping between every single host name like every DNS name and every IP address right because so you if you want to run multiple websites you have to run them on different IP addresses so that way the server would know based on which IP address which host did you want so they said that's actually kind of crazy you may want the case where you have one server which handles tens or hundreds of hosts and so that's why each of the clients in AsiaDB 1.1 send the host field here so that the server getting a request knows which host to use authority it gets changed to the IP address it goes there so the authority is google.com it's going to the servers that host google.com and it's this host that actually says this is the the server that I want to host yes so the client will click on the link so the link will have the link so this will be a link for atvp-slash-slash-w-w-w google.com-slash right and so that link would generate this request so I'm doing a DNS request you can actually get back a lot of different IPs for google.com and it will actually depend on where in the earth you are doing this request from Google will send you a different DNS list to give you servers that are close to you geographically so it will give you different IP addresses of servers then you would pick one of those servers and you would connect to that and then you would send this and say hey I'm looking for a google.com fun fact about google you can talk to any of their servers and change the hostname and it will like their servers are just front-ends so you can change the hostname to access any google service you want which is super interesting if you're using an internet that only allows you to go to google.com but does it allow you to go to mail.google.com or docs.google.com so it's going to allow you to check your email sometimes okay any other questions on responses making good progress okay HTTP authentication so fundamentally what we've seen and all that HTTP is up until now is a request and response mechanism so the server in order to give this page back all the server knows is this information all it knows is the client IP address the client's port and all of this information so you can think of web servers that essentially have amnesia so every time they get a request they go oh hey new user awesome what are you looking for oh the home page great there it is and then you click the google search button and hit enter oh hey wow new user cool this is a search every single time which would be insane what do you know about in terms of authentication so how would you actually authenticate anybody or give some of you access to some content some of the time so there is authentication actually into HTTP so there's this surprisingly called HTTP authentication so the idea is there's this really simple challenge response mechanism the challenge is the server sends a 401 and says hey you're not authorized to access this page and there's a realm that you can specify and it's super annoying because your browser will pop up with that browser level box of username and password and say hey you're not authorized to access any access like I think the SharePoint site is here do that and it's super annoying it looks like a fishy attack so the client actually then will need to be an authorization header that includes these credentials so basic authentication the client basically puts in a username and password and that information is base64 encoded so what does base64 encoding mean what is it what would be base16 encoding hex, what would base64 be yeah you have 64 characters so do encode what's the difference between encoding and a hash encodings are 2x what does that have you can decode a hash encoding so fundamentally a good hash function you should never be able to go back right but encoding is just changing just like percent encoding right if I gave you something and said this is percent encoded you could decode it right the same thing here so it's fundamentally a browser is sending your authentication credentials in the clear base64 encoded so after all that I just said could you be able to crack the username and password yes we've passed this thing to a base64 encoder it's the simplest thing in the world and if we're talking about HTTP what does that mean how does this header feel this authorization header how does that go back to the server in plain text so that every single router along the way knows your username and password now if you're on an unencrypted wi-fi anybody can see that username and password and if anybody is on your local network performing an ARC man in the middle attacks they will also see this password do people sort of like servers like take data that they have over the internet how do you do that I'm just curious about how do you do that say a ladder can you like have like some HTTP server things get routed through you all HTTP requests you have to protect it or like it's not all like ISPs do you have to be a switch be a switch somewhere in the network who does that line you up that we want to say that's part of this thing looking at the microphone and that's a part of this stone in Docs X revealed that they essentially had this at a far greater level than others than they thought even Google they were shocked that there was the ability there was the ability to monitor their inter data center communications before that they had assumed that we're pretty good by encrypting everything coming in but then when it was in Google's network they would assume it's fine not using encryption on communications in between so as soon as those came out they completely changed their architecture so now all the internal communications is also encryption now that assumes that people can't access their encryption keys so you have to have them get started there too so we'll not touch that issue but it's interesting to think about it technically right what would you need to do that so basic authentication is purely terrible like on the face of it is awful HVP 1.1 authentication is a little bit better because it defines cryptographic digest and hatches that can be used and the server sends a nonce to the client the client sends a request a hash of the username and password the nonce, the HVP method and the requested URL the downside here the problem is that the web server needs to validate all this information so it needs to actually have your username and password in the clear which means if you break into the web server you now have all these people's username and passwords so this is if you've ever wondered I can't believe I use so many websites and they all make me log in differently right they all have different workflows should this be made to the HVP protocol the answer is yes it is let's roll it around a nonce is a random one time value so the idea is if I was able to so if I was able to sniff this information without a nonce then I could say hey I want to log into your website I just give this blob that I sniffed from you and then I can send that along and the server will think I'm you so the nonce basically is a random value say hey this time prove that you know all this information and include this random value so I know that each time is different yeah then none of this matters it makes it slightly better so I can still log into you but I can't actually sniff your username and password off the wire so when we start to analyze it and look at it because just like debugging the debug binaries we need to actually look at HVP traffic to understand what's going on HVP traffic sniffers so you can use TCP dump to collect traffic servers can be a lot of logging browsers so actually now browsers are really awesome actually rarely I mean depending on what kind of like 10 testing or web stuff I'm trying to break usually I'll just use a browser especially modern browsers with their tools and JavaScript debugging they're actually really good nowadays very cool things so on the client side so they are client side proxies I highly recommend Burk proxy it is the best proxy on the market they have a free version that you can use it's a proxy that you specify to intercept all your traffic so you can see and view all the HVP requests you're making and HVP responses it's super useful just as a proxy when you're testing something it's even more, it has other cool modules like repeating attacks and doing all kinds of cool stuff it's the best one so is Wireshark the proxy? Wireshark is a sniffer so Wireshark just listens to everything that's going by Wireshark is essentially a graphical TCP number it is actually forward sniffer you have to modify your browser so in Burk proxy you have to change your browser settings to say hey I'm using this proxy it's running on localhost and port 8080 and so every time your browser goes to connect to a website instead of connecting the website first it connects to the proxy on localhost 8080 and then it says hey I'm trying to connect to this place and so Burk will then go make the request on your behalf so it's essentially your man in the middle of your own traffic so that you can see everything that happens is how people break like IOS or Android apps so they can see what requests are being sent from the app to their servers and then they can try to contest those requests to try to manipulate them to do something bad which is something that app developers never realize and don't actually think that people can see all these calls ok HTMLs are getting through the Trinity today hypertext markup language it's actually a fairly so as anybody on a web page right click and click on view source before ok there should be more hands you're all computer scientists these are the programs you don't have to raise them now just internally you haven't done that don't go bad just go do it so you can say you've done it the idea is it's a very simple markup language so that your documents are from one systems to the other so unlike something like PDF which is a proprietary data format HTML is specified and open and so this way you can write multiple browsers for this hypertext markup language is originally based on SGML so this is kind of the precursor to HTML HTML20 was proposed in 1995 3.1 in 97 4.0.1 in 99 then they went in this crazy direction so then after 4.0 they created XML a recent XML so I actually thought that XML came before HTML but it actually is the other way around so from an HTML because it was so popular and successful they said oh we should extend this to not only capture hypertext markup we should extend this for arbitrary documents so that's where you got XML that's where you got schemas that's actually what part of what I was doing at Microsoft so I had to deal with huge schema documents and XML documents and actually a lot of our transformers are in XSLT too anyways so what they did they created this XML and they were like oh this is amazing it's going to take over the world and they looked back at HTML and they said XML because of the way the tags are structured and because there's a lot of shortcuts that are taken in HTML so they tried to shoeboard an HTML back into XML and they created an XHTML in 2000 and it's completely failed so this is around the ancient time I started by the programming I honestly never understood the difference between the two now when you do it to like a historical and political perspective it doesn't make much sense but XHTML is essentially dead HTML5.0 is the new standard it was proposed as a W-C which is the world-wide web consortium that's where the C comes in they're basically a group that's possible around leading the development of the web what Tim does I think pretty sure who pays them now so the other cool thing is that HTML is now also a living document so you have this really interesting property of HTML and that is that once a server starts sending documents with a certain hypertext markup language browsers either have to decide are they going to read that and understand it or not so if you've ever wondered why there's so many bugs in browsers it's because a browser today has to not only support HTML5 and 5.1 it has to support all of the garbage that is up until then because the thing about it how many of you use multiple browsers yeah I do I've used Safari now because it's supposed to be more energy efficient on the Mac OS which uses a lot of power but I used to use Chrome a lot before that I used Firefox a lot before Chrome came out and so if you were using a browser and you went to some web page and this is an HTML2.0 page I can't display it what are you going to do get rid of it you can go to a new browser nobody has time for that nonsense so this means that browsers need to support this and it actually says something else which I was going to talk about later browsers not only have to support all these different standards they actually need to support the terrible HTML that developers write to the standard because it's the same thing if you went to a website and you're like there's no image here and you saw a little warning message on the console it's like actually an image package would be like this browser is garbage the browser is garbage exactly and so because of this browsers are incredibly permissive about the HTML2 page which actually leads to a lot of security bugs because in these weird corner cases of parsing where browsers treat things differently that weird things happen so and finally now we have this weird chicken and egg problem where with HTML let's say you want to create some new HTML feature how do you do it do you lobby the browser developers to support it well why are they going to support it there's 10 things they can do no website actually supports it yet so then how do you get website owners to actually use this new feature because no browsers support it right so that's why you need these experimental features so people can implement things in browsers and be able to test things out both on websites and on browsers so that's why you need consortiums like the W3C in order to bring all these people together to talk about these things it's a super interesting problem okay the basic idea is you have raw text that you want to mark up with tags so which is basically you want to add meaning to the raw text of course I'm not talking about meaning kind of like semantic meaning not necessarily but it's going to be more meaning than just raw text yes it's essentially metadata data about the data so a start tag with something like this so essentially a bracket start angle bracket some text which is the name of the tag and then a closing right angle bracket so this will be a tag with the name foo and this is a start tag then it can be followed by after that any kind of text arbitrary text whatever also other start tags finally it will be closed with an ending tag so an ending tag is a start angle bracket forward slash okay I don't remember the difference I know backslash alone okay so forward slash and then foo and then the name of the tag and then a right angle bracket that's it it's a lot of this and a lot of other things there are also self closing tags which are leftovers from XHTML so you start a tag you have a slash that's equivalent to a start tag with no text followed by the end tag you can have void tags which are super interesting to see that so essentially all start tags must be matched by an end tag so a section that tells you which part of the HTML document is this valid this tag you can have some tags that have no end tag like an image which kind of makes sense like the image doesn't say like oh the image is from here to here the image is here this is where I want the image and that's defined in the HTML5 which ones are which okay tags yes I was going to say it's basically all containers yes so you can think of it like a tree which is how we think of tag structure so the tags are hierarchical so you first have HTML so the HTML tag is the root of the tree in this case there's nothing outside of it next the first child of the HTML tag is going to be the head tag this is part of the HTML document that's like the headers and the htgp response it tells you better data about that page one of the things in there for instance could be a title tag so now title would be a child of a head which is a child of html title as text as its child of example and it could have other tags head it could have a sibling so now html has two children head body and the order there is important the order the fanfare and document body has a child of p which stands for paragraph that has text inside of that document questions so you kind of like when you turn it sideways we'd be able to actually pretty closely draw the tree from this if you look on your browser you go to a web page and you say I think like inspect element or something it will actually show you on the developer console the whole html tree to that point it's going to be much more complicated than this good question I was going to say in Firefox you have a little add-on it's like a little bug and then when you use the add-on it'll make the whole page you can see every box that every tag yep is order actually important there? yes the answer I believe is yes again there's a question does the standard say it's okay and do browsers still parse it correctly I believe this is a standard page later this is a standard html page like this is a semantically valid like the least number of elements that you need I think the standard says you need these things in these order so this would be the tree based view of this now tags themselves are not very expressive all we have is title head htmlp that doesn't actually allow us to specify more metadata about the tags so attributes are another incredibly important part so every tag can add some attributes attributes essentially live inside the tag so between the name of the tag and the closing right angle bracket there'll be a series of attributes also super confusing there are four different types of syntax here this defines an element a start tag named foo with the attribute bar foo is the tag named bar the other way is you can specify values for attribute names so this would specify a tag named foo with the attribute bar which has the value bads you can also put the value in double quotes exact same thing single quotes oh the v made it look like double quotes you can put it in double quotes too so these are all the different four types of syntax and there's different escaping mechanisms well in single quotes you need to do backslaps with single quotes with the double quotes you do the same thing with double quotes and multiple attributes are separated by spaces so you can have multiple attributes on a tag each attribute can use this different syntax so you guys know how it looks like this okay alright we'll stop here