 All right, let's get started. So the second to last lecture in the class is going to be about remote procedure call. So we're going to review some of the history of remote procedure calls and their applications, especially in remote file systems and ultimately in web services. So we'll introduce RPC and then dive into some examples. NFS, which is the network file system that's based on an RPC, and then we'll talk also about the Andrew file system, which is a second generation file system that sort of influenced a lot of things that came after NFS. And then we'll move in the second half of the class to talk about the RPCs, basically soap that's become a critical part of the distributed functions that are on the worldwide web now. OK, so we've talked quite a bit about the coordination between different machines, between clients and servers as part of operations that we've already talked about, like the network itself with TCP. TCP is a distributed protocol that involves maybe a client and a server exchanging messages to establish the connection. Check, do flow control, check for lost packets. And it's a complicated protocol, if you recall. We spent about a lecture and a half discussing all of the aspects of getting that protocol to work reliably. Both client and server have to be doing exactly the right thing for this to work. We talked about two phase commits in databases. So that's another message exchange format that guarantees that the state of the client and the server, as far as some transaction goes, remains consistent and is robust across machine failures. And get and post, we haven't talked as much about, but there are two important message passing protocols as part of the web, the old web. And last time we talked a little bit about, or at least we brought up, the messaging that's part of SQL Server, really just because it was a massive vulnerability that led to the slammer worm. So those are the one traditional way of doing distributed computing. It's just exchange messages, having agreed on protocol, and make things work somehow. And making things work entails that both client and server have to maintain some view of their state. So at both ends of the connection, you have state machines that can be quite complex. With TCP, the compliance to the protocol has to be very rigid. Luckily, there's plenty of TCP machines that people have been able to test against. And there's an elaborately agreed on protocol because it was a web standard. People know exactly what TCP is supposed to do. But it's more challenging with proprietary systems like the protocol in SQL Server, or even in two-phase commit where the implementation's proprietary, generally speaking. And that leads you into difficulties of version control, different versions of those protocols might be implemented slightly differently. And features tend to accumulate. People want to increase the feature of some service. And that leads to potentially inconsistencies between different versions of the same, ostensibly the same protocol, executing on different hosts. And of course, you have to worry about error recovery, which means really any potential pair of states might happen if there's a failure in the middle of one of these operations. And you have to figure out how to recover from that state, get things back to a consistent state. You have to worry about protecting the data that's going across the connection, data integrity, as well as privacy. And you also, at some point, need to check for message integrity. So last time we talked about buffer overruns, which are caused by parts of a message being too long, not really complying to an assumed standard of length. And also last time we talked about denial of service attacks, which are deliberate attempts to lock up resources on a server. And whenever you have a complex state protocol, you can cause denial of service by executing part of the protocol that puts the server in some known state with some typically additional data in its memory, and you just leave it there. So the more complex the protocol, the harder it is often for the server to perhaps uncommit from where it is. And the more ways often there are to get the server into that state. So stateful protocols have more risks as far as denial of service goes. And finally, you have to make this work. When you have a distributed system, in general, the client and the server are not running the same operating system or executing the same target language to execute the code. So that makes life pretty difficult for these systems. And so there's an imperative to design something that's a bit higher level that does some of this in a generic way, but still allows you to do application-specific things. And so that's the basis of a remote procedural. So the idea is that, conceptually, it's very simple. It just looks like executing a normal function call on the client. And it translates automatically into an equivalent call on the server. So in this case, this would be a handle into some resource on a remote server. And this would be a request to read from a specific location. And by wrapping a very complicated protocol inside of this RPC, we're able to hide a lot of the functionalities from the previous page under the hood. So very likely the RPC will have a lot of the complexity of message passing that we saw on the previous page. There isn't one RPC standard. There's several standards. They may or may not, for instance, actually execute some sort of transaction process with multiple messages to make sure that if the procedure call fails, then both clients end up in a state of basically an error state that the procedure call didn't execute. That's not that common, but it does exist. And it's actually typically part of RPCs that support database systems. More general RPCs don't implement this for performance reasons, because there's a very big overhead in delaying their commitments, and it limits the amount of concurrency. So usually they're not transactional, but still they often implement some of the other features here. They manage the state at the two ends. You would use some particular RPC version of some particular RPC standard, so the two ends of the connection should be able to communicate consistently. The RPC would do some error recovery for you, at least it'll sort of signal if the RPC itself failed, and they can sometimes provide data protection. That's sort of a sub-layer to do with marshalling the arguments. And there's typically type checking on messages. All right, so RPC is a clean way of implementing remote protocols through a simple abstraction that takes care of a lot of the details of particular messages going back and forth between the client and the server, and allows the programmer to just focus on coding up the impact of the message. But they do that in exactly the same way as if it was a local execution of a function. All right, all right, so the execution, at a very high level, is based on two pieces of code that exist at either end of the connection. Those are the stubs of the connection, and the primary function of the client stub is to marshal, collect together the arguments to the function call, and then un-marshall return values. And the server-side stub does the opposite. It un-marshalls the arguments to the function call, and then marshalls the return values. That sounds like a wild Western. What does marshalling mean? It means, among other things, mostly it means serializing the objects. There may be some canonicalization of objects. If there are data structures, potentially with cycles, it'll have to somehow put them in a particular form, it's a tree form, probably, with references. If there are also, so with references to other objects in the address space, there's a complicated set of choices that you have to make. The simplest choice is to copy everything, but in practice, that's not always practical. Depending on what this protocol is for, or rather what the RPC is for, the reference actually may get translated into a reference to the object on the client machine that the server can understand and potentially access. So the object sometimes approximates the objects that exist on both sides of the connection. And actually, that's one of the key decisions that's made as part of designing a remote procedural. All right, and then, more or less, the RPC abstracts away the details of the operating system, but there's various things that you do have to take into account, such as the threading model, whether you wanna have a blocking operation at both ends, or whether it should be a synchronous and so on. Those often have implications for the coding up for different platforms. All right, so the compiler typically generates stubs from an intermediate, excuse me, interface definition language. And the interface definition language mostly specifies the types of the parameters and the types of the return argument. And it outputs the stub code for client and server. And the intermediate, excuse me, interface description language is an abstract language. The stub code will be in C or Java or some appropriate source language. Okay, so RPCs usually initiated by the client. So the stub code is what the client actually executes. So that stub code is going to take the real arguments to the call, bundle them up somehow. So in Java, it's easier to think about this because you've probably all seen Java serialization, which is turning objects into strings. So serialization happens as part of the bundling. And then that text that's the serialized version of the arguments is packaged up with some headers, put into a packet maybe in TCP and then sent through, sorry, the packet encapsulation in TCP happens here. The packets go out over the network. And then there's a receiving packet handler which will extract the marshaled arguments and pass those to the server stub. So then we'll take the serialized data and turn it back into objects in memory. And then finally actually call a piece of server code that executes that remote procedure call on the server. So this again is like a normal function call. It says though you sort of short circuited this call from the client to the server. So whatever, you know, of course this could be a different programming language, which is part of the beauty. But if it were the same programming language basically it's equivalent to the client directly calling the server with those same arguments. So then everything goes unrolls backwards. The server returns arguments in its native language. Those get serialized somehow into a string sent to the server packet handler which packs them up in network packets, sends them back. Those get forwarded on to the client stub code which is going to now deserialize and produce objects in the address space of the client and then basically execute a return. And this really just really looks like a regular function call to the server. Excuse me to the client. And this function call looks like a regular function call to this piece of code. All right. Everything clear so far? All right, so some important details. In remote procedure call is which machine do you send the call to? And that normally involves a naming service that resolves usually a high level name of the service. Maybe I don't know, look up address. Maybe it's sort of address service colon look up a home address as a symbolic address that gets resolved down into a particular machine and port number because the service is globally known under that name, like look up address. Then that endpoint is either resolved by the compiler if it's, if you're using a sort of a static look up process you'd have to know at the time of compilation what the address server colon address look up port and machine was. And then you'd code those non symbolic addresses into the code or you perform a look up at runtime which is for obvious reasons, pretty obvious reasons, much better solution. It's called dynamic binding. So you have to have a name service online to do this. The cases where you do static compilation might be in some kind of embedded device that may not have access all the time to the network. So it's normally gonna correspond to services that are known at compile time. All right, so dynamic binding also provides, allows for access control to check if the user is really permitted to use the service and it also allows you to do some failover for reliability. Let's see, so that's a simple story. Very often though it's in larger scale RPC systems, more complex RPC systems especially. People go a little bit further in as well as including a name service to look up the service names. There's an object registry which provides both, which provides generally a bit more. It may provide host port but it may also provide the name of an object in a distributed object setting. There's actually an object in the sense of a C++ object or Java object on the remote machine that you would like to access. And it will also often provide client stub code. And the reason for having non-standard client stub code is to allow the client to be smart about how it manages for instance, references to objects that are more complex. Sometimes the right thing is to just pass a reference to the original object. Sometimes this particular call might not need the whole object, it may only need a tiny piece of a field of the object. And the client stub code can just look up that piece of information when the stub code is called, pass it on to the server. So it basically allows you to do the most sensible thing for that particular call. And it also allows you to deal with potential version problems by making sure that the client's stub code is always current and it always matches what's on the server. Yeah, so registries are pretty common in complex, internet of things type environments where you have a lot of different, a lot of different potential clients, some of them very thin, some of them some of them are very old and using potentially legacy systems and which you can assume they have, they're running some particular current version of client code. Okay, so let's see. Okay, so RPCs as we've described so far are oriented towards communication across the network boundary, but actually they have also evolved into a convenient technique for communicating transparently within a system to other processes either in the same system or across the network. So we've seen already, you can communicate in shared memory using things like semaphores and monitors and so on. You can communicate through the file system as Hadoop does. You can communicate through pipes which are an inter-process abstraction that look a lot like a file. And you can do a true remote procedure called even to access a process that's on the same machine. And so the RPC abstraction works perfectly well. You have to address a lot of the same issues because the different process on the same machine is still running in a different address space, different set of virtual addresses. So in order to do an RPC from one address space to the other, you've got to do essentially the same thing. And indeed the other address space on the same machine may also very likely be running a different compiler. So this allows you, for instance, to call a remote method in C running on the same machine from Java, for instance. So yeah, so you can do that. And in fact, once you do that, you have the ability, typically, pretty easily to extend that to remote machines as well. And it looks exactly the same from the client code point of view. So for instance, yeah, so sorry, before we do that. So an example of this is the X-Windows system which is the Windows system under Linux, for instance, if you're running GNOME or KDE or whatever, those are actually built on top of X-Windows. And as you've noticed, I'm sure you can just run X-Windows pretty much transparently just by setting the display variable either on the same machine or somewhere else. And X is gonna execute a lot of RPCs to move data around and you won't see the indirection that happens when it has to go through the network. So it's an elegant solution and so RPCs have also sort of generalized from true remote access to be local access, but through the same mechanism. Okay, so some examples of RPC systems. The probably most common one is Open Network Computing RPC, which used to be Sun RPC before they opened it up. And that's the workhorse of remote file access in Linux, of NFS in Linux. Windows implements it as well, so it can access Linux resources. And actually many other Unixes as well use the same thing. So this is sort of the first generation really widespread RPC. And then there was a big consortium effort to make it a more open standard because originally this was proprietary and it was first generation and they wanted to fix a bunch of things. So there's a big consortium under the label of distributed computing environment and they went through as big consortia tend to do many, many generations of cramming features into a standard called DCE, which RPC was a critical part. So this was pretty highly evolved and complex and it didn't really manage to take off. There were some implementations of it, digital equipment had one, HP had one. There was starting to be some convergence on this thing, but then if I recall Sun around that time brought out a brand new version of NFS, which was completely incompatible and but had some of the same fixes that were in DCE. And that propagated a lot more quickly and more or less this became an orphan. Also the companies that were supporting this were starting to move out of the Unix business as well. So but anyway, DCE was picked up by Microsoft and it did become, it got renamed and reworked and became MSRPC, which is the foundation of a lot of their distributed file system and distributed protocols. All right, and Java has its own remote method standard as well. Okay, so RPCs are used in an interesting way in micro kernels, because as we've said, they're a natural way to implement remote access to a different process. And micro kernel operating systems sort of break up the, if you recall the idea was to break up the key pieces of an operating system into separate processes and have them use inter process, sort of universal inter process communication system to communicate with each other. And that's usually implemented with an RPC. So RPC also became kind of backbone of micro kernel systems. And we talked about that some earlier in the course, so I won't go over that stuff again. And there's a black's window showed up here. Okay, so let's examine those some weaknesses of RPC in some ways of working a little bit more high level. So first of all, the implementation of RPC is, as I said, is does vary from one standard to another. Not all of them implement fault handling the same way. And in fact, generally they're fairly conservative about trying to, they are on the side of performance. So they generally guarantee not very much as far as fault tolerance goes. That's a deliberate choice to make them more usable. So they defer fault tolerance and fault handling to the author at high level. That's not true of all of them. As I mentioned, there are database oriented RPCs which invest a lot more effort in doing this. Yeah, so, yeah, right. Without RPC, a failure normally causes things to crash. But, all right, so if you're executing a normal procedure call on one machine and there's a problem, then in a sense, the caller and the call E, first of all, the call E finishes execution. The call Er will accept an error. So there's really only one process running. So there isn't a consistency problem. But with RPC, you know, you actually have two different processes running. And so you have many options if there's a fault. If the remote machine crashed, potentially you could return something that allows the client. You could return a fail message and that would allow the caller to continue. So that's the decision. The trouble is, though, that then the computer, the client that called the RPC may not know or generally won't know what happened on the server. So did their call get executed? Did it somehow get partially executed or did it not get executed at all? So we don't know. And again, that's not part of the standard. Some of them implement transactions. Some of them just try to keep it as simple as possible for performance reasons. Okay, so the performance of RPC is pretty intuitive. A normal within process procedure call is much faster generally. It doesn't normally require a marshalling of arguments. A same machine RPC is considerably more expensive because it does involve taking binary arguments and somehow packaging them up so that they can work across different languages potentially and different address spaces. And then RPC is inevitably more expensive because you've got to, in addition, take your serialized data and send it across the network. Well, so there's a really nice abstraction away of where the function call is being executed, but on the other hand, from a performance perspective, it's pretty obvious where things are being executed. In X Windows, it's excruciatingly obvious because it was, as it turns out, for whatever reason doesn't seem to work very well over current networks. It was meant to operate, was originally designed at MIT for people to work on machines in different parts of the campus and it really doesn't work well across the wide area network. So the locus of these procedure calls does become pretty obvious in a lot of cases. So you can try to do caching in certain contexts and then we'll look at that. But caching introduces then potential inconsistency between caches and their source data so that makes handling problems difficult again. All right, so in the distributed file system, we have some client over here. He wants to read a file across the network and there's a server there which holds the data. It's a very common model. Actually, in the old days, this was sort of a universal model. Now people tend to operate more with Windows PCs which don't necessarily, without relying as much on a network system. But still, the distributed file system provides these kind of affordances. The idea is, at high level, you want to make things transparent. So you want to make the user feel as though they're using some files that are on their local machine. What does that mean? Transparency means transparent concurrency, which is you like the clients to appear to have access to exactly the same files. Failure transparency, well, something crashed. If you're not actually using the application at a time, you'd like to ideally come back and see that your files are still there. Replication transparency, for reliability, you might actually have multiple copies on the same or different servers. And it would be great if you don't have to be aware of that. And let's see, yeah, it would be nice. Also, if you're gonna migrate files to maybe do some load balancing or bring them closer if it's a really big data center, maybe migrate them to get them closer on the network to the client, then that should also be transparent to the client. Oh, yeah. All right, so those are the desiderata. You also have to figure out consistent and global naming conventions. So, all right, so, but of course complicating that is that the naming choices are often need, well, okay, so a complication of having global namespaces is that they get very complicated very quickly. And in fact, remotely mounted file systems are usually aliased because there's a good chance of collisions. They have fairly conventional names like home slash user, which probably exists on many different machines. So, the mounting process in whatever operating system you're running is going to manage the, basically a connection to each file server, some sort of registration that you have handled to the files on that, or files or directory structure on that system. The mount point is the root of that directory tree. And it'll manage the translation between the local mount file system name and the remote file system name, which the server understands. So, something like this is a translated file name. That helps you avoid aliasing problems that would otherwise be pretty inevitable in a distributed context. All right, so this is what's most commonly done. You can also try to have a global namespace. So some of the distributed servers like Andrew file system will try to do this. And there's some cleanness to it, which is that you don't need to worry then about what machine you're operating from. As you move around, that address will always be the same. Okay, so here's a simple distributed file system. Basically, every read and write is executed and atomically, typically writes are executed and changes are made to disk before the return happens. And that will give you the most, the simplest way to get a consistent view of the file system from multiple clients. But the trouble is performance. So, you know, being very careful about making everything consistent normally means that there'll be delays for changes to happen before other clients can access the content. And so you'll have the network also adding delay to the movement of the data and to the signaling to the other clients that data is ready. All right, so caching often helps. It allows you to basically get data onto the server, hold it there more quickly, excuse me, hold it there, make it available more quickly to other clients before it's actually committed to disk. But then of course, you lose potentially on robustness because if there's a crash, those updates are lost. So anyway, so there's the usual trade-offs that you have to do. Let's see, so a server crashes client. Yeah, so there's a few options about dealing with crashes. It actually is sometimes implemented with a kind of a lock where the client does have to wait until the server comes up. Trouble is that data in the server's memory is then lost. Yeah, and the other approach is to try to share state across the connection so that both sides may have a buffer containing part of the update. And that way the client can retrieve what it's already got in its local buffer, even if the server content is lost. But then you have a consistency problem. So let's see, so this is another, one of those very complicated things. It's probably a good compromise, but it entails a lot more effort to make it robust. Let's see, so message retry. So if the server crashes, yes, after it's done a remove, but before it receives an acknowledgement, you have to retry. Yeah, and we're then, all right, so with NFS, excuse me, NFS is more of a best effort system, so it'll attempt to, it'll start removing files, but not in any particular guaranteed order. And when it removes blocks of a file, it'll also unlink those in an order that's not guaranteed. So potentially you can end up with stuff that's partially removed in a failure of an RM process. Okay, so for some of those reasons, it's desirable to make the system as stateless as possible. So a stateless protocol is one in which the information required to process a request is passed with the request, so both ends of the request know exactly what's being asked for. And it saves the server from keeping a lot of information about the client. Except potentially it's hints to improve performance. So the request is consistent. The, it normally means that the request can be repeated. If there is a failure, maybe client doesn't know what happened, they can repeat the request. And it should execute the second time and it normally would be item potent, meaning it, the outcome is the same. Okay, so yeah. Well, all right, let's leave the client crashing off for a minute. So examples of stateless protocols are HTTP by itself. So get and post. You know, modulo the side effects of some other site modifying the contents. Those are item potent operations. You can keep doing a get operation and it should retrieve the same content. That's the basis of web caching. If you do a post, you should be able to repeat doing the post. And if one of them failed and you do it again, it should eventually cause the update that you want. And REST is a design pattern that we'll talk about in detail at the end of the lecture, that goes even a little bit further as far as trying to make something like an RPC that's stateless. All right, so we'll get into that later. Interestingly, well, recently REST has perhaps become the dominant remote procedural for the web, but SOAP is still widely used as well. And whereas the REST model is a stateless protocol, SOAP is more deliberately and directly at a remote procedural. So SOAP is stateful, but it's implemented over HTTP which is stateless, okay? So it seems like a conflict, but it's really not. In the same way that low level IP protocols, like UDP is also a stateless protocol, it doesn't keep track of connections, you just send packets to a host. But SQL server, which implements transactional protocols, uses UDP, at least in one mode of operation. And there's client and server code that's maintaining a lot of state to make that work. So SOAP clients and servers maintain their own state. They execute, though, messages atomically more or less over HTTP. So you can, you know, stateless protocols, you have to be careful, you have to specify exactly which layer of the protocol you're talking about. Because statefulness is just relative to one particular layer. Okay, so NFS. It's a layered file system. Designed to make access to remote files transparent. And peers, though, they're the same as local accesses. So you have a file interface that's familiar. It's a Unix file interface. You can open, read, write, and close things with file descriptors. The VFS layer, when you look down from the top, it just looks like a file system that implements those operations. What it does is then, under the hood, though it spreads out the operations to a physical file system or to an NFS stack. And then the NFS server sits down below that and implements the actual details of the NFS protocol, network file system protocol. And VFS works at both ends of the connection. So it looks overall like this. The client code sits up here. Makes them read, write, open calls to the VFS interface. So VFS then farms out the actual calls to either directly to a Unix file system, maybe a Windows file system proxy, or to the NFS client code. And so NFS executes the ONC, RPC, and XDR is kind of a data bundling format that goes along with that. So it allows you to package up your data and transport it then over the network as network packets. So the NFS client basically does this remote procedure call, like the other RPCs we saw that turns into sort of marshalling and eventually transmission. And this thing un-marshalls the data, passes it up to the NFS server code, and then there's another VFS server layer which is implementing the bridge between the physical Unix file system and the NFS server which then allows you to access the disks. Okay, does that make sense? And you can see there's a symmetry here. You know, in a real environment, especially with NFS, either one of these machines could really be treated as a client or a server, either one of these machines could be exporting its disks through NFS. And so you can see how potentially somebody over here on this machine could also come down here, go through this NFS stack and access to machine on the other side. All right, so that's the idea. That's the rationale for having this interface at both ends. All right, so NFS is a protocol that you run at both ends that allows you to, that uses RPC to allow data, that is basically file data to go over the network. So it provides the usual capabilities of a file system, reading and writing data, and then manipulating links and directories. And let's see, file access, yeah, the usual things that you would do on a physical file system. So, you know, but it doesn't just pass those through to the physical file system because, well, you can figure some of this out, right? The links themselves may be to other physical file systems or to other NFS systems. And accessing file attributes, attributes may actually be different on this client and host machines. Access control has to be managed very carefully. So NFS is not just a, what's the word, a middleman in these processes. The NFS software actually has to manage those explicitly. Let's see, so NFS normally does write through caching, meaning that the data that's committed to the server, sorry, the data gets committed on the server's disk before you reply to a client that you finish the write. All right, so it's not very performant. The client has to wait and might have to wait a long time. You know, caching isn't really, well, you're not benefiting from the cache, right? It's sitting in memory, but you're gonna wait for it to be written before you can continue. And it turns out NFS is actually, and it doesn't do anything smart for readers to notice that you've made the change. The readers actually will have to ping the server to figure out that there was a change. Okay, so NFS servers are stateless, meaning that the request provides all of the arguments and information to complete. So let's see, yeah, so for instance, unlike TCP, there's not a notion of a connection or there's not really a notion of opening a file and using a file handle. The read operation will actually have, you know, the client, when it opens, the client can still open files, but it will receive information about the I-node on the server machine and the file position, I guess. And so it'll read at that specific address that it's got from the server in an earlier transaction and at the given byte position. So yeah, so there's no notion of a persistent open file. This has to actually be a physical address that's universal for that particular file, yeah. They don't, it can happen. So NFS doesn't manage locks or prevent people from doing that. There would be a right time that you would get as an attribute when they access it, so they'd be able to tell that it was modified recently. What else? I mean, the client, yeah, the client, if you're reading a file and somebody modifies it, yeah, you would normally get a message about that during the operation, but the trouble is it won't stop the right from happening. You don't have a lock. If you think, I mean, because you're just reading basically blocks of the file, the server doesn't know you're gonna read another block in the next step. I'm almost certain we'll tell you if you read a block, that's, well, actually, maybe not. I guess there's no reason for it to tell you the modified date, but the modified date may just be part of the metadata that you get of every transaction. That would be logical. I don't actually know it, but. All right, so no NFS is really not very careful about managing consistency, and it's very, it definitely admits a variety of sort of inconsistent and corrupted files to happen. Sure, I mean, that's the ideal situation for any distributed system. But it can work okay in those distributed contexts, but it's rather like when you have, even if you're using GitHub or something, it doesn't stop you from doing inconsistent changes. It just makes you aware that there's an inconsistency and it tries to resolve them, but NFS, it really leaves that responsibility to client code and to people using the code. But it's by no means considered to be a clever solution for this problem, so we'll get to some better solutions soon. So an important property that's shared by almost every stateless system is the idea of item potency, which means that when you repeat the request, the outcome is the same. Modulo, state changes from outside. So reading the same file, calling the same read operation returns the same argument. So intuitively that's different from just saying read on a local file because read on a local file returns the next block and maintains an implicit file pointer. At the low level, NFS reads are always just reading a specified block, so you'll always get the same one with the same file arguments. Similarly, when you write, you're always writing an explicit block to an explicit file, so it'll always modify just that block. Yeah, and we'll see this again and again, actually. All right, so the advantages of this are that if the operation somehow fails or you don't get an acknowledgement that it's succeeded, you can just do it again, and the outcome should be the same, and it shouldn't matter whether it happened the first time or not. Yeah, yeah, it's sometimes said to be side effect free. Yes, well, remove is a tricky one, because normally if you try to remove something that's not there, it should give you an error. So, but in NFS it's done as an advisory error or sort of a warning rather than causing your code to crash because the system doesn't know that you're doing the right the first time or maybe doing it as a backup. Okay, so the, yeah, the failure model, well, let's see, it tries to hide it from the client system to the extent that it can in the hope that maybe the problem will fix itself. Or, you know, so two options that NFS does provide for dealing with Fox. One of them is to lock up, you know, which is the, which it definitely uses for short periods. If it can't complete an action, for some reason it'll retry a bunch of times. It won't return anything to the client. If it succeeds eventually it'll come back. So the client will be blocked until that happens. It'll sometimes return an error, but that's neither of these is required. So the behavior is, I guess, dependent on a particular file for some implementation. I don't know if it's configurable in the OS, but both of the options exist in NFS. Okay, so let's see how we're going. All right, so let's actually, all right, cache consistency. NFS has a very weak model of consistency. We talked about weak consistency before. It uses polling for clients to check whether changes have happened. So client can poll, get the last written date, and then determine that a file's been updated. Let's see. Yeah, and clients also have caches. The server gets notified that the client's done a right to one of its caches, which should eventually cause an update on the server, but the server, first of all, is told that the updates happened, but doesn't try to invalidate the other client's local copies. So in other words, it's deliberately allowing inconsistencies for a time. It's something like eventual consistency until the client successfully writes its data to the server and the other client's check, you have inconsistent data. Okay. Yes, and so, and a very bad kind of consistency in terms of multiple writes. Because clients can write blocks basically independent of each other, one client may actually succeed completely. Very likely though, if they're writing different parts of the file, you'll get scrambled data. Okay, so it's, on the good side, NFS is simple and highly portable. On the bad side, it's inconsistent. It has scalability problems, too. Very quickly, a better system that followed on from NFS, excuse me, was the Andrew file system, which was a research system at CMU. And basically it was meant to address a lot of the problems that people had noticed with the NFS. So instead of the polling mechanism, it used callbacks. It's kind of like more of a published subscribe system, where, excuse me, clients that have a handle on a particular file, or a file system, notify the server that they have that handle. The server then basically maintains a list of all of the clients that have a handle on that file system. When there are changes to that file system, they'll do a remote procedure call to the client, say, oh, this has changed. What do you want to do? It's left to the client to actually figure that out. And instead of writing through, on updates, it's right through on closed, on closed. So you can do a lot of changes. They'll just operate on the cache until you close the file, and then they'll get committed. So this is a lot more performant for the client that's making the rights, but really it's hiding the changes from the other clients until you're done. So this, I guess, is more, it's basically a locking model, because if you are modifying the file, no one else can see it until you're done. All right, so that's the first set of things. Yeah, it has a two-level caching system. Okay, I think we better, we'll just move on, because let's just keep going. Sorry, I don't wanna, sorry, we have to skip some of that stuff, but we're gonna run out of time. So the midterm's coming up, which I've shown you to no one. It's coming up 5.30 next week, Wednesday. And we have these two rooms, similar format to last time. You can have one double-sided set of notes. And it's gonna cover material, all the material since last time, last midterm. Okay, so we do have the review session scheduled from seven to nine in 306 Soda, Sunday, December 1. Don't forget we still have office hours, et cetera, for Monday, Tuesday, Wednesday, next week. And any questions about the midterm? Okay, and Project 4 Initial Designs do Monday, so make some plans for that, and that's it. Okay, so let's take a break and we'll finish up and look at web-based RPC in the last half hour. Okay, let's wrap up. So let's review some of the things we talked about with RPC and NFS. So first of all, does RPC require special networking and functionality? Yes or no? No, all right, good, yeah. It operates at a high level. All it has to do is sort of serialize data that can go over the network. All right, client and server for RPC must use the same hardware architecture. Right, that's, almost the whole point is to allow things to work across architectures. All right, local procedure call is much faster than same machine RPC, it's much faster than remote machine RPC. True. And let's see, NFS provides weak clients over data consistency. Yes. Okay, so as we said earlier, RPC is still considered a pretty low level interface for distributed computing. It does some things very well, it does abstract a lot of the details of sort of message passing, communication, reliability, some of those. But for instance, it doesn't support a lot of shared state of persistent shared state between a client and a server. So you can argue that you don't want that, but there are some cases where that's inevitable. For instance, if you have two different, if you have a database that's distributed across multiple servers, you need a lot of state on each one and you need it to be consistent. So just as in regular programming, distributed systems like any piece of complex software can benefit a lot from careful software architecture and especially object oriented programming. So object oriented programming allows you to sort of break up the state of some complex system into objects and sort of simplify their interaction by calling methods. So same idea works at scale with distributed systems. And in fact, probably the most effort in making distributed systems work has gone into object systems. Most people have heard of Cobra, I'm assuming, have you? Yeah, it's not obviously the answer to the question now and it probably will be less obvious in future because the system was, these two systems were incredibly important for a while but they kind of faded from visibility. Still important for those cases though that I mentioned like distributed databases. All right, so Cobra was one of those other monstrous committee standards. Common object request broker architecture and almost unpronounceable as well. But the idea was to allow distributed objects to exist across machines and to allow them to exchange method to call each other with methods and to maintain consistent state. Decom is a Microsoft standard just as we saw with their RPC. They borrowed Microsoft RPC from the earlier consortium F at DCE and they similarly built a component object model on top of this distributed computing environment and they completely appropriated the name DFS which used to mean an open standard that was part of DCE and now it means a Microsoft Distributed File System. So a lot of effort went into that and it's a lot more complicated than an RPC. The idea is to really do the opposite of the things that make your life easy because you're making a commitment to actually have a lot of data shared across servers. It could be clients and servers but it's especially important with servers because they really do have a lot of shared state. With client service systems maybe this is less important because you can arguably maintain everything on the server and just allow the client to access that. So in addition to just remote function calls you have to have our object proxying which means the object's appearing on both sides of the connections, executing methods from either side and maintaining consistency and you even have to implement garbage collection. Most of these systems have garbage collection which includes the references from the other side. So you've got to figure out for an object that's maybe hosted in one machine, maybe proxied in another machine, how many other objects have references to that object across the network. You've got to count those and only when they've all gone to zero or when some of the machines that had references are definitely offline and won't be coming back with those references then you can garbage collect those objects. Anyway, it's very complicated to do all of that. So anyway, they were successful standards. They were implemented and very widely used and they're still used. And people figured that they would also become the protocols that people used across the internet for distributed servers, I don't know, large scale data centers and so on. And to first order that didn't happen. And anyone have an idea why? So most of the answer comes from last time actually, indirectly from last time, which is that what happened while those standards were evolving is that computer security became a much more acute problem and the response to the very dangerous worms and viruses that were going around on the net was the introduction and very widespread use of firewalls, corporate firewalls and packet filtering within routers over the internet. So that made it very hard for arbitrary port number of packets to move around, mostly impossible. So the one thing that always works, one of two things that always works is HTTP and email typically works as well. So anyway, but HTTP, to first order is the only reliable protocol that goes everywhere on the net. So there are two ways of leveraging that. One way is to try to tunnel other types of payload, in other words sending packets that are not HTTP packets through port 80. That can work but it's very tricky to work because at the end point you can demultiplex, you can have a router that examines carefully the packet headers and routes them to an appropriate address locally. The trouble is though that a variety of things can go wrong because for instance, packet filters in the middle of the network may be looking at the type of payload and filtering non-HTT packets anyway. So yeah, the system that really got deployed widely though instead was actually just taking HTTP as the standard protocol for remote computing. So you can treat HTTP as just a transport and send well whatever you want across it. And that led to the technologies that have become the web service technologies, which are SOAP and it's discovery, it's service description protocol, WSDL, and more recently REST. So these are the dominant systems for what amounts to RPC over the web. And they're heavily built on top of XML and JSON. So whereas these are not necessarily part of the web standard, they are sort of the core technologies that enable it and web services have driven a lot of development and a lot of what's the word, interchangeability with these standards. So a lot of application software will speak XML and JSON in order to be able to work across the web. All right, so the idea of SOAP and REST is to allow you to actually send more or less an arbitrary remote procedure call across an HTTP connection, which among other things implies it has to be text. So SOAP is the standard for remote procedure call. So SOAP directly implements an RPC model where you send some arguments and receive a result. So it has to specify a message format because it has to tell it what kind of message is gonna be sent across the HTTP connection. In some cases over, actually I think it's always HTTP. And then there's a description, oh no, I'm sorry, it can do email as well, sorry. So two media can be used for the SOAP message and a set of rules and a set of conventions. So that's a bit abstract. So let's just look at some examples. So a high level of the SOAP message just has a header and a body and some XML schema boilerplate at the top to specify the schema, which means the structure of the XML. And here's an actual SOAP message, which is the calling message. So there's the envelope in the body. Here's the method name that's being called. It's order goods here. And here's a couple of parameters being passed. Everything's gonna be converted to text, in fact, to an XML string. The string's getting sent over HTTP to a server. And the server's gonna build a string that looks like this. Again, with an envelope, a body, a method name, which in this case is just return, it's a return method. And then the value, which in this case is gonna be some ID, some integer ID. Okay, so where are we? This is the, here's a more detailed message. Let's see. Here's the body. Here's the method name. In this case, get flight info. All of that is schema naming. Sorry, namespace aliasing, excuse me. And then down here, we have an airline name, which is, I guess, UL. And then, no, wait a minute. Yeah, LA name is UL, sorry, I should be doing a better job of reading this. And then a flight number, 506. So it's UL506 is the flight information. And then the response gives you back more information about the flight, namely the gate number and the status is on time. So is that, is it clear what's going on here? We basically encapsulated with a serialization into an XML string the arguments to those function calls and the return value. So it's very similar to the other RPCs we described. But when we described them the first time, we didn't specify what kind of serialization we would use. With web services, it's always an XML or JSON serialization so that it can run over a text port. So associated with SOAP is a description language, web services description language, which allows you to specify exactly what SOAP messages look like for a particular protocol. So it describes what the service does, what operations in other words, what verbs it implements, where it resides, just a URL and how to invoke it which will include argument types typically. And in some cases I guess protocols. All right, so, well this is not too bad actually. So here's the interface section which specifies a set of methods. Let's see, get flight info here, are the methods down here? Oh, here's another method, sorry, here's, all right, each of these is a method name. And they have parts, each message has these sub parts that are shown in the sub parts of the XML. All right, so this is very abstract, just telling you what the structure of the messages is. The implementation section contains a list of the same message names, but in addition, the types of arguments should be visible here. And also encodings and namespaces. So let's see, can I use the equals literal? Well, probably just because they're strings so they're all ending up with string types. Anyway, so there's these two different levels of the specification and that's enough for a compiler typically to generate the stub code to actually produce the RPCs at both ends. Okay, so the last topic is REST which is a reaction or a second generation system that is diverging from SOAP. SOAP was the dominant standard for a while when web services first became popular. But SOAP had the limitations that we described before as far as RPCs have. There's an issue of what happens if there's a failure. How does the system work with web caching? Because caching is normally an intermediary. There are web caches acting as intermediaries between clients and servers. Clients sometimes move around in IP address so there's a number of things that make it more complicated even with a fairly simple round trip RPC like SOAP. So REST was the reaction to that to try to make things even simpler. And the idea is that every message in a REST protocol contains all the information needed to execute the request. Also, there's just one type of object which is called a resource and those can either be websites or objects within parts of an object, an object that's hosted by a website and it could be an object or a part of an object as well. So you can have these very long hierarchical URIs. Let's see. And REST sits on top of HTTP so you have the usual operations but they're just one-way operations not a round trip RPC like SOAP. And in particular, they're idempotent. So REST tries to be idempotent and it is idempotent in the same way HTTP is. You can keep doing the same REST operation many times and the outcomes are the same modulo, the effects of an update. But if you're the only one interacting with a service, they're idempotent, okay. Yeah, and REST supports the resources using a description which is basically the URL description of XHTML or XML. Okay, so quickly here are the principles of REST. It's a cleaner client server separation than in SOAP or normal web protocol. So for instance, in particular, it avoids the use of cookies on the client. It avoids distributing state onto the client. In REST, the server should house all of the state if your client has to move or if you transfer from your laptop to a smartphone and you use the same URL. You might need to get credentialed, that's the one stateful piece of this, but once you're credentialed, it will do exactly the same thing if you access it from a server, from your home computer, from your smartphone. So everything's maintained on the server for that reason. It's stateless, meaning there's no client context stored on the server between requests, and that allows it to be idempotent. It allows repeated requests to work. And let's see, yeah, it's base, what's the third reason for that? Yeah, it allows you to basically recover from disconnects or server breaks by just doing the operation again. Let's see, yeah, it's consistent with web caching, so clients can cache responses or somebody in the middle of the client in the server can cache a response because the response is going to be the same as long as the URI is the same. So some intermediary can see that somebody else did the same URL and received a certain response. They can cache that and give that to you to reduce some loading for common URLs. Yeah, and those two sort of go together. So it more generally supports layering where you can't really tell if you're connected to an intermediary or the real server. The idea is that there's this one-to-one mapping between URLs and responses, or URIs, excuse me, that an intermediary can copy and reproduce to provide the same service or approximation of the service. And finally, as we saw earlier on, it's often useful to be able to give the client some custom code for efficiency. So basically give the client stub code so they can execute a more complex call. And for instance, provide just the data that the server wants for that transaction. That's actually an optional part of the REST standard or the REST principles. But it should be clear why that's useful. We already explained that example in the context of RPC earlier. All right, so quickly, here's an example of a REST transaction. Here's the URL. URL, URL. Oh yeah, all right, so yeah, here's the location, the URI for this resource. Here's the contents of the resource. So in order to get information about the resource, the client just needs to do an HTTP get of that address and they'll get this information. If the user is suitably credentialed, they can also put data back there. They can modify some of the attributes here and it'll appear on the server quickly, immediately, actually, in most cases. There's no need to cache in this setting. Yeah, and then that information normally becomes immediately available to other clients. So all right, so this is very clean when you're doing simple update operations, as we've seen here. On the other hand, though, the client has to be fairly smart. It has to be able to produce these URIs, which in many cases involve hierarchical access to data structures. So for instance, for access to some complicated data set or database, the client's probably first gonna download a directory or the schema of the data set in order to know how to invoke the subfields in order to traverse to something that they wanna do. So that's not too difficult, but the client has to have the ability to parse XML, understand schema, and so on at a minimum. Okay, so a simple way to think about the contrast between REST and RPC is that RPC systems are based on verbs and based on actions and arguments to actions. So in other words, in SOAP, we saw that there were operations typically involved in adding, removing users, or updating user content. In the REST system, the emphasis is on nouns. So for each of these transactions, there's an associated user or resource that who's state you wanna change is a response to the action. So the first four here are user. So in the REST system, though, it would be the URI for user and you would probably try to read and then update the contents of that URI if you're trying to modify it. So in the sense that the client has to be a lot smarter about what this all means and they have to be, the server also has to be careful to check that things that are being posted by clients are valid, compliant to the schema and valid contents for hosting in the URI if it's a put. But with modulo, those constraints, REST makes life, it makes the implementation a lot simpler and it makes a lot of things work in terms of working across caches and through firewalls and so on. And it has become the dominant system for computing, distributed messaging and RPC across the web. Okay, so to wrap up, RPC conceptually is a way to execute function calls across machines as though you're executing a simple local function call. And the RPC code, the RPC system implements some amount of error checking and some of the difficult to write state maintenance between client and server and provides the programmer with a simple interface where they only have to worry about implementing the functionality of the function call. So it's basically two pieces of code which would be similar to writing the code on one machine. So RPC is being the backbone of distributed computing. It's still widely used as a interface for, especially for local communication on one machine. We talked about distributed file systems, NFS and briefly about Andrew that are, implement file access remotely using RPC. And then we actually talked about object systems which are a layer above the RPC system that allow you to view the world as though a distributed environment as though it's objects that you can access from your machine but are actually distributed across many machines. And we talked about SOAP and WSDL, which is a remote procedure of protocols that runs over HTTP. So it can work across the heavily firewalled web and includes a description format, a meta format, WSDL which typically allows compilers to generate stub code for the calls for a specific set of messages. And finally we talked about REST which is a kind of second generation system that avoids the round trips of RPC in order to make things even more simple and more robust and more compatible with the web in terms of caching and recovery from errors and people having a kind of a mobile life where they're accessing their resources from many different endpoints. So that's it. So have a great holiday and we'll see you on Monday.