 Good afternoon, everyone. Thanks for being here. My name is fenu. I'm a lead actor at huawei. I'm here with my co-speaker iran. She's a researcher at Microsoft and holds a phd in distributed systems. I would like to mention about my colleague who has envisioned This project and thought it would be worth outsourcing it. Unfortunately, he couldn't make it to this event due to some reasons. All right. Let's get started. Our team and Irene's group had been collaborating for the past Few years on a pretty exciting project. To take the research on She did in distributed systems for her phd thesis and expanded Into a full-fledged open source project in cncf. We are here to talk about this project today. In today's agenda, we'll cover the brief history of Evaluation of applications, nature of applications, Development from monolithic to microservices, and the Skillset required for application developers, system Developers, sra's adwaps engineers. Note that the application Developer i meant here is not a standard industry term. I'll talk about it when i get to that slide. And then we'll briefly touch on the broader effort to create A full distributed operating system, and then zoom into The most interesting part of it, microservice runtime system. Irene will describe the internals of how this System works and show some interesting findings from her Research, and then i'll summarize with current ongoing work And show the demo of face recognition app running on the Amino runtime. Once upon a time, applications that People interacted day-to-day were pretty simple. They typically support single user, ran on a single operating System, and on a standalone machine. There were things like text editor, document viewers, Calendars, and simple games. Life was good for application programmers. Single machine operating systems work fairly well, linux, Windows, macOS. All of them provide a great layer of Abstraction for application programming with the basic Building blocks like local process, virtual memory, file System, synchronization, preventives, and locks. All of these are indispensable when writing an application For single machine. Mostly one or two programming Languages are required to build an application, and Application programmers understand their platform pretty Well. But suddenly everything Changed. Cloud computing happened. Everybody moved to smart phone, mobile for strategies, Network connectivity, high speed internet everywhere. So the type of applications changed. Suddenly we have social media applications like twitter, Pinterest, facebook, and in addition to that, some of The previously existing applications with single user Also became collaborative multi-user applications. There are fairly some big differences with today's Applications in the nature of application and the kind of Application development we need to do. Firstly, users share these applications, so the application Has to run on the number of servers in the cloud. Next, users access these applications through mobile Devices like cell phones and tablets. So it means that applications need to have mobile Component, cloud component, sometimes desktop component, and Increasingly sometimes edge components too in case of smart Cities and autonomous systems, autonomous vehicles. And lastly, these applications need to coordinate across Distributed components over the links that are sometimes Low or unreliable. Today even an aquarium At home can have a raspberry pi attached to it that Controls the water level, temperature, feed, and the chemicals. All this being controlled through the software running on the Cloud. In summary, all this adds up to Significant challenges for application programming. They have to coordinate data and computation across Distributed nodes. They have to meet application Requirement with guaranteed performance limitations and failures. And they have to do all this across different programming And hardware resources. Which means application Development is really very hard with multi-user, multi-program, Multi-platform, multi-node, et cetera. This is one of the reason Why cloud native computing foundation exists. Along came essentially cncf, containers, Microservices, et cetera. Most of you know well about the things on this slide. Things have certainly got better. Applications can be partitioned Into deployable containers, programmatically Orchestrated and declarative configurations with Kubernetes. Different languages can be used in Every single container. And we can connect Microservices with services like linker, d, et cetera. SRS can run these things and developers can develop them. However, the problem is somebody still got to write the code That runs in these containers. And it turns out to be it is very Difficult to do it reliably in a way that gives the application Properties that you want. Companies like google and amazon Have been doing it for quite long, but they do have somewhere Around 100,000 pretty good software engineers spent on Building these resilient and scalable systems. Problem is that a company that needs to build some end user Application doesn't necessarily have thousands of distributed Systems software engineers. Kinds of problem people run into Very frequently are things like distributed concurrency, Synchronization, reliable rpc. How do you make unreliable network Look like reliable from remote processor call point of view? How do you handle the failures, disconnections, Reconnections, replications, leader election, sharding for Horizontal scaling? All of these are complex distributed System problems that we are forcing application developers to Understand them and handle them. Some of these areas are Partially fulfilled with things like service measures, but There isn't a single platform where you can sit down and Write a distributed application and know that you have got All the primitives you need in order to build a Distributed application. If you contrast this with Single machine operating system like linux, they provide a Rate abstraction, but we don't have equivalent of it for Distributed systems. Not surprisingly, these are Distributed system problems. These are the things we study In computer science operating system classes. We learn about How tcp sliding window protocol works. But we don't expect application developer to implement tcp Sliding window protocol in their applications or decide Which memory should go on disk and which memory should go On ram. This should be abstracted away. Which at the moment is not in case of distributed systems. Unlike single machine operating system, which takes care of Them, all these things under the hood. So the question is, what do we do about this? We clearly need people to solve these problems and are not Necessarily application developers, and they should be Specialized in this. The terminology i'm using here Application developer is not something everybody understands. Certainly not a standard term in the industry. Just want to be clear on what i'm talking about here. I think there is a large amount of very necessary Specialization developed around this area. These are application developers. These are the kind of people Who know particular application domain very well, like Social networking, travel, finance, banking, et cetera. They know what their systems need to do, what their customers Want, what functionality they need. They don't give a hood to distributed system problems. In the middle we have system developers. They are the people who are interested in understanding and Solving the complex distributed system problems. They are extremely short in supply. Problematically they don't understand your business needs. On the right hand side we have srs, devops engineers. They are the people who understand how your actual Application behaves in real life. What happens when a particular customer Hits your application? How much capacity you need? Overload? What do your Asserting systems look like? What breaks and why? Once again they are not necessarily distributed system Experts and not necessarily application domain expert. But they do understand how to run the systems well. I think it is very useful to understand these distinctions. Srs are good at running reliable systems. Application developers are good at application domain. So what we think is ultimately we need a distributed Operating system. Amino os is an umbrella project. We think it should probably comprise at least these Measures of systems. First one is amino run. Manages distributed microservices. It is a distributed programming system that provides Customizable and extensible deployment of mobile and cloud Applications. This flexibility enables Programmers to separate deployment logic from their Applications. Second one is amino sync. Reactive data management service which provides persistent Cloud storage and reliable synchronization between storage And mobile devices and also provides automated execution Of application code in response to shared data updates. Third one is amino store. Distributed transaction protocol Like that provides linearizable transactions using Inconsistent replication. Unlike two phase commit with Strong replication, strong consistent replications Within the partition. Last one is amino safe. Distributed system that gives user control over how Mobile and cloud applications share sensitive user data Collected on mobile devices. Basically it deals with Privacy and security aspects. We are going to talk about Amino run for today. Irene will be describing it in Detail. Just wanted to give the context and where does This amino aspects. This picture is self-explanatory. What we really want you to be thinking is distributed Cloud native applications doesn't actually have to Worry about which piece of code runs on which machine. Operating systems has to take care of them. Amino is a unified system that provides microservice run time, Memory management, storage, and security components. It needs to run across all the operating systems in the Network. A lot of research had been going On for several years in the university of washington. Around the time Irene had published her thesis, we came Across it and thought that it's worth turning into an open Source project in the cncf so that people can use it. Research wasn't intended for that purpose originally. It was really exploratory to figure out what this face Looks like and what we needed. In particular, there are Academic papers floating around in the internet. Some of the terminology in the papers are different from What we have described here. Over time we discovered that Over time we used were confusing. For example, we renamed Amino system to amino run, diamond to amino sink, et cetera. If you see the contrary terms floating around the internet, Hopefully you'll understand the origin. I'm going to hand over to Irene from here. Can you hear me? I'm going to talk in english because my chinese is not that good. You've gotten a great introduction to the overview Of the whole amino system. I'm going to focus today on the Very detailed parts of the amino run subsystem. Sort of what the magic is and what the knobs are that make it all happen. For this subsection of the talk, I'm going to talk about the goals That we had in doing the research for this portion of the amino run system. It's architecture and how it actually works. And then I'll talk a little bit about deployment managers, Which are the magic sauce to this system. And then finally I'll wrap up with some experience and evaluation And then we'll go back for a demo. So when we started taking a look at this problem space Of developing a distributed operating system for mobile and cloud application developers We had a couple of goals that we wanted to achieve When we were looking at building this run time system. So the first is that we did not want what we considered Deployment logic to be mixed in with the application logic. So you can think of this as the code that a distributed systems programmer Would have to implement like building RPC Compared to something that an application programmer should be building Which is like how your application operates Or what a tweet is. We wanted to make sure that the application code itself Was able to be very simple and intuitive But we still wanted application programmers or SREs To be able to decide how actually to deploy their application So we didn't want to take all of this control away Even though we were going to separate the deployment code From the application code itself. Then we had some other high level goals Like we want to be able to support a large number of programming languages And then of course we want to provide reasonable performance And we want to for sure support external infrastructure systems Like Kubernetes, Istio and etc. So what we came up with was this Amino run system Which is a distributed operating system That supports pluggable deployment managers Which extend the functionality of the distributed operating system In various ways. And so I'll show you how these deployment managers Actually allow application programmers to decide exactly How their application is going to be deployed. So the best way to understand how Amino run works Is by taking a look at its architecture And the key thing here to notice Is that there are going to be three layers to this architecture At the top will be the distributed application Which is the application code that's going to run Across all of your mobile devices And IoT devices and cloud servers Then at the very bottom layer we have a deployment kernel Which implements best effort, basic deployment tasks Like just making a call to a different part of the application Or being able to find other components in the application But the most important part of the Amino run architecture Is the deployment management layer So the deployment management layer Extends the functionality of the deployment kernel In such a way so that application programmers Can actually choose what kind of functionality The deployment kernel will offer Each part of their application So to give you an example The application programmer takes their application And splits it up into a couple of large microservices What we do underneath then We have the deployment kernel Which runs these microservices But then for each microservice There's going to be a deployment manager That provides the exact functionality And deployment tasks that each microservice needs So going back to the top level of the Amino architecture Amino applications are fundamentally large microservices They run in a single address space So any Amino run microservice Can call any other Amino run microservice But they can execute anywhere And they can move transparent to the application itself So the application code doesn't have to worry About where these microservices run And it doesn't have to worry about how to find Or reach one of these microservices What's important about the microservice encapsulation Is that it gives a granularity For which the runtime itself The deployment kernel can decide to move And allocate instances of this application So if you had a distributed game, for example You might create one microservice That holds your game board Another microservice that holds the scores Of all the players And then maybe another microservice That implements some sort of game logic Or physics engine So then if we look down into the deployment kernel The deployment kernel is a piece of software That is distributed, but it runs on every single Cloud server and mobile device Its entire job is to instantiate and manage These high level microservices That comprise the Amino run applications It tracks where each microservice is It enables microservices to move around Transparent to the application And then it also makes sure that the microservices Can call each other transparently Without having to look up the IP address And actually set up TCP connections And things like that So that's the bottom layer of the architecture So now I'll talk about the deployment managers Which are the secret sauce So the way that deployment managers work Is that they implement sophisticated microservice Deployment tasks by interposing On calls to the microservice On RPCs essentially And then routing or managing those calls So as an example, to give you a concrete example One deployment manager might replicate a microservice By creating multiple instances of that microservice In different containers for example Then every time the application makes a call To that microservice, the deployment manager Will route the call to all of the replicas So the nice thing about deployment managers Is that they encapsulate deployment code And tasks separate from the application code itself So programmers can decide that they have implemented This micro-service, this piece of application logic But then later they want replication They can simply add a deployment manager Sort of like a library onto their microservice And transparently achieve that functionality Without having to change their application code So I'll give some examples in the next section About how exactly these deployment managers work But to understand a little bit better If you think of each of these as being a microservice We could decide that perhaps this game board Microservice needs to be replicated So we would attach a replication deployment manager to it Which would ensure that the game board microservice Is replicated and instantiated in different containers And then have each of the RPC calls To the microservice replicated Then we might have a set of least caching Deployment managers for your score board Or maybe some code offloading for your physics engine So we have a large range of deployment managers The thing to understand here is that Almost any deployment tasks that you might want to implement Could be implemented in a deployment manager So essentially as long as you're able to implement the tasks By interposing on calls to the microservice Then you're able to implement this deployment tasks Inside a deployment manager And we've built a fairly extensive library already Of stock deployment managers That application programmers can go ahead and use Or go ahead and modify themselves If you're a distributed systems person The other thing that's interesting is that You can combine them in interesting ways So that you can use more than one deployment manager at a time So we have lots of different categories In which we group our deployment managers We have a bunch of deployment managers That just implement very primitive deployment tasks Like exactly once RPC We also have caching and serializability Checkpointing, replication, mobility, scalability and more So there are a very large set of these Library deployment managers already And more to come So I'm going to talk a little bit more about how the So how the sapphire slash amino run Deployment manager library works In order for you to understand how it's possible For just any programmer to extend these deployment managers So when I talked about how the deployment managers Interpose on RPC The way that we actually implement deployment managers Is through an interface provided by the Amino run deployment kernel So what happens actually is that each deployment manager Is implemented as three separate components Each of which support a sequence of up calls That the deployment kernel automatically invokes When RPCs come in So let me give you an example So we have a server side component Which is co-located with where the micro service Actually runs So when an RPC comes in This component will be made to the server side Deployment manager Similarly for the client side deployment manager We have another component that runs Where the client makes the RPC So then from the client side whenever the client Makes an RPC will invoke an up call Inside this component of the deployment manager Then we have a centralized component For every deployment manager Which we implement on something that is Essentially an LCD kind of service Provide replication and fault tolerance For these components As you can imagine these components Are going to be much slower when they run Compared to the server side And client side deployment manager components So we encourage programmers To put most of their implementation Into the other two So let's take a look at what happens Exactly when an RPC is made To a sapphire or Amino run micro service So when you start the micro service The sapphire deployment kernel Will actually instantiate the Deployment managers automatically So let's say we're trying to replicate this micro service So what will happen is the Deployment manager will receive an up call Saying that the micro service has been created At which point it will create replicas Of the micro service on two other kernel servers These replicas can talk to each other And so when the clients make calls The client side deployment manager component Will forward those RPC calls To where the micro service is And then those calls can be replicated To the other copies of the micro service So let's look at exactly what happens With primary backup Sort of leader based replication protocol So we first stand up three copies Of the micro service for a Consensus based kind of Leader based replication protocol We designate one of the micro service Instances as the leader And then when any other micro service Makes a RPC to this micro service The client side component will direct The call to the leader based micro service instance Which will then replicate the call To the two backups Wait until the RPC finishes running And then respond to the client Saying that the RPC was successfully made And of course replicated to The two other instances of the micro service So this is essentially simple State machine replication But as you can see we've been able to do it Inside a library completely separate From the application code itself And we're able to essentially attach This deployment manager to a micro service And provide the state machine Replication as a service Without changing any of the code Of the application itself So I'll go really quickly through How this works for a couple of other things Because we're running a little bit short on time So another option would be Taking a micro service that perhaps runs Some code on the client side Like maybe an IOT device of some sort And offloading that to the cloud So essentially what would happen is We would need to just create an instance Of the micro service the same code In the cloud and then just forward calls From the code that was running on the IOT device or phone to the cloud instance And so depending on which of the two Run faster then we can make a decision About where to actually run that code We can sort of do the same thing Essentially in reverse to cache Data from a micro service locally On the mobile device So what happens is we have some sort of Instance of the application micro service in the cloud Then when we make a call We bring the data from the micro service In the cloud to the mobile device And then make the call locally On the mobile device instead So those two, a couple of examples That I did were pretty simple Just so that you could understand How exactly the sequence of events happens Inside a deployment manager But you might want to run a much more Sophisticated algorithm For example to automatically deploy An application based on minimizing the latency So what we actually are able to do With the deployment kernel is things Are things that are much more sophisticated So we can take a look and monitor The latency and throughput of every call Between every kernel server And every micro service And make a decision about where to place these Using a distributed algorithm And we'll actually see an example of this Migration and dynamic placement happening In a demo later So we can compose deployment managers as well It's done by chaining deployment managers So you can imagine every time you make an RPC The call goes through a series Or a sequence of deployment managers But of course you could imagine that this rapidly Gets very complicated So we don't imagine people would be chaining A lot of these deployment managers together Alright, so five years ago When I did this research We did a little experiment And I'll show you the experimental results Although the demo that we're going to show later Is probably much more interesting So we did this experiment On a couple of mobile devices They were pretty out of date And a Dell server that we just had in our lab And so all of this stuff was running over wifi And also like 3G or whatever we had at the time So essentially what we did Was we built a multiplayer game So we were able to So we were able to use A couple of different deployments of the multiplayer game And so what you're seeing here is In each case, the read and write latency For making a move And then reading the game board In milliseconds So the first thing that we did Was we put the game board As a microservice in the cloud And then we let both mobile devices Make calls to the cloud to both make moves As well as fetch the state of the game board So that means that everybody sees Roughly high latency uniformly But what we were then able to do Is change the deployment manager To move the game board microservice Onto one of the phones At which point the player Whose phone the microservice is running on Obviously sees very low latency And then the latency for the other player rises And the really cool thing here Is that we didn't have to actually Change any application code To change this game Essentially from something that is A cloud server-based game To a peer-to-peer application And then we were able to do the same thing Where we just replicated the microservice On both players' devices So reading the state of the board game Became very fast But then every time the player made a move They would have to update both copies Of the board game on both devices Which increase the right latency So I'm going to skip this next experiment Because it sort of shows something very similar I will also say that The Amino Run system does support Multiple programming languages Through the use of Grail VM And so this was one of our goals To be able to be somewhat Programming language agnostic So you can essentially write your Amino Run application and deployment managers In any of these languages So that's pretty much it for the Guts of the Amino Run system I will give it back to Varnu for the next steps and demo Yeah, once we... About the next steps, it's still early We have a lot of ways in which We would like to extend Amino Run in the future Today microservice migration only works for the states Stored inside the microservices But there can be local files, dynamic states Where we are not migrating as of now We have to work on that And we need to support... Definitely need to support more languages We support very few now With the support of Grail VM we are going to extend it And we want to build some more dms to integrate With external systems like load balancers Toetcd, tikv And at the end, we still need to build Some good edge applications and verify that Everything works well, at least on the cube edge Rather than on the normal Kubernetes Feel free to get involved This is already open source, there is a Slack channel, GitHub repo, website You can have a look at it and get in touch with us If you are interested in it I will run the demo for face recognition app We don't have time now, we will be doing that We will share it offline if people are interested We are open for q and a now Any questions? Here, here Hello, okay, so you said earlier That you were able to sort of turn a cloud game Into this peer-to-peer game Is that like a live migration? Or is it some kind of... You need to kind of take the service down And do some sort of reset and start it back up I think you have an answer So the question was whether the evaluation The experiment that i showed was a live migration or not So the particular example i showed was not It was a change of one line of code And then standing up the service again But the demo we have, if you want to look at it online Is actually live migration Adding to that, it's a java serialization and dc Relation, so it's a java-based application So it's a java serialization and dc Relation at the other end We don't have dynamic state migration as of now We are going to extend that support Thank you, you're done