 Okay, so the recording is on and thank you everyone for joining the Edge UTG session today. It's April 11th, 2024 and we are starting with a presentation from Inria given by Helen about their choreographic infrastructuralist code project. So I will give the microphone to Helen and to guide us through the presentation. Thank you, yeah. So everyone seems to know me in the video but I will do as it is recorded, I will do as if no one is knowing me. So I'm Helen Coulomb, I'm a French associate professor at IMT Atlantic which is a famous French research, sorry, engineering school in France. And I'm also a member of the Stack Preserve Group at Inria and today I'm going to present a set of works I've done with many people. I will talk about the person involved in this work later. So the title of the talk is Toward Choreographic Infrastructuralist Code and I will try to be as clear as possible to prove this presentation. So I will first try to start with some concepts that you know very well probably in this working group. So the concept of infrastructure as code. So the basic idea is that now everything or almost everything is infrastructure code, even a simple web application because you have to provision some bucket to store data, some Kubernetes cluster if you have a containerized microservices application, et cetera. And because the capacity and the complexity of infrastructures grow, through times, let's say system administrators and developers thought that it would be great to handle infrastructure as complex codes, meaning using good practices such as software engineering and languages to write infrastructure and to make infrastructure evolve through time. So a few definitions and basic examples to start. So I will consider an example with pre-resources that I call A, B, and C. So a resource in this talk is any piece of software to deploy and maintain. So it could be microservices, infrastructure, pieces of infrastructure, et cetera. So what I will call a lifecycle is the set of possible states for a given resource. What I will call a dependency is when a resource requires some information from another. For instance, in the above example, A is using B and B is using C. So these are dependencies. I will call a goal or a target state, the state that I want to reach for my infrastructure. So for instance, I would like to switch the version of the resource C to version 2.3 and then go back to a running state. This is an example of the goal. And finally, I will call a plan, the set, the sequence of actions to move from the current state to the target state. So let's take your first very basic example. So let's consider that I have a very limited lifecycle for my resources, which is I can turn on a resource or turn off the resource. And now let's see what kind of plan we can do with this lifecycle expressivity. So the first very naive solution to update the resource C to version 2.3 is first to turn off A, then to turn off B because B is using C and A is using B. And then turn off C, update C, turn on B, turn on A. So this is a very naive way of doing an update of the resource C and this set of actions, so turn off A, turn off B, et cetera, this is what we call a plan. So in this example, we can see that what is actually happening is that we have a propagation of the turn off action. So we need to turn off the full system to then update C and then turn on everything again. So another example, which is more, let's say smarter. Okay, I have the same assumption, so I have only turn on and turn off of a lifecycle. But this time I'm going to start the new version of C that I call C prime, then turn on a proxy that will intercept request from resource B to resource C and load balance the requests 50% on C and 50% on C prime, for instance. It could be another choice, right? Then when I consider that C prime is running perfectly well, I can turn off C, turn off the proxy and now I have the new version of C prime without turning off A and B. This solution though, I have to add the additional reverse proxy and also you have additional costs because you have for a temporary period two instances of C, but you do not have any downtime. So this is another example, okay? And I have a third example and this is just to illustrate that what you can do in the lifecycle is actually quite important. Let's imagine that now I can have an intermediate state in my lifecycle, which is posing the service. In this case, my plan could be first to pose A, pose B, turn off C, update C, turn on B again and turn on A. And this is another example. So you have some kind of downtime but you don't have to turn off A and B but you probably will have shorter execution times because it's probably much shorter to pose A and B than to restart A and B. So these are example of plans. And the second part that I want to introduce before diving into the details of my work is about the declarative approaches when doing infrastructure as code. So the idea of a declarative approach is that we do not want the DevOps to write plans. So to write the sets of instructions to move from one state to another. We would like that to be automated. So you have many existing DevOps tool having declarative approach of the ideas that the tool will automatically compute the set of actions to perform to move from the current state to the desired state. So what would be expressed is, for example, I want to switch C to V version 2.3 and go back to a running state. Okay, so I will try through this presentation to answer three motivations in the work I'm going to present. So the first motivation is what if we put as much flexibility in the life cycle as required? So meaning you have a programmable life cycle you can put as much intermediate state as the pose we've seen before as you need. And the associated research questions are, will we have shorter execution time as illustrated with the pose and mechanism before? And will it be possible to keep a declarative approach if we do that? Because by introducing programmable life cycle you also introduce more complexity. The second motivation is is it possible to offer a fine grain? So I will explain through the talk what does that mean? But a fine grain programming supports to write a life cycle and dependencies between life cycles of software entities. In particular, in my work, I'm interested in introducing more parallelism and concurrency to reduce the execution time when deploying or maintaining infrastructures. With the same research questions as before. So can I have shorter execution times and can I keep a declarative approach? Again, because this will increase the complexity. And finally, the third motivation, which is probably the one particularly related to edge issues is that existing declarative infrastructure as code solutions are mainly designed in a centralized fashion. So it means that the plan, so the sets of instructions to apply to move from the current state to the desired state are coordinated by a central entity. And the actions are sent to the distant nodes but you still need this central entity to make sure that the overall coordination is a behaving way. And so the third motivation is to be able to have a decentralized infrastructure as code. And this is particularly interesting, for instance, when you have a cross functional or cross geographical DevOps organizations, but also when you have some frequent network disconnections such as at the edge or in a cyber physical system. So in this talk, I'm going to present first what we call the concerto reconfiguration model. So you can see concerto as a coordination model for DevOps procedures that I will explain that. And then I will explain how we make a concerto calligraphy, so I will introduce what is a calligraphy and how we do that. And finally, I will present a recent work on how we've done declarative calligraphies with a tool that we taught ballet. And I will conclude. Okay, so before explaining concerto, I will do an unusual thing. And I do that because I hope presenting first some results will motivate you to listen to the rest of the talk. So here I present some results on the deployment of an open stack system. So it is a minimal open stack system. So it's composed of 11 components and 36 services. And we compare the deployment time to Kola and Siebel. So Kola and Siebel is a production tool to deploy open stack. And we've done the same deployment with concerto and a concurrent solution in the academic literature that is called IOLUS. IOLUS is something very close to do-do, for instance. And so what you can see is that with concerto, we are able to win around 70% of the execution time when deploying open stack. So why is that? It's because of the level of parallelism and concurrency that we are able to offer within concerto. So here you can see, by the way, can you see my pointer? Yes. Yes, okay, perfect. So here you can see the GAM diagram of the execution by using Ansible. So with Ansible, what is actually happening is that each role will be applied one after the other. So in open stacks, you will have first the facts, then common set of dependencies that are required by all rows. And then HAProxy, memcached, if I remember correctly, MariaDB, et cetera, and you have to deploy all the rows one after the other, like that. With IOLUS, which is closer to do-do, as I said, you can introduce a bit more parallelism because of the level of the dependencies between the tasks that have to be applied within the rows. And with concerto, we increase a bit further the level of parallelism because in each role, so for instance, Nova here, you have a few lines for Nova because we are able to divide the actions to perform to deploy Nova and execute them concurrently. Okay, and the same for Netrun, for instance, or MariaDB. So now, oh yes, I have another example. So another example is it's an example extracted from an open stack submit where they use the Galera cluster of MariaDB database and they switch from a centralized MariaDB to a decentralized one. So it means that you need to reconfigure the running centralized MariaDB to become a master in the Galera cluster and additionally replace, sorry, deploy the new work notes. And so you can see again that compared to Encebal, we win around 50% of 50%. So now I will give some details on how we do that and what is the language, the concerto language. So the idea is to introduce as much parallelism as possible but to be sure that the coordination is safe and that the dependencies will be guaranteed. So the first thing to do, it is a bit like writing a role in Encebal. You, as a developer, you will have to write what is called a control component, components. And this is how it looks like by using a graphical notation. So what we do is that we are going to have a programmable lifecycle for my software entity. So it does not represent the functional code of the software entity, but the lifecycle. So the control aspect, controllable aspect of my software entity. So in this example, it's a server and I have decided to have two kind of behaviors with my lifecycle, which is I can install my server or suspend my server. But as it is programmable, I could add anything I want. So posing my server or anything. And I will define the set of tasks required to actually apply the behavior. So for instance, to apply install, I will apply the green arrows. And in the green arrows, I will put the real code. So like installing a package, pulling a docker image or anything. And here is the first level of parallelism because I can trigger multiple tasks simultaneously if I consider as a developer that I can do, for instance, the file configuration at the same time that I do a docker pool. So this internal, that's what we could be internal net. Is a modeling of the lifecycle of my software entity. And in addition to that, I have external interfaces because this as a role in Ansible, this component, this lifecycle, will be later connected or combined with other components. So what does it mean? So I have two kind of interfaces. So a use port, which is this semicircular notation, which means that this subpart of the lifecycle will have to use an external database. And this subpart of the lifecycle will have to use the IP address of an external database. And you also have provide ports, represented by the smaller dark dots. And so it means that in this subpart of the lifecycle, so in this case of the running state, I provide a service for my server. And this service will then be used by another component. And so the second type of interfaces I've already talked about is this behavior concepts which is a way to hide the details of the internal net. So instead of saying you can do this task and this task simultaneously, then this task and this task, I just say to the external that you have an installation procedure and a suspension procedure. So concerto is written in Python. So this is how it looks like from a user perspective. So the component developer, the developer of the piece of software or the piece of infrastructure will define the set of places by unique identifiers as a string. You have an initial place where to start if the component does not exist. A set of behaviors, so installation and suspension in this case and a set of transitions. So the transitions are the arrows. And so a transitions has a source place, a destination place, a behavior. So it's a color in the graphical notation and a function, so a callback function. And in the function, you are going to actually call the code. So installing things, pulling some local images, et cetera. You can even in a function call an Ansible Playbook if you want or anything. And finally you also define the set of ports. So we have two use ports in this example and you give the group of places concerned by this port. And we have one provide port in this example, which is bound to the running site. Now we have the second part of the tool, which is how we assemble or compose those life cycles. And this is typically done by the DevOps engineer. So it's a bit more like writing a playbook. So you have writing a role, which is usually done in Ansible by developers. And you can write playbooks, which is more or less the equivalent of the reconfiguration language in console. So the language is quite simple. You have six type of instructions. You can add or remove a component instance. You can connect or disconnect components. You can ask a component to apply a behavior and you can wait for a component to finish applying the actions. So I will give you two examples to illustrate how concerto works in practice. So the DevOps will write this kind of program with the six instructions I've presented before. So for instance, here we are going to have a program to deploy a server and a database from scratch. So what the DevOps writes, for instance, is I will add a server of type server, add a database of type database. I will connect the server and the database by their compatible ports. So in this case, two ports have to be connected. Two pairs of ports have to be connected. And then I will say, okay, a server, I would like you to install, which is the green behavioral interface of the server. And this will push the green behavior in a queue of behavior at the level of the components. And then I say, okay, push in the database components the behavior deploy. So it's the green, it's also green in this example. And these instructions, the push behavior are unblocking instructions. So both components will try to apply their behaviors in a concurrent manner until their ports requires some synchronization. So the DevOps simply writes this example of program and the rest that I will present, so the execution is of course automated by Concertal. You do not have to do anything as a user except writing this program. So what will happen is that each component will try to apply the green transition simultaneously. If I have two transitions starting from a given place, then the two transitions will be executed in parallel. And sorry. And when I have a merge of two parallel branches, I will have to wait that both branches are finished before reaching at the next place. Okay. And this is not, wait. I'm thinking. Yeah. And in this example, I cannot reach the running place for the server because I need the database to be running to do that because this part of the life cycle has a dependency on the database service. But as soon as the database is reaching the running place, which means, which also means that the database ends its deployment behavior, then the port is activated and so the server will be able to continue and finally deploy. So this is an example of an automated coordination with Concertal. And the idea is that we have a lot of parallelism because we have concurrency between the execution of both life cycles but also internal parallelism because of parallel transitions. This is why we have such interesting execution types. So another example. Now, okay, we have the database and the server running and I would like to maintain my database. So I will say as a DevOps, okay, pushing the database maintenance behavior and then deploy again to go back to a running state. And for the server, I will need to suspend it and then deploy again. So why do I need to suspend the server? It's because the server is using the database and in this example, I cannot I cannot stop, well, stop, maintain the database while it is used. And so how this is executed by Concertal. So the database will not be able to apply the blue behavior until this service port is not used anymore. So this is a kind of a security, right, of the coordination. And as soon as the server leaves the group that is using the database, then the port will be deactivated and the database will be able to make the maintenance and then we do again the deployment as I've shown you. So this is the part about Concertal. And now I will talk a bit about choreographies and how we made an evolution of Concertal to do that. But maybe you have questions before moving to choreographic aspects, no questions? Okay, so if you remember in the introduction, I've talked about the third motivation, which was to have a decentralized infrastructure as code. So why would we like to have a decentralized code for the coordination of the life cycle? So of course you have lots of advantages when having a centralized approach, but you also have some limitations. And one of them is that when you need to execute a plan like in Concertal for instance, or any other language, you need to know the state of the system. You need to know where, if I say it with the Concertal words, you need to know where are the tokens in the system to coordinate well the execution. And sometimes building a full state of the system is impossible or difficult or unwanted. So for instance, it is unwanted when you have cross DevOps organizations. So for instance, when you have multiple DevOps teams and laying a sub part of the infrastructure and you do that to actually not build a global state of your system because it's too complex, it's too difficult to do that. You also have this problem when you have frequent network disconnections to have a consistent view of your global state or frequent faults, or you may have difficulties to have a global state because of the scale of the system. The other problem is that if you have a central coordination of deployments and maintenance, then you have a single point of failure, meaning that if the central coordinator is unavailable, then you cannot do anything, even a simple local action. So in the literature, what is called a choreography is actually a central coordination program of a system explaining what are the interactions between autonomous agents. So it's like when you're dancing and you follow a choreography. So the choreography has been written by a responsible entity and is written in a centralized patient. But each participant of the choreography will have what is called a local projection of the choreography, meaning the set of local actions to perform, to respect the choreography. And so of course, the decentralized execution of the local projections for each agent must be equivalent to the initial choreography specification. So this is what is called a choreography. So we have worked on what we could consider to a decentralized. So it's a language that is very close to concerto. Actually, it's an extension of concerto that makes possible local projections of a centralized concerto program. So here is an example of a program, a centralized concerto program to update the masternode of the gallery cluster of MayaDB database. So we have instructions. So we interrupt the master, we update the master, we interact with the worker and update with the worker. We deploy again the master. We wait for the interruption of the master to be applied before deploying again the worker. And in concerto D, what we will have is our local projections on each node. So for the master, you will have, okay, I have to interrupt the master to update the master and deploy the master again. And from the worker side, I will have to interrupt updates, wait for the interruption of the master to be finished and deploy again the worker. These are the local projection of this centralized program. And the idea is that concerto D will automatically handle the set of communications required to de-centrally execute those local projection. So we need communications in a few different cases, so four different cases, actually. So when we have a complication of a disconnection operation between components hosted on different agents, when a behavior is handing because of weight instruction, we have to communicate that when a port is activated or deactivated. So to give you an example, so here's the same example than before with this program, okay. So we have a master MayaDB components and a worker MayaDB components. And we are at the stage where the master ends the interrupt updates and deploy again. So the master is in its running state. And as reaching the running states, the master will have to inform the worker that the service port is activated. So a message will automatically be communicated by using concerto D between the master node and the worker node, so that the worker node is informed that it can continue the deployment execution. So now I will move to the declarative choreographies that we try to achieve with the tool that is called ballets. So as a reminder, so you remember in many DevOps solutions, we try to achieve a declarative approach because we do not want the DevOps to write plans or to write this kind of programs which are error prone and very difficult when you have very large infrastructures or systems. Okay, so it means you have no human responsible for writing the central and the concerto program. But we want to go even further because we would like to directly compute the local projects written in concerto D in a decentralized manner. So it means we would like to have a decentralized plan planner. So compute the plan in a decentralized manner to reach the set of local projections written in concerto D directly. And for the same reason, why do we want to have a decentralized planner? Well, the same reasons than before. So you have difficulties to build a global consistent states of the system or the infrastructure in some cases. You may want to avoid single point of failure that may block the whole system when the planner is unavailable. And you also have a scalability issues in the case of the plan because a plan computation is an optimization problem because you have a lot of different possible plans and you have to choose the best one according to some metrics. So it is usually an incomplete problem. And so it does not scale at all. So taking smaller decisions with a few additional communications is probably more efficient at scale. Okay, so this is the overview of Ballet. So the idea is that, so in this figure, I focus on the cross-functional or cross-geographical DevOps organization, but this could be also true for edge sites, for instance. So the idea is that we have, for instance, two DevOps teams working on a subpart, a different subpart of the infrastructure. And each DevOps team can submit a maintenance goal, a maintenance target to a front end. And the idea is to avoid, hopefully entirely, human exchanges between the DevOps teams. So to do that, we need the first to build a global knowledge on the source of the maintenance, but this is a details where I will not go into this components because it's not very interesting from a scientific viewpoint. But I also need a decentralized planner, meaning that if this DevOps submit, well, ask for a maintenance operation, this planner will be responsible for sending some messages to over planners to see if over parts of the infrastructure are impacted by this maintenance and need to locally apply some actions. And finally, I will use a concerto D to have a decentralized execution of the overall paragraph. So we've used a case study, which is a multi-site, an open stack multi-site case. So we have a cluster of, well, a gallery cluster of MariaDB database to handle the Keystone authentication service. And in each site, we will have two, three nodes. So the worker nodes for the database parts, Nova node and neutral node. And the same for each site. So from a developer viewpoints, it's the same as in concerto. So the developers of the pieces of software will have to write the control components of their components of their service. And from a DevOps viewpoint, so it's quite different because now the DevOps do not have to write a concerto program, but can simply express a goal. So what we call a goal. So for instance, this is an example of a goal where I say, well, I would like the components MariaDB master to apply an update. And I would like in the end to have all my components running. So the language is quite small, but you can express things like that. So on behaviors, on components and on ports. So why is it difficult to automatically generate the set of concerto D programs? So local projections from this kind of goal is because mainly because we need to introduce waiting instructions at the right places. So it is some kind of a scheduling problem actually in the literature. And this is the same example as before, but without the waiting instruction at the worker side. And in the worst case, without the waiting instruction, we can imagine that the worker is able to actually execute the free behaviors locally before the master made its update. So in this case, the worker is deploying again and the master is, the worker is deploying again and the master is blocked because it can no longer leave its deployed state because the port is used. So this is why we need a complex solution to solve this problem. And of course, this complexity is due to the complexity of concerto itself. So you do not have such complexity if you have a simple turn on, turn off life cycle. So to do that, we have a solution. So I will not give you too much details, but we have a solution with different steps. So the first step is to compute a local solution for the component concerned by the initial goal request. So for instance, for MariaDB master, I will compute that I need to interrupt, update and deploy to do that because I need to update and I need to go back to my running state. To update, I need to interrupt, then update, then deploy again. And by doing that, what is going to happen is that my ports will move from activated to deactivated and so I will need to send some messages to a neighbor node saying, okay, the components that are using the master service must disconnect until the interruption of the master ends. So a message expressing this kind of information. And this message will be sent to the neighbors. So the MariaDB workers, then MariaDB workers will probably say, okay, if the service is not offered anymore, then I need to interrupt and send themselves the information to their neighbors, et cetera. And so we have a kind of a gossip protocol to propagate the ports deactivation or any useful information to the neighbors. So in the end, when we reach the leaves of the tree, then we have an acknowledgement phase. And finally, we take into account all the messages that we received to enrich the way we are solving the problem locally and then do a final local resolution in the end. So we've made some experiments, of course, with one, two, five and 10 sites of OpenStack. And all artifacts are available in Zenodo and we compared ourselves to an academic solution that has been implemented on top of Pulumi. So it's called Muse and it offers a decentralized declarative approach also. So in terms of execution time, we win 40% of execution time compared to Muse on a deployment of a multi-site OpenStack and 25% on the update phase. And what is interesting is that you can see that the planning phase is actually a negligible compared to the execution phase. So as a reminder, both planning and execution are performed in a decentralized manner. And to give you an idea of the work that we have avoided to the DevOps, different DevOps teams is the number of constraints that we inferred and sent to Oversights and the number of instructions in the programs. So for instance, for the update, which is the most important one, we inferred 200 constraints, 100 instructions and we have to exchange about 80 messages between sites. So as a conclusion, so I've presented the first Concerto which is a coordination language for deployment and reconfiguration of complex systems or infrastructures. Concerto makes explicit the dependencies between interfaces of software entities. And the reconfiguration language of Concerto makes possible modification in the assembly of components and in the behaviors of each component. And Concerto offer an automated and safe coordination. ConcertoD is a decentralized coordination language. So it's the decentralized version of Concerto and it offers a way to handle choreography, choreographed infrastructure as code, if I can say, by projecting a Concerto program, which is centralized on different agents. And the communications will be handled automatically and safely by ConcertoD. And finally, Ballet is on top of ConcertoD and offers a decentralized declarative approach. So it offers a small language to express the goal of reconfiguration and a decentralized plan will be, well, a plan will be the set of ConcertoD programs will be computed in a decentralized manner. And in all contributions I've presented, the focus is made on reducing the execution time and also offering automatic and safe solutions. So of course, I've done that with a lot of people. So in no particular order, I will not name all of them. But of course I did not work alone. Regarding ongoing work, so we are trying to integrate some concepts and results of Concerto into existing infrastructure as code tools like Ansible and Terraform. But as it is already five, I will end my talk. So thanks a lot for your attention. Thank you for doing the presentation. I think we have a question. Yes, from Tangfei, maybe. If you're speaking, you're still on mute. Or maybe you wanted to blow up. Yeah, maybe. Okay, I'm still learning that part of the tool. So we have a question from Guillaume. I cannot hear you very well. Yeah. Thank you. No, I can hear you. Can you- Hello? Yes. Okay, can you hear me now? Yes. Oh, yes, nice. Yes, I had a question about, it was a very interesting talk, thank you very much. And I think it could be very useful in Kubernetes operators. So do you have plans with mixing Concerto with operators, Kubernetes operators that can get any kind of application? Well, not really. Well, I do not plan to apply what I've presented today to Kubernetes because Kubernetes ecosystem is a bit complex and also have very different approach than Concerto in the sense that it does not care at all about dependencies. It will be more a retry approach. But I'm working with Tangfei Han, who is present today and over colleagues on the self-stabilization of Kubernetes controllers. But it does not involve the results that I've presented today about Concerto. Okay, thank you. Do we have any other questions? I think Joland had a question before. And Tangfei also, but- But maybe they were trying to upload. Ah, yeah, okay. I think it was really interesting, especially as you were taking the approach from a centralized configuration and moving it to decentralized. And I also really liked the data points in terms of the parallel operations. Have you also tested error scenarios when something in the back of your head not working out? Good question. Yeah, no, unfortunately, no. So we have some kind of a switch in Concerto. So we can say, for instance, if this part is unavailable, then do this or do that. But we don't have, for instance, exception expressivity. And for now, we do not have an automated way of rolling back if some actions are not possible, for instance. This could be an extension of Concerto. For now, what we can do is stopping the reconfiguration and starting a new one to hopefully reach back a previous state. So kind of rewriting the actions that- Yes. Awesome, thank you. If there are no more questions, then thank you for doing the presentation. And I'm just going to check if there's anyone here who would like to stay, to have conversations about edge-related items. I did see David Patterson, who I mentioned at the very beginning joining, but I think he had to drop in the meantime. He's part of the edge computing group. Yeah, of the ways I think all of us are in a member, so. Oh, good, no worries. But it was nice, thanks a lot for the invitation. And as it is recorded, hopefully some people will watch the video afterwards. No, of course, like I'm really sad that Rob Hirschfeld couldn't make it today because he and his company are working a lot in the area of infrastructure as code. So it would have been a great conversation as he could share his experience in this area and compare what he's doing. But if needed, I can connect to a future working group Okay, I have a discussion about this, it's okay. That would be great. I will check that with the group in terms of when people are available to do that. I will share also the recording once we draw from the call and it finishes the processing. My laptop is about to melt down right now. And yeah, I will make it available to everyone. Ellen, I will send the link to it to you as well. So you have it handy if you would like to share with others as well and I'll let you know what the group's plans are. And I'm hoping that there will be an opportunity to have further conversation about the talk. And okay, great. All right, then thanks a lot. Bye, thank you for joining, have a good rest of your day. Bye-bye. Bye.