 Hello everyone, so thanks for coming in for your time and today we are going to talk about Service Fabric that you of course know about SAP Cloud Platform that we have released and we have gone public a few weeks back. We are on Amazon now and by the end of this year we will also be on Azure and also will be on GCP. So these are the places you will find SAP Cloud Platform, go have your own trial there, experience what SAP is doing in this. Our cloud platform is completely based on Cloud Foundry. So get yourself a trial, try some Docker instances and see how it works. My name is Krishanum Kishwaz. I'm one of the engineering manager at SAP and I'm Shashank. I work as an architect with the team here. So today our topic will be to have a look at Service Fabric and how do we manage basically enterprise-grade services within SAP for the Cloud Foundry applications basically. So moving on, some motivation. So what is kind of Service Fabric? I think every enterprise-grade Cloud Foundry application. So if you are building a Cloud Foundry application you invariably probably have used one of the other services, I mean backing services, right? So maybe a Postgres, maybe a MongoDB, maybe a Rabbit or maybe a Redis or something like that. So every application needs one or the other kind of services, stateful, stateless. Now the point is that how do you provision, manage and operate these services? Just think about a developer comes in and he wants to quickly kind of test his application, try his application, code his application around and he needs one of the Postgres service immediately so that he can go on, you know, rather than going back to IT asking for a full-fledged deployment of a Postgres like 5TB and that will take time, right? So it's the power that now goes in the hand of the developer, yourself. You can start as a developer your own dockerized Postgres service and you start coding your application, right? So that's where the experience starts for SAP Cloud Platform. And then when the developer is kind of ready, he sees that his application is working fine and then he can go about creating a more robust Postgres or a MongoDB that is like more production grade, having all production level qualities like high availability, failover and stuff like that, right? So that's the cycle. You start as a developer, you immediately start off with a service that you need and we are all talking about service here. The question that you need to ask in your organization is like how fast can I provision such a service, right? How easy it is to provision such a service? I mean, we are not, we are talking about cloud. We are not in the traditional IT, right? Where you talk to the IT, you tell that I need to answer service and this is my specification. My department will be using this, that, some of the networks and IPs I need. No, that days are gone. Now it's in your hand. So you pull up a service, connect your application to that service and start working. And here what we are trying with service fabric is making that pin point easier for you that if you are a cloud foundry application developer, just quickly can pull up your own instance and can start working on that. Manageability, how efficiently can you update, upgrade, monitor? Now you have your own instance running, right? Let's say a Postgres instance running. It's like kind of five VMs with all high availability and failover mechanism all in build. Quickly you'll realize that managing this system becomes a nightmare. And this is just one system. And what if you have hundreds of such systems? So in your company, there could be many departments and all of them are using their own Mongo Postgres. Rabbit and they are like all three to five node virtual machines running. And question is, how do you manage them? How do you update them? How does the operators operate them, right? If there is a stem cell update, how do you do that? So quickly you run into this kind of problems and then you start thinking, maybe there is some tool that could help me here doing this in an automated fashion, right? Service Fabric is not alone. So Service Fabric is all about provisioning, managing, and operating service, I said. But without the services, Service Fabric is nothing, right? So you need high quality services as well. So Postgres, MongoDB, these services should have their own service qualities in build that we talked about. Resilience, failover. Scale, all of this, the service should provide. And they are the Bosch releases that we have. So Service Fabric understands any kind of Bosch release. You give them a Bosch release, it'll deploy for you, right? And then all of a sudden you have a full slated service running for your organization. Experience we talked about, from development to production, right? And then multi-cloud, that's one thing that is important. So these services have certain qualities or the way they are, they basically get spawned. It has a touch point with the underlying IS. Now, the way it gets spawned on AWS doesn't work the same way on Azure. And now Service Fabric takes care of this, right? So all that you have and need to have is Service Fabric and allow it to deploy services on multiple clouds the way you like it. Yeah, if it is on GCP, it'll spend the same Postgres cluster on GCP. It'll do the same for Azure, you don't have to care for. You just go on to the cloud or pick your cloud. And you are multi-cloud that way with Service Fabric. So welcome to Service Fabric, and it's an open source contribution from SAP. You start using it, we have our next repo update coming up in August. And would like to receive your feedback also on that. With that, moving on. So Service Fabric is Service Factory. So that's a fabric is with a K, it's called Service Factory, basically. So Service Factory, it's a kind of a manufacturing, service manufacturing machine, right? It's kind of, it's joining services for you. So creating service instances for you, right? That the application can connect and can consume the services. So it provisions, it manages and it operates services instances at scale for cloud foundry applications in an automated and managed way. Which means to say, when your application is up and running, you also need to know how the application, how the service is doing. You know, what is the CPU usage at this point in time? Is there anything alarming? Is the memory usage going fine? Is your Postgres running out of disk space, right? And then you can monitor all of this. And if it goes beyond certain thresholds, you can raise alerts. That your operators come to know that something bad has happened to the system. Or at least gonna happen to the system. And then you can take a call accordingly. You know, if the disk space is kind of, you're going out of disk space, the operator can move you to the next plan. So maybe 800 GB Postgres to two terabytes for Postgres, right? So you can take actions. So you need all of this. And that enables seamless application development and deployment experience for an organization using services. So from a single node Docker service instance for development, to multi-node highly available clustered setup for production environment. Service fabric is there. So integrated monitoring, alerting and logging, including audit logging or course, which is very important, who is using your database, what actions he is taking on your database. It needs to be logged for audit reasons. And that is built in. Backup and restored. So you have all of these instances now, all around in your organization. Some Postgres is running, some Mongo is running. How do you take backup of all these instances? Yeah, your department, some of your business department, asked you to give them a Mongo, you gave it. But now what if suddenly some bad thing happens? How do you bring the system back? So service fabric takes backup behind the scene and in an incremental way, right? So it's a delta backup. The snapshot it takes, it keeps taking. So you can go back to any point in time and restore your system if something bad happens to your system. Update and upgrade, talked about it, right? The system update and upgrade, it's a continuous process. The security patch has to be applied. If a new stem cell update that has come in that needs to be applied, service fabric can do that on all the instances. And of course the security is very important. So that's what it's service fabric. Any question at this point in time on what service fabric is capable of? So this is what it does, it does, it manages services. It provisions, not only provisions, it manages and operates all the services. So this is how it looks like, yeah? So from your CFCLI, you're a developer. You basically tell service fabric that, hey, give me a Docker VM, right? I mean, give me a Docker instance, of course Chris. And there are certain, then service fabric basically talks to Swarm, the Swarm manager. And then Swarm then puts this service that you have asked for into one of the Docker VMs, right? So a lot of different types of Docker containers reside inside each VM, each Docker VM. This Docker VMs then has a lot of containers running all different kinds of services. And one of your development services is also running there. So connect to it, you build your prototype or develop your application the way you want. The other kind of service when you are done with your development, you then tell service fabric, hey, fabric, now give me a production grade, post-chris. And then service fabric talks to Bosch. It tells Bosch, can you just pin a cluster of post-chris for me? And that post-chris is basically, let's say 5 to 7. It depends on your Bosch release. It could be 5 to 7 VMs together, and which is highly available. And yeah, and with this particular Bosch release, with this virtual machines, we also inject an agent inside. So fabric and the agent can then talk to each other. Fabric can ask the agent, how are you doing? Is it all good with you? Agent can also take certain metrics and evens out of the virtual machine and can send it to the central monitoring system. We have Grafana for this, right? So it goes to Rimaan and then Rimaan stores it into influx and from there, Grafana pulls up and shows, we will see that in some time. How does the monitoring look like? But that's how the communication takes place, right? So you have Broker, a service fabric broker, which talks to Swarm for Docker services, which talks to Bosch for a more production-grade services. And it's a matter of seconds or a minute when your service is up and running, the cluster is up and running, right? So you have a service now, you connect to it. And then let's, so this is the big picture. And there you see that IES blob store. That is basically the place where all your snapshots or your backups get stored, right? The service fabric keeps taking backup of your volume, right? The Postgres data, whatever you have on the volume, on the disk has to be preserved. And this backup is stored somewhere. So on Amazon, we are storing it on S3. On OpenStack, we are storing it in Swift, right? That's how it is. So that's why that particular piece. So this is a very high-level picture. Let's get a little bit into the details of what it is doing and how is it doing. So Shashank, maybe take us through that. Thanks for the great introduction of service fabric. I think I guess you guys are finding it interesting. So let's go over the brief architecture, what it means, in terms of components, first of all. So I think everyone here is aware of what Cloud Controller is and what the CF APIs are primarily. So the reason we have service fabric built over the Cloud Controller or the service broker V2 APIs is primarily to have the lifecycle managed by Cloud Controller. So there's nothing much difference in terms of CF managing the instances here, primarily. So if you look at the call trace, it starts from the Cloud Foundry site on the extreme left. If you see, there's a Cloud Controller. And then we have the CF APIs, which invoke the broker. Broker is nothing but the service fabric itself. It's a Node.js process which is running, deployed as a Bosch release. And this service fabric broker can then talk to two systems primarily. It can talk to Bosch via the HTTP APIs to the Bosch director or it can talk to Docker swarm again over HTTP. So as Krish explained, primarily there are two scenarios for Docker swarm. We are kind of looking into development kind of instances, single node Docker instances. And via Bosch, it is more productive usage. Like you get a complete cluster with HAA and all kinds of service qualities. So what you see here is from the broker, we trigger a lifecycle operation on Bosch. We deploy at the manifest. And then it creates instances of Postgres, MongoDB, Redis, which are cluster in themselves deployed on different virtual machines. All these systems can then talk to infrastructure components like Grafana, Elk. Grafana is where it is pushing all its monitoring information, like what kind of events are happening, what is the health, how many connections, let's say for Postgres are active at certain point in time. And then we also log into the audit log or Elk stack. Primarily the lifecycle, like when the process was started, if the process is shut down, all kinds of DDL or DML statements if you want to configure there. Then you see on the service fabric component itself, there's a monitoring agent, which is also capturing the health of service fabric process itself. Like how the service fabric, is it too many connections on the service fabric node? Do we need to scale this? So all these kind of data is also pushed to Grafana by the service fabric monitoring agent. So compared to the previous slide, this is a more detailed view on how the components sit into the infrastructure. So if you see on the top, you see the Cloud Foundry CLI, and then you have the IS layer where your elastic load balancers in case of AWS are sitting. And then you have all kinds of object store volumes, object store volumes, and then the compute part of it. On the left side, again, the... Sorry. Let's go back. On the left side, again, it's the Cloud Controller and the Diego runtime. Primarily it's from the app's perspective and the CF runtime primarily, which again talks to the broker over the standard interfaces. And broker internally either pushes Bosch deployment to the Bosch director or interfaces with some for Docker containers. So a little bit talking more details about how Docker is working here. So you see on the extreme down on the Docker bracket where you see the Docker engine, which is the Docker daemon, and then you have a storage. So we kind of wrote our own volume driver, which is abstracting the actual volume which you get from EBS or Cinder. And this is called the LBM driver, the logical volume driver. So what we do, we create a sparse file on top of the actual volume we get from Cinder or EBS and then carve out logical volumes for each container. So we save on cost. As we said, this is all for development purposes. So the idea was here to save costs rather than giving each individual container a proper volume. This also helps in quota management? Yes. So again, the idea is that you enforce quotas on the disk. So one container can only take certain resources. So if you see no Docker, primarily it enforces two kinds of quotas on the CPU and memory, but not on the actual volume or the disk storage. So this is how we enforce data container cannot exceed a certain usage of the disk itself. So on the service instance deployment, then you see there are a couple of agents apart from what we have as the main process like a Postgres instance running, you have also a Bosch agent and a service agent. So Bosch agent is something which is provided by Bosch, automatically deployed on the VM. The service agent is what we package alongside. So it's a job which is co-located, which means it is on the same VM on which your service process is running like a Postgres, Redis, or MongoDB. So what it does primarily is that it provides an interface to the service fabric broker. So I can query the agent over HTTP and check what is the status or what is the health of the particular process. Or else I can trigger certain operations on it like a backup and restore. So the way we have implemented is the service agent exposes certain capabilities. Like it can do backup, it can do restore, or it can do certain other operations. These capabilities are then queried by the broker or the Node.js or the service fabric, and then accordingly it invokes those operations on these specific nodes. So more when the broker kind of spins a cluster of a service, it also schedules certain backups, and tells that at this certain point in time interval, you keep taking backups. So we have detailed slides on that. On the same slide, back again. So if you see on the extreme left, extreme right down, you have the jump box which is our best end. But I'm really, it allows for the ops to see, get into the virtual machine. So it's like an isolation, not everyone can get into the virtual machine. It is only via the jump box where you allow entries. Good. So this is a simple sequence diagram on how a service creation is working. I try to keep it simple, but there are many complexities which are handling or which are being handled behind the scenes, like creation of security groups and all. But in principle, what it is doing from the CFCLI, you trigger a creation of an instance. The call goes to the cloud controller which invokes the service broker. So in case of Bosch, what it does primarily, it generates a manifest, a manifest which is acceptable by Bosch, which Bosch can understand. So what it does is also injects some kind of network configuration because we control how much, how many IPs for a deployment you are getting at this point in time. And then once a manifest is generation, you call Bosch director over the HTTP API and trigger the deployment. Yes. So I think you kind of touched upon the monitoring and logging. So this is specific to let's say a service instance. So what all capabilities we need from a service instance perspective to say that it has good SLAs, right? So it needs to tell its health, not just health of the VM, in terms of like the CPU, memory and disk, but it also needs to tell how the process or a specific type of a process is behaving. So we give the example of Postgres, the active connection, the wall logs or the specifics of a Postgres process or Redis as an example or Rebitm queue. Let's say how is the queue size looking? Are the queues exhausted? So all these kind of metrics can be exposed and then pushed to the Grafana tool. So you go on by the time I'll just connect to monitoring. Okay. So and then similarly we have the Elk stack where we are capturing the logs. We use RCS log daemon as the initial ingestion point from the VM perspective, where all these logs are captured into RCS log and then pushed to the Elk stack where you can look into Kibana and then there is also an integration from our side. Let's say you triggered some operation and which has failed. So that event is registered into Grafana, but from Grafana you can go to Elk and see and trace establish the traceability that what has actually happened via the logs. So this is again the backup and restore process where we said that there's an agent which is sitting on the particular VM and it is exposing the capabilities that let's take an example of Postgres. So there is a master and a slave node, but we all only want master to take the backup or do the snapshotting. So the master only exposes the capabilities that I can do the backup operation for you. So when service fabric queries that which of the nodes can do a backup, only the master reports its status back and then adequately you can invoke that operation over an HTTP call and do the snapshotting. So currently you see the snapshots what we kind of depict here. We use the infrastructure layer rather than the services specifics to do a backup like we could have done via PG dump or something, but we rely on the snapshot mechanism of the IS layer in case of OpenStack as well as AWS here. I think this is the most critical part and how do you now let's say you want to use service fabric and you have a certain set of services. So how do you onboard a service? So currently there is a defined process for us where we have a life cycle tool if you see on the left side. So we have a life cycle tool and we use certain manifest for the service fabric which internally uses two files. I'll explain what is the purpose of these two files. We call one is a EJS file and one is an ERB file. So we parse through the EJS file and evaluate the ERB file and then inject the template into the actual service fabric manifest. So what this means is that if you want to bring in your own service what you need is the upper three components. You need a Bosch release. You need to give us an EJS file and you need to give an ERB file. So ERB file is where you define your plans of the service. So let's say a database plans. This is the memory. This is the CPU you need for a specific plan. All the plans are defined in the ERB file and in the EJS file you define the manifest template what Bosch can accept. How many VMs you want? What is the job definition for your release? So all these things are captured in the EJS file. Finally, service fabric creates one configuration file out of whatever services it needs to take care of it parses through the plans and then injects the template for the specific service in the properties. So finally, when you invoke via the cloud controller it reads through service fabric, reads through that configuration file and triggers the manifest generation of the specific service. Coming back to the advantages we have via this using service fabric. So as I said that we use the complete cloud foundry programming model here. So there's nothing like we invent some APIs here. You always go via the CC APIs or the cloud controller APIs which has CF service create maybe we'll show you the demo. Second, we tried to keep this component as stateless as possible. The reason being that we can then scale it horizontally. So whatever state we need, we keep it either in the Bosch or let's say in case of Docker it's in the swarm. So we query the state rather than maintaining our state ourselves in a DB or something. Then we have done some work around the agent framework. As I mentioned, it's provides a pluggable agent framework. So what does it mean? So primarily the thinking behind is that any agent you deploy has three operations or three capabilities to do. It has to collect data from a source. It has to process data and then it has to dispatch that. So take an example of you are collecting some statistics from a VM. So we have written a health collector which is constantly checking how the VM is behaving. And then we process not all data has to be dispatched as an example to Grafana or any other monitoring stack. We process, drop certain data and then we dispatch what is needed. If there's some aggregation or something which needs to be done, then and there we do it at this level. Okay, so for the sake of time, let's take a look at some of the observability. So you have, we told that, right? There are certain systems running and you wanna know what's happening within that system. What's the load and all that. So I will try to pull up. Okay, so this is the dashboard and what it shows that on the left hand side there's a swarm manager and I'm running it from our staging landscape. So you'll see it all live basically what is happening. Swarm manager availability and then it shows the broker availability. So the broker is up and the swarm is up. That's what it says. Backup wise, the backup stats if you see it says that there are 15 backups succeeded in the last, I think one hour and the two backups fail, right? So you can now go inside and check in the Kibana system. Okay, what happened at this point in time? Because everything is logged. You can say that why these backups fail and what can you do about it? So you can do that. That's the traceability part here. That is the traceability part. Okay, so that was the backup stat. Now, this is Postgres and MongoDB, basically. Again, it's part of backup. It says that there are around 500 backups for Postgres SQL that's being taken. And for the MongoDB that's around, how much, what is the number? It's around 60. Yeah, those many backups have been taken so far. And backups are important, right? I mean, that's your last line of defense, right? So before you lose your system, I mean, you have to ensure if there is nothing then at least that the backups have been taken. This is something that you should really be worrying about. Broker system health, it shows the memory usage in percentage and CPU usage in percentage, how's the system doing. If you see at the Docker, you know, the Docker images, you see that there are total two Docker nodes, Docker VMs, now currently existing on this particular landscape and there are around 89 containers running, right? So this is the aggregated view of how many containers you are running, how many nodes you are running there. Right. So now if you basically want to see like, you know, what are the services? If you want to go one level down, what you saw now is like an aggregation. You go one level down and you want to see like particular instances. So one one box here, the green boxes are nothing but one one clusters. And these are all Postgres dedicated clusters which is running within inside the CPU, right? So, and you see that, you know, they are kind of, these clusters are up in good health and that gives you a good feeling. But of course you can also drill one level down. Now you can get into this cluster and see like what's happening within the cluster, right? And this shows you now on this particular cluster of Postgres, what is the disk usage? What is the memory usage? As the usage is pretty low here, but you can see that, you know, as there's a lot of vital metrics which is coming up. So the database of bulk data hit and the number of connections here conquering active DB connections on this particular database. So, you know, it gives you a lot of visibility here on from a very high level, aggregated level to drilling down into a particular instance. And of course you can set your thresholds wherein if it crosses that threshold, it'll just alert you up. Yeah, and it supports the stream processing here on the Grafana side or the Raymond side primarily where you can define the closure scripts to trigger alerts. Right, I think we are now running short of time now. But yeah, when you have this, you can of course go ahead and try this out. This is our repo and please have a look. Maybe you can just open the templates once and... Which template? Service-subject-product component. Yeah. So if you use... This is what I was talking about, the onboarding part. So this is the grand template what we have here, right? And which will pass through the NDV will open the EGS file in the ERB files for the services and show. But this is what is responsible for creating this big configuration for service fabric. So if you just open the ERB, go into the services directory and take any of let's say Postgres EGS file or ERB file, open the ERB file also and show. So this is where if you see all this metadata about your plans and other things are coming into play, right? So how do you define the container plans? How do you define the managed plans or the dedicated instance plans? So finally what fabric will do is in the manifest template, the bigger one gets parsed and then all this information is just made part of one configuration file, right? So then when someone makes a request that I want to create an instance for a specific type of service and for a specific plan, it reads through that configuration, what it means, what template it needs, what configurations it needs and then adequately generates the Bosch manifest or the Docker, it pulls the Docker image and creates one. Yeah, so the point is if you have your own service, that it's a Bosch service that you want to get deployed on your landscape, bring it in, service fabric can deploy this for you and can monitor and manage this for you. You need three files, one is ERB, EGS and then your Bosch release, that's all you need. Yeah, so our next step is to make this onboarding even more easier and we want to bring in also this out of bad Bosch deployments, which means that if you already have a Bosch deployment and you want to bring it in within the ambit of service fabric so that you get all the advantages of service fabric, you can also do that. Bosch 2.0 compatibility, we have to get their service bundles so there are multiple services if they can talk to each other. Right, yeah, so the idea here is, let's say today the model of Cloud Foundry is app to service binding, but it misses a notion of service to service binding primarily, so if you deploy a certain service and you want to consume a managed service how do you do that? So I think this is in lines of that idea here. Right, okay, so with that we come to the last slide here and try SAP Cloud Platform. You get certain services, time Docker services for free and you get your own org in space there, build your own application against the services, have fun. For service fabric you can go to our service fabric repo an SAP repo, service fabric just have a look and yeah and you can also know more about SAP's contribution to Cloud Foundry in that link so this will be uploaded of course you will get to have access to all these links, they're all public. A lot of things SAP is doing today for Cloud Foundry is a lot of contribution happening including contribution to the core Cloud Foundry itself, but other than that service fabric and there are many more other such projects like Abacus and all that that SAP is contributing to. So take a look and stay tuned. So our next repo update is gonna happen on 21st and then on I think we will work together to kind of build a better service fabric from here on. Thank you. Thanks for your time. Yeah. Yeah. Thank you.