 Thank you all for coming. My name is Abhisar Chhulya here. Today we're going to talk about how we decide the public cloud at scale and what have we learned from the past and why we change it to the fabric networking here, what's the benefits of all this thing. Also, I'd like to mention a little bit about my co-authors here, Mr. Chan Sinchin Prasad and also Mr. Jirapat Supapinan here. Both of them couldn't make it because of the visa problem here, so I'm here alone. But they're with me on the call here. If you have any questions, I'm sure I can translate it and answer you that, okay? So it's the last one of the conference, so I think we just have to, we don't have to rush or anything, so we're going to go with the flow here. Okay. And first of all, before we tell you what's the problem, I think we're going to give you some agenda first. Today's agenda is we're going to talk a little bit about what Nipah Cloud is doing in the past, what kind of public cloud we provide, what kind of service. And then when we talk about the old public cloud, when we scale what happened and then the bottleneck with the underlay and overlay network and why we switched to tungsten fabric, what's the benefits and what the features behind it. And then later on, we're going to show you with that features, we can do multi-availability zones, the whole complete structure, architecture of open stack and tungsten, how we implement it. Okay. First, a little bit about the Nipah Cloud. We established in 1996 in Thailand and we have a headquarter there. We have about 135 people here. We have two availability sites, 24x7 NOC. And we have our mission. We want to provide limitless cloud computing with this open stack and storage services here as well. We do have a lot of certification with ISO standard like 27001. We have 20,000-1 ISO standard as well as the ISO 29110 as well. We partner with Juniper Network today. Also, I'd like to give the credit to Juniper as well in implementing this tungsten fabric together here. Okay. Okay. Before we move on, just want to mention a little bit that we start providing the public cloud with the open stack in 2016. And we use Okada. That's a version. And then in 2019, we launched the private cloud. We have a customer like a banking and telco. And then in 2022, which is this year, we just launched a new cluster. We call it Nipah Cloud Space here. And it's with all these fabric networking and the version we use it, Victoria. Okay. And then the old public cloud that we implement with Okada is still there. Okay. We still provide a service. And we did upgrade it until the end of the Okada version. Okay. Okay. When we experience with the old public cloud, when we tried to scale it, some problem happened. And we didn't really go up to like 100 racks yet. But after 10 racks, 20 racks, you're going to see some problem with that. How are you going to scale Compute Note over the racks? And then including the set storage, we encounter with underlay network bottlenecks and the performance of the overlay network as well. Okay. So these are the big question marks for us that if we want to scale, these are the problem we need to solve. And how are we going to solve that? Okay. And then multi-asset is also something that I think we all have to consider because if you don't implement this tungsten fabric with the fabric networking, we cannot implement our multi-asset. And that means we're going to call, if you're going to have another cluster at different data center here, we have to call it region. But the region is supposed to be outside the country. It shouldn't be within Thailand. Within Thailand, it should be like an ASAP availability zone. That's the way we look at it here. Okay. So when we start out, this is the layout of our network and also the racks. Here's the issue. When you're scaling to 100 rack, the East-West traffic, you're going to encounter with the large broadcast domain. You get a handling, a lot of broadcast, unicast, multicast, which all this thing, when you handle all this thing domain announcement, if some nodes go down, okay, you've got to reinstall it and then restart it. And then you've got to re-announce all this information through all this rack, all this switch. It becomes burdensome for the process here. It reduces the performance and it's in increased latency. When we start out, we want to do like SLA 99.99, right? But with all this kind of problem, we cannot launch SLA 99.99. I think the best we can do is 99.99 because of this problem occur. Another issue is the centralized routing. Because as you can see, that's a big problem for us. When you scale, you've got to add this. Anybody has to connect to the bottom of the rack. All the rack has to be connected on that one, okay? So when you have some problem, you have to restart. The conversion takes so long and the, for example, like a spinning tree, there's some problem with the looping and then you have to restart it. And that's how we encounter with this problem. And I think we have to look for the solution, how we can deal with that. Also, the network protocol for this kind of setting is that we have the app, basically, when you add a new node, okay? It's a big problem for us. We have to re-announce that. We use the virtual switch system here, or we call it MLack here as well. This is the technology we're using that one. And another problem is that the whole cluster here, you have a limitation of VLAN 40.96. That's another limitation if you want to scale. Okay. So in order to scale over 100 racks here, I think what you have to... I'm saying about within the same availability zones here. So we need to first... We need to transform the architecture, network architecture. I change it to CLOS technology, which is the one that you have to... You have to understand it's about spy and leave. And you can scale it one by one. But in the previous one here, when you scale it at this point, you have to scale it two at a time. So when you say it's two, but you can't go more than eight. Because everything is centralized. It's hard to expand after that. I think it's good if you're going to do a private cloud. I think that's probably okay. But if you're going to do it for public cloud, you need to look further beyond that. What kind of network architecture are you going to use? And we go with the CLOS topology, spy and leave. We can talk a little bit about it. And we change the communication between overlay and underlay. We switch it to VGPE, VPN or VXLan. This is a protocol for the IP public here. And instead of a single gateway, we use the anycast gateway here. Or you can call distributed gateway. This is give us the way to balance the load with the ECMP. Equal cost multipath. This will give us a high availability as well, which is what we are looking for. Of course, we want to achieve enterprise public cloud. We want to achieve SLA 99.9. Nine, two nights, not just one night. So for the CLOS technology, spy and leave, why you need that? Well, because it allows you to scale. It allows you to scale one by one. You don't have to scale two, three at a time. You just one. And you don't have any problem that MLAC presented to you. It's become the link between spy and leave. It's become active. It's not just the active, passive one. You use one. You don't use the other one. So no blocking while spanning. You don't have to worry about that. It allows you to have multi-tenancy. So you support multi-tenancy with the virtual routing function here. That's the beauty of it. You get more tenancy in there. And like I mentioned before, it's distributed great gateway or anycast gateway over any rack in the same as that here. So we can scale both spy and leave with this CLOS technology. Now, this is a new one. This is an update. As you can see that, I switch a little bit here. Before that, L2 is right here. Now, under the name, we have the L3, L3, L3, EGT, EGTN. And you can see that this is a spy and this is a leave. Right here is an anycast gateway. It is an e-sharp. And the connection wise, you can see that one, two, three. It's a different way of connecting. You have to rely on just these all. You can just expand one here, one here. And if anyone wants to go out to the internet, you can hook it up to the L3, EGT above. So for the north-south, it's a new one. We all have a VGP. The VGP overlay, under the name, and connect to outside is also VGP. We all realize that when we go out on the internet to talk to everybody else, we use VGP. And then that's the beauty now, the whole thing becomes VGP here. Okay. Okay. Just give you comparison right here. As you can see that this is really a big difference here. This is later three, later three. This is later two. This is later two. Okay. And the strength of it is that it provides you with the flexibility of the standard support multi-vendor. That means there's no vendor lock in here. That's a good thing about it. And again, it's a multi-home across-link. So no blocking without spanning tree. It's an equal cost, multi-path. And it's very simplified technology and configuration here in terms of operation-wise. So when you have some pros, but there's some cons in this thing that it's perceived as tungsten fabric is very complex. I think that's a hard part about it. We spent like two years trying to understand it and work on it. And the other thing is that we have a FCOE fiber channel over internet and fiber channel internet protocol. This thing is not supported for this kind of network technology. And again, when you can scale it, you're going to have more device to manage. Okay. So that's another big problem with that. But every time you restart a new model, it only, within this graph, you don't have to announce, okay, hey, I'm a new guy here. This one, no. This one, okay. Just so that everybody knows it. But it's within this graph. But in here, if you miss the computer, now you're going to restart it or replace it. If you put it in here, this guy's going to tell everyone it's going to become over and over with broadcasting. Okay. Remaining issues with the Neutron Overlay Network here. Okay. Again, in the past, with the old public cloud, we used Neutron, right? We don't really have tungsten from it. We used only OpenStack, SEPS knowledge, and then Neutron. But by using that, we still have a problem that is not supported at FCOE. That's also something else that we want to do. We want to be able to do access. You see our list here. Okay. And it's not support L3 routing for floating IP. When you use the old way, you use OpenSwitch, right? And there's an issue with that. When some port fail, or else what do you have to do? Port fails happen a lot. And what you do is restart it. And when you restart it, you get the problem with the large domain announcement, which is going to take longer. When you restart it, it will take like one minute, two minutes, five minutes, 10 minutes. And if that's the case, how are you going to sustain the SLA 99.9? Because some of you complain, oh, gee, you're not restarting again. I got to wait. Those are the problems that you can encounter. Something that we have experienced a lot because this is about the operational issues. Okay. Maybe it works fine at the beginning. In the first year it works fine. But as we move along, as we expand it, people will ask, see, what happened? Why it restart again? And we can't answer that. And it's become a big issue for us. Okay. Also, there's many agents. It's a neutron server overload. And it also have a big problem with the Rabbit MQ. The old public cloud that we have, Rabbit MQ, worked so hard, so many traffic. It's overloaded on them. Sometimes, you know, we have to slow down. We have to wait. It's waiting too long. Rabbit MQ is just trying to set up the queue to everybody. But it's been overloaded, so it's slowed down the whole thing here. So that's the big problem here. The last one here is not support integration with physical router here. Okay. You cannot plug in directly with the physical router. And that's another problem that we're facing. And we would like to change all this. Okay. So that's why we bring tungsten fabric in here. Okay. So it integrates seamlessly. Okay. With open stack. Okay. And if you understand the history a little bit of the guy who worked on this thing. And I talked to Mark Collier and Jonathan Bryce, and they recognized that who is this guy. And I don't even know him. But he moved around to a, I think he started off with one company. I think it's Cisco or Juniper. And then they rewrite this thing. And then they move around. And then they end up by setting this tungsten at Linux foundation. Okay. The reason that we also come up with all this story, and then it works well, we already launched production. Okay. With all this, what I said today here, it's already on production. And you can visit our website. Okay. Porto.nipah.cloud. And you can try it out. See how it goes. We have a beautiful portal for you to try it out. And one thing about the reason, we picked the hardware for the switch and router. We picked Juniper. It's because Juniper allowed us to put the open source into their hardware. Okay. They used to call it contrail. Okay. Open contrail. But now they think they changed the name to see something like that. I forgot. But they allowed us to do that. So we felt that this is a good way, at least someone guarantee their hardware. Okay. I could go with the white box. But I said, well, you know, white box, you don't really guarantee with the hardware. So if you put open source on the white box, then it's a little bit too risky. So I said, well, let's go with Juniper. And they already have experience with the contrail. Basically contrail and fabric, constant fabric is pretty much the same. But it's the one you pay for the license. Okay. The other one you don't have to. But you're going to work out on it, try to install that. But I think they did really help us a lot here. Okay. So it's our route base, okay, between Compute Note. And it's simple, like I said, without the press platform, such like Kubernetes and it's no vendor lock-in here. The support deployment with multi-option of data path, like kernel mode, DPDK, SIOV and SMARTNIC. Okay. It's highly scalable with high performance. You have to try it out to believe it. It has a lot of risk feature here. And these are some of the features that I think is very interesting. It's advanced ACL. Okay. You can, in the old days, I mean, in the old networking, sometime we cannot, I think Neutron allow you to deny it, but it won't let you allow the port. Those kind of action that we now can do both allow and deny. That's the service group. We can do service group. We can do object, okay, with the policy rules, okay, on that. And what else? We can do service change, okay. That's transparent route and NAT here. It's Elastic Resilent VPN with the load balancing, ECMP, equal cost, multi-path. And it support integration with physical router, the DCI here. So you can plug in right through into the router and it support multi-cloud and multi-ESF. Okay. This is the good thing about it. And, okay. Here's the overall tungsten fabric architecture. Okay. As you can see, this is the architecture that we are using right now. And you can, in here. You can start out with the same one. The first step, you do like 20 rack, 20 sort of curve. And you can do like eight of them, I think. Yeah, you can do eight rack. And the network behind is like 100 gig. Okay. And it's a 24 gig time four. That's why we use it. For the overlay, it's all 100 gig. Okay. Here's the multi-ESF architecture of over stack and tungsten fabric here. All right. Now we're getting into the detail of this thing. The key of this one is that you need to have, let's say I call this one backup. This is not coming to the ESAP. So this is ESAP one, ESAP two, and the global zone. As you can see that the global zone, you can see how it's going to be in the neutron, keystone, horizon, grants, normal, and sit there. We're going to talk to everyone this way. Do all this, this thing and how it connects. So, and this is again, all this thing is what we call the global zone. So centralized control for management. We need to have three sides for active. So, we need to have three sides to have this, to have a higher availability right here. So, the global side could be anywhere that you can do it. As long as it's not in these two. Because it's one side down, we don't want to have these two now. We want to have one down. And if it's one down, there's two things to do. That's why you're going to have two sides. The edge zone. We call this one edge zone, right here. So you're going to have a controller computer storage. All in this one. You're going to have a control, a set storage for humans. And higher availability to do no dependency for each ESAP. Multi ESAP for open step services. So, in each one, we have NOVA, volume, grants, and network, which is a neutron concept traffic. You used to see that we use neutron, but actually, neutron, we don't really use much, but we need to have it because open stack and safe storage talk to neutron. So we use neutron almost like a proxy. It passed to tungsten fabric. But you need to use it still. Then let's look at this NOVA cell first. And for NOVA cell, we're going to install this one so that we can separate rapid MQ to support computer in each ESAP. All right. You can see that NOVA conductors. So you can see that each rapid MQ separated out. So the load, it becomes less to begin with. And secondly, rapid MQ doesn't overload on a server because it all go to tungsten fabric. And you can see later on that all tungsten fabric run on. So we create something that is making networking easier, simple, and still integrate nicely with the open stack here. So we deployed this NOVA conductor and rapid MQ for each ESAP and then local rapid MQ for scale computer. Because if you have for each side, it can scale easier. Horizon API CLN are able to choose each ESAP on its standard to create. You can use all this. You prefer to go with that. Next one is the Cinder. We deploy Cinder volume. Cinder, as we all know, is to talk about the volume with the SAP storage here. And Cinder volume is responsible to manage local SAP cluster. And we use volume instead of local disk. In the old public cloud, we have the local disk within the computer. So when you pick flavor, I have to tell, you have to pick two core four gigs and 80 gig of storage. But the new one, you can specify anything you want because we use the volume instead. So you have two core four gig and then you can pick like size 20. It's up to you and then you attach to it. And this makes life easier because if that VM goes down, you can just, you know, replace the VM with the CPU and RAM and then attach back to the volume. So there's no local disk. And make it easier for us to expand the volume itself. Okay, so we also do volume snapshot. Okay, on-site backup. When we did this, there's a link here between the two. We try to minimize the link, not to use it too much. Okay, we want to make sure that if you want one site, one AZ, then you can, you make sure you do everything in that one. So that you can save the link. But if you want to go across AZ, then there's some cost attached to it. But if you don't really need it, you want to save backup, you can do within outside. You can also upload the image to replicate it by using the writeouts gateway. If you want to go across a site, we decide that you have to do it to the image. That's a thing that we, that's going to have to go to GANs. Okay, we're going to talk about it next year. But that's the way we designed it. We want to set your cost and if you don't really need it, then you don't send across the ASAP. Okay. Then you look at GANs here. You see using self writeout gateway as a start back in with the script API here. And multi-site replication of writeout gateway. You see that side of GANs coming through and talking to each other. One thing that we decide is called sender image hash. It's something that we want to save the bandwidth. It's a new feature to speed up volume created by cache image that have been used on that ASAP before. For example, you download image from the system one year now. If it's on this side already, next time the second user come in and download it, they will use the cache. Instead of downloading from this, it's actually that one. So that's the thing that we decided to help save the bandwidth for anyone who really need to use, design the cluster that will have the least traffic going between the two. So GAN is also very important here with the sender volume. And the last one here is about the tungsten fabric, which is the dead working here. So Neutron, like I said before, with the tungsten fabric service, which proxy API refers to the tungsten fabric controller. This one, like I mentioned earlier, we need to download to every node that you need to put the Neutron agent into that one. So we deploy the TF tungsten fabric router instead of Neutron agent. And tungsten fabric router connect to tungsten config node. Here is MTP. As you can see this one right here, and it's coming through here, but you see this green line here as well. This is going to help us take care of the network with the router of Neutron. It really makes life easier for us in terms of operating this. So TF controller node is able to peer with VGP to the external router. So actually we use the MX router here to exchange the router. So VGP and it's really nice. And my last slide here. This is a network diagram. So let's say if you want to build your virtual network on this side here, you can call that 0, 1, 1. With this one you have a VM1, VMG. So if you do that and you can look at the purple one, these two will connect to each other. You can talk to each other using these square scrolls. But if you chose not to talk to one another, another you can go with the virtual network 0, 2. This one won't be able to talk to one other side. So the other thing about this configuration of this architecture is that each side, again we decided to make sure that you can go out to the internet. We have external network and RISP. Every other side has it. But if you have to go through this one and go out here, it's the cost is because this is the NPRS link. That's the higher cost. If you have to go through the two, we want to get to the internet. So this is the thing that we decided to make it simple and easy. But we can use that too. This is a darker light. You can use it anytime you want. If you're willing to bear the cost of this one. So this is a network diagram that we have. And that's pretty much how we designed it. If you have an equation, I think we can answer you that. Jackie, you want to use the microphone? I'm not a data scientist. I'm not a data scientist. Why? It's fine. What happened, the global zone is dying out, just itself, right? Okay. Okay, let me... Okay. Let me show you this one here. I forgot to mention that the global zone is actually in our tree. It's already in our tree. Not just one. Yeah, there's a global replicate in ours. Yes. It's quite three sides. What do you need on this one? Which one? This one? Yes. What do I need like this? You need what? Yeah, I have a question about your network architecture. Actually, you're using the tungsten for the provider network for your cloud, right? So you can serve the S3 model or problem very efficiently, but I'm thinking... But I'm asking about your... How you're dealing with the airport IP table system like you can block or you can... You should take... You're dealing with the IP tables with the neutral S3 agent or something. But... I'm just curious. How can you deal with the IP tables with this model? IP table? Yeah, for the... You do this model, right? Yeah, yeah, yeah. Okay, let me ask my staff. Take up how we handle IP table. I mean for the... If you provide the airport for your tenant network... For the tenant? Yeah, yeah, yeah. How can you deal with that? Right, right. Okay, then how you handle that? I forget about the IP tables. I mean, yeah. I should mention about the airport road balance. Maybe, yeah, we'll talk about it later. There's a security group the old way. You can... Right. You can do the old way. There's a new way, the second choice, which is what? The other one is you can use advanced ACM. Okay, yeah, yeah. Because I have a very similar implementation with this architecture, but I'm not using the system fabric. Yeah, yeah. So I'm really wondering about that. Yeah, problem, yeah. We'll discuss it later. My question is how do you handle, you know, updates on the switches from Unipa concerning the operating system when you're using the open source version from Contrail, which is tungsten? I mean, the switchers will have updates, yeah? And then you have to reflect that in your tungsten fabric manager as well. I know, I think the setup is quite a bit impressive. So that's the only way how you can combine overlay networking with switches and Linux system because you're using the Vrouter from Unipa on the Linux system. But, you know, when you're using the open source version, you know, are you then maintaining this version, you know, when the switch operating system is being updated and you have to react because otherwise it doesn't work anymore? Okay, we'll talk about the update. Do you talk about CLI's technology? What? Do you have to follow the update? Yeah, I was like, when do you have a new switch operating system? New switch operating system? Yeah. There's something about the tungsten side. When you connect the tungsten. Okay. How are you going to update that? Go update your tungsten fabric. Check out the update. How are you going to update that? Okay, thank you. Okay, no more questions. Thank you very much.