 Hello everyone. First of all, I would like to thank you the Linux Foundation for this opportunity and for host these great events. I think it's a great opportunity for all of us to learn and share. Yeah, I will be presenting the TeleX use case of migrating services from the cloud to a DC enabled by Community Sonic. Yeah, the agenda, it will be basically presenting the use case, presenting the solution and discussing some of the issues in lessons learned during the process. Yeah, who I am, who is this little guy, right? So I'm Alessandro Veras. I'm a network engineer at TeleX. I've been in the market for the past 18 years working with big ISPs in Brazil and mainly with proprietary solutions, big vendors like Cisco, Juniper. So yeah, I know both sides. This was my first experience with white box switches and open source and it was great actually. So I'm basically a user of all of the community does and thank you for that. Yeah, so what's our use case? So basically TeleX is a communication as a platform company. We have thousands of customers and for that reason we hold thousands of instance and service in big clouds like Google Cloud, Amazon and IBM. So at 2021, we started developing a new product, a storage service that's based in open source technology, blockchain technology and for that the first challenge was how to implement a reliable and cost effective data center fabric because the goal was to have that in on-premises data center. So now it was a successful POC and now we have seven sites operational across the US and for that product and the next step was clear, right? So why not leverage this infrastructure for the main products besides the storage one? So yeah, we now currently have a POC so basically we migrate a lot of service to one of the what we call edge pop sites and we are saving thousands of dollars by doing that right now and of course we have a plan to migrate everything we have in the cloud to our DC. So what we did, it was we deployed a classic spinal leaf design architecture so in our implementation we have six spine switches in the fabric and 24 leaves distributed among 13 racks. Yeah, for the boxers we use it's basically a 32 by 100G switch and we are currently running version 2021 11 of Sonic of Community Sonic processing power so what we have on those data centers today it's basically 50 machines in this phase we are actually expanding 50 machines with two CPUs of 64 cores two terabytes of run and we have 20 storage machines with a lot of disk capacity for host integration how we integrate hosts it's basically they are multi-homed so we have two switches on each rack and they are dual connected using 100G interfaces using DAC cables for redunners we use MC lag it's working fine with unique IP gateway so we provide a single gateway for the hosts for simplicity and for some hosts we use BGP the routing design it's based on RFC 7938 we run a full layer 3 fabric we don't we don't support layer 2 in the fabric right now each leaf has its own AS number so we use it a convention for signing that so we basically took 65,000 plus the switch ID so for leaf 1 that has ID 1 is 65,001 the spinal layer it's basically has a single AS because they are standalone switches so there is no BGP sessions between the spines and the edge layer that it's the layer responsible to for interconnecting the data center with other stuff like the internet and the telnex backbone it also runs over a single AS so it's all EBGP so we don't need an IGP we don't need reflectors so it's a simple it's a simple solution for us because we're getting server integration we basically contain the layer 2 domain on each rack so we have some villains that's only local to that rack and we basically use BGP to announce the subnets related to those villains and in this scenario we have a full equal cost multi-path design so we can load balanced traffic to both leaves at the same time we're getting to route propagation so how it works so basically everything starts at the leaves so as I said before we have some villains for the hosts in the leaf layer but we also have some static routes or some routes we learn via BGP we host kubereris nodes running calico in some of these hosts so we run BGP also and we advertise these to the spines and following the BGP rules everything we learn from an EBGP neighbor we also will advertise to other EBGP neighbors so all the prefix the spines learns from the leaves they re-advertise to all the fabric we have a border leaf layer that's not a dedicated device in our case but it's basically a leaf with a connection to our edge layer and at the edge layer we do some aggregation in the routes we receive it from the fabric to optimize the advertisements to the internet and also to the town x backbone fabric and internet security so basically it's something nice because for some hosts we provide a public IP so it's native it can talk to the internet directly but for some others we just need internet access to download packages install updates and stuff like that so we don't have a public IP for those and to provide basic internet access for those hosts we are actually running VPP VPP is another I don't know if you guys know but it's another open source project and it basically transforms the better metal hosts into a router yeah I know Sonic is developing an integration with VPP for data plane so it will probably in the future I will be running Sonic in the in those devices as well so it's it's really really performatic we are cruelly able to forward and net more than 200 gigabits per second using only two CPUs on the host yeah and related to automation how how we what we do to deploy that because we have a lot of devices we need to install sonic we need to provide configuration so basically day zero configuration how the device gets IP addressing and install sonic we leverage the only discovery feature so each time a new device it's connected to the management network it is provided by an IP address and a URL and we and on a kick scene and download the sonic image and self installs the image related to to that device and after that phase all the devices in the fabric should have an IP for management and the sonic installed and then we have a ansible playbook that basically does 99% of the fabric provision at day one so it's I'm we did a model for that so basically now when we need to deploy a new site all we need to do is kind of change 10% or maybe 20% of that model to include of course new IPs and and stuff that's only related to that site and then we hit a button we run the playbook and everything works fine it's another great stuff with with sonic it's easy to automate and it's for a network engineer it's pretty beautiful to run a playbook and see all the bgp sessions connectivity running and it's it's really great yeah the the monitoring solution we are using right now it's all basically no open source tools so probably I will add suzik q on on that but but today we have lab librarian ms running for inventory monitoring and alerting we also have primitives for metrics and alerting and grafana for dashboards and we use gray log for sys log uh lessons learned uh first one uh it was hard to find out that we needed to compile uh sonic from the source code to have mc lag enabled uh yeah that that was one and in this version we find uh a memory leak in the snmp container so from time to time we need to uh reset the docker the snmp docker container to release some memory uh and we also find uh uh found a mc lag bug it basically crashes from time to time and another thing that we uh find difficult to implement is uh evpn at least with the platform we have now uh and we are we are we have a partner now that's working with us to to implement that because we are uh we are hoping to implement uh some virtualization on this network and for that it's being required later too yeah so uh yeah that's all i have if you want to contact me that's my email any if you have any questions uh anything i can share with the community about this experience i will be happy to do with any questions thank you very much