 Ja, herzlich willkommen zurück auf dem Femm-Channel. Gleich lässt die Femm ein wenig die Hosen runter. Welcome to the talk, das ist doch Haar. That is, but it is, High Availability by Jenny. This talk is being translated into English for you by Jörn and Attila E. Any feedback is appreciated. Please use the hashtag C3Lingo on Twitter or Mestadon. We hope you enjoy the talk. Vielleicht sieht man sich ja mal später in der Welt. In this talk, there will unfortunately be no Q&A and no breakout sessions. Ja, because Jenny doesn't have that much time right now unfortunately. If you want to hear the English translation, this talk will also be translated into English as you hear right now. And you are already in the English translation. Now, please enjoy this is High Availability. Hello, I'm Jenny. I'm going to talk about High Availability constructs. Out of funny reasons, they might not be as high available as thought. So, shortly, who am I? I'm a computer science student since 2017. And also in the resource technology of the FEM. I also am responsible for software constructs in the Internet. Here's my email, my Mestadon, my Twitter, my GitHub. If you want to know what I do, just a quick overview. This is the introduction. I'll talk about my motivation. And also the reasons why High Availability constructs are falling apart. Why are we doing this? At the FEM, we have large software constructs, self-made constructs, and they are far more complicated. And that's why they fall apart out of funny or actually sad reasons. And now we'll talk about the reasons and we'll take a look at it. Let's start with something, which I don't have as much exposure to, with the GC. The FEM operates a GC instance in the past. This only has been one VM, but since corona started and everything went online, the demand was to have a GC, it had to be expanded. And that's what happened. There has been a master VM, which has been globally reachable, and it's several video bridges that were controlled by the master. And the master VM has public IPv4, IPv6 from the standalone VM, and that has been turned off, the standalone VM. There has been a problem where if you look at the approximate configuration, here we can see that a boot has been forgotten to turn off. And that at the proxmox update, it led to problems, namely that if we rebooted the node, the VM also was on afterwards. And it also took its statically configured IP, and there has also been assigned to the master. This also led to it being broken, half broken, because we had two masters, because according to the old standalone VM was not so and so. And what's interesting, the GC was still largely usable. So you could do conferences, people didn't hear each other, sometimes people couldn't connect. And we asked ourselves why is that? According to our GC admin, you could use SSH to connect, but it still worked. The old master VM was turned off. Everything has worked since then afterwards. And now about something that I have even a little more exposure to Web 1. What is Web 1? The Web 1 is a web cluster of the FAM. I was managing it. It's largely been used for web hosting for us, for the FAM, but also for other clubs and other at the Uni. Lots of institutions have relied, and also us have relied on a web cluster. We have put a special eye on the high availability of this web cluster. I have made an overview of this Web 1, one for Lautex. Ich get drawn, thank you to Nex. I'll show you how this construct is put up, composed. We have two load balancers, one node number two, port 22, is also add by one, so that people can also login through SFTP, and Active Note Lab Balancer is forwarding the requests to the web nodes, two PHP nodes. Currently there are two for PHP 7.1, two for 7.0 and 6.4. This is not state-of-the-art, but I'm actually just managing it. I didn't set this up. Additionally, PHP applications can also use MySQL on the web nodes or Postgres, depending on what they need. This is our MySQL cluster. It consists of MariaDB with a master-master web application. It consists of two nodes. PS2, I built it myself. This is the only component I built myself. We have three nodes that are in hotstat by Postgres 13. We also have high availability here. All of those web nodes use an FS to access data on the high availability storage, depending on what the web page needs. These files that are there are also available on Web1Manage through NSF for those people who login through SFTP, so that they can access their data. The authentication is happening against MySQL 2 through the people attached logins, simply because we don't have as many logins. So we made it simplified there. Here, the HA Storage 1 is standing alone. Is there a reason why it doesn't have an HA? It was originally planned as high availability storage and we wanted to realize this with GlusterFS, but then realized it's actually not a good idea because it's slow and that's why we rejected that idea. So now we only have one node which offers a network file system share. And when this node is down, all PHP-Scripts are not available and all websites are down. And basically, this is a single point of failure. Und wir haben ja, we're going to talk about the next failure now, the AdminDB. And some will wonder what is the AdminDB. The AdminDB is the member database at FAM and it is maintained by me and others but I didn't build this, I'm just maintaining it and it manages all devices in the FAMnet and configures the switches and if people join us with their devices, they have to register the devices und dann we keep track of what kind of devices on our network and they have to be university members or and the devices of the members get aesthetic IP so that we know who did what and when and that's why the AdminDB is also configuring the DHCP and DNS server. Dann hier einmal eine sehr stark vereinfachte Struktur der AdminDB. This is a very simplified structure of the AdminDB and I hope so I'm trying to explain this to you and this is the AdminDB2DB this is the central Postgres database and it has a hot stand by construct of two masters and two slaves und ja, das ist so realisiert dass das dbd also das die Postgres-Dat dbd ist praktisch ein WebEin it works as follows so the data lies on a dabd und das ist eigentlich ein recommended way by Postgres to realize this and this is how we do replication und ja, dann ist der primary and we have one block device and this block device is being mounted and this is how we have access to the data und ja, dann ist der primary and we have one block device also the data but this didn't work as intended and we will come back to that again we also have the web front end consisting of two nodes so one is stand by um die AdminDB zu verwenden but only one of them has the high availability IP and we have AdminDB to radius so these are two active radius servers which get contacted by the switches and who authenticate the switches and the radius server checks for the MAC addresses and determines which device gets access to which VLAN we also have the AdminDB X so was X and actually I am not quite sure what X means in this context so based on our wiki this is extra services extra services and the switches are being configured automatically via SSH die werden halt darüber dann automatisch verweitet genau und zusätzlich ist das Ding auch noch der primäre DNS server für alle unsere internen Zonen die wir bei uns im Netz haben and we have, this is the primary DNS servers for our internal grouting this is the power DNS power DNS server hier haben wir einfach wieder zwei nodes, auch mit einem Arbiternode she takes it from the Postgres database and we also have two nodes also with an Arbiternode that share the services some, for example on one one things, some things run on one of them and the power DNS also we have the AdminDB to DNS those are the secondary DNS servers for our internal zones and also the DNS resolver our clients receive through the DHCP these DNS servers there are two the high availability is for the IP to send the IP from one to the other if one is not available and we also have an arbiter here and the clients receive the IP addresses from the DHCP this is also fully high availability those are two load balancing DHCP servers the node one is also in the file 1 to 3 the second is vlan 2 4 to 6 they also receive the configuration from the AdminDB there are also scripts additionally you can see on this arrow the art watch and what that is we will talk about this later I just wanted to mention it here so AdminDB 2DB that might arise from this over engineered design now let's talk about AdminDB 2DB this is a centralized Postgres database and through that the centerpiece of this AdminDB high availability and as I already mentioned only the current primary node the problem is at stopping Postgres really wants to write to the disk that it's really stopped and it's really shut down but the problem is before that happens the DBD in a lot of cases has the failover made to a different node and doesn't have the data system and couldn't write that it has done a proper shutdown the old node couldn't write that it's still stopped the new node couldn't start because Postgres seemingly hasn't been stopped properly and then the AdminDB was broken and the Femnet was broken completely the radio service the DSP server doesn't regenerate its comfort you can't sign into the Wi-Fi it's difficult and by and large one by one the network is broken if we do these type of things man redet nicht direkt mit den SQL-Tabellen one last thing with the database there's a specialty with us you don't redirect the SQL tables but to the functions that has the reason why we are being checked we have a simple tree design for writes and you simply sign in with your Postgres credentials and then you do the functions through that but functions can also have access to very sensitive data so the intention is correct but we want to log in transactions but they shouldn't disappear at a broken or cancer transaction because you could say I do very illegal things I look at things that I'm not allowed to look at and I cancel the transaction and you could think what's your opinion but we're putting this into practice through a Postgres function in Python and also logs into MongoDB I know you can think about how intuitive you think this is or not that's why on the DB-Months we have a replicated MongoDB and then the logs are being written in there and now I have a code snippet from the adminDB log I have a bit of a modified for this but actually only this part is important I hope you can see that normally this would have been aligned all the way to here but what you can see here for logging it connects to the MongoDB AdminDB Master 1 Master 1 lief this has a great effect that once we had the host of which is the Master 1 if Master 1 was off Master 2 has the Postgres database in the primary everything worked it took its IP I've looked it up and that's pretty bad either you log on one node or on the same one right below this little mistake could easily be resolved by changing the one to a two if nothing is locked even the DSCP config is being locked and it breaks the entire FAM network so the old concept still works but the DSCP updates can't be run this breaks the FAM network this once again breaks the high availability now one more component that breaks sometimes is the wired device in the FAM network gets a public IP address because if you connect through wire you get at least a gigabit connection and nothing a gigabit to several people might not be a good idea for many years so that's why we do public IP for addresses and additionally the IPs are statically assigned according to the MAC address this has the reason that we know in abuse cases which device has which address but if an IP address is freed then that can also be reassigned really quickly that's why we have a least lifetime of 5 minutes but we also have 5.000 devices in the FAMnet but that means every of those 5.000 devices lets assume if they are active at the same time they get to the DSCP server and every 5 minutes they request whether the IP is still valid that's why the DSCP server has a lot of fun, lot of work to do additionally to the normal DSCP server we also have the task what I've already mentioned the app watch on admin db2x so what is app watch it looks at the following in the following app and looks whether the IP is connected to the right MAC address and it says which IP address is assigned to which MAC address if the DSCP is completely through a PostScript is called which on Redis writes on admin db that this MAC has this IP address at a certain time we can talk whether this is the right solution or not but at least this app watch script writes an email if there is an IP address that is assigned to a MAC address that it doesn't know there are a lot of these just to say that but that leads to the DSCP server after all 5 minutes it has to assign a new lease to all devices and call app watch and in the past it's been a way that's why I'm talking about this now in our construct a single DSCP server in our VLANs the DSCP construct was on two nodes but it was as a hot node just one was active and one process was responsible for our VLANs and at peak times the DSCP didn't have any fun it just simply broke because of the high demand the DSCP was broken and that didn't help the DSCP requests still came through and then the legacy internet IP was broken and you can see all the emails because people mentioned that only Google is possible and nothing else works that was a small insight into our infrastructure and some more things that actually failed but these were the highlights that I also had to work with and I very much thank you for your attention and yes, goodbye next thanks for this really nice and interesting talk and also thank you for your attention from the translation booth you just heard the talk that is but it is high availability by Jenny it was translated by Jörn and Attila E if you have feedback for us please use the hashtag c3lingo goodbye