 Okay, hello everyone. Next up we have Goffi Frux who's be talking about issues with large deployments of Debian Hello, thanks for joining in this session. It is meant more than a experience exchange instead of Talk to where just I speak So I just want to first introduce you to What issues we are facing in our environment? I will tracking this stuff in this document LSD stands for large scale deployments in case you wonder So please join the Goffi session. I will go through it Quickly I'm working for the Austrian e-Health system we are running the The electronic health system health care system in Austria The boxes at the doctors all are running Debian There are around 12,000 boxes all across the country and we do software updates twice a year And to be able to have the same Version across all the systems we need to have them updated in the during one night Also because of the health care system is a quite critical system So it has to be done All at the same time and quickly so that people Can go to the doctors the next morning without any troubles If Something in this environment goes wrong it might be a big case because you can't send out technicians to To check on 12,000 boxes across across the country and There's a fair amount of countermeasures that we've taken to work against these issues One of them the most important for us is the rescue system. There's a special partition on the boxes Which is read only which resets to a known good state. It's Technically it downloads a bootstrap binary Runs it and on top of that Starts to do the software update again This rescue system can be triggered manually, but it is also triggered automatically if the system Reboots 10 times without bringing up the application We have integration test which does just overall test looking if the system is still accessible through the Interfaces which are exported to software developers that can base the software on our software stack for for the system and Also Plugging in a monitor and keyboard directly into the box or using the web interface To work with the system and all this is done with the integration test Then there's the second step the system test which does the Tests on application layer which checks the different applications that works on the system and We have actually in our software development we do three revisions During the release cycle All the deeper changes are meant to be in the first step Then the system goes to integration test and system test and they test it There's another revision that is done Which only is allowed to contain minor changes and bug fixes There's a last revision only for bug fixes and Potentially also documentation changes and things like that Then when the rollout comes we first do the upgrade on the servers with respect to database changes required ones There is a first rollout to about 300 clients with which we have the first field test and check if there are any issues that we weren't able to reproduce in our local test environments and a week after that usually there is the rest of world rollout which Runs the update for older systems across the country. Of course, there's proxy proxies involved or caching servers to Work against bandwidth issues When you do updates for so many boxes in one night you have to work against these So this is our personal environment The software update is its own package. It usually runs an update update then It triggers an updated sources list or which has the new version name in it Does the update Fetches the packages or there is a two-step software update going on the first step just updates or Specific software update packages Which also has some pre and post processing hooks where we can Check our work around certain limitations with respect to Config files and something like that The configuration files from the packages usually are shipped in one of our internal packages We don't touch the devian packages that much, but we have All the packages are installed with force conf missing force comf new and this the package hooks That we know the system out there has the new configuration because it all has to work unattended of course and After the software stack got updated. There's also a hook that might need to reboot the system and then after reboots there's The software update hooks in again and updates all the remaining packages and in the end brings up the application layer again So this is how things work for us And I wonder if there are people who might have Stumbled into similar issues have more or less similar big deployments going on and What problems others are facing in such environments? Maybe first The first round of question if there are any question with respect to our specific setup and Yes, my my Hey Hello, one two three my first question is which kind of boxes are these are these the desktops at the doctors or are these some small servers that are used Okay Yes, I should go into a bit more details. They are Sort of embedded devices There are no movable parts on it. They are directly in the doctor's office or in a hospital and in this environment environment various special limitations regulations with respect to Emissions what The boxes might are allowed to take with respect to power and Things like that there Embedded small devices and Running the application stack for for the health system so when the doctor puts in your health card it checks the authorization and Then the doctor can put in the consultation information for for the case when For which you are there So it's not desktop systems in that case it's just exports more or less a web interface in which you can put in the patient data and For payment Through that the doctor gets paid by the health care insurance company So Exactly, how do you trigger the installation of the update package? There's the We do run our each of our releases has its own version code and In we have a software distribution server which can tell the clients which version they should update to and This software distribution tells the server that all the clients there is a new version out there There's a cron job running Every day on the clients which checks if there might be a new version available and there's a maintenance window in Between which the clients might hook in the updates So the central server sets the new version for the clients In in the evening the clients see there's a new version and they trigger the update on Within their respective maintenance window There's also for for some areas that might be Especially for for bigger environments like hospitals They have the the possibility of manually triggering the update So it's not automatically done in the night Especially with respect to emergency rooms and things like that They need to be able to trigger it on themself and not having done in the night for the usually practices and Is it means that there is no user data on these devices? On these devices the question was if there is data on these devices or user-generated data on To some degree. Yes The boxes are also meant to be able to work in offline mode So if the central servers are not reachable the doctor practices Need to be able to do their job nevertheless. So the systems can store data Offline and when they get the network connection again this data will be sent to the central servers So any Any more question with respect to our special environment or So how are these boxes connected wire Virtual private network and is the bandwidth limited so on the client side Yes, most of these parts are already covered in my talk on the debian day They are connected either via least lines or regular DSL lines The providers are required to run a MPLS service on it for quality assurance There's a minimum bandwidth which has to be ensured to be able to Run the update in a timely manner because this will also get tricky if the boxes just a Few hundred bytes per second and the update is a bit some sometimes a bit bigger So we really try to get the update limited quite low Yes, what I didn't mention here what might be needed together and it is we strip the debian packages from their documentation and Help pages and all this stuff manual pages so that the packages are really narrowed down to The minimum that is really required on the boxes There's no documentation stored on the boxes because the old boxes that we are currently in the In in the work of replacing them the old boxes just had 256 megabytes of RAM and the same amount of flash memory so it's quite limited in that area and That's it's not only the bandwidth reason why we We strip off the debian packages from the documentation, but also the space limitations on the embedded devices The question was if we encounter problems with the network Yes and no The group of all the 12,000 Boxes when when doing the rest of world rollout We usually group them in several chunks with about 2003,000 boxes a chunk and trigger them at different hours over the night so That the servers Don't stumble into the bandwidth limit. We also do Some sort of simulation graphing beforehand to be able to know where there might be the bigger peaks or things like that and afterwards the real data also gets Gets written down noted down and checked if our pre-calculations if our Tests where more like more or less within the lines and that we be able to potentially improve these limitations but so far at least within the last few releases we managed to get things really nicely lined out My real question is are there any other people around here that might have That many boxes or even just thousand or something like that under their hands Well Two or three releases ago. We did the upgrade from Sarge to Lenny If you know about the history there was the edge release in between So we skipped the edge release. We directly went from Sarge to Lenny and This also took a lot of testing during the process because there were some symbols removed and The upgrade path within Devian is ensured only for one release at a time so we had some issues with that, but it was overdue and we do We do the system upgrades that the complete distribution upgrades for the base system most of the time Now that we have Lenny on the boxes contain only of improvements in in the whole process or adding new applications and Bundles along that lines which Offer better supports for for specific application enhancements like recently there was added e-medication which checks for Interactions between different medication types and be able to present them to the doctor so they can Warn you about interactions of different medications and tell you about them Yes, Neil a question from IRC How do you cope with? Bandwidth requirements do the clients have some sort of randomization for the time to start downloading the updates Well within the maintenance window, so they don't try and download all the same time. I think I read that question up myself How do you cope with bandwidth requirements? Yes, there is some randomization for the time started within the parts Like said we split them into several chunks. So they are split out out over the night But within that there's also some randomization for which goes over about I think 10 minutes within Which the different clients can start sometimes the clients that run into timeouts during the updates giving Giving some travels, but the rescue system is Dare to even help with those because when the application doesn't start up within a certain time There hooks the walk. There's a watchdog that checks for the application start and if it didn't start for a certain time the system reboots on itself and Most of the times the systems are able to heal themselves through these reboots and Because other systems are not updating anymore at that time. So that works fairly well for us most of the times there with 12,000 boxes They're fairly The same hardware wise, but there are still some different charges that were done while production so in our last release we had a timing issue with with the thing and a new kernel update and the watchdog hooked in too early and some of the boxes were in an endless loop here and We had to fix that but These are just some corner cases and gladly they don't appear very rarely Yes, how do you fix these issues? So when when some of the boxes have trouble like an endless loop or Where the update simply fails? Do you have access remote access to them or do you have to? Basically go there Our operations team has SSH access to all of the boxes When we are not able to SSH into them anymore That's the troublesome part But when we at least be able to SSH to them we can log in and fix stuff That's the easy part even though sometimes Things are still pretty tricky because the system is extremely complex over when it became fairly complex over time Most of the times SSH access is possible Sometimes in that case with the kernel we had to go into the client Distribution system and tell them to upgrade to Deploy an older version and not their most recent version and Then they went through the rescue system to the older version and we were able to reach it Sometimes it's really Hardware plain hardware breakages the old hardware is about five years or maybe even older and it starts to wear out so We currently are replacing the hardware nevertheless and in case some of the boxes Go mad with respect to hardware technicians have to go there and replace it Neil at Far back Do you see issues with the base system getting larger? Is this a problem for you? Pardon Is this a problem for you when the base system gets larger between releases? Yes And no because actually now we're in the area of Deploying new hardware. So the issue isn't there anymore. We go from 256 megabytes to four gigabytes of hard disk space. So that gives us a bit space for future Deployments, but on the other hand Making it bigger Becomes more troublesome for for the servers for the upgrade process the download times don't scale up in that Scale up that much. So we still have to be very careful with how much software we put out there and Sometimes we even have to make decisions like is this bug really that important that we push out an update? Or is it easier to work around it? But yeah, these are decisions that has to be made in such an environment Is there no one here who has also a big server farm or something like that under their hands? Yes, Neil well at Where we work when it's not Debian based, but we do have a lot of the similar problems for company go about we I work for a Set top box manufacturing company. We have a product at telecom Italia server currently about 60 70,000 boxes And that there's a lot of very similar problems with making sure things stay up to date One of the ones that we had to do was one of the requirements was making sure that the hardware can't brick itself at any point in time And so there's a sort of a hardware recovery thing What sort of mechanism do you do for the updates? Is it just HTTP? Do you have a separate kernel that that that loads this software or how do you make sure that that you don't? Uh, that a bug that isn't going to take the entire box down Well, like I mentioned, there's a lot of quality assurance system test integration test going on test rollout of the 300 golden clients and the rest of world You never can ensure you can take a lot of precautions Be very careful double triple check There's this free revisions that we do internally before we so we push out the software for the first time and All these Things that roll during the release cycle made us at least Gave us the possibilities to go to bed and be able to sleep So Yeah, the the possibility of breaking systems, especially when you have to do kernel related updates is rather high like I mentioned formerly that the issue with the Kernel timing issue was was a bit troublesome, but we managed that very gladly very good and Yeah, does that cover your question properly to some degree? So if there's no, uh, yes, Kristen So this is not really related to The boxes you are upgrading here, but still somehow related. Are you actually upgrading the card reader firmware too? Or just the box? No, the card reader firmware also gets upgraded from time to time Especially recently we had to upgrade the certificates we were using they were running out and In that process there is also A card reader related upgrades going a firmware upgrades going on. Yes Is this are these upgrades triggered by the box or they are triggered through the box? Yes Because sometimes the card readers are in their special VPN environment locally and we can't reach them from external and There's also some card reader proxy for for bigger environments where they have like in a hospital you Usually have At every desk there are a card reader But you have only a few Main systems that are the access to the healthcare system VPN so That's done from there One more question on IC. Let me check it up Which one was that? Yes Deploy updates Selectively to specific clients for example if some of them need a slightly different package or configuration Can you explain how you do it? well No, there's no Software difference between the clients There's configuration related to different clients we have one main configuration file Which stores specific tuning parameters for for a software stack? but The packages that we deploy are all the same across the different clients There are different not not every client has that the same set of applications So the sources list line that is sent by the server Contained different components at the end for the different clients and these components guide which application the specific doctor is Allowed to use and or rather is able to use because the other applications are not uploaded to the system But apart from that Yeah, it's done through the components in the sources list file that different software stack we might be Able to work on that system more currently it's just used for the applications that The doctors can use on the system. So hopefully Yes Neil, how do you make sure that the software on the device is what you put there? Do you verify the kernels or check the file system or or how do you sort of really verify? There hasn't been tampered with in any way well The local administrators Might have they only have access to a specific dialogue-based administration menu menu through which they can change Just a very limited set of settings like network settings or which card readers are connected to the device But apart from that no one Locally has access to the system About being tampered with we do rely on Uptkey just like regular DVM parts so There's no additional layer that we put on top of that We can remotely log in and check things there's also some sort of We have a specific software that is able to do some checks on on on the whole tree of Clients if we need to find out whether How much bandwidth was used in the last update from the clients we can Hook in some scripts that get uploaded to the systems and give us this data Which might be used it's sort called health check to to be able to check The health of the system. There's no regular Rd tool or something like that running on the boxes but if we need the some some specific information for Whether they might be had been some problems with the last software update or that we can extract the logs and and check them so the next Talk that is going here more or less goes in the same direction and I hope there will be more people around that's work in different in similar environments and also have bigger amounts of machines under their hands and speak up about the issues they are facing so If you know someone who works in such environment, please tell them to get in contact with me. I would like to discuss troubles they are seeing with using Debian in such environments and In case Yeah, please get in contact with me. I Hope you all know how to do that So Thanks for being here. Thanks for your questions. I hope I have answered all your curiosities and See you later