 We are back again. This session is we're calling new tools for deprovisioning accounts and we've got John Stewart with us to kind of show to show some of the work he's been doing and to help him make decisions on what account to be deprovisioned and so the purpose kind of showing showing the the work you're doing hearing about your thought process on this stuff and like I said we're kind of ultimately hoping to be able to learn from this as the reclaimed community maybe pulls some of these tools to be usable for other folks as well so that's the end goal. We probably won't accomplish all of that in 20 minutes but we want to get started this is the start. Yeah so yeah you want to show us what you've been working on? Yeah yeah so the the reason for all of this is that we've been running OU create for eight years now. I was looking back at our records just before we started and the earliest users got in in July 2014 and so we've had it around for a while we've got 7100 users in our domains right now and so we've got seven servers I think running and the issue has always been how don't we clean that up occasionally and so I've developed different ideas for how to identify users that aren't active anymore and some of that is emailing people but they don't necessarily read the email or email me back and say yeah you can delete my stuff so one of the ways that we came up with was just seeing you know who hasn't been in their site for a while and so you can see from WHMCS and some of those tools when the last time someone logged in was but that's if they're logging in through single sign-on and through you know the the main portal if they're logging in directly to their website it's harder to find that information and I think we mentioned that earlier you know a couple hours ago even if they're logging in are they actually doing anything to their site or you know is it just sort of sitting there was the next question and so I was trying to find a way to see you know when was the last update that somebody had made on their site was and you know maybe that's a blog post maybe it's updating a piece of media maybe it's you know updating a page or adding a page something like that so it turns out that any updates that you make whether that's adding media adding a post updating a post any of that shows up in the posts table in your WordPress database it all goes into this one table and so if you just look for the last time that a change was made in that one table that's a pretty good indicator of when the last usage was and so the script that I wrote looks at well it first looked at a single user's database and just checked when that last post was and returned that as a I printed it out to a CSV file just so I could look at it as a spreadsheet and I was really excited when I got that because I thought like okay you know I've written the script now it's just a matter of like uploading it to my server and and running it across all of the looping it over all users and I thought that was going to be straightforward and it's it's not at all um I thought I could do that with like a SQL query but every every single one of these databases is its own SQL database you can't run or at least I don't know how to run a SQL query over 7000 databases um but what I figured out how to do was to take the install-a-tron information for the servers and get each of the database credentials across the whole across all of the servers and then to just loop it in a Python script and so it literally just goes server or you know database by database and just asks the question what was the last update to the posts table and so yeah this um I think if we've got the share screen that we can pull up yeah I've got the um so all of this is in a GitHub repository and um I can show this the script real quick it's not particularly interesting but um like I said maybe the the thought behind it is a bit more interesting but it's using MySQL in Python and then it's using uh CSVs both to read and write and so basically the first thing that we do in this loop is we import a WordPress account list dot CSV and so I've got a list of all the WordPress accounts for each server and then for each one of those accounts it tries to connect to the um to the SQL database uh for that WordPress account and then it just looks uh the query down here on line 29 says from the posts table uh look for the most recent post and literally just return me to the top line for that most recent post on the post table and so one of the fields in that is the date and so I pull out the date string on lines I guess 31 through 33 or something like that and then I print the whole thing just to a CSV and so what I ended up getting was something like this so this is what um Installatron gives you is a full list of all of the um Installatron installs on each server and so this is from the original OU server Oklahoma one I think yeah and um and so we have the owner's uh sort of username their email address the path of their um site their URL database username the title of their blog so some of these are just like my blog which is usually a good indicator that they're not doing a lot with it um and then the Installatron will tell you when they installed this um app originally this WordPress app uh or whatever other kind of app it is and then I pulled in this information from this this Python script as to when their latest post was and then I looked for those accounts where the latest post was on the original installation date and so if they made one post and it was a hello world post or even if they made one post that was like you know I just started up this blog I wanted to see those and be able to flag those and then I did you know basic subtraction for uh latest post minus install date and then you know what is this this most recent date and the unique is whether or not this person has more than one WordPress account running so if it's a one that means they've only got one WordPress site running and if it's more than one you know then they have more than one WordPress site running and so I didn't want to you know accidentally delete somebody's account if they have you know two or three WordPress sites running one of them is an active because obviously I don't want to delete their their active sites so I use this to see you know what are the the days and use for all of these so um it's actually more like uh yeah how long have they been using this thing so there are 517 days between the latest post for this user and when they originally installed it and so uh they've got over a year of active use whereas some of these other people were only actively using their site for 97 days in this case uh in 2019 and then I could also look you know for these latest posts I could filter down to everybody that was what I ultimately decided was anybody that hadn't um updated their site in three years uh were the accounts that I wanted to flag for deletion and so for each of our servers um there were a decent number of of sites that hadn't been updated in three years and so this is sort of the the general data for it um how many of the how many WordPress installs are there how many were at least three years old how many are two years old how many are one year old and how many also had no posts since the original installation date and then I I did a couple of other sort of pieces of math I looked at the mean age so how long did the average user um keep using their site and the answer was a little less than a year um and what was the median age and I thought this was really interesting is that the median and the mean are like vastly different and what this is telling me is that most of the users in our domain of one's own system use their website for about three months and that's how long a semester is so I think most users are getting in um setting up a site for one of their classes and then blogging for that class and then um not using their site again after that class is over um and so that's why the median age is about that length and then the mean age is inflated because we do have a lot of users who have been using their site for two three four five years eight years in my case and so the mean is going to be higher than the median and I could graph the the differences and I think we'd see sort of a bimodal distribution where most users are undergrads using it for one class and then there are users like me or other faculty members who have been using it for a long time and then there you know some other people in between but I think it's going to be you know a bimodal distribution so I generated email lists off of this of everybody that I wanted to reach out to and what server they were on and then I worked with Reclaim to to email all these folks and basically tell them you know we're planning on on deleting your account in uh at that point it was in about 90 days I think and um you know to let me know if you don't want me to and we did get a decent number of responses maybe 30 or 40 responses of people saying like actually I'd like to keep my account um I promise to use it more often or whatever and uh and that was what kind of documents make them sign I've heard that before yeah exactly I had several that had never you know done anything beyond hello world but they are like oh no no I really want to keep that it's okay so you know we don't need to I don't I don't need to get it you know I don't need to delete 5000 or anything so as long as I can get a few of them off the server then that's fine did you yeah John did you have any situations where the users were predominantly doing stuff that wasn't application-based like WordPress like maybe HTML or maybe PHP or just a custom thing that they wanted to make almost like the old Tilda spaces they wanted to make a site you know set and forget um or is that really not the case yes so uh this is a like slightly longer answer than than what you're asking for but the um what installatron shows you is the installer and the type of installation and so this is all of the different PHP installations so you can see that for most of our servers like WordPress is dominant but there are other types of installations in here like Omeca and Drupal and all sorts of other stuff and also it shows you the information for your backups and so I filtered down for just the WordPress accounts that are current so not backups and not other installations and so I'm really only running this script against current WordPress installations and I'm ignoring people who just have you know HTML running or who have Drupal or Joomla or Omeca or whatever else and so this this sort of sweep is just focusing on WordPress users but WordPress users represent probably 90 plus percent of our user base so yeah we do have a lot of other types of users but they're excluded from this whole process it's interesting too because your your data your medians and means which I've never seen it broken down like that and I love that does reinforce why you're using WordPress multi-site along domain of one zone like you actually have the data to demonstrate that yeah yeah that was the the sort of wake-up point for me just seeing that that huge gap and um and you know it's it's uh confirmed when you actually start looking through the blogs and like clicking on them and as Taylor was saying you know do snapshot of them is that a lot of them and I know even what class is there for we have a huge journalism class that has 300 students every semester and those students are primarily blogging for that one class and we encourage them to keep blogging after that class but obviously you know some aren't so Audrey asked a question and and this kind of gets into one one of the things that I think is interesting because uh so the question is what what do you do when people request you never ask them again and assume the answer is they always want to keep it um but before we even get into that I just want to mention too that I think it's important to note that there's a lot of different ways that I think schools decide on when to do for deprovision accounts one of the reasons I was really I'm really excited to be having you talk about this John is because I think the way you're making these decisions is cool and I don't see people doing it this particular way a lot and I think it's a pretty interesting way um some folks are very much like no like after you graduate that's it right and so that's another way that's what I did when I was a domain's admin and so obviously that's a little bit different conversation there but I just wanted to mention that before we get into the question so I did you have that come up and how did you handle it yeah a few of the people said like I really want to keep the site and yeah please don't ask me again anything and so I made notes in WHMCS as to you know each of these accounts where the person said like I'd really like to keep this or and so I just made a note saying you know this this account is sort of whitelisted from future sweeps and I tried to note you know when I when they emailed me I tried to note sort of who they were and and what type of rule they were playing so if they were faculty that's a bit easier of a note to like put in there and not worry about for a while whereas if they're a student I might still look in you know two or three more years and just see if they've graduated yet the problem for us and part of the reason I'm doing the sweep this way is that we have a really hard time telling who's left oh you so we've tried to ask our IT department at various times can you send us a list of emails that you've deprovisioned and that way I can run it against my list of emails and just see you know which one to take off and that works a little bit but not great and often it's too late once they've deprovisioned those emails I don't have anybody to email anymore because and so you know this is a bit more you know action oriented and ahead of the curve but even this isn't great and so we're still you know this is one sweep of like five or six that I'm trying out but this one I think is you know definitely caught the biggest yeah I'm envisioning kind of like you know something so an earlier question that I just someone asked about like a root password to to the database so that we wouldn't have to loop through all and I I think actually in a c-panel environment the way you did it here is probably the way I would do it as well and I think it would be possible it would be some work but I think it would be possible to eventually like maybe we could get this to a place where it actually automatically via the install trying api finds an account loops through gets the password runs the type of code you have like that that all seems possible to me obviously it's and I don't know how long but you know it seems yeah and I don't think there is at least as far as I could tell I don't think there is a root my sequel user password because each of the databases and maybe there is like a universal one but as far as I could tell each database is its own sort of distinct thing and that's my understanding but I'm yeah but but I did it in two steps where I downloaded the information from install Tron and as you said I think you could roll that back into the code so that the code goes to each install Tron database pulls this list for you and just does it all and once I fed it the list of accounts that it was supposed to look at it you know took I don't know 10 minutes or something but it wasn't it wasn't like your your backup you know thing that took you know six hours so yeah yeah 10 minutes is actually I mean that's a lot of accounts right like you said 7,071 100 accounts something yeah and I had to batch it for each of our servers so 1200 1200 accounts per server or something like that and so these are these are WordPress installs also so there's going to be usually more than there are c panels because a decent number of people have multiple the other thing I'm thinking of and again this is all I'm just dreaming a tool here really you know building on your work here is I think comparing this adding in you know like bandwidth which is something we can already get from c panel right to have sort of the flip side right here we know how much a site's been edited and here we know roughly how much a site's been viewed obviously very roughly bandwidth is kind of a terrible metric for that but it's one of the ones we have yeah I do that sweep also and so I print off a list of our lowest bandwidth users and then I also do one based off of our lowest storage users sure the storage one is still problematic because as y'all said you know some people just have html sites but even an html site will generally be you know somewhere in the range of four to ten megabytes if if the total storage on a site on an account is less than a megabyte it's you know it's pretty good indication they're not doing anything I think blank c panels tend to clock in at one to two usually yeah and and the other thing I'm thinking about this they're just jogging my brain a little bit as I'm thinking about this I like I wonder if it'd be possible to look at times like date modified timestamps and user folders that may not be that useful like we probably have to be choosy about what folders and files we actually care about right because some log files will be written to all the time but that may be useful too in in be a little bit like for static sites it's interesting because whenever you bring up deep provisioning I think you know questions come up like okay how does your school keep the email like when does they use lose access and moe asked that question and I mean you can lose access to your email moe but then as Jackie pointed out in the comments still log into a particular application like word press or even see panel if you have the the the username and password that's not single sign on so that it really is it's almost like a a policy based in the university and some people say for simplicity's sake six months I know Annika brought this up and you were talking about this you know at UMW we used to say okay we have these students who are graduating we can cross reference that list with our list of emails we have and then those are the ones we're deep provisioning in x amount get the emails ready talk to them hey Domain in one zone is great if you love it so much go set it up on a host of your choice love it so much pay for it yourself exactly so I mean but that's really and I like the the fact of one of the things I think that appeals to folks about W WordPress multi-site and tell me if I'm wrong John is that you don't really have to worry to the same degree about archiving because you can have many many sites and you know it will have problems down the road as rampages and UMW blogs and many old ones suggest but at this time you don't have to worry about cleaning up in the same way immediately because there's no cost associated with it directly yeah that's that's a big part of the appeal and it's um but you know same same types of questions of you know one of the things I wish I'd done differently at the beginning was collecting more of the data as to you know are you faculty staff or student how long would you like to use this I'd really like to eventually have something where people can say I'm only using this for one semester please delete it for me at the end of the semester that's a good use case for the gravity forms right like maybe tapping some of that stuff at the point you know and then knowing that they're part of that journalism class and then that's another field you search against when you're doing these deep revisions yeah the other thing I was going to say is with the script you know right now I'm just using it for deep provisioning but one of the things it's it's pretty powerful in that it can access the post's table and there's that's where everything is and so I was thinking earlier from what Taylor was saying about thinking about screenshots of each of the websites is you could do sort of textual screenshots you could look at the posts and maybe you could look at like the about page and see if they've updated their about page and if you know you could do some sort of text analysis on whether it says faculty or student or something else that tends to indicate one way or the other you could do sentiment analysis across posts over a given time period you could do all sorts of database analysis of the whole of WordPress for your institution um and it's just a matter of I haven't quite figured out what questions asked yet and also there's some not insignificant like ethical considerations over yeah if I'm going to do sentiment analysis like I probably should get buy-in from people first but you know I was thinking about various COVID for example just doing you know some sort of um suite for COVID related posts and then I don't know collecting those that's I I mean the sentiment analysis I know that we are running on time and John as always you're amazing and I love hearing the work you're doing but you know before we make a digital departure you got that I like this whole idea of the you know sentiment analysis it reminds me of like when people run a movie through like a visualizer and you can see the colors of the movie like what's the colors of your domain of one's own instance right like well one of the things I'm thinking about so like I didn't mention this but the web archiving script I have it does do a text it does end up with a JSON file of all the texts that it's crawled as well so you could just plug that into a sentiment analysis script which that requires you to know or I would require I don't know anything about like um something that you know like that's that's like machine learning stuff um but but I think that that would be interesting okay well again another great session John thank you so much for the work you're doing and I'm glad that you know OU has decided to deep provision so many accounts because we all benefit from learning how you do that because it's a question we get regularly so thank you yeah the code's there for everybody and then and then Taylor's gonna make it like all work better so eventually yeah thank you so much thanks again John and we will see you all the next session is why reclaim cloud right so stay tuned