 Hi, welcome. My name's Robert Allen. I'm from Houghton Mifflin Harcourt. I'm suspecting most of you haven't heard of us or don't know who we are, but we're a global publisher. Some of our works that you might be familiar with are, let's see, Lord of the Rings, All the Tolkien Works. I'm pretty sure most of us are familiar with those, Life of Pi and several others. We're about a 185 year old publishing company, which in itself brings a lot of challenges. Besides the trade, the publishing of books and such, we work heavily in K through 12 education assessments, formative education, formative assessments. We have a large staff of learning architects. We deliver a lot of high stakes assessments, things like that. We also do a lot of research. We work with schools to develop practices on various types of education, aspects like intervention, so students fall behind, things like that, or accelerated students. That's where the professional services come in. We provide a lot of data to schools, comprehensive assessments, etc. A few years ago, I guess about three years ago, we came into a crisis. The market's changing in education. I'm sure a lot of you have noticed that. A lot of different companies coming into it. Innovation, accelerating in various areas. We found ourselves having difficulty with innovating the products, developing the products. That led us to recognize that we had a problem. A large group of people weren't really sure what that problem was. Some people ignored the problem. Some people didn't. We had a lot of trouble getting consensus on what those problems were. About three years ago, we started looking at that and trying to understand how do we get these products delivered. Nothing was really getting delivered to the customers that needed to be. We ended up recognizing that we have a portfolio of about 900 different applications. Most of them are different. However, almost all of them share about 60% of the same functionality. Because we have a pool between 400 and 600 engineers, which is a lot of teams, a lot of people building the same thing, different things. We recognized that almost all of the development time was being spent doing the same thing every time. When that happens, we couldn't get anything done. We also had situations where the operations at what we call the technology group, they couldn't get infrastructure delivered. I say months, but in some cases, it was years. They would have a product that we needed to develop. We would need infrastructure to build those development cycles in. We couldn't even get access to it for over a year. Around this time is when I joined the company. I was asked to look at it from a much smaller perspective, more from one or two teams instead of the five or 600 engineers. I had come from a telecommunications background. I had already been looking at mesos at the time, but it was very new. There was a lot of things missing, like containers. I think it was version .1 or something. It was pretty clear to me at the time that that was the direction we needed to go to orchestrate all these different applications. All these monolithic applications, they all shared the different things like authentication, logging. In most cases, they didn't even have time to develop metrics where they could monitor it properly. There was a culture of, if I can get it through Dev, I can get it accepted in QA, I can just get it over the wall. It would go into legacy, into support mode, and someone else had to deal with it. There was this culture of really just running away from that problem and just give it to somebody else. Contractors in some case that were responsible for the health state of it once it was delivered. It led to a lot of upset customers. It led to a lot of things just really not, it wasn't good. It was bad. There's a few here that can attest to that. I say up to six months. There were literally cases that went as much as 18 months. By the time you would get the infrastructure, the products already moved on. You're gone. You've lost your opportunity. Other things, like I said, the logging, just all of these different aspects, it was clear that if we could narrow that down and let our engineers focus on what's core to the business, infrastructure and operations, we're not an infrastructure and operations company. We're an education company. What we needed to do was let all of those engineers, those hundreds of engineers out there and those teams really focus on what was core to the business, which is just building the application. Not worrying about necessarily the security, the authentication, logging, the metrics, and really define what that was. We came up with a series of goals. There was a process. We spent six months really talking about what we wanted to do. We wanted engineers to be empowered. To do that, they needed to own it from concept all the way to legacy. If you develop it and you don't develop it right, you're the one that's going to get woke up at night. You've got to fix it. We started enforcing that. That took a cultural shift. We had to really get people to understand that you can't make bad decisions. You have to really think about what you're doing. We wanted continuous delivery. The release cycle on some of these apps that we were delivering to the customers were months or years. The fear was that the students or the teachers and the educators that were consuming these products couldn't handle change. The culture, we had to really help people understand that there are ways to continually deliver product without imposing mass change on the consumers. We had to stop preventing failure. We had pools, and we still do in some cases. We had pools of people. Their responsibility is to prevent things from changing. It's hard to deliver a product when you've got a whole team of people saying you can't do that. We had to change that. We had to get people thinking that it's better to have small change sets, lots of them. When you do that, you're imposing a lot or you're imposing less change and less potential failure. If you do a change set with thousands of change sets in it and you deliver that, potentially you have a thousand problems. What we wanted to get to, and we're still working that direction, is having one or two changes going in maybe two or three times in a day. In some places we have that, in some places we don't. We wanted to decompose these services into segregation and responsibility. One service handles the authentication, authorization, and all these different things. Again, let those teams focus on that product and that product alone. November 2015, my team was chartered to solve this problem. The catch was, we had to deliver a solution in six months or less. The only way, that was another crisis moment. As a startup, which is where I come from in the background, startup, it's sort of easy to green field solutions like this when you don't necessarily have a lot of customers. We have millions of students. We have a lot of constraints around data. There's all sorts of regulations around the data that we deal with, the handling of that data, geographic restrictions on where that data can be handled or stored or processed and all these things. We had to step back, we had to look for ways, how can we solve this problem. The team was formed, what we now call Bedrock Technical Services. The entire package of ideas, goals, et cetera, we call Bedrock. It's an okay name, but it probably could have been better. It's good, it works. We're currently a team of five people. At the time, it was challenging in the early months to say the least. We knew that technology alone can't solve this problem. We knew that we had to have a grassroots understanding throughout the company that this is the direction we're going, these are the problems we're going to solve and we need to do this together. This is not something, and it never will be something that you can solve from a command level. You can't tell 600 engineers and 50 or 60 teams of developers that you're going to do it this way. It has to be done this way. You have to get a consensus or at least an understanding of all these people that this is the direction we're going, we're going to take these steps. You have to be pragmatic about it all. The pragmatism is that I can't or we can't come out with a whole bunch of solutions at one time and change everybody's world overnight. We needed a way to really provide a level of behavioral design in how the infrastructure is delivered. What we strive to do is to make the right things the easy thing. There are cases where you can't make the right thing easy, you have to make the wrong thing hard. It turns out that's not as difficult as it sounds because most of it's hard anyways. We run Mesa's Aurora pretty much exclusively for our job orchestration. Like I said before, we're currently a team of five people plus myself. I try not to contribute too much because I'm more of the chaos monkey in things. Frustrates the team from time to time. We run Mesa's Aurora. We have approximately 30 individual teams that are contributing to that today. They are all empowered to do their own deliveries, their own job definitions, etc. We do that with Aurora. It's an ongoing process because a lot of times engineers don't understand things. I wouldn't say understand. They don't always take into consideration what their decisions impact the entire cluster. My team's job is to make bedrock safe for engineers, not engineers safe for bedrock. When you accept that and you truly embrace that approach to it, it changes operations approach to everything. That means you have to put governance in place but not restrictive. You can't restrict what they're doing but you have to make it safe for them to go in and not impact other consumers of the cluster. That's not always easy. There's a lot of challenges around that. That's where we get into other aspects of the cluster. We also use LinkerD now. We originally started with Finagle. We do have Finagle in some places that we use because Aurora originates with Twitter and they develop Finagle to work with the zookeeper, the service discovery pattern or the group membership patterns and the server sets, etc. That was sort of a nice fit. As we started scaling out, we also ran into other issues with AWS. We're currently in AWS, which gets into the whole other crisis that the company had when they started doing the lift and shift. You guys familiar with what that does to a company? There was almost eight-figure panic that happened probably my second year at HMH. The budget was bad. It was a crisis. They had no idea what was coming because cloud providers are like banks. If you don't have a plan for your money, don't worry. They have one for you. They will clean your wallet out quick. The assumption going into it for a lot of the leadership at that time was that we go to the cloud. It's an automatic savings. But the reality is, and I think we all know this, that you can't just jump in the cloud without a plan. The cost saving comes in the ability to scale down and scale up. The way HMH did this at the time was, to some degree, we still do. Every environment, every app was copied seven times. So you have dev, cert, cert2, some cases cert3. So they had certifications environments for the certification environment and another certification environment. And then you might get the prod. And in some cases, you had two prod stacks. So when you repeat all of those things, your cost just grows. You can't control it. And so what we've done is, one of the main decisions, and this is what really upset a lot of people, was it was a conscious decision to go into this with Mesa and Aurora to only have one environment for the infrastructure. We have two, but nothing's ever truly deployed in the dev environment. So we test all of the infrastructure. To us, regardless of whether the app is dev or production, the infrastructure it runs on is production. And so that really helps us make decisions and keep things stable for them going through. Now, the old school mentality of you having to keep everything segregated and separate with networks and VLANs and all these things, there was a lot of panic around that. End containers, containers are bad. We can't do containers. It was a mess. Originally we started with cloud formation. It was great until it wasn't, which didn't take long to get to where it wasn't. Lots of issues, and as we scaled out, applying updates with cloud formation would cause all kinds of headaches because sometimes it would lock up. You would get the resource bound. You couldn't get the resource down. And in some parts of our stack, it would take days to get a change applied to that because we'd have to go through support. Somebody at AWS would have to kill it. And the whole, everything about it was bad. Terraform came about and we adopted that, which led to keeping secrets somewhere else, which we've now started going towards vault. And it's gotten much better. It's also opened the door for us to start looking at other providers to move some of these workloads to. And then finally, we used Jenkins. Now, the one thing I think is almost universal, maybe not quite, is that everybody hates Jenkins. It works, but because the team was so small, we accepted the fact that some things were just going to use because we don't have time to develop or work on another solution. Jenkins was that for us, as well as Datadog and some of the more expensive things. Those were conscious decisions early that, over time, we would start to work out. We originally started with one monolithic Jenkins deployment using Operation Center, et cetera, that everybody deployed their jobs to. But then the other problem with Jenkins cropped up where you have plug-in collisions, right? So then we realized that, well, with 30 teams, we can't have 30 teams doing this. Because everybody wanted a different plug-in, the plug-ins were in conflict, Jenkins was down like every hour, it was a nightmare. So we split it out. Now the teams manage their own Jenkins, and they manage it through Aurora. They run it just like another job, and they deploy everything there. That has been a great improvement for everything. Excuse me. So as we do this, we realize that there's a lot we have to watch. We have a lot of expectation. We have all these guidelines. So if you don't meet these guidelines, the certain patterns, health checks, monitoring formats, logging, et cetera, you can't run there. You can't run your app in bedrock, and you're stuck with the month or whatever trying to get infrastructure. That's making things harder, and that's the concept of behavioral design. So you want to run your app. If you want to do it now, you have to do these things. And so we spend a lot of time inspecting this. We inspect how the apps are being deployed, how they're running, that they're providing the metrics that they should, and for the most part, doing the right things. And it's important to realize that you have to constantly monitoring, watching, not just the performance, but just the entire environment, the culture. And so this led to really embracing the openness, being open about our failures, being open about what worked, what didn't. In a 200-year-old company, that's not an easy thing for people to do. A lot of people like to, it broke, but we're not going to tell anybody. So we had to embrace, really, well, my team had to act as an example of, you know, airing our laundry, so to speak. We have, we've made a lot of mistakes, but we've done a lot of things right, but we've always learned from our mistakes, I think. I say I think, because that's something that's not judged for months or years down the road. Getting teams to do that was, at first, difficult, but it's gotten much easier over time. We, Slack is, we use Slack, Slack is very good for that. When we have, you know, outages, we always publish an RFO, a reason for outage. We try to outline exactly why it happened, what mitigations we took, and what our plans are for the future, preventing that to happen. That whole communication chain is important. You know, you probably come in here thinking this is going to be an entirely technical talk, but it's not, because it's really, you know, technology, you know, everybody likes to present that the technology is a solution to all your problems, and while it's a major part of it, it's important to understand that if you don't have a plan and you don't have a culture that goes with those changes, you're not going to be successful. And it literally took two years just to get in a large company like this to really get the momentum going to make these changes, because had it not been for that crisis, you know, that not being able to deliver any applications that needed to be delivered for that year, that school year, you know, we'd probably still be doing the same thing, and, you know, I might be somewhere else by now. I don't know, but it's, it's been quite a learning experience. And so the philosophical aspects of this is really, you know, you can't, you have to empower the engineers, you have to set those engagements, you have to have accountability, and really all of these things come together, and then celebrating the successes, the failures, and recognizing what is and isn't working, because they're all equally important. Thank you. So really, I wanted to get through this quick, so there would be time for questions, because we have a lot to share about this, but it really depends, you know, I'm interested in, you know, what are you interested in, and what can we share with you in more detail beyond this. Now, I'm very hard of hearing, so bear with me. Yeah, so you've mentioned that it took like around two years to actually do the whole kind of grassroots change and stuff like that. The question is though, being a 200-year-old company, I suppose that you went from this kind of like Ponzi scheme and management structure to maybe something else. I mean, so your bedrock team has obviously been kind of, you know, comprised like a tribe, if you will, if we'll like use Spotify's naming. So did something else change in that department? So that's a great question. Yes, there was a lot of change in management, and I would say it's safe to say that it was involuntary change for those managers. Yes, a lot of change, a lot of restructuring in the management. It really wasn't until the new management got in place that we started getting momentum for this. So while this was something that, at the time, it was, you know, Ryan and I, just the two of us doing this, and we were trying to get, you know, adoption momentum, there was no acceptance of it. In fact, there was a lot of resistance, and in some cases, you know, institutional resistance to it. It wasn't until management changed and management knew that there was nothing sustainable there and there was nothing that could really be saved. You had to throw it all out and start over, and we've done that. Now, we still have, because we are a 200-year-old company, we still have silos of different business units that have their own engineering groups, right? But I think management has continued to change up until recently in some ways we still have a lot of change going on in that. And it's through that, I think, that we'll see much more openness. Now, we've actually gone out as a team and sold what we're doing to internal business units to get them interested in it, and that's sort of where the marketing aspect, you know, the bedrock, you know, we created logos and, you know, things like that to sort of help sell it internally. And that's an important part. And I think that, you know, we also want the other teams to be developing products internally that externally our customers don't really know that it's called ORCID or whatever name we call it, but internally it's important to know that, you know, these services are available for other parts of the business unit and how they can adopt that. And so that was a big part of that. Very similar question to the person before. So we are undertaking a similar, very similar, actually journey. And so I'm wondering how did you manage to convince the product teams how we call them to really, you build it, you run it, right? So that is what we are currently struggling a lot with. So product teams still try to, I mean, they want to have the benefits, right? So they want to be on the platform, they want to be agile, but on the other hand to take the full ownership and also support their applications and own it to legacy. That's what I found really interesting on the slide. So if you can have some words on that. So where this, how we really managed to improve that was there was one team that was in, they were all working probably 80 hours a week trying to get their product out. And it's the identity team, which is not an easy problem. Identity and authorizations and all that. And they'd already gone through this a whole year. And that's where this whole crisis started. And so what we did, we actually went as a team and we were traveling to Dublin and we would go in and we'd sit down with them and we would work with them and we would help them through the learning hurdles of it all. I think that one of the things you'll find is that most of your resistance comes from fear of change and fear of change is derived from not having a full understanding of the solution. And so what we managed to do was to go out there, work with them, show them how their life gets simpler doing this. And it's just persistence, it really is. And you have to start with one small, you know, start with one team or a small group of people and really get them on board, get them to fully understand it, and then let them be the agents of change. So that's where your grassroots level comes from, right? So we worked with them. We got, you know, we trained them. We helped them understand the benefits of what they're doing. You know, once they understood that if you develop better software, you have less support issues and so then it becomes, well, maybe the supporting it through the full life cycle is not as bad as we thought. And, but that takes time because first you have to clean out all the old problems that were always there. You have to change, you know, in some cases you have to change your whole development life cycle, you know, and how you develop the software itself. It's not just, you know, this is one of these things. It's not just a one thing fits all. You know, you can't just say, you know, you change how you deploy apps, your life's going to get better. No, you can change how you deploy apps, but you're still building crap apps. You're still dealing with crap problems. So it's a matter of just iterating, working with them. In some cases we learned that what we thought was a good solution wasn't and so we have to listen to them. So to us there are customers and so we support them through that. It's not, you know, it's not adversarial, it can't be. And oftentimes I think that in, there's a culture in operations going way back of adversarial and it just, it's not good. You have to throw all of that out, you know, and you have to also work with the team that's doing this in some ways. You have to teach the people that you're working with yourself that you have to change how you approach it. You have to, you know, you have to work on your people skills. You have to, you know, it's just, there's a lot of things to it. It's not, like I said, it's philosophical and technical, but in reality the technical facets of solving these problems are very minimal. It's almost all a people problem. So you have to keep that in mind going forward. That answer your question? Hi, short question. How did you solve your database problem? Short question, short answer. We didn't. But no, seriously, we were very deliberate in what we do and don't do, and we do use RDS for a lot of the databases right now. It's not optimal, it's not great, and it's expensive. It really is, especially when they want to run, you know, four or five node Aurora clusters and, not Mesa's Aurora, but Amazon Aurora, things like that. It gets expensive, but it's a trade-off. You know, we use Terraform. We, the engineers, the engineer teams are expected to create their own Terraform pull request. We review the request, we deploy their database, and then the accounting's done with cost tags, et cetera. That's how we solve that problem. At this time, I don't want to take on putting databases in Aurora. I mean, it's bad enough that we run an elastic search cluster, and I don't even know how many nodes it is right now, but it is the biggest pain to deal with. You know, you see a lot of these presentations, oh yeah, you can spin up an elastic search cluster in Mesa's. Well, it's not that simple, not at scale. You know, we, there's a lot of data that we, you know, that teams are putting in elastic search. We deal with a lot of problems with that. You know, it's mostly shards. You know, some of the, we have over 50,000 shards in our elastic search database right now. That creates a nightmare. It's all the time when it's trying to rebalance it. So it's not a problem I want to take on, not yet. And so in the, in the interim, for now, it's, we will continue using RDS or what other, you know, external to Mesa's database solutions there are. So he did a follow-up question to my aunt. I'm doing a follow-up question to his because that's how inception works. So regarding the culture change, I have an interesting situation. And this is actually a question about education as well. So you said that you, you know, you went to other teams, then you found these people who became your agent of change, your zealots if you will. Well, you have taken a somewhat of a Stolman approach. Of course, everyone should use them. So the question was, and it is actually, so you have these very junior engineers coming into the team, engineers who are not taught to think about distributed systems, who are not taught to think about, you know, what will happen in the long-term, who do not know, who do not understand why they should log things and what are the differences log levels because this is something that, you know, why you just send everything to SED out. So the question here is, how do you, how do you kind of like prepare these very junior engineers to think in this more of a kind of distributed system failure that will always happen mindset rather than, oh yeah, you know what, I just like, I do something and know JS and it like runs and it's like so awesome. So yeah. So one of the things we do is, so mostly my team's in Chicago, right, the majority of our engineers are in Dublin. So every quarter we go to Dublin for two weeks and we do two-hour sessions twice a day for two weeks and we cover things that are bedrock specific, we cover development practices, you know, we talk with the teams about cap theory and all these different things. It's something that you just have to reinforce. But probably one of the biggest things that we've done is we've moved away from primarily contractors, you know, coming in and doing a lot of the engineering and so we're out recruiting, this is my plug now, but we're out recruiting, you know, senior engineers, staff level engineers to actually come into these teams and act as mentors and, you know, provide guidance to the teams. I don't know if many of you have, you know, gone through the whole recruiting cycle on the hiring side of things, it's not easy, you know, and so, but we've made progress in that, right? So we've got a lot of really good engineers that are in teams now and they're providing that guidance and that has made a huge improvement. You know, you have to, you just have to keep reinforcing it and when we do see things that, you know, are sort of sideways with what we expect, that's where the, you know, that's where our pull request, everything that goes in, you know, we review, whether it's ourselves or the engineers and we provide that ongoing feedback and that's really how you have to do it. Part of the bedrock principles that we applied is that everything goes through, you know, code review now. That wasn't always the case, which to me seems extremely odd, but that's just the way it was, you know, they were committing to the master branch or actually subversion at the time. But you can imagine, yeah. That's really it. I mean, you just have to keep putting strong people on the teams to act as those mentors and guides. So how did you get there in the first place? Because you strike me as a very progressive vanguard technologist voluntarily signing up with a 200-year-old dinosaur of a company and that's the disconnect for me. Good question. So four years ago or so, maybe five years ago, and this is an interesting story, but my previous boss at the last place I worked in a startup, he ended up there. I think mostly because it was close to his home. It took him almost a year and a half to get me hired to go through the 200-year-old company process of hiring. Yeah, but he was persistent and he and I are good friends and I liked working with him as a boss and that's really what helped get me in there. Had it been anyone else, there's no way it would have happened. I wouldn't have, and I'm not, I don't know about you guys, but the whole interview process, doing the resume and all that, it's just not my thing. I hate doing it. It's a huge pain. So that sort of made it easy. Yeah, it worked out, but had it been anyone else, no, I wouldn't be there. I probably left Chicago, which is okay. I'm from Texas, so Chicago is not my home place. I probably would have gone west or something. I don't know, but yeah, it worked out. That's how I got there. Now, it's important to note that whole year and a half thing has gone away. It's not perfect. You know, there's still a bit of a lead time and we have trouble competing with the startups where they're much quicker at executing, you know, hiring than we are, but we're working to improve that. We've also, you know, the jobs, the way we hire for jobs now has changed quite a bit, you know. So if you look at, if you're interested, and you look at careers.hmco.com, you'll see that, you may see that, you know, there's some, well, modern looking, you know, job descriptions in there. Four of which are for my team, if anybody's interested. But anyway, yeah. So what were those wrong things that you made it harder to do for the teams? The first one that comes to mind is we started trying to use Graphite to collect all of our metrics. It was very difficult for a small team to manage at that scale. That's when we sort of just pulled the plug and we went to Datadog. Datadog is kind of pricey at scale though, right? So now we're running InfluxDB and Capacitor and we're actually working now. We're close, I'm hoping Q1 will be off of that and will completely, you know, be self-monitored and alerting. Some of the other things, we didn't start out with containers per se. We tried to run a lot of the JVM things, native, you know, using the basic Mesa containerizer. That didn't work out so well because all the teams have like different requirements for their JVM, heat management, all that. It's just, it wasn't working too well and it was hard for us to support, you know, a small team, again, all these different engineers trying to support all the different requirements natively to the Mesa agents was impossible. That's when we really adopted containers. So everything that goes in has to go in some form of a container and it has to manage its own requirements, which then led to the other problem, the wrong thing, was, you know, how do you take a lot of people that have no ops experience, even at the container level, right? You still have to have some understanding of operations and putting the right packages and security and all these different things. So that's gotten much better. What we ended up doing was creating base containers that everybody uses to sort of take that, you know, to simplify that for them and provide constraints around that. The other thing is relying on our EC2 profiles for credentials, et cetera, without having some form of segregation of those credentials. And so we ended up, that's where Vault has come in for us to manage those credentials better. And in some cases, we just, we ended up segregating, you know, based on auto-scaling groups. That's helped a lot. And then finally, probably the biggest one was thinking that we could attach 300-plus load balancers to a single auto-scaling group in EC2. Turns out they have a soft limit of, like, 50. So anything more than 50 ELBs on an auto-scaling group, they don't usually do. We've managed to twist the arm because we spend money to give us more. We're at, like, 300, but they're not giving us anymore, which is why we're throwing all the ELBs out. Probably very soon, actually, Patrick's been working on that a lot with LinkerD, HAProxy. So, which turns out, it's going to save us, you know, a few thousand dollars a year, just in itself. So, and we'll get a lot more performance out of it. But don't assume, you know, the AWS limitations are infinite. They're not, or any other provider for that matter. That was probably the biggest one. That one's actually got us in a pinch now, because we're limited at 300, and I think we're at, like, 295 or something, you know. And we're always pulling older apps out, so that we can make room for production apps until we can get our LinkerD solution in place. That's about all we have time for questions. We have a break outside, so you guys can chat further. So, several of the team are here, and, you know, if you have any other questions, feel free to ask them. Could you guys, yeah, so Sampa, Ryan, Patrick, and myself. So, if you see us, feel free to stop by, ask us, and we'll be happy to talk about it. Thank you.