 Hello everyone, and welcome to this talk about Phenops and observability. As last time I was first talking about this in November here in Brno, and it was also unpopular by the title Phenops, so this time the same, which is cool. I'm first going to reflect a bit about the talk in November that I had about the Phenops, and then we will move on into some more technical details about the talk, about the presentation actually. So this part was already on November, so I will go through it a little bit faster. So the whole engagement started when my friend Andy Thompson and I were engaged in a project with a large company that needed us to sort out their Phenops engagement. Back in the time, I was a DevOps engineer, I didn't have any actual Phenops experience, so Andy and I started exploring what Phenops is, what it means for a company, how we can do it properly, and all that stuff, you know. At the same time, they had the first task for us, and this is the biggest cost they experienced was related to EBS volumes, and everything is happening on AWS, and the overall money the company spent on a monthly level was anywhere between five and seven million US dollars monthly, and big pile of that money went to EBS volumes. So we started exploring what is going on with EBS volumes as engineers, you know. We looked into the patterns of use cases, how teams are using EBS volumes, what they do with them, and all that stuff. But at the same time, as I said, we were exploring the Phenops. What Phenops is, how it is managed, how it is defined, and what should we do as now part of the Phenops team, DevOps slash Phenops team. While working with them, that was really a great experience because we were able to see firsthand how engineers in this large company were struggling with the idea of being into the cloud and not being able to use the cloud in the way they wanted. The promise of the cloud is that you get your resources when you need them, how you need them, you set them the way you want them, and all that stuff. That wasn't happening for this enterprise because the enterprise had strict rules about how to use cloud resources, how to manage them, when to dispose them, and all that stuff. So that was frustrating for engineers. At the same time, we were, as a team, engaged in a lot of talks with the management. So we were able to tell that the management was struggling with similar issues. They were struggling with the ways to control engineers from spending money on the cloud. So it is like a clash. On one side, you've got engineers willing to do stuff in the cloud, experiment, and try out new things. And on the other side, you've got management, which is keen to control the spend, which is natural. So both approaches make sense. While preparing for the whole stuff and exploring EBS and exploring Phenops, obviously Phenops.org is the place to go if you want to learn about Phenops. And this is the beautiful definition standing on the Phenops.org site, which is almost completely not accurate. This is really nice when you are in the management, and you want to explain how Phenops is the next big thing. So a lot of parts of the company working together towards the common goal of making the world a better place, using less money, and everyone is happy, and everyone is collaborating, and all that stuff. But in reality, what we have found that Phenops is after some exploration is this. So Phenops is purely about saving money in the cloud. The whole talk about changing the culture in an enterprise where every engineer would think about the cost did not happen. I at least didn't see it. I was on that project for three years. And the reason why I didn't see it was the miscommunication between the management and the engineering team. And that actually guided my colleague Andy and me into building a system that we hoped that would help facilitate this communication. Because that was the biggest challenge and the biggest disappointment. On both sides, you've got people who are willing to do something good for the company, but it's just that they're talking different languages. They have different goals, and that was always the problem in this. So going back to engineering, and I will be during this talk, jump a little bit on the engineering side of the things, and then on the management side of the things. So what you see here is a flat, simple model that we used to analyze what is going on with EBS volumes. So we said, okay, we want to rely on AWS giving us the data, so we need pricing from AWS. Then we would like some metrics about EBS, about how EBS is used. If you want to see if EBS volume is unused, or if you're not pushing any data towards EBS, you are not using it maybe, and you want to support these claims with actual facts. So if you can pull metrics, and if you can prove to your engineering teams that there is no byte coming to or from the EBS volume for some amount of time, that should be good enough for them to comply that this EBS volume is actually unused. I mean, physically and actually unused. Also there were all kinds of specific conditions in this enterprise, how you declare that something is used or not, or how something can be used or not. For instance, they had a lot of accounts in AWS which were not claimed as production account. They are not labeled as production account, but they were used as production account. How this happens is another story. The good story about that is that they would push a new service. They would find themselves a client. They wouldn't get the approval from the company to create a production account to support this client, but they had a client. The client wanted to pay. They asked the management. Management said yes, onboard the client. They onboard the client on the development account, and they just continue rolling in new accounts, new clients, and new clients on this dev account. And at the end, you've got a dev account supporting production. And there are all kinds of things you would find when you dig deep into these accounts that were used. We had communication with around 400 different teams running all kinds of projects in this enterprise. So we had to come up with some kind of custom rules that would help us control the whole thing. On the other side, we would collect the data, apply some pricing, AWS pricing on them. We would apply some rules, and we would do some logic around that, and we were able to generate two types of results. And this was very important for us. The one type of result would go into engineering teams, which means that we would meet with the team, and we would present the data. And we would say, this is what we have found about your EBS volume usage. This is the list of EBS volumes. These are the ones that are not used. These are the ones that are not used properly. And these are the ones that are optimized. Them you don't need to touch. And we would support all this with actual data. The teams would get back to us. We would get into all kinds of heated discussions around what is wrong with EBS volumes. You know, these stories like, we can't touch this EBS volume. Why? A colleague of ours created that volume and left the company. And we don't know why this volume exists in the first place. And you would say, yes, there is no communication with that volume. There is no bytes going to or from that volume for three months. So maybe it's safe to delete it, or at least to snapshot it and delete it. But they would be reluctant, because they would say, well, we've got contracts, and we've got SLAs with customers. And we would rather continue paying for that volume, even though we don't use it, than risking some outage of service or something like that. So again, these are real life stories. Not something that we read in books, how we do stuff. One fun fact, the management didn't like this approach. They preferred this. This is the same stuff, but a little bit with colors and different shapes, it looks a little bit more modern. And this resonated well with the management. When we had to present to teams what we do, we would use this model. And engineering teams responded well. This is simple, this is very simple design. But when we went to talk with management, this was the one. Because then we could say, yes, this is something magically that happens, and then we process some data, and then we generate some results and all that stuff. Apart from the first report, that was a recommendation for engineers. We generated a second report that went to the management. And what we did is we defended, well, we presented some recommendations on the company level. These recommendations would be like this. There is an opportunity to save X amount of dollars per month if you would be willing to perform these and these activities. We already conducted talks with the teams, and we already got their estimations on how much time would they need to implement the changes. So that usually would be like open the Terraform file, change something, apply changes, and that's it. But we went through the talk and we collected the data about how much would it take to perform these changes. So now we would prepare a report for the management and we would say, if you want to save this amount of money, it is possible. You need to come up with the resources for engineering teams to spend this amount of time to execute these changes. And now management actually had something to make a decision, some enough data to make a decision. It was palpable, it was not tacit anymore. You could actually anticipate how much time you would need, how much effort you would need, and what kind of money you would save. So again, this was, and this is one of the, I love on these talks to present these guys. This is Niklaus Wirt. Niklaus Wirt is most famous by being the father of the Pascal programming language, but some of his work is much more than that. I personally like this plea for a lean software paper that he wrote in 1995. It's a five pages paper that explains benefits of a simple design. You can find more modern books about this, like David Furley has this book, Modern Software Engineering, which is great talk about how you simplified the design, how you start from a simple thing, how you add stuff into this simple thing and how it becomes more and more useful. The whole talk, the whole approach is don't try to be modern, don't try to be tech for tech, but try to build tech that would be actually useful, that would sort out or deal with some problem. So again, this is the influence of the guy. This is how we build the models. After this initial success of the approach, the company said this is all one of my friends like to say nice and dandy, but can you do it for some other resources? Like we've got DynamoDBs, we've got RDS instances running around, we've got EC2 instances running around, and we have no idea how to save some money on these. And again, my apologies to all the people that are dealing with the actual software that was built to generate metrics or to generate the observability in different definition scopes. This is a different kind of observability that we are doing. We are not getting so deep into how many processes are being run on certain instances and how much memory they use or how much processing power they use. This is a little bit different approach. We wanted to be in the middle between engineering and what can be good for the company. So we said, okay, if we are going to do this for another EC2 or DynamoDB, let's say, we would like to build a framework. We would like to build something that can be reusable, that we can extend, that we can just add new stuff into the play. So we would have an engine and then we would just, well, you can say, add plugins to this engine. So the engine would become aware of how to analyze DynamoDBs and all that stuff. So we started simple. One of the colleagues on the previous talks talked about ECS. And it is the simplest form of running containers on AWS that you can actually orchestrate a little bit. What we wanted to achieve now into technical stuff, what we wanted to achieve here is try to minimize the need for orchestration and increase the choreography. So we wanted a lot of independent services running around, not needed to be orchestrated because we wanted to avoid this. If it comes to orchestration, then we will deal with that. But until we have to, we would like to avoid it. So we said, okay, let's treat this as a microservices architecture. Let's put the analyzing engine for EBS in one container. Then we put another one into a second container, like that one would analyze DynamoDBs. The third one would analyze, I don't know, ADS instances and stuff like that. So we started adding more and more services following the same pattern. Every service would have a different model, but the model would contain same stuff, like get me AWS pricing, get me some metrics that I need to assess whether this information is useful for me or not. And some costume parameters, like the stuff that I talked about, whether some account is production account. So I cannot really apply some of the rules on the production account. Again, bear in mind that metrics and observability in terms of, let's say, how Mitsitip C is explaining observability is different than this. This is like a high level of observability. Let's call it like that. We would rely on AWS to give us data. So what we did is we put these containers in place. Then we needed to somehow start them. We had an idea of running them every day, collecting the data. And once a month we would generate these reports that we would push to teams that are targeted for actual optimization and to the management so they could figure out whether they want to invest resources into optimizing the infrastructure and saving money, eventually. So we said, since we don't have a lot of orchestration, let's do it simple. We created events that are timed. So we said, in the morning we will schedule these containers. They will spin up, do some stuff. It was serverless, it was Fargate enabled, so all that is fine. After that, we figured out that a lot of these containers would actually go and fetch prices from AWS. We would need some secrets to store them somewhere in the AWS. And then somewhere around this point, we started thinking like, okay, how about we make everything idempotent and serverless? That means that if we are going to build it, I can run it many times a day and generate the same results. So I wouldn't mess up with my previous data. That was idempotency. And then serverless, well, events, ECS, Fargate containers and all that stuff still is serverless. So that was cool. Then we proceeded and we said, okay, we want to go through accounts. And we want to analyze what is in these accounts. We need some permissions. We need some account roaming stuff that will go around. And around this point, we started thinking like maybe this orchestration, microservices organization is not the most optimal one. Can we put aside the tool, the piece of code that would go through all these accounts and do this for us and just hand us the token and we continue doing it like that? And that is doable as well. In our case, when once we put it in motion, it wasn't really good use of our time. We figured out that it was, even though it's elegant in the code way of things, in the architectural way of things, it wasn't really elegant in the ways of us having to maintain that. And we always had in mind that at some point we will hand over this work to the team that should contain to maintain this. So we wanted to keep it simple and follow this approach. So we said, no, every container will have its own engine that will roam through accounts so we could separate them even more. So we could avoid the orchestration. That was another reason why we avoided this because this would introduce some piece of orchestration and we couldn't scale them in the same way. So we kept the isolation thing. And then we said, okay, since we wanna maintain this serverless approach, let's write some of the data to the Aurora serverless. Now, still idempotent, the data written in the database is not going to be there longer than a day, because we don't need it longer than a day. So if I delete, if I rerun the task, it would clean the data for the task. It would regenerate the data and the data would be fresh for that run. So this is why we could keep the serverless approach. We needn't keep this data for, I don't know how long. And that was fine. Also, obviously something on AWS, you need to log stuff, you need to, yeah, you need to send, send information. And this is the final stuff. This is where we, at the end of the day, the only piece of orchestration that we had to introduce is once all the tasks finish, we would push the data into the S3 bucket through some reporting mechanism. And from that point on, the data would be picked up by some of the Power BI or Tableau tools. It would be invoked into the Snowflake. It would be reorganized or done something different with it. That point company said, okay, can this be used for something else? And this something else we didn't do because the contract was expiring and they didn't want to go through with the whole idea. But the question was, can we use this approach to make company more advanced on the market? Like how do you do this? Well, you make your company serverless or use more serverless stuff. This company used a lot of EC2 instances, a lot of monoliths running around. So if you wanna be more modern company or more competitive company on the marketplace, you should follow from the tech side of things how we see it. You should approach the way you build software a little bit different. So this is the slide that we prepared for the management actually, when we went on a meeting. And we said, okay, if the management would like to be innovative and competitive, and if the tech has a strategy how to support this. So that means, I as a company wanna be competitive, cool. What the engineering stuff can do? Well, we can build serverless, cool. So we've got two ifs. Then, score teams by their current level of serverless adoption. And then finally, management needs to approve resources to re-engineer and re-architect and rewrite these applications. This is how it goes. And the final stuff is this. When we went on a meeting and we discussed with them and all that stuff, and you know what it turned out? The company was more keen saving money than becoming competitive. Because the company was already dominant on the market. And their idea of being competitive on the market was based on, okay, if we can do it with not a lot of effort and create a bigger gap between us and the competitors, that was cool. But when you introduce them to the idea about how this works in engineering mind, like you tell us what you want. We propose how we can assist you in the journey. We give you the estimate how much time, resources, money, people we need to make this happen. And then you say, yes, we want to make it happen. And then you follow it through. And then you can support this approach using this tool to provide you with a proper analysis on how much teams are serverless, how much teams would be capable of follow through the whole stuff. And this is, for us, for all of you who are my age appropriately, let's say, you remember Hitchhiker's Guide to the Galaxy and the Bubblefish, and it was all that in our heads. Like, you create a tool that is a good tool to communicate between engineers and management. They usually don't know how to communicate. I'm sure all of you have been in situations where you wanted to talk with the management, but it doesn't really go both ways. So us providing this tool felt like providing the Bubblefish. On one side, we were able to explain to the management what can be achieved with certain amount of resources. And on the other side, we were able to provide the engineering teams with what needs to be done, what should be done, and why we need to do this. Unfortunately for us, they didn't go through with the second part of the idea of modernizing the approach and making something better with the whole stuff. But the first part was good enough. So the Bubblefish, in our case, really worked for the first piece. And it was very much thanks to Nicholas for the whole idea about how to keep it simple and how to be. There is one thing I want to mention at the end. It's really tough to build a software that you believe will create the benefit for the company and not introduce some nice and flirty technical things you can brag about because we as engineers like to be proud of what we do. And if you're not using the latest grade, you didn't see Kubernetes in this presentation. You didn't see any, well, not any actually fancy stuff that we did. It was simple, plain engineering that we did that did some good for the company. But I need to say it's not really easy to defend this approach when you go into the purely engineering audience. You would get all kinds of questions like, why didn't you reorganize your microservices in a different way? Why didn't you do this AWS service or that AWS service? And it's really hard to defend when you have to offer an explanation like this. Okay, we knew which team is going to continue doing this after us. We were there for a limited amount of time. We wanted to build an application that will have a longevity, that will be able to continue operating after we are done, and that will be able to be successfully managed by the team that we give it to. So that influenced a lot of decisions that we made during the architectural phase of the stuff and all that. If you're building a product, if you're building a service that you want to commercially offer to someone else, that's a different thing. You're building it for yourself with a different goal in mind. This was made especially for this company with the idea of then being able to take over the product, well, the service from us and continue using it and it saved really a lot of money. Well, it's not that difficult to save money when you have a company that spends like five to seven million dollars per month, but it was fine, it was cool. In the end, I think the last year went through like the company said, if they don't save us as a team, Andy and I, if they don't save more money than they cost, you don't have to pay for those two guys at all. And that was really easy sell on the management panel. But I wanted to say, I wanted to re-emphasize this one more time and it is the end. If the company has a strategy and if the company is able to communicate through the strategy and if the company is able to stick with the strategy, then tech guys will follow. Otherwise, you will have all kinds of clashes between the management side of things and the engineering side of things. And this is it. And once again, thanks for Nicholas too for his appearance. Okay, questions. Yes. Oh, to repeat the question. Okay, so the question was like I started as a DevOps and then I went into the FinOps and then I worked in establishing the communication between engineering side and the management side and was there enough management knowledge in the company to utilize, to use, to ride on this experience? I would say yes. The first piece of the puzzle that we put in place is a nice prove that it is. The company management really went along with the approach like okay, if you give us, if you give, I mean the ABS, the first thing was the key. This is the key that we use to unlock the company. So when we proved that we can give you the estimate and the amount of money that you can save if you follow the estimate, if you facilitate the needs for the estimate and when that happened and when the company had actually the results to look at, that was the moment when the company said like, okay, this can be done. But that was the piece of the work that was following the company's strategy. Like we want to save the money and that's cool. The moment we get to the second part of the puzzle when we said like, okay, you want to innovate or you want to be more competitive, that was the moment when the company had to say no, we are not really into that. Which is cool, which is fine. The bad thing is if you as a company want to propagate this idea, but in reality you don't really want to follow through really. You want to save the money and you want to look good so you would talk these things because they sound good and a lot of people can align with the ideas, but not in reality. So my answer is yes because it was, I think yes because it was properly aligned with the current company strategy and this is why it worked. It was still amazing how these two, I would say gangs in the company didn't communicate well. Everyone wanted their stuff. If you talk with engineers they would moan about not being able to use the cloud freely to create new clusters to do stuff. They would moan if someone would warn them that they need to clean up the stuff after themselves and all that things. On the other side management only wanted to control. You know what they did? They imposed strict policies on how much money you can spend and you know how engineers got to this? They said like okay we will build something that is below that limit and as long as we are below no one will bother us. And then in this threshold they could do all kinds of stuff and not a lot of them were good. So while analyzing stuff we discovered that actually a lot of accounts that were below the budget, the limit, were very not optimized because they didn't care. They cared only on doing the daily stuff and be done with it because they were below the radar. The company didn't even notice these accounts because they were below the limit. Some manager that was leading these teams was good enough, I would say, fighter to go on a meeting with company management, upper management, to fight for a good budget and then after that he would bring the hunt back and then he would say hey I fought for this budget, we have the enough budget, now you go and play. And these teams were really happy but still not optimized. This helped company actually analyze and isolate these accounts as well. So again, back to the question. I think yes because of this. The strategy in place was correct, it was about saving the money, the engineering was able to follow, the model was simple, it was easy to translate, we would invoke new, a lot of that was written in Python, some of that in Node.js, so both very popular languages that was really easy to invoke new engineers into the game and they picked up things pretty fast, small isolated pieces of code, so relatively simple approach. Yes, why we didn't? Accounts management. Cost management, yeah. So that is absolutely good question. The problem with this approach was that they were using this before us and to properly investigate 400 different accounts using AWS cost management, required them to spend a lot of time over and over looking through that. And you know, in that big company, things change on a daily level, not on weekly or monthly level, that is drastic changes up and down in the use cases, in the usage, in the patterns, in all that stuff. They wanted something that would be automating the whole stuff, so cost management is a good approach. It also fails to fit in the, to fill in all the gap about custom stuff. Like I said, they would have production load running on a dev account and you can't just undo that load. You can't just take it out and move it into the product account. So a lot of these goaches, instance, an example, they wanted us to introduce something they called Big Red Button. You know what Big Red Button is? They wanted us to create a UI with a Big Red Button. They would click on the 15th of December and they wanted us to shut down all kinds of resources throughout the company because they would go into the low usage mode until, I don't know, 10th of January. That was also part of the whole activity. But we started with one idea. 10 days later, we would have all kinds of exceptions from that rule because again, large company, large organization, a lot of managers were able to fight their way and get the exception from this rule. So we had to avoid them. And when you go, I mean, a lot of these AWS solutions were really good and you can use it for that as well. But when you go into an enterprise, having so many different tweaks and things, if you are not willing to dig in and build something custom, not really gonna work for them in a way they expected. In the end, we were using the same stuff as cost management. So we would execute API calls towards the AWS pricing API. We would execute cloud watch calls to get the metrics about EBS volume. So it's underneath, it's all the same stuff. It's just that it was not really usable for them on the company level. They tried two or three different FinOps specialized products for that. So products that are fully baked with all kinds of reports and all kinds of things you can think of. The answer would always be like, tag everything properly and we will handle something for you. That tag everything properly does not go well with that kind of a company. It's tag everything properly with two teams, three teams, five teams. That's really cool. But when you think about 400 teams and you have to go in them, we also had a situation when we would go to a team and we say, hey, you need to tag things properly so that the tool would be able to pick stuff up from the account. And they would say to us this, and I quote, like, when you come to do our job every day and when you have to also manage the existing infrastructure and deliver new features in this amount of time, then you come and you say to us, implement this. And they would just refuse. And they knew they were below the budget limit, they were bringing money to the company and they would just say, no, not interested. And that's the reality. It's one thing that when we talk about it, but this is like the real stuff out of time. So thank you for your time. I hope you enjoyed and see you next year in Brno again about something else.