 Okay, so next we have Alelita talking about consider a lot pop-up considerations for your infrastructure cost monitoring All right. Thanks Richard Hi, everyone. I know that I am holding you back between lightning talks and my talk But thank you everyone for your patience I am going to dig into observability considerations for cost cloud cost optimizations today A little bit about myself. I'm Alelita Sharma. I Am a co-chair for the observability group a tag group in the CNCF I also work on open telemetry, and I'm on the governance committee of the open telemetry project that you've been hearing all about and I also lead a IML observability at Apple So with that said, I'd like to ask folks How many of you are users of cloud infrastructure? I'm guessing many folks Alright, so this is this is something that I think is near and dear to many of our hearts both from a using user perspective as well as from a vendor perspective and What I wanted to do today was kind of have a Discussion oriented Talk, I know we ran out of we are short on time a bit But I what I did want to call out is that there is in There've been some talks on FinOps today earlier in the day and as many of you You know work on cloud infrastructure as well as building applications on cloud infrastructure Most enterprises big or small Leverage cloud infrastructure today for running global applications, right? Whether that's services or whether that is you know, middleware services or being able to provide data to your you know applications so Cloud infrastructure kind of forms the baseline for all the scalability Reliability and the performance that we look at from a systems point of view We use observability as many of you are here today to be able to Understand system behavior well and also be able to understand health of the Service end-to-end both of the application as well as the infrastructure layer real time Because there are many things, you know, many implementations that have been done over time which are Not really real time, right? They have the aggregate data. We look at data You know for days we go in often also Go in really analyze post Collection, but in this case I'm talking about real-time observability and as Time goes by we operate, you know, very large services at scale and operational planning as well as running cloud-based applications Cannot be done without real-time observability, right both on the infrastructure as well as the application layer so Observability, you know again helps us build operational resilience many of us, you know, I'm looking at resilience at different layers How do we provide h.a. And high availability across zones across regions across, you know different? configurations Availability as well as performance of applications right so again all of these areas go Lockstep and many of us sitting here, you know are very familiar with the details of how to operate such large-scale systems So enter a new area that is intersecting with the observability space It has been intersecting for a while, but I think that even today it's far more pronounced and has really grown in an Evolved in the space of observability where you see a lot of the financial planning the forecasting the Analysis of understanding public cloud costs has become more and more offense scenario and a use case for Observability right because here. We are we're coming from systems. Well. We think and breathe systems But you know here. We are also now thinking about how does that intersect into cost? Both from an engineering perspective as well as from a financial perspective So today what I want to do is focus this talk on really, you know thinking about how Observability intersects and what are some of the underpinnings that observability is that now providing for financial planning? I do think that in today's world at the scale that which we operate at you know our services globally operational and financial planning needs to happen lock step that is you cannot you know really not think about Financial planning at the same time that you operate your services This also means that public cloud providers who are an integral part of the fabric of the services that are being run are also key stakeholders and need to provide you know cost-effective services which really scale at the You know on demand for the user So I want you to kind of think about the different stakeholders in the in the equation right when you say that you have The operations and the operating of the systems the design of systems We're all familiar with about that was engineers But then you also have the CFO's office the financial teams coming in and knocking on your door and saying that hey you know it's like how much is this stuff costing and Why are we getting all these you know massive bills when we really don't have the kind of breakdown that we should have? Right like you know any cloud provider that you're using Does it do they give you what you need and that's a good question to ask So if you don't have observability data today, right fully instrumented fully collected for all your application application services as well as all your infrastructure Are you first first question to ask is that are you is your cloud cost analysis and financial planning? Really you know being able to leverage the crowd Resource usage data that you're getting from your cloud providers You know and cloud providers can be public cloud You could be running hybrid infrastructure You could be running all different kinds of configurations But you really have to question is the data that you're getting for usage information of resources really adequate Right the other part really is that is your provider providing you enough data granular enough or You know additional Organizational detail that you may need to be able to you know clearly account for that cost Into projects or into different domains or different users or different cost centers again There's a lot of complexity that goes into the financial space so often What you know is What ends up happening is that financial? Planning becomes very much of a guess guesstimate Right because you really don't have the kind of you know detail that you wouldn't need to have you know for Resource usage as well as other you know dimensions and it really becomes an guesstimate analysis which you know leads to Very quickly can lead to unmanaged cloud costs To inefficiency in the way you run your systems in the way you configure your infrastructure waste and hence Organizational over spent because if you don't have visibility Which is what observability is all about then how do you actually get to the next step and often? You know it is actually many teams just getting up and trying to pull data together. You know post occurrence of you know this usage and trying to combine compile that information back together So fast forward to observability and what you know if you do have the data You know in a perfect world where you have all the instrumentation you have the data You have your infrastructure You know being able to emit the kind of detail of granular data and the types of data that you need to have For your infrastructure as well as for your applications observer Observability becomes very useful because you can take your resource data usage data. You can take your Real-time telemetry data collection Apply correlation apply ML trend analysis that's where ML is becoming more and more useful and I'm not talking about buzzwords here. It is actually very useful at scale and the results in you know very accurate very granular Data-driven understanding of resource utilization and spend patterns that your organization may have right so it really is that Adding observability is imperative for getting accurate cost analysis And the continuous pay-off for your organization really is that hey you know you have actionable real-time accurate financial and operational planning which really leads to cost cloud, you know cost optimization for your cloud spend and in the long run really what every business is after is how do you reduce risk on You know financial planning so that you know you don't have to you can sustain any kind of changes in the economy for example So I want to kind of take this, you know in the mode where I said, you know, there are multiple stakeholders in this game It's not really only engineering. It's not only application development and developers but it's also the CFO's office often financial teams that are operating across the company and What does the CFO's office come back and ask you about what do they need as a systems engineering team as a platform engineering team as an You know team that's operating infrastructure or developers who are building applications on the cloud You really need to be very very much, you know lockstep with the CFO's office So if you were to come back to a systems engineering team and said hey, you know, what do I need? I need just these this is the list of items. I need from you at a minimum One is telemetry data Even though we have loads and loads of telemetry data terabytes and perabytes But sometimes it's not enough because it may mean not be the right type of data, right that you need for cost analysis So more granular data more data for that support streaming at scale for real-time collection Correlation and analysis on the fly right because that also gives you a lot of performance to be able to calculate cost and apply models and do a lot more pre-processing on the on on the fly as you are doing collection and What that then does is it also sets up the data to be very easily consumable for intelligent analysis And why do I say intelligent analysis and not just Regular observability analysis, you know where you can do correlation or you can just do Sampling for example is that yeah, that's fine That's at a basic level, but then you actually do want to accelerate the speed of analysis that you want to do at scale at Real-time and for that typically ML is used a lot. I Do introduce a new buzzword here, but it's an old buzzword actually AI ops Also known as ML Ops sometimes For understanding reliability patterns now we do this a fair bit in systems reliability You know where you are actually not only building your observability pipelines But then you also have ML applied at scale to be able to understand reliability patterns, you know errors failures Analysis, you know that really gives you the kind of trends in which your data your applications are not only behaving But also the patterns of usage across regions and different types of metrics that and dimensions that you want to see and That is something which is very useful for you know, again going in supporting financial planning because in you know at observability This is a different area Data analysis also where you are actually working with cloud providers specific, you know requirements And these are again, you know areas that systems engineering is not our only stakeholder on could be capacity Reservations for example, right like one team that is doing reservations for capacity and working with cloud providers May know about that detail, but others others may not Developers may not when you're building an application. So that data along with you know other discount data A cloud provider, you know deals, etc. All needs to factor in into your Into your analysis because you really can apply that at the system scale at the infrastructure scale before it even goes into financial planning and that's something that's super useful and that's something that also needs to be part of the SLOs and SLAs that are committed to by cloud providers Why do I call that out is because alert management again? Which is the next step of observability if you will in an observability life cycle At the systems observability level we often set alerts for understanding behavior of systems, right for applications that we are looking at But they also can be you know a whole range of budget alerts that you can set up for usage And why that matters is because you want to be able to understand, you know, what what is your infrastructure, you know Utilize being utilized for at any given point in time real time You do want to set thresholds and be able to understand, you know How that it triggers in various regions various resource types various Applications, you know in different dimensions and that's something that's super important The other aspect is also for these budget alerts of this category alerts You can actually surface that that information to different stakeholders. So not only, you know systems engineering platform engineering Operations, but also developers, you know development teams, which are actually looking at how performance and application is going to be and also your CFOs team, right? Your financial teams were actually also working with you lock step on supporting your needs for the business and Last but not least, you know having these budgeting alerts will also convert into reporting with detailed Analysis on the different types of trends patterns and the usage not only at an application level, but also Factoring in all these different parameters that you know some of which I called out So it's it's actually complicated because it's you know has its own Data needs if you will and it is more and more intersecting with the same groups Across organizations where it's not only good enough to understand, you know Setting up and designing large-scale infrastructure, but it's equally important to understand what the resource utilization and efficiencies will be So at the end of the day, I just want to call out that you know There are different stakeholders in this whole equation and they have to be at the table Sitting and you know when you're building out your infrastructure They do need to be at the table at the same time, right? There has to be regular conversation regular planning systems, you know all the way from day one when you actually start Building your scalable applications as well as deploying them worldwide So you have the functionality of engineering with the CTO here You have the CFO with the financial offices and you also have the operate operating Teams from the business who are actually looking and working closely with the other teams to be able to do You know analyze and spec out your budgets The other area that I do want to call out Which is actually a very important area and this is kind of one of the takeaways that I'd like to community You know share is that public cloud providers today? Do not provide the kind of detailed information that and the detailed granularity of data that's needed for you know Really highly efficient usage data, right? And public cloud providers need to be able to provide that kind of granularity that kind of customization and the relevant data That is needed for applications and that you know and resource utilization We are going into a very data intensive age with AI and many of you you know who may be looking at AI or You know thinking about applying it are definitely going into an age where we need to be more and more conscious about Every single aspect of optimizing pipelines of infrastructure. It's not only is G CPUs anymore It's not only memory usage. It's not only just storage But it's a lot more than that and that fabric, you know really really has to be optimized Has to have metrics and other data exposed so that you really have visibility From the public cloud providers for starters as well as general system design Planning that needs to happen in order to provide that kind of data that you need For financial forecasting at scale I'd also like to call out that the use of semantic conventions which has been very popular and very well Received in the observability community especially in the open source projects within CNCF Where for example a lot of work on semantic conventions has been done for other kinds of parameters in open telemetry semantic conventions should be actually defined by all the stakeholders at the table for cost observability metrics for cost Observability data because that is a whole layer of data that you need above and beyond what the cloud provider is or cloud Infrastructure is surfacing today That even goes back to you know how we think about Kubernetes and what Kubernetes itself is in orchestration and environment is Exposing from the infrastructure, so using semantic convention for cost observability is an area which it will become more and more important as we go forward and Not only engineers, but product managers, you know operations SRE as well as financial Product managers and analysts need to contribute in that whole process Because once that is standardized I'd really like to see the public cloud providers provide and consistent interoperable way of Being able to surface these metrics so they can be used and Custom you know and don't need the kind of heavy lifting that is needed for cost analysis today The other area that I'd like to see again And this goes back to API's and using standardized API's is for instrumentation and collection of cost Observability data which needs to be first of all defined and I think there's some work that is ongoing But you know, I'd like to see more momentum behind that with Specifically with some of the open source projects which are in the CNCF But also a lot more standardized implementation across the larger projects, right like There's cube cost when you're if you are familiar of an open cost But I think there's a lot more to be done there and last but not least Leveraging, you know, what I described as AI ops. There are many different aspects of that, but standardized Registration or not only when analysis perspective, but using new technologies such as your such as LMS to be able to actually Use this at scale for accurate forecasting And being able to actually have a lot more definition on the kind of real-time analysis you need for Getting back to your cost management and cloud cost management Right. So again, it's super important because we don't have, you know The ability at some point of the other to be able to say, hey, you know We're going to put in all this customized engineering to really get what we need It is very important for the cloud providers actually to be interoperable and to provide the standardized set of Metrics and other data observability data that can be used very easily All right. So with that said, I think We covered a lot, you know, it's a big big area and I know we're at Time almost but I'd like to, you know, ask folks if they have questions I just wanted to reiterate that, you know, this is a very large area But it's also a very fast moving area because the use cases for this need are Now right like as we start building out a whole generation of new Intelligence systems and services. It really is very imperative to, you know, kind of Make progress in this area and it's exciting to see this intersect with observability. I Do think that, you know, also more stakeholders need to be at the table. It's not enough for, you know Us engineers to sit and do this. It's also important for the other stakeholders to actually come to the table and work with us And last but not least, you know, really taking Making these optimizations effective because I think that today there's a lot of Handholding in the way that we do cost management and that can be easily easily, you know improved a lot All right, so that said, I'm done questions Questions two folks have questions one Okay, you're time for one What kind of tools do you use for? getting the cost monitoring data because Some of the tools that we have does not Provide the detail information Do we have to invest in or from CNCF are we trying to invest in a couple of these? Cost monitoring tools. Yeah, I think I think there's some amount of work And again the question is that you know, there's not enough to tooling for cost management today And the tooling that exists today is probably not as well developed as it could be So are there other projects that are you know in the CNCF space that are looking at this? And how can we do more and I would say that I think we don't have enough projects being done But however, I would also say that you know, there are several interesting projects from the cloud providers Which actually can be pulled into the CNCF and would actually help everyone utilize them Carpenter from AWS is one of them There are also other projects Which you know could actually benefit from being integrated into the CNCF Any other questions? I think they're done. Thank you everyone. Have a lovely day