 Cześć wszystkim, jestem Marcin Storzak. Idziemy przez PARK. Ja teraz pracuję w Kononiku, bo w ostatnich 4 lat spędziłem sumologię, pracowałem na telemetrową kolekcję, na kolekcję telemetrową data i mam ten klient, który mam share z was. Cześć wszystkim, jestem Marcin Storzak. Idziemy przez PARK. Ja teraz pracuję w Kononiku, bo w ostatnich 4 lat spędziłem sumologię, pracowałem na telemetrową kolekcję telemetrową, na kolekcję telemetrową data i mam ten klient, który mam share z was. Chciałem pokazać, że od różnych kolekcji do jednego standardu i to, co znalazłem w tym sposób. Jeśli myślisz o kolekcji, że jest innym kolektykom to są wiele różnych kolekcji, w jakimś zespołowie dopingują. W tej zespołowie w tej zespołowie ma nie 對 an kontakty, ma te metry połącone, nie tylko w tej zespołowie. Był,可能 teraz i do wczoraj wczoraj na telemetryczny kolektor, telemetryczny kolektor, ponieważ to nie jest zaskoczna. Może wy myślisz, że ok, tylko wyczuć, co ludzie używają na kubarnety. I to będzie ok, ale to jest kubarnety telemetryczny kolektor standard. To jeszcze nie jest łatwe, ponieważ są wiele agencji. To nie jest jak 1 lub 2. Jest 5 tutaj, ale może być więcej. I mieliśmy problem w sumologii też. W sumologii zaczęliśmy przystąpić te. I wtedy znalazłyśmy, że to nie działa tak dobrze do nas. I do naszych klientów. My znalazłyśmy to, żeby to było lepiej. I znalazłyśmy na telemetryczny kolektor. Powiedzmy, że ta historia zaczęła. To zaczęło z tego, że kubarnety telemetryczny kolektor standard. Bo to nie będzie bardzo dziwne. Jeśli myślałeś o te telemetrycznej kolekcji kubarnetycznej, to znalazłeś te. Dlaczego? Będziesz w porządku. Bo ma wiele plug-ins. Komunicja jest tu. Dokumentacja jest tu. Features są tu. To było w Rubii. I bo to było w Rubii, ma problemy. To działa, ale to jest trochę odległe. Powiedziałam, że nie czujesz, że to jest bardzo, bardzo odległe. Więc może to będzie ok. Może użyjesz flow and beat. To jest bardzo szybciej. To ma to wielkie memory. I to jest w Tenin C, więc musisz być zauważywany z bieżącym i zazwyczaj nowe feature. Oczywiście C nie będzie nigdy łatwiej od razu. Więc to jest trwa. I to jest bardzo zauważywany. Zatem jest to bardzo wielkie memory. Jest to bardzo zauważywany. Jest to bardzo zauważywany. To jest bardzo zauważywany. To jest bardzo zauważywany. usłyszałem, ale to nie było w tej chwili, a chwilę temu, a gdy używasz to jako usłyszałem, gdy używasz to jako usłyszałem, potem to miałeś problemy. Więc w tym razie, to było niezrozumiałe usłyszenie, że jest to database, ale nie używamy tego, database, that sent the wrong message actually to people that were using our stack. We didn't have traces at the time. And for tracing we were so lucky because exactly at this point the open telemetry emerged. And when open telemetry emerged, we looked at it and immediately saw that, hey, this is very, very interesting project. It has all of the features that we need and it seems great. Let's try it out. Now why? Because open telemetry tries to solve proper problems in the proper places. It gives you the libraries for your languages and it gives you the collector to collect the data that you have created with your libraries. Also it gives you the protocol, it gives you the standard, it gives you the schema for your metadata, for example, and it doesn't give you a backend. And what does it mean that it doesn't give you a backend? It means that right now the backend that can be vendor provided, we have a lot of vendors out there that actually would love the idea, like they would like to support open telemetry. They can compete on the backend right now, they don't need to reinvent the wheel on the telemetry data collection and creation, they don't need to create their own libraries, they don't need to create their own agents. And also they all consume the same type of data, so you as a user, you don't have to choose at the very beginning and be so much locked into the one environment or the other. So actually this makes a lot of sense, because you have the project that is supported by many vendors, by users, it is actually hugely popular. It is the second one after the Kubernetes, which is for the CNCF, which is great. So like I said, open telemetry gives you a specification, so it is not a de facto standard, it is like a real standard that has been created before. You don't need to dig into the code to see what the code is supposed to do, you have the spec for that, you have the language SDKs with many languages actually supported, and you have Collector, which is a very, very smart thing, because it has those receivers, processors and exporters, all of them, all of those are wrapped in pipelines, and all of those components can be used in whatever matter, like you can compose your collector using those components, you don't need to use whatever is out there. One am I mentioning that, because there are a couple of distros of the open telemetry collector, and this is actually just a little bit of zoo as well, hopefully this will be cleaned up, just a little bit more. What is this situation? You will have this core distribution, which has only the core components, you might find yourself pretty quickly in a position when this is not enough, and those core components are just not enough, this is not a battery included. The battery is included is another distribution called Contrib Distribution, the problem with that is that even though this is a battery included, Open Telemetry Project says that, hey, you should not be using that, because we don't guarantee anything about the Contrib, and that is very fair, because all the vendors that are contributing those components to the Contrib, they need to guarantee that, but Open Telemetry cannot, so there is a binary, but nobody actually feel accountable for it at the point. So, what do you do? You could use the custom distributions from the vendors, the vendors take those components, they build their own distributions, Grafana agent takes some of those components, doesn't use the Open Telemetry internally, but that is for good reason, I've heard that this helps with the performance, but we are there to see that hopefully in the future. Then you could use the distributions from Sumo or AWS, they are like more vanilla distributions, they will have the same configuration as the core or Contrib, and those vendors actually say, hey, we created those builds, the hotel code builds for you, you can use them, we support them. And if you don't like those for any reason, you can actually build your own custom distribution as well, this is fairly easy at this point, and that is a great thing as well, you don't need to use what is out there, you can just choose your components. So, what do we do at Sumo? We had those agents that we needed to switch from, and we started with traces. Like I said, in March 2020, there was this Open Telemetry, we thought, yes, we are going to use that for our tracing, and we pretty quickly found out that there is this data float attached to tracing, that is not specifically Open Telemetry thing, because that is how traces just are, you can create a lot of data, you can do the same with logs and metrics, but with traces it just tends to be a lot of data and people are not prepared for this. You can find this, you can fight this in two ways, or actually more, but what we tried was filtering, so you can filter some data, but the problem with filtering is that you need to have all of these fonts for a given trace, and because you need to have all of these fonts, your memory on this gateway that does the filtering can actually grow up pretty quickly, so it's better to do this filtering probably somewhere on the back end, and that is what vendors do, they just filter on the back end, and you can do sampling, but with sampling you never know how much data you want to sample out, so you probably would like to use both, so you would like to have some sampling, and then filter out everything that is not error or warning for example. Open Telemetry collector doesn't have a good answer to this as of now, unfortunately, and I hope this will be work on in the future. Then for the metadata layer, we used Fluent D to enrich the data with the metadata, and the Fluent D had this problem that it was single threaded, and because it was single threaded, the performance was not great. What we found out when we switched to Open Telemetry collector that the Golang performance is pretty good, especially comparing to Ruby one, but the most important feature here was that when we switched from Fluent D to OTCs, we removed the back pressure from our Prometheus, Prometheus, when it was sending the data, it was sending metrics, it was sending it using the remote write, and the remote write is a feature that really, really, really doesn't like being back pressured, when you do that, your memory consumption goes out of the roof, and it is pretty terrible. I know that it is being actively worked on, and I've heard that it is going to be fixed, which is great. It was not the case back then, so we had problems when we switched to OTC from Fluent D, we stopped having problems with Prometheus' and also our Prometheus' memory just went down, so not only Fluent D's memory, but also Prometheus' memory. Here I have some data, so across the cluster for the data that we've sent, we needed 38 CPUs for Fluent D's, and then for the same amount of data we needed only 13 CPUs for Open Telemetry collector, which is three times, that's good, isn't it? Then for the memory we went down from 2200 GB of RAM to just 75, which is again three times, but then we started looking more into that, and we found out that we can go actually as less as 11, as low as 11, so that is 20 times less, that just money that is not being burnt. Then with the instances we went down from 85 to 20 at the beginning and then even to 11, which is eight times less, and this is important because you don't want to have too many instances poking around your API server on the Kubernetes to get the metadata, because API server does not like that very much, and also if you have a huge cluster, a really huge cluster, this stuff just starts matter, actually. Then we said, okay, so we have on Traces Open Telemetry we switched for metadata, it worked okay, let's see what we can do with logs, we didn't have that many problems with logs, but we wanted to have our environment actually homogeneous, so even though with FluentVit you have great CPU and memory usage, it didn't support metrics and traces at the time, it does now, so should we support them back then, the story might have been different actually, but it isn't, we switched to OTC, we found out that the CPU usage is actually very good, memory usage is reasonable, especially comparing to the code that is written in C, but the most important thing is that there was no major feature missing, so all of the features that we needed were there, some of them, if they were missing, we just added them, our customers, some of the customers asked some features that they've been added and some other people actually contributed there as well, obviously because it's a huge project, it is funny because sometimes you find yourself in a situation when hey, I would like to add this one feature that I found and then you go back after three months because that's when you find out, that's when you find time to do this, to only find out that the feature is already there because someone else has added it, which is great actually, you don't have to do it anymore, then it's better because you just lost your chance to contribute, that is something that is actually funny and very, very cool about this project. Then we last but not least switched the backend for metrics, which is Prometheus, like I said, we misused Prometheus because we used it as a forwarder, it was not a forward back then at the time and we found out that there are some quirks around the names, there might be some quirks like the dot versus underscore, something found out that they are just different. But again, the resource usage, even though when we stop backpressuring Prometheus, it did not add a lot of RAM, all of the RAM in the world, we still found that OpenDile Metric Editor actually used even less. One other thing to add here is that there is this thing called Prometheus Receiver that actually helps you get data like Prometheus and huge shout out to the team that did that, it reuses the Prometheus Metrics inside, it works great. You just have this dropping replacement almost, you have the same configuration, whatever configuration makes sense in the OpenDile Metric Editor world, it just works and it's so, so easy to switch, we are very, very happy with that. What were the outcomes in the numbers, we went down for CPUs five times and with memory we went down five times as well. So we were able to switch from those agents into the OpenDile Metric Editor and now we are able to send all those data to your backend or actually backends, you don't need to send those to only one backend because all of them actually support OpenDile Metric as of now, which is a huge win for you as a customer, as a user, isn't it? Were there any issues? Yes, there were some issues, some of them were more funny, some of them were less funny, but most of the time I think we survived. Let me tell you about this one issue, for example, that when you wanted to get the logs from your file and you had an empty line, depending on where this empty line was and how many empty lines you could have your file, not read, read files or read in a loop, right? So that was funny, but that was in 2022 and this bug, obviously, those bugs are not there anymore and many others that are so easy are not there anymore as well. And as of now, the state is actually good. If you use components that are used a lot, the bugs found there are going to be a lot more sophisticated, not long hanging through, like at all. If you use a component that is not that much used, across all of those components, you might find bugs there, but it's just life, I think it's the same for any other project, so no surprise here. All in all, we are actually, the sumo logic was very, very happy with the stability of that and we didn't hear much complaining from customers as well. That being said, I hope that you enjoyed this presentation. I hope that you feel encouraged to try OpenTerm Magic Collector yourself if you didn't use it as of now and if you'd like to see the slides that are available here. Thank you very much and I'll see you around. Happy collecting. Hi everybody, so we sort of have an extended Q&A for the end and I also had some data to sort of present because Perk had some really interesting insights on the migration path going from Fluent D to using an OpenTerm Magic Collector. Interestingly enough, let me switch this so everybody can see what I can see. Is that three? Yeah, so if you wanted to look at how the benchmarks were done artificially rather than using the production workload, I made this handy thing and you can look on your phone, you can see a co-worker of ours, Matt Rumian, who had put together how the tests were actually done in an open source environment and I can switch to that now. So there were two different HelmChart versions that were tested here. One was the HelmChart V... Well, one of the V3 versions versus our V2. The V2, again, was using Fluent D directly. I guess somewhat is like middleware between the open telemetry collector and us and then versus using the SumoLogic hotel collector directly. Andrej here, did you want to go through some of this? Okay, so this, I think, only covers Fluent D versus OpenTelemetry collector, but the interesting part is that I saw, is if you send for one minute... Is it there? If for one minute you will be sending five megabytes of logs per second, which is 300 megabytes of logs during one minute. Fluent D will use the whole single CPU, because that's what it's capable to do, for three and a half minutes. So logs have been sent for one minute, but three and a half minutes of sending by Fluent D. So, yeah, you can use Fluent D, of course, but you will use a lot of resources. And for Hotel, the same amount of logs, it's basically instantaneous and it uses 16%. But that's an easy one, comparing things to Fluent D is an easy target, right? Probably when we would compare to Fluent Bit it would look completely different, maybe not the other way around, but different. I'm not seeing Eduardo Silva here. I think he originally wrote Fluent Bit, so he could probably tell. Fluent Bit is definitely very, very performant, but we actually had really serious issues in big environments. And those issues aren't easy to fix, because it's C, because it's a mature codebase. We just hope to not go into that place with the Hotel Collector in the future. So Perk in his presentation had said that it was about five times more efficient in terms of CPU usage, and that was with using everything under production. The way that Matt did the benchmarks here was that he actually created sort of like a logs creator, logs generator, and you can look at the file, nerd out at some time whenever, if you're curious. And then this was all kind of done just to run a variety of different benchmarks when they were upgrading on different Helm chart versions. And you can see that it's consistent. The data is actually consistent with our production. Although I think everybody individually might have different experiences. For Prometheus. So there was a very nice talk from Brian Borum about Prometheus just a couple of minutes ago. I think my takeaway is if you can, just don't use Prometheus at all. We were using Prometheus as a data forwarder, as Perk mentioned. And we're currently using Hotel Coll and it's fine. I mean, Hotel Coll still uses Prometheus libraries under the hood when you use Prometheus receiver. So it's still five times better. But I think ideally in your future the whole world will switch to OTLP including Kubernetes API server and other components. And we will not have to scrape those crazy amounts of metrics from one big source. Prometheus is a whole other story. Metrics is a whole other story. We would have to do a whole other set of benchmarks and tests to see the improvements on the metric side over the log side. But are there any questions from folks here? And you can just go and up or you can yell. Yelling also works. So this is, what is the output there? Is it always OTLP? What is the output? Is it always OTLP? Because we did some other tests there at Grafana and ingesting Prometheus and outputting Prometheus is always going to be more performant with the Prometheus agent than with the collector. We output OTLP. So you convert Prometheus to OTLP and then you speed out OTLP. So the tests here with logs we would have to create a whole other test bench based on our newest HelmCharp deployment that was just announced this morning against the prior one which was using Prometheus' middleware. So the performance gains would be, I mean we would have to just re-run the benchmarks so we don't have that data on hand right now. We can talk about what happened in production, right? In production we already had previously replaced the metadata layer with OTLP. We knew that we want everything to be hotel so we needed the setup with the metadata layer replaced was that Prometheus would scrape and it was sent over remote right to the remote right receiver from the telemetric collector and replaced it with the hotel that scrapes with the Prometheus receiver and sends out with the OTLP exporter. Two quick questions. So two quick ones really. The first one there was a talk about moving from metrics and we looked not too much but we looked in early 2022 at open telemetry for tracing and we looked at it for metrics and they had the metrics when you talk about the generic kind of system metrics but they didn't have the maturity say of something like a telegraph agent and it felt like we were going to put open telemetry on only to then talk to something like a telegraph that we end up writing ourselves to get a lot of the metrics out and so it felt kind of a backward step to try and reduce the number of branded agents say on the node. I don't know what your thoughts are now about that because I haven't looked again but it was very generic the amount they got and it wasn't really mature like a metric service and then secondly a follow on question how does this work at scale? So Prometheus works quite well at scale when you isolate them you see across different clusters and then you have them forward on and then maybe you leverage something those two questions then kind of unrelated but just to save me getting up again what were your thoughts around those? What was the first one? I'm stressed. Open telemetry. It's funny that we're fully off of no longer talking about fluent bit and fluent D. We're now fully talking about Prometheus. Metrics, maturity. So the maturity of metrics in the open telemetry. Prometheus same. So open telemetry compared to that is just not as mature and yeah I agree with that. What we did at Sumo Logic is what I wouldn't recommend we devised a telegraph receiver where you can use telegraph inputs but integrating the telegraph code base with the open telemetry code base is not easy because you need to import a lot of stuff. And that makes sense though because that gets around that maturity problem because that was our way of doing this but it's a pain. So ideally we just try to contribute everything into the upstream and we encourage anyone to contribute more. That's my answer. And then from the scaling perspective when you've got all those open telemetry agents forwarding stuff on. So imagine you've got a million of those little things chatting and trying to forward them on. How does that work at scale? You understand our Prometheus works at scale. Well you know in terms of metrics we use this Prometheus receiver. That's one. The other thing I think it's much easier to shard it or to share it. What we use in Kubernetes we use the open telemetry operator with the target allocator and we shard all the scrapes among as many collectors as we want. It's much easier with an auto collector actually. I'm talking about the Kubernetes specifically. With target allocator or auto telemetry operator this is much easier to scale the load than with Prometheus. I think we're close to time anyway but if anyone else had questions let us know. Maybe not. I'm going to do one last shameless plug which is if anyone wants to look at this ebook it's free and you can learn to navigate Kubernetes monitoring with it. I'll get people two seconds for that and then we might just move on to the next thing because Austin's like get off the stage. I'm getting the hook. Thanks.