 Okay, hi everyone. Thank you very much for joining me today. My name is Yosef Arbiv, and this is Steve Ballmer. He's the former CEO of Microsoft. He's a great guy, and I'm sure he likes you because you're all developers, and he really likes developers. But what he didn't like 20 years ago was open source. 20 years ago, he called open source communism and also cancer. But now, 20 years later, Microsoft acquired GitHub, and you can run Linux from your Windows machines with the WSL, the Windows subsystem for Linux. So Windows has a completely different approach to open source than it had 20 years ago. Today, I'm not going to talk about Microsoft or Linux. I'm going to talk about our journey with open source, about our journey with open telemetry. When we started Epsagon, it was five years ago, we started with closed source libraries and proprietary protocols. Today, we are a part of the open telemetry community. We are building products that support open telemetry natively, and we are a part of Cisco. So I'm going to talk about how we got from where we started to where we are today, on the mistakes that we made along the way, and what you can learn from them if you want to combine open source in your products. So we are going to do an introduction on observability in general. Then we are going to talk about different steps, the different stages in our journey with open telemetry. We are going to talk about our plans for the future, and we'll have some time for questions at the end of this talk. So a little bit about myself. My name is Yosef Arbiv. I'm the father of three adorable young boys, and I'm the group manager at Cisco ETI for open telemetry. So my group is responsible for contributing code to open telemetry, and I have 12 years of experience as a software engineer and as a team leader. So let's start by talking about observability in general. What is observability? So this is the Wikipedia definition of observability. Observability is a measure of how well the internal state of the system can be inferred from the external outputs of your system. So if you have a system, it can be in one of many different states. If you can know in which state your system currently is by looking on the external outputs of the system, you can say that your system is observable. But let's try to understand what exactly does it mean in the context of software development and software maintenance. So let's say we have a simple application. Our logic is running on a single host, and we have an API gateway and we have a database. We want to know if the system is working as it should be, right? We want to know if everything is running or we have some issue, maybe we have an issue with a database and so on. So what we want to do is to collect some metrics from this application, right? So we can know if everything is running as it should be. And then we can set up some alerts maybe if there are latency issues, for example, and so on. But this is not enough. When we have an issue, we want to be able to look at the logs from our system. So we will run another agent on this system that collects logs from the different modules in our system, and it sends the logs to an aggregator, hopefully outside of the application, so we can look on the logs and see the current status of our system. But what happens as we move to microservices architecture? Nowadays, more and more companies are moving from a monolith architecture to microservices architecture. And this is happening for a couple of reasons. First, microservices are easier to scale. If you want to be able to scale your application to support more and more customers, it is much easier if you are using a microservices architecture. The second reason is that microservices can reduce the blast radius of bugs. So if you have a bug in one of the microservices, it only affects the specific flows in the system that this microservice is involved in. Other flows in the system are not affected by this bug. Microservices can also support smaller size of teams. So you can have teams that are responsible for a subset of the microservices, and you can have smaller teams in your organization. So there are great reasons to use microservices, but it also presents some challenges to the development team. So this is how a small demo application that we wrote at Epsilon, how it looks like on a microservices architecture. As you can see, there are many different microservices that are talking to each other, and it can be difficult to track what exactly is happening inside our system. This is how an actual application looks like, not a demo one, so you can see that it is much harder to understand what exactly is happening in our system. So what are the challenges that we are facing here? We have three main challenges when we are talking about maintaining those applications. The first challenge is modeling your system. Only understanding those pictures that I showed earlier is not that obvious, right? To get to know what are the different microservices, who is talking to each other, what flows do I have in my system, which microservice participate in each flow, it can be very hard. If we go back to the definition of observability, I said that the definition is to know in which state the system currently is, right? So, only understanding what are the different states that our consistent being is very hard when we have a system that is built on microservices architecture. The second challenge is troubleshooting our application. Let's say that we have an alert that we have an issue with one of our flows in the system. It can take a lot of time to understand which flow exactly is it in our system, and then we have to look into the different microservices that participate in this flow. We have to search in the logs of each one of the microservices. Maybe we need to correlate between different logs from different microservices to understand what exactly is this flow, what is the root cause of this bug, and only then, only after we search and we find the exact microservice that is responsible for this behavior, only then we can start investigating for the root cause and hopefully fix this issue in our system. The last challenge that I want to talk about is optimizing our system. Let's say we have a customer that complains that a certain flow in the system is not as fast as it should be. Again, we need to search for the different microservices that participate in this flow. We need to find the specific microservice that is responsible for this behavior, and only then we can start fixing this issue and maybe optimize our system. So in order to deal with those challenges, we have the three pillars of observability. We have metrics, logs, and traces. Those pillars help us with these challenges that I explained before. The first pillar is metrics. Metrics tells us what is happening inside our system. So metrics are essentially numbers on a timeline. We can collect business metrics. How many transactions do we have in our system? Are there checkout transactions or return transactions if we're talking about a shop, right? What is important when we are talking about microservices architecture is to be able to collect the different metrics from all of the different microservices in our system into a single dashboard. So we can have one place to look at all of the different metrics from all of our system. The second pillar are logs. Logs are the most basic kind of telemetry data that we get from applications, right? Every system produces logs. But when we are talking about distributed systems, it is important that there is something that we should take care of regarding our logs. The first thing is that we would like our logs to be structured. We want the logs to be in a certain format and not just random strings that each developer decides on their own what exactly should they write. When we have structured logs, it's easier to store them and it is easier to search in them later. Another thing that is important is to add identifiers into our logs. When we are adding identifiers into our logs, such as user ID, transaction ID, customer ID, and so on, it is easier to correlate the logs between different microservices in our system so we can connect the different logs that participate in one flow. So we can then search for all of the logs from all of the different microservices that are inside this single flow. The last pillar are traces and when we are talking about distributed system, we are talking about distributed traces. So let's explain what exactly are distributed traces. So this is an example for a distributed trace on a timeline representation. So a distributed trace tells us the story of a transaction or an event in our system as it propagates through the entire distributed system. So here we have operation A, which trigger this entire trace. Each tile here is called a span. A span is a representation of a single unit of work. All of the spans together are a distributed trace. So we have operation A, then the operation A triggered operation B, which in turn triggered C and D, and when operation D completed the entire B operation completed and then operation E was triggered and then the entire A operation was completed. So we can understand the different microservices that were a part of this flow of this operation and we can see them on a timeline. This is another representation of a distributed trace, but this time as a graph representation. So you can see the different microservices that participated in this flow. You can see that we had a checkout operation and in order to complete this operation we had to go to the Mongo database and already sketch, and then you can see that we had an error when we tried to access the discount service. So here you can see how a distributed trace can help us understand where exactly an error happened inside our system and you can see how from this checkout operation that failed we can then continue to search into the logs of the discount service and see what exactly was the root cause of this error in our system. Okay, so now we understand better what exactly are distributed traces and what is observability and we can talk about our journey, the epsilon journey with open telemetry. So the first act is what happened before open telemetry existed. When we started Epsilon, we targeted the serverless market. Serverless is a concept where you are using microservices but you are not managing them on your own but you are using the cloud provider. The cloud provider provides those microservices and it manages them for you. So you don't need to care about which server is running those services. You are not dealing with Kubernetes clusters or anything like that. Essentially you have small pieces of Lego where you can build whatever you want with those pieces. So you have small pieces of services that are provided by this cloud provider and you can build your application on top of it. So we saw that those customers using serverless architecture have many issues with troubleshooting their applications. So we decided to try to create a product that will help those customers to have a better way to troubleshoot their applications. So initially we targeted customers using AWS cloud. This was the most popular cloud and it's one of the biggest clouds until today. So we needed some way to collect those spans from the application, from the customer application in order to be able to create such graphs in our backend. So we searched on how can we create those traces, how can we create those spans from the customer's code, from the customer's microservices that were running on AWS cloud. And back then there was no industry standard for disability traces generation and disability traces format. What did exist back then was open tracing. Open tracing was an open source project for creating disability traces but we had a couple of issues with it. The first issue was that open tracing only supported manual traces. That means that the customer needed to write the trace on their own and we wanted to create something that would be automatic for the customer so that the customer do not need to have any way to configure this or to write their own code. We wanted the onboarding to be as smooth as possible for our customers. The second issue was that back then open tracing was backed by mostly one company that was a competitor of ours. So we were a little bit afraid of putting our entire business depending on a single competitor. So eventually we decided not to use open tracing and to build our own SDKs, to build our own libraries and our own traces format. So we created those libraries and we started to gain more and more customers using those libraries and troubleshooting their application using our backend. At first we want those libraries to be closed source. We thought that the way that we are instrumenting the code and generating those traces automatically for our customers can be one of our intellectual property and one of the ways that we can protect our product. But soon enough we discovered that customers did not want to install closed source libraries into their code. Our customers wanted to know what is happening inside their systems. They were afraid that we would crush their systems. So we decided to open up our libraries and to publish them as open source. We were also hoping that we'll be able to create a community around those libraries so we can have contributions from customers and from outside of our organization to those libraries. So maybe customers can fix their own bugs sometimes or contribute a new instrumentation to a new framework in the future. So this is the first act, the first stage of our story and I want to talk about the lessons we learned from this stage with OpenTelemetry. These are the traces before OpenTelemetry. So the first lesson is about product defensibility. When you are building a new product it is important to understand which parts of this product you should be defensible on and which parts can be open to the public. When you are trying to protect the entire product usually you are wasting too much energy and like in our case you can be hurting your own business. So you want to be able to decide which exactly parts should be defensible and which can be open. The second lesson is about building an open source community. Building an open source community can take a lot of resources from a company. You need to really invest in the community and this is not something that is easy to do. Especially when you are a small startup. In our case we opened the sources so our sources were open but it was not really an open source project. So there weren't really a lot of contribution from outside of the company because we didn't have the resources as a small startup to build a strong community around our libraries. So it is much easier to join a community rather than to build your own open source community for your own libraries. The second act is the standardization of the market. So we had good success with the serverless market. We were very popular but we understand that the serverless market was not big enough to build our entire business only on it. And we searched for other ways we can expand our business and we decided we want to expand it to customers using Kubernetes clusters. But when we examined those clusters how exactly those libraries look like we find out that it was much more complex than the serverless case. On serverless there was a relatively small number of languages that were supported by the cloud provider and there are limited set of frameworks that we are using on the cloud provider. Usually you have one cloud provider and you interact with different components that this cloud provider provides. So essentially there are small set of features that small set of frameworks that we need to support and there was a limited set of SDKs that we had to develop. But when we are talking about Kubernetes clusters there are a lot of different languages. For each language there are tons of different frameworks that can be used and it is really hard to develop all of these different SDKs. And we realized that there is no chance that we were able to create all of them on our own. So we checked again what's happening with open tracing and open tracing was much more mature than. Open tracing was more popular it had a lot of different companies supporting it and there were already some distributions of open tracing that were using automatic instrumentation. So we created a fork of those distributions in order to fit our own needs. So we were able to create forks of those distributions where we changed them in order to match to our proprietary traces format and we were able to add our easy onboarding to those distributions and we published them as our own SDKs. And then what happened is a very exciting moment for the community is the announcement of open telemetry. So open telemetry was announced as a merger of open sensors and open tracing. Open sensors were another open source project that was dealing with automatic instrumentation and was focused on metrics. And together with open tracing, open telemetry was announced as the new version of open tracing and open sensors. So I want to expand a little bit on what exactly is open telemetry and what you can do with open telemetry. So open telemetry is a collection of tools, APIs and SDKs. So it is not just one library or one framework. It is a collection of a variety of tools and SDKs. And you can use it to instrument, generate, collect and export telemetry data. So open telemetry deals with the generation of the telemetry data, which can be metrics, logs and traces and how you collect it and how you export it. And it is done to help you analyze and improve your software behavior and performance. So this is a basic architecture of an application using open telemetry. So you can see that we have APIs and SDKs that can be run beside your application. And you can also process the data and export data using open telemetry libraries from your application. And then you can send it to a backend either directly from the exporter from within your application or alternatively using an open telemetry collector. An open telemetry collector is a tool that you can use. It can run beside your application on the same host or on a different host as a gateway. And it can receive the data from one or many different sources. It also supports sources that are external to open telemetry. It can have one or many processors that process the data. And then it can have exporters that export the data to a backend. So open telemetry deals with the generation and collection of the telemetry data. It does not deal with the backend, so visualizing the data or storing the data. There are different open source projects for that or vendor-specific projects that support open telemetry. So open telemetry became popular relatively fast because it already had big communities of open sensors and open tracing behind it. And we started to use open telemetry libraries as the basics for our libraries. So we took open telemetry code. We modified it. We added our logic to it. Our logic that was collecting more data. Open telemetry usually collects metadata from the application. And we wanted to collect also actual data from the transaction so we can visualize it to our customers because we find out that this was really helpful for our customers to understand the root cause of the issues in their system. So we created more and more forks of open telemetry and this way we were able to support more and more customers as we expand our business using open telemetry libraries. So what are the lessons we learned from this phase? The first lessons is about forks. So as I mentioned, we created forks of open telemetry in order to create our libraries and what we found out is that forks are really fast to create. We could create more and more forks really fast. We knew what we should change in order to fit our needs, our proprietary format of traces. But forks were hard to maintain. Each time there was a new version of open telemetry or a bug was fixed in open telemetry we had to cherry pick those changes into our code. Because we changed the code, we couldn't use newer versions of open telemetry and maintaining those forks became a headache. And this brings into the second lesson which is about balancing between velocity and tech debt. As we created more and more forks we actually created a tech debt because we have more forks we need to maintain and the maintenance of them was really hard. But on the other hand having been able to create those forks very fast helped us get more customers and expand our business. So this is a something that you need to balance when you are building your business. And this was what led us to the third act when we joined Cisco and the open telemetry community. So what happened is that we realized that all of those libraries were really hard to maintain. We noticed that we are investing too much time in maintaining more and more libraries. And we realized that this was the time to make a change on how we create our libraries. So we decided to create an experiment with our Java agent. We created a new version of the Java agent of our Java SDK but this time it was not a fork but it was a distribution of open telemetry. We used open telemetry code but we didn't change it. We used it as is and we only expanded it. We only extended using the extension mechanism of open telemetry. So we took the open telemetry code and we extended it to collect more data that we need for our customers. And we also added the zero code onboarding to it so our customers can just install their agent and have nothing else to do. And we also had to do some changes to our backend because our backend was supporting only our proprietary format. So we made some changes to our backend to support the open telemetry trace format and then we were able to collect data from the open telemetry distribution in our system. And this experiment went well. We managed to create a distribution of open telemetry in a relatively short time without too much changes to our code and to our backend. And we did more and more libraries in this way by creating distributions of open telemetry extending open telemetry. But then we had a change of plans. Epstagon was acquired by Cisco and we decided to stop working on the previous Epstagon product and to join forces with Cisco in order to create the full stack of observability platform for Cisco customers. So today we took another step towards the community and we decided that we want our customers to be able to use open telemetry natively meaning that our product will support open telemetry natively and not our own distributions so we can know that our product is going together with the entire community in the same direction. So today we're working on two products as a part of Cisco. The first product is AppDynamics Cloud which supports open telemetry natively and we are working on the distributed traces experience within AppDynamics Cloud. And the second product that we are working on is an open source product for visualizing distributed traces and it also supports open telemetry natively and we hope to publish it in the next month so it will be available for the entire community as a complete open source project where you can self-host it on your own systems and you can send telemetry data from open telemetry into this open source product so you can visualize your telemetry data and have a complete open source stack. So in the future we hope to become a significant part of open telemetry. We are working on contributing our code from our distributions back to the community, back to open telemetry this is what we are doing now we want to be able to have what we figured out that was helpful for our customers we want it to be available for the entire community so we are working now on contributing those changes to open telemetry and we want as a part of Cisco to be able to create a better visibility future for the entire open source community so thank you very much for joining me and I'd be happy to take any questions if you have I just want to say Is your second product you said which will be published next month is to replace Yager for future Yes, so the first is to have something like Yager but with a better focus on disability traces and better visualization of open traces of disability traces, yes Thank you Any other questions? Maybe for our remote audience Okay So thank you very much for joining me If you have any other questions I will be here for a couple of minutes and you can also reach me out in those platforms on LinkedIn or Twitter I'll be available and I'll be happy to take any other questions or ideas you have So thank you very much It was a pleasure