 Okay, so, okay, now yes. Hello everybody, welcome to this DrupalCon. Today we are talking about observability, so we are looking how to understand how our Drupal instance is doing, just looking from the outside. I'm Luca from Italy. I'm a Drupal, mainly Drupal and PHP developer. For an Italian company called SparkFabric, here you can find some of my social. The name is everything the same, so it's easy. Just a couple of words about the company, SparkFabric. We are a tech company. We do a lot of custom application development, mainly focused on cloud-native infrastructure. So we develop application that will be deployed on some cloud vendor, also Drupal. So we have a lot of Drupal instances on the cloud. For that, we are a partner of all of the main cloud providers, also Alibaba in China. Basically, right now, we are all working with distributed systems of some form because we have microservices, containers, cloud, serverless, addless, message queues, and so on. And mainly a lot of combinations of those technologies. So all of those complexity increase the number of failures that our systems may encounter in production. And because in a distributed environment, all systems can be implemented with different technologies or different languages or systems, it's complex to understand, usually it's complex to understand when a problem occurs, where it occurs, and it's difficult to predict possible failures in the future. To overcome these problems, we can use this technique that is observability, that is a measure of the internal state of the application just looking from the outside. So we don't want to log in in every component of our infrastructure, just to understand some where a problem is. We just want to look from the outside. And we need more data other than classic metrics like CPU and memory, we need the data from the inside of the application to understand how it works and how it behaves. Observability is based on three pillars, basically. So we have logs, metrics, and traces. And during this presentation, we will look, we will understand how to expose logs, metrics, and traces from our Drupal websites to observe it. And the end result of what we want to achieve in this, with this talk is to have this dashboard that shows us, for example, the head map of the request time and response time of our application. For example, the number of user registered, the number of nodes created, the distribution of logs during time, and the details of logs from Drupal. And for example, from external microservice. So we want to have all those data in a single dashboard to understand how the system works. Okay, to do that, we need different tools, different technologies to collect logs, traces, and metrics. In this presentation, for logs, we will use monologue, prom, tail, and locky. Later, I'll show you everything. Metrics, for metrics, we will use Prometheus. For traces, we will use OpenTelemetry and Tempo. And Grafana is the tool we will use to create the dashboard to show all the metrics collected. Grafana is an open source project and it allows you to query, visualize, alert, everything regarding logs, metrics, and traces, no matter where they are stored. So you can collect data from basically everywhere. So let's start with logs. Okay, logs are about storing specific events that occurs during the execution of our code. To do that, we are using monologue. Monologue is a standard PHP library. Library is used for every kind of PHP project like symphony. And it can be used in Drupal using a contrib module that it's called monologue. And with monologue, you can send logs, for example, to files, to sockets, to syslog, inboxes, Slack, whatever. And monologue implements the PSR3 interface, so it's compatible with the logger of Drupal, of course, Drupal, so we can interoperate without any problem. So first of all, you have to download the module. You can do that using Composer because you need both the module and the library that implements all the logging. Then after enable it, you have to configure monologue. The monologue module for Drupal doesn't have an user interface, so you have to configure it manually using a YAML file, a YAML file. For example, this monologue.services.yaml in the sites default folder. In that file, you have to specify the handler that will manage your logs. In this case, we are using this rotating file handler that will save a log file per day. And then in this example, after 10 days, it will delete the oldest one and create a new one, and will log everything from the info level to fatal error and so on. After that, you can create the define handlers, so sorry, channels. So in this case, we are logging all logs. So we are using the default channel. You can log for different channels to different places if you want. In this case, we are using default. We are using this rotating file handler we defined in the previous slide. We are using Gison as the format of the log, and we will say why. And then we are defined a set of processor that will alter the log before it will be written somewhere to add some useful information like the current user or the IP or the line and the file where the log is generated. And with this configuration, we have to add this file to the set of service container in Drupal. So in the settings PHP file, we have to add this file to the YAML of the service container. And if we wrote this line somewhere in our code, so we log using DrupalCon as channel, a notice with some text. What we have in the log file, for example, the one of today, we have all the information in a Gison format. Gison is useful because we can query all the information easily. So we don't have to parse the line because it's in a structured format. And if you want, we can add more processors to monologue to add other informations to a log line. For example, in a cloud infrastructure, you can add the name of the pod, the ID of the cluster. For example, the ID of the order in any commerce website to filter all the logs probably generated by the user with an order, and so on. The problem here is that in a cloud native environment, we have different servers, different pod pods. And we need to extract the logs from every instances of our infrastructure. And we have to discover them, scrape them, and send to a log collector. So we are using PromTail that is an agent. It's from Grafana that can scrape and extract logs from files, for example, and send to this other project that is called LOKI from Grafana again that is a log collector. So LOKI is the storage that persists logs. So again, in a cloud native environment, we will have an instance of PromTail deployed on every server or on every pod that sends data to an instance of LOKI somewhere. Configuring PromTail is quite easy. The important part is that you have to specify the path where the logs are in Drupal. If you want to extract some information from the Gison and expose it as label to LOKI and Grafana, and then the endpoint of the LOKI collector that will aggregate and store all the logs. At the end, on the Grafana dashboard, we can create a panel like this with all the logs generated and recorded in Grafana itself. So basically, we are using monologue to generate logs to the file system, then PromTail that read the logs, send to LOKI, and then we are using Grafana to query and show all the logs. Okay, then the second pillar of observability is metrics. Metric are a measure of a specific value in a specific time. For example, we can go back to Prometheus later. For example, the number of time we receive a request, how much time we spend creating a response, how many user or node has been created, the number of modules with security issues, the number of orders in any commerce website, and so on. So specific values that changes over time, basically. To do that, we are using Prometheus. Prometheus is a time storage database. That collects events with the specific timestamp when they occurs. Prometheus is an open source software, is, has been developed by SoundCloud, but right now is an independent open source project. And it was the second project incubated by the CNCF, the Cloud Native Foundation, just after Kubernetes. Okay, to expose metrics to Prometheus, we are using a module for Drupal called Observability Suite, or abbreviated O11Y. So this is a standard Drupal module, so I can install it using Composer and then enable it using Drash. And for this example, we are enable the metrics and metrics requests sub-module. Then we have to configure it. For example, we can enable the node count and user count collectors to expose the number of users and number of nodes in our Drupal. And for the node, we can specify the content type we want to expose. And then, the slide is a little truncated, but then we have, in this example, to enable the access of the Prometheus data for the anonymous user, so an external system can queries data on this slash metrics endpoint. So the module will collect data about, in this case, users, nodes and requests and expose them to a slash metrics endpoint of our Drupal website. And because PHP doesn't use an application server, so it doesn't store anything anywhere, we have to store data somewhere between requests. And in the default implementation, we are using database but the module supports also memcache or redis if you want to store all the information in that system. The main observability metrics module expose to Prometheus data about the PHP info, the number of nodes, the number of users, the number of item in queues, if you use the QAPI of Drupal, the number of extensions enabled. But there are a set of submodules you can install to provide metrics about caching, configuration, database requests, update, commentate. It's quite easy to route a new module to collect and expose some other kind of information that you may need. If you enable the module, configure it and then go to slash metrics endpoint, you will see this text response that is a standard Prometheus format that in this case, it expose an histogram that is a data structure in Prometheus. Let's say, for example, that the user slash admin slash create, sorry, user dash admin as create route, and for a set of time bucket, this is in seconds, how many requests or many response takes this time to be returned? So for example, we have four requests that takes two seconds and half, six that takes five seconds and so on. With all those information, we can, for example, generate heat maps that expose all the response, again, from five milliseconds to 10 seconds, and here we can see the point collected by Prometheus. The more lighter is when we have more data in that point, and if color is near zero, our site is faster, is near 10 seconds is lower. So we can understand just by looking at this graph if we have some problems with the response time, basically. Prometheus, configuring Prometheus is quite simple in our case. We just give this job name, says that Prometheus will call the slash metrics and point every five seconds and the URL, basically, of the website, of our website. So again, we are using the observability metrics module to store metrics to the local database, then Prometheus, every five seconds, collect them and send them to, and then we are using Grafana to create dashboard on the data. Okay. The last pillars is traces or distributed traces because we have probably a distributed system and with distributed traces, we can understand all the request response flow, the all of the data pass through from the beginning to the end. For collecting traces, we are using this project that is called open telemetry, that is quite a new project. It's created after the merge of open sensors from Google and open tracing from Uber. Also, open telemetry is incubated by the cloud native computing foundation and it can be used to collect metrics, logs and traces. We are using it only for traces in this case because the PHP implementation of the SDK for metrics and logs are not ready. So we are just using it for traces. But it has become the standard for collecting all those information about observability and all main cloud vendor are adopting it to collect the data. As I said, the logging support has not yet implemented for PHP. Metric and tracing are in pre-alpha status. Tracing is usable metrics not yet. For example, the JavaScript SDK is more mature. Okay. Open telemetry is made of different components, different pieces. One of that is the open telemetry collector. That is basically a middleware that will receive all data from your application and it can deliver data to other system. This is useful because it is vendor agnostic so you can switch the storage of tracing's log and metrics but all of your code stays the same. So it's useful to use. And in our example, we are using tempo to store traces. So we are using open telemetry collector to collect the traces. Then we send traces to tempo that is a storage engine for tracing. Also from Grafana project. So it can be used to store and then query traces information. The observability suite module has a sub module to deal with traces. So we can enable it, observability traces. And because usually you want to have only one library that instrument your code to extract information. And maybe you will have different modules that will deal with the code to extract the data. We create a third project that is called Tracer. That contains all the logic that instrument the application and then different plugins that sense that traces to different backend. One of that is the observability suite module. The other is the web profiler module that is useful on a local environment when you develop a website. So to configure which backend to use, you can specify it in the settings.php file. Here says that the Tracer plugin in this case is the observability one. If you want more information about how this works, there is a blog post on the Spark Fabric website to explain all the integration between the observability module, the web profiler and Tracer. So we are using Tracing to understand the journey of our quest and the response through a distributed system. And in this case, to give you an example, we are using a Drupal 10 website that renders a page with data from external microservice. So basically, it's quite simple. We have a custom controller that renders this route microservice one. The controller contacts an external service via HTTP, takes the response and renders the response on the page. So the controller is quite easy. So we use the HTTP client service from Drupal car to call an external microservice. This is written in Go just to do a different, using different technology. Then we take the contents of the body of the response, log the message and then return a Drupal render array that prints the message. To expose all the metrics, all the traces, sorry, we collect two graphana tempo, two open telemetry collector and then graphana tempo, the configuration of open telemetry collector is quite easy. Again, we are using HTTP as communication protocol. We are sending all the collected traces in batch to an instance of tempo. Quite easy. The observability suite automatically instrument your application. You don't have to do anything. If you just want traces about the events dispatched by the event component of Drupal, tweak templates, HTTP calls, database queries and services. So every time Drupal request a service from the service container. But you can also trace your own code very easily. To do that, for example, after you receive the request, the response, sorry, from the external microservice, you can call some complex method. And in this complex method, you retrieve the tracer service, create a new span with some labels and some values, then do the hard work. Okay, sleep one second, for example, and then stop the duration of that span. After do that, if you go to graphana to the tempo, to a panel created from tempo data, we can see that there are a lot of lines before and a lot of lines after. But at some point, Drupal calls its controller, the controller that will render that page. The controller will do an HTTP call that will last one seconds. Then we see a different, this is in different color because those two traces comes from the Go microservice that takes one second, for example, to complete. Then some services again from Drupal and then this custom trace that takes one second with sleep. So we can understand all the flow of the request between all the services and layers of Drupal, then external services, then again the response will reach Drupal again and all the services and twig call and database call to render the page to the user. So in this case we are using the tracing module, sends data to open telemetry collector, that sends data to tempo and then using graphana to collect all information. This is useful, for example, if some error occurs. So for example, this is another controller that calls another endpoint in some other microservice that returns an error. So for the user we need to print some pretty error to say them that something is wrong. But internally we have to understand why these microservice, for example, return an error. This is easy because all the requests between Drupal, all the microservices and return contains a trace ID that correlates the same request in every layer that will be passed. For example, this slash microservice 2 that return this error has this trace ID. If I look for the same trace ID in the logs of the microservice, for example, I can say that the endpoint 2 is not implemented. So we are calling an endpoint that does not exist and we can easily understand where is the problem in this case. Without this trace ID correlation in a production environment where the logs are many, many more, maybe it's difficult to understand which log of the microservice is correlated with a log in Drupal. We can do this easily because we can create a processor for the monologue module we saw in the beginning to add the trace ID to every line we log in the log, for example. The observability module do that for you. So there is a processor for monologue that add this trace ID to logs. So in the set of processors of monologue we can add this tracer processor that add the information of the trace ID generated. And then we can configure Grafana to link from logs to metrics to traces. And then we can easily found the information we need. And the same information, so the trace ID of a page is added both to the web profiler toolbar in this trace ID line and also as a response either so you can just found it in the response of every page. Okay, now I just want to show you a quick demo, for example. So this is a standard Drupal 10 website with the umami profile. And if we go, for example, to this microservice one endpoint we saw that we received this data from Remote Microservice in the dashboard created by Grafana. Okay, for example, this line has been logged from the controller that renders that page with this data from microservice. And here if we click on this tempo link on the trace ID in the log we can see all the information that has been traced from the internal of Drupal and then to the microservice. And here there are a lot of information but basically we can see all the services that Drupal requests from the service container starting from the module handler then the stream wrapper and so on. So here we have the name of the service extracted from the service container. Then at some point Drupal performs a query to the database. And then if we go, okay. And at some point as we see in the slide it calls the controller that renders that page. The controller will use the tracer tracer service because it's injected in the constructor. Then it performs this HTTP call to microservice N.1. Then the microservice itself is instrumented to expose traces to tempo and grafana. So this N.1 and this sleep trace comes from the microservices in Go. Then for example, because in the code just after we receive the response from the external microservices, microservice will log it. So we need to take the logger factory from the service container. We see that just after the response of the microservice our code is requesting the logger factory service that probably request the monologue handler rotating file service and so on. Then we trace our complex method that takes one second to complete. And then probably we have this event that is the kernel view, that is the event Drupal dispatches to render the page. And then at some point, okay, at some point, Twig renders the page.html.twig and so on until the end of the page because there are a lot of services called during a page request. And this is the dashboard. So here we have all the request response time from our Drupal instance, for example, the number of users created. So if I go to the website and add a new user on the Grafana dashboard that will refresh every five seconds. Right now we have 12 users. If everything works, the next refresh will have 13. So we can see in real time all the metrics and the logs and if you want traces of our system, if we have a different dashboard like this that measures the CPU, memory, network and so on from our cluster, for example, we can correlate maybe a spike in CPU and memory with a spike in node creation or user login or errors or whatever without logging to Drupal or logging to the server to extract the logs just from a single point. Okay, if you want all this project, so all the Docker stack that contains all the services, Tempografana, Locky, it's online on this repository, so you can use DDEV to spin up all the stack locally. In the repository there is also the go code of the microservice if you want to look at it and also an example, everything uses, using this, maybe it's too small, but using a library from Grafana that is called K6 that is useful for doing stress test and performance measurement so you can route using JavaScript. A test plan and then execute it to stress test your server. If you go to the K6 folder and run, okay, it's maybe, okay, if your K6 run scripts.js.js and go to the dashboard, you will see data flow because we are creating logging, creating nodes, perform call to microservices and so on, so you fill the dashboard with data without going to Drupal and create it to demonstrate that everything works. Okay, so there are some questions. We have five minutes. Hi, thanks for the very interesting talk. I have one question. What effect do you observe with that in terms of performance? Okay, of course there are some impact on performance instrument and application. I don't have specific data about it, but of course some CPU cycles goes to do performance instrumentation. You can optimize this using things like wrote logs to file or send trace to a collector because it's in the same server where you are. So this is quick. And then send logs and traces and so on to the real storage after. So you can optimize the time when you spend, okay. Thanks for the presentation. Just a quick question. Is there any reason why you overlooked something like Splunk for log collection? Okay, you can do the same with Splunk. It's the same. We are using Grafana just because we can use the dashboard of Grafana to create everything we need in the same place. But you can deliver logs to Splunk and then query them from Grafana. So you can use Splunk just for storage and then use Grafana to query logs from Splunk. Okay. I wanted to ask which of these tools needs subscription? It's everything open source. So you can use Grafana Cloud. It is a paid service if you don't want to manage all the components. So they will provide you an instance of a tempo locking the Grafana dashboard, the Prometheus, on cloud, and you have to pay for that. And instead, if you want to use the open source version of everything, you have just to deploy them somewhere. It's everything free. How do you deal with sanitizing log data, like personal information? Good question. For example, you can route a processor to remove all data in a processor. You receive a log record from Monologue with all the information logged and structured. So you can remove, for example, everything is sensitive from logs. The module doesn't do anything because we don't know the effective message you are logging. Hi. Does the observability module provide any sort of JavaScript error handling? So traces or pickup errors on the client side to be shipped? Not at the moment. Only back end. Thank you. Okay. I think time is over. So join us to the Contribution Sprints because every project in Drupal needs some hands and fill the session survey on the mobile app. Thank you.