 Okay, so my name is Olivier, I come from Switzerland and I'm very, very happy to be here. So we are going to talk about data visualization and how to bring data visualization into backstage. So one objective of the session is to share a recipe that you can use and apply to bring visualization in your developer portals. Before that, I'm going to provide some context and explain why we're interested in this topic. So the broader context is software analytics and it's something that we've talked about during the day, even if we have not used the term software analytics. So what is it? It's a research field that is active for more than 20 years and it's basically analytics on software data to provide information to software development teams and to software development leaders. And in the previous sessions we've talked about all these data that is available and that can be presented to the developers. There is data in the code base, there are metrics, there is data in the activity that developers perform and in green there is data about how the people and the developers feel and the friction and the satisfaction that they feel. So when we talk about software analytics, the question first is what is the data that we're going to work with? What are the questions that we will try to answer with these questions? And then very often we come to the idea that using visualizations is a very powerful tool to use data and to answer the questions. And before talking about software data, I want to run a quick demo. So this is an article that was published in the New York Times several years ago and it's an article about the taxes that companies pay in the US. And behind the visualization we have a data set and what do we have in the data set? We have companies who have earnings, who have a tax rate and who pay an absolute amount of tax. And what we see here is a non-conventional data visualization where the outer visualization expert has mapped data properties to graphical attributes. So what are the graphical attributes? We see that there is the x-axis or the position of the circle on the x-axis. Every circle represents a company. We have the color that is used to create buckets. We have the size that represents the amount of tax paid by the company. So this is a visualization that I can use to explore the data. I can run queries. And what I can also do is use animations to do a breakdown by company. So in the previous session we were talking about telling stories with data. This is a very good example of storytelling with data. We see there is a lot of text. There are a lot of annotations. And at the same time it's an interface I can use to explore the data set. So this kind of visualization provides an inspiration. And the question is, can we apply these kind of approaches to the data that is managed by backstage or through backstage? And one of the questions is if we have the data catalog, can visualization make it easier to get insights about the information in the catalog? We've seen before that the catalog can go to hundreds, thousands, 10,000 entities. How can you make sense of this data and how can data visualization help with you? So on the right you see some of the benefits, some of the advantages of using visualization. You get faster insights. You also get deeper insights. There are things that you see in the data visualization that would not stand out in a standard table or in text. And it's also something that's very enjoyable. Exploring a data set, if the interactive visualization is well crafted, is something that triggers curiosity. And if you think about the adoption of the developer portal, having a tool that invites the users to explore, to query the data, to understand what is going on in the development organization is something that is very powerful. So in the rest of the session I'm going to go through three steps and the idea is to build an end-to-end demo, to get some data from a development team, to build a data visualization using a tool that I'm going to present, which is Vega. And finally to bring this data visualization into backstage with a plugin. The data, and we've talked about that during several sessions, comes from everywhere. The data is all the tools that we use to develop and to operate software. The data comes from backstage as well. The tech insights, the backstage analytics, these are also data sources that are interesting to absorb and to render visually. So how do you get the data so that you can use it for visualization? And on this slide I've tried to describe four patterns that range from very simple to more sophisticated. So on the left, this is usually what we start when we create the first backstage plugins. We create a front-end component that directly talks to a data source. Again, maybe it's GitHub, you use the GitHub API. Maybe it's one of the API exposed by backstage. But you extract data that is ready to use and you make it visual. The second pattern is a slight variation where you also have a back-end plugin, and usually you do that because you have some authorization, some course process to deal with, possibly because you want to optimize the performance with some caching. But the idea is very similar. You have a direct interaction between your plugin and the data source. So there are benefits of these two first patterns. First, it's very simple. You don't need to deploy additional infrastructure. And because you are talking directly to the data sources, you're querying and you're exploring live data. However, there are many use cases where you need more and where you want to decouple the data collection, the data processing, from the data querying and visualization. And these are the two patterns on the right. The first one, you have a decoupled pipeline which is going to query the data sources, apply some processing, and store it. And here I say it stores it in a static metric store. What do I mean by that? I mean, I'm just going to store to generate files, maybe CSV files, JSON files. And these files are then going to be provided to the renders. The most sophisticated pattern on the right, you have a dynamic metric store where you can submit interactive queries. A very good example is elastic search or open search that you deploy, that you feed with data, and then you use the aggregation query language to run interactive queries and get the data. So one example of the last pattern is what you want to do is if you want to visualize data from Jira. I guess we all know Jira. We probably all know what the CFD diagram is, where we want to show the amount of work to be done. This is the kind of visualizations that you cannot do by interacting directly with the Jira API. Jira API gives you the ability to query and to obtain the list of transitions between the state of the tickets, of the issues, but you definitely need to do some processing if you want to compute the change. So it's a good example for a process where you need to have this decoupled pipeline. The pattern that I'm going to use for the demo to end up with the visualization is much simpler. It's the KISS analytics, where I'm going to use the data source being lit, so I'm not even going to run queries against an API. I'm going to run the Git log command to obtain the raw data from my source code history. For the pipeline, I'm going to use GitHub actions, and then I'm going to store the results directly in GitHub. Of course, there are many reasons why this is not a very good idea. The main reason why I made this choice here was to have an end-to-end demo, and I will come back to the things to watch out for. The first step is the workflow, and in the presentation, the slides are available on the side. You have the links to the GitHub repo, so everything that I'm showing today, you can also access the code and rerun it yourself. The first step is implemented as a GitHub action. It starts by cloning the repo. You just need to be aware that by default, when you clone a repo, it doesn't clone the entire history, so you need to have a special parameter for that. I run the command, and again, I store the metrics files directly in Git. Why is it not a good idea? Well, it depends on the kind of file that you generate, but if you generate files that are bigger than a few kilobytes, then it's certainly not a very good idea to use Git. An easy evolution on that pattern is to use a bucket, a cloud storage bucket. It's very similar to what is done with the TechDocs recommended architecture, where you build the documentation in the CI pipeline, and you push the outcome in S3 or in other cloud storage. What is also done in the pipeline is custom code, and what I'm doing here is computing aggregations. When you run Git log, you will generate a CSV file with one line for every commit. This is very interesting because you have all the information, but you can imagine that if you have to send this dataset to the client side for rendering, it's going to pose some performance issues. The purpose of this code is to work on the Git log output and to compute some aggregations. In this case, one of the aggregations, I want to know the number of commits per outcome. As a side node, you see that the code that is highlighted is using a library called D3. D3 is very closely related to Vega that I'm going to introduce later, and is known to be a visualization library. In this case, I'm on the server side. I'm not generating any visualization at this stage, but D3 has these very useful aggregation functions that I'm using here. This is a very simple example. In this case, I don't do any processing, so be aware that the data in Git is never really clean. You always have pseudonyms. You always have duplicates in the outer names. If you want to do a proper software analytics pipeline, these are things that you need to improve and to integrate. The action, the GitHub action has been deployed. For the demo, what I've done was to fork the backstage repo, add the action in a custom pipeline, and so I'm now able to run this action and to generate to compute the metrics for the backstage code. This is what we see on the screen, where we have the list of files that are generated by the pipeline. If we have a look, we see that the raw data where we have one line pair commit is 2.7 megabytes, and the aggregations, if we look at the Git outers, it's 16 kilobytes. This gives you an idea of the difference of size if you apply or not the aggregations. That's the preparation bit. Now the interesting part is how can we create this visualization. There are, of course, many approaches to do data visualization. I'm going to talk now about Vega.js, which is an open source library built on top of D3. Vega is described as a visualization grammar, a declarative language for creating, saving, and sharing interactive visualizations. The idea when you use Vega is that you don't write JavaScript code, you don't apply an imperative approach. Rather, you describe how you want your visualization to look like. You start from a data set, you have rows and columns, and what you want to do is transform these data attributes in the visual representation. Am I going to represent the elements with squares, with bars, with lines? Am I going to use some of the columns for the y-axis, for the x-axis, for the size, for the color? This is the approach that you apply, and so the elements of the grammar are the axes, the legends, the marks, and below that, for this to work, you need to work with scales that map values from domains into ranges, and you work with data. What you see on the screen are examples of data visualizations realized by Vega, so you see that there are standard looking graphs, like bar charts, plot charts, but you also have visualizations that are a bit more original. So what I'm going to do now is to show you concretely how a Vega spec looks like, and how you can actually create the Vega spec. The important part is at this stage, I'm really looking at the data sets, using Vega to create a visualization, with the idea that later on, this will be integrated in the backstage, but to do this work, I don't need to build the backstage codebase, I don't need to have any specific knowledge about codebase, so this property, this ability to decouple the data visualization from the backstage implementation is something that is quite interesting. So if I move to the editor, the Vega editor is an open source project as well by the same community, and it's an easy way for you to create your JSON specification, to test it with a data set, and then to have your presentation or your visualization ready to go. So I've prepared here a first example, and what you see is a blank sheet of paper. The first step when you use Vega is to prepare the data. Maybe in some cases, you have a data set that is ready to use. You have your tabular data with a number of lines and columns, and this is directly what you want to render. But very often, you have data, and you want to apply transformation, filtering, aggregation, things like that on the client side. So let me zoom in a bit. What you see on the left is the data part where we have a sequence of transformation. The first one here, I'm saying I'm going to define a data source called Commits, and I'm going to pull the data from this URL. In this case, it's the CSV file that is generated in my pipeline via the GitHub action. What's interesting is that this URL could point to a URL exposed by Backstage. So if we have the tech insights or the metrics endpoint, I could directly talk to Backstage or to a data source, receive the CSV data, and have it ready for exploration and rendering. You see that I can do transformations like aggregations. So for example, I want to group by author and count the number of commits and obtain the date of the last commit. If I specify a transformation like this, you see that I'm transforming one data set into another data set, and the Vega editor gives you here the list of the data sets. So authors, now, as this shape where I have a list of authors, the number of commits for every author and the date of the last commit. So I'm not going to go through all the transformations, but what I'm doing in this pipeline is doing the aggregation, creating buckets so that I have the, let's say, top 10 contributors plus a bucket of all the other contributors. And in the end, what I have is my data set here with a column named bucket, a rank, a number of commits, and the date of the last commit. And I applied a sorting algorithm to have the data sorted in descending order by number of commits. I have prepared the data, and you see that you can do some of the data preparation in the back end, compute some of the aggregations in the back end. You can do some of them on the front end. Of course, if you do them on the front end, you have the ability to do live filtering and to give the ability to the user to specify the parameters. If I open the second visualization here, so it's exactly the same data transformation pipeline, but you see that now we have a graph that is rendered. And this is the second part of the Vega specification, where we have the following elements. First, we define scales. In this case, I have three scales. I have the scale that I used for my field attributes. The domain is from the Baker data source, the number of commits. So let's say the smallest number of commits is one, and the maximum is 4,000. This is going to be the range. This is going to be the domain. And the range in this case, I want to map it to the width of the diagram. I have a second scale that I'm going to use for the y-axis. So in this case, I'm going to work with the eighth of the graph. And remember, I have this field that I named the bucket, which contains the name of the bucket. The last scale is a color scale. And I'm going to use the date of the last commit for this value. I'm then defining the axis, and you see that by changing an attribute here, I should be asked, that's the great color. You see that I changed the attribute. So what you see on the screen are some of the attributes supported by the Vega specification, but you have access to a lot of parameters that you can use to fine-tune the visualization. The documentation of the project is very good, and you have access to a lot of flexibility. The last part of the specification are what are called marks. So a mark is any element that you draw on the diagram. In this case, I have these rectangles, these blue rectangles, and I have these two text labels, one side on the right, one side on the left. And this is here in this specification where I say I have rectangles. The width of the rectangle will depend on the value of the field number of commits on the corresponding scale. And for the positioning on the y-axis, I'm going to use the field of the specific bracket, of the specific row, and map it with the scale brackets. So this is what Vega specification look like. In the slide deck, you have the step-by-step explanations that I just gave. And again, the important point, and one of the major selling points of Vega, is that the visualization is expressed in JSON. So it's something that you can easily port across tools. We have a tool that we called Avalia Slides here, which is an interactive slide deck. So I can move in my presentation, and I have some slides that contain visualization. So this is meant to be used as a presentation tool. What we have at the center is a Vega spec. If I go to the next slide, it's similar. We have another one, which is a bar chart, and you can even do some fancier visualizations with force layout. All these visualizations are expressed in Vega specs. So to conclude, because the question was, if I have this visualization, what can I do with them in backstage? So we've created a plugin that you can use. It's open source. You add it to your backstage installation, and then in your page or in your tab, you just add a fragment where you reference your Vega specification. You specify the size of the widget, and it will appear in your portal. So if I show you a demo, this is the kind of things that you can do. And we even have the ability to use styles. So if I have a look on our portal, you see that you can apply the visual theme of your backstage implementation to the visualizations. So with that, I'm coming to an end. You have the references to the plugin. In the slides, you have all the examples. Now we are very eager to work with people who are interested in visualization, people who want to try the plugin. And more generally, people who are interested by visualization in the context of backstage are tens, hundreds of use cases, and we would be very interested to discuss and to explore these ideas.