 Thank you. Sorry. Hello, everyone. I am Paru and I'm an engineer at Red Hat and with me I have an intern from Toronto. You will see him later, but right now you just hear his voice. Guy, can you say hi to the audience? Hi, everyone. Yeah. So that is the voice. Today we are talking about how we can have sustainability in computing and we are going to introduce our Sustainability Computing Stacks. We'll talk about a few projects and we'll see what we can do with that. We'll start with a little bit of a background of how the states of computing and sustainability are at the moment and then we'll be introducing a sustainability stack and then we'll have some code demo. So to give you a little bit of the background, in 2021 an ACM Technology Brief estimated that the total global carbon emission of ICT, which is Information and Communication Technology Sector, is 1.8 to 3.9, which is like 4% of the global carbon emission and we think is a lot. And so it put us to ask ourselves a question that how you can measure energy consumption indirectly because you cannot physically go to the data center or you can have access to the cloud to put some hardware that would be monitoring the energy consumption. And how do you attribute or measure the energy consumption of particular workloads? Because that would give us an estimate of how much a particular computation is consuming or emitting carbon and greenhouse gases and at last how do you attribute power on shared resources again when we think of cloud as shared resources to process containers of parts. So there are various ways that you can measure energy, but the slide we talk about the various target, what we think we can do, but what is in the reality that's happening and how you tackle those problems. So one of the measure you can use is measure the frequency and to do that you have to monitor every circuit's frequency, but in reality you cannot do that because the Linux kernel CPU dynamically changes frequency for power saving and the solution is you can use average frequency to approximate. You could also measure capacitance and you can do that by monitoring the number of circuits that are powered on during a computation. Again there's no direct way to do that on the current CPU architecture. So you can resort to counting the number of CPU instructions to approximate that. And then lastly the execution time which is the duration during which the computation is happening. So all these circuits that stay powered on and there's no direct way to do that. So you can just use CPU cycles to approximate. So our mythology is we are trying and it's based on papers that we have put on that. You can just say that whatever is the CPU consumption of a process could be very directly proportional to the total power consumption of that workload. So let's say that you have one part and that has one container and it is consuming 10% of CPU then you can say that it has attributed to 10% of CPU power consumption. Similarly if you have multiple parts with multiple containers running and each container has also several processes then you can say that 50% of CPU usage means 50% of power consumption. So our mythology uses software counters to measure power consumption of hardware resources and we have trained machine learning models that approximates what is a power consumption and we use hardware resource utilization which could be CPU, GPU, memory to attribute what is the total power consumption of the process containers or parts. So our cloud native sustainability stack right now we have the project Kepler and we'll talk about more. The model server guy is going to introduce that to you and at last peaks which is a scheduler Kubernetes wave scheduler that tells you how you can schedule workloads on nodes that are best suited for those. So the first project we are talking is Kepler which is Kubernetes based efficient power level explorer very complicated name but it uses software counters to measure power consumption by the various hardware sources and exports them as from this metrics and the most interesting thing about Kepler is it will give you per for energy consumption which is quite tricky but you can use that to directly correlate how much carbon emission or greenhouse emission was for a particular workload. Kepler it gives you per for energy consumption based on components like I mentioned CPU, GPU and RAM and it is it supports both bare metal as well as clouds if you run it like on either of those infrastructure you can get per for energy consumption and it supports cloud native stacks like Prometheus, Grafana and the exporter metrics so you can just easily deploy them on any Kubernetes cluster. The one interesting thing about Kepler is it is very lightweight. We don't want something that measures power to have a very heavy footprints like it should not be consuming 50% of your CPU cycle to tell what was the power consumption of your computation so it uses eBPF which are like programs that can attach themselves to various trace points, Linux kernel trace points and other performance counters and we did experiments and it takes like 2% of the CPU which is admissible given you don't have anything at all. And it uses machine learning models to approximate energy consumption what it does is it sees what is the energy consumption of a part and then depending on the existing models that we have trained it can say how much power it can consume and we have the references again that if you are more interested to study what are the experiments that were conducted you can follow the reference. And this is like our architecture just it goes from down to top level so as I mentioned at the ground level you have eBF programs that attach themselves to Linux kernel trace points and other performance counters and exports or collects data like process ID, CPU ID, CPU cycle, CPU frequency, instruction, miscache and everything and it exports that data for aggregation to the user space where you have other stats that come like the cgroup stat, the GPU stat, the hardware monitor stat and then the model server which we are going to talk about what it does it strains regression model based on these information that that's been pushed to the aggregation layer and then Kepler exports these metrics to Prometheus that is consumed by peaks, a scheduler or as well as the model server to train models. So here's is like a short demo what it shows that how you can you actually get to see the energy consumption of just one part or one container. Let me just try to run Kepler on a Kubernetes cluster. We are going to install a bunch of manifest. So let's do that. Okay, so now we are going to just see the parts that has been created. So we see that the Kepler exporter is running now it's time to install Prometheus and Grafana. Okay, so we have Prometheus and Grafana both installed and now what I'm going to do is just forward this Grafana and then I will load the dashboard to see what the Kepler is being capturing in terms of energy consumption of this cluster. Here is the Grafana page as you can see in the first quadrant we have the pod current energy consumption in the namespace monitoring and these are the various pods in that namespace and you can see the energy consumption. Over here we have the total pod energy consumption in the space namespace and this is the total pod energy consumption for namespace. So these are the different namespace and the energy that has been consumed. Now I'm going to deploy a high computational task that is going to take a lot of CPU frequency and we can see we're going to just see how it changes the energy consumption in our particular namespace. So I have run a simple Python app that uses Monte Carlo algorithm to estimate the function of pi and I run the interval to thousand intervals. So we can see that because of the CPU frequency the power consumption has been increased and it is very easy to see it in also time series. So you can see precisely when there was a spike in the energy consumption and when it flooded out. So this could be beneficial in thinking of when to scale, what is the good time to scale or also to figure out patterns like at what time of the day they're usually spike in power consumption. So that's about it. Thank you so much for watching our demo. To run Kepler on a Kubernetes cluster, we are going to install a bunch of manifest. So let's do that. Now we will have Guy Portion. He will explain the model server and we have recorded this because again, he is in Toronto. So let's look at the model server and how it fits in the Kepler project. So let's begin by talking a bit about the Kepler model server. So one of the core goals of the Kepler project is single pod energy consumption prediction for Kubernetes clusters. This prediction can be done by attributing a relationship between performance counters and energy consumption. This is what the Kepler model server aims to accomplish. The ultimate goal of the Kepler model server is to provide reliable machine learning models that Kepler can then use it to accurately predict pod energy consumption and container level energy consumption, given the relevant performance counters. The Kepler model server is currently implemented with TensorFlow Keras and Flask. The main reason why we chose TensorFlow over other options like Cycuit is because TensorFlow operates more efficiently when it comes to deep neural networks. So let's talk about the models within the Kepler model server. So currently, the Kepler model server implements two linear regression models. One model aims to predict CPU core energy consumption given CPU architecture, current number of CPU cycles, current number of CPU instructions, and current CPU time. The other model aims to predict DRAM energy consumption using CPU architecture, current residue memory, and current number of cache misses. Both models follow a supervised machine learning style. And the main reason why we chose to first implement them with a linear regression model is because it is more simple to set up for experimentation. If the linear regression models are not effective enough or are too simple, in the future, we plan to then implement more models, including deep neural networks, to try and achieve better performing models and results. So let's talk about the implementation of the Kepler model server. So now that we have the models, how do we train the models and how can these models be exported to Kepler to predict hard-level energy consumption? As we know, Kepler exports no performance counters and energy metrics from a Kubernetes cluster. And that all comes from the power consumption SNA agents, which are present on each one of Kubernetes nodes. It exports this data as time series, Prometheus metrics. This constant flow of performance counters and corresponding energy metrics for each node at some specific time in the cluster can be used to continuously train and retrain our models. Since pods also provide energy metrics and performance counters, or specifically provides the energy metrics and performance counters, which are the same as the nodes, our models will be able to predict hard energy consumption given hard performance counters. Oh, sorry, let me clarify. Nodes provide the energy metrics and the performance counters. And these performance counters are the same as Kubernetes, are also provided by Kubernetes pods. So then in theory, the Kepler model server can then use the Kubernetes pods performance counters to then make predictions on energy consumption. So using this strategy, the Kepler model server will periodically scrape the Prometheus time series metrics exposed by the Kepler's power SNA agents, which is then aggregated in the Kepler metrics collector and then explored as Prometheus metrics to the Kepler model server. So the Kepler model server will scrape this and then convert these Prometheus metrics into a sufficiently large TensorFlow dataset. The dataset is then used to train the Dram energy consumption model and CPU core energy consumption models. Certain metrics like root mean squared error and R squared values will be used to check if the model is acceptable for Kepler to use it. Once the models are trained and being acceptable to use by Kepler, they need to be exposed to Kepler. So Kepler can then take advantage of the model and use it to predict hard energy consumption using hard performance counters. There are three ways the models are exposed to Kepler. The first option is through a class HTTP endpoint, which explores the desired model as a TensorFlow save model. The save model can be downloaded directly by Kepler and the agents and can be directly used for predictions if you call a model that predict on the model in Kepler. The second option is through another class HTTP endpoint which instead exports the weights of the desired model to Kepler. Kepler can then use these weights to calculate a prediction given the quad performance counters which will be provided by Kepler to the model which will be provided by Kepler. Note that the mean and variance are also provided alongside the weights for all the numerical features. The reason being is that all the numerical features in the models will be normalized. Weights are also provided for each category for all the categorical features since they're all categorical features that are one-hot included. For now, we only have one categorical feature which is CPU architecture. The third option is also exporting the weights of the desired model to Kepler. However, in this case, it's exposed as Prometheus metrics. So moving on to the code demo. The demo will start with Kepler scraping simple notes from a simple Kubernetes cluster. The Kepler model server will then scrape the Prometheus data exported by Kepler, train our models in scratch and demonstrate how to export the models using the class we're out there with Prometheus. All right, so let's talk about the Kepler model server and how it fits in the Kepler project. For this code demo, Kepler is writing and exporting performance count of CPU core energy consumption and RAM energy consumption from the nodes in the Kubernetes cluster and they're being exported as Prometheus metrics. The Kepler model server will periodically scrape these Prometheus metrics to create a sufficiently large dataset for fitting our models. And currently we're in the Kepler model server project. We go into the server directory. There's going to be a script called energy scheduler. Running energy scheduler will periodically scrape Kepler's export of Prometheus metrics, convert these metrics to TensorFlow training, testing and validation datasets, and then finally fit and evaluate the models with the new datasets. So as you can see, the models are currently being trained. So we'll skip to the moment when the models have finished training. So now that the models have been fitted, the script will sleep for a set amount of time before scraping Kepler's Prometheus metrics again and fitting the models again with more of the data. Note that the test results of the CPU core energy consumption model and RAM energy consumption model like the R-squared values and the root mean squared error for testing data sets and validation data sets are not ideal. So if we take a look at first the RAM energy consumption model, we can see the coefficient of determination of root mean squared error here as well as the coefficient of determination for the most of the validation data sets. If we were to now, we can now also look at the CPU core energy consumption model results. So we can see for the test dataset the coefficient of determination that root mean squared error. And we can also see the coefficient of determination for the validation data sets for the CPU core energy consumption model. So the main reason why it's not that ideal is primarily because we're testing these, we're training, testing and validating these models from scratch using insufficient amount of data. So of course, when we deploy the Kepler model server we will make sure that the models are pre-trained with a lot more data, a lot more varied data and under a variety of different conditions and Kubernetes workloads. So we can get a more accurate model. If the linear regression model proves to be too simplistic for our problem we will also be, we will definitely in the future also be developing a variety of other models for experimentation, especially deep neural networks to see if we can improve the results of our models. The models will also be when they are deployed apart from just being pre-trained but the models will also be later continuously trained with real-time data on the Kubernetes clusters are deployed on to ensure that the models are more up to date. So in the background, the Kepler model server will also expose Flask endpoints for exporting the saved trained models back to Kepler for predictions. Currently the Kepler model server has three different ways it exposes the models to back to Kepler. So as you can see here our Flask application is running. The first way that the Kepler model server exposes its models up to Kepler is by allowing Kepler to directly download the saved model the TensorFlow flow saved models. Once it downloads the saved model it can directly use it to make predictions by calling model.predict. So we can do this by going to the models endpoint and then passing in a query parameter of the model we want. So if we want this core CPU energy consumption model we pass in core model you wanted the DRAM energy consumption model we pass in DRAM model but this endpoint will allow Kepler to directly download the desired model. The second way is by exposing the model to Kepler real weights. So returning the weights of the desired model. So we can do this by traversing to the model dash weights endpoint and then passing in our desired model we wanted to see core model. We can enter that and we can see all of the weights for the CPU core energy consumption model including all of the weights for each category the one-hand encoded categorical feature which is in this case there's only one which is CPU architecture and we also have the weights and mean and variance will hold the normalized numerical features. We can do the same for DRAM model here. So the final way that the Kepler models have exposed its models to Kepler is the Prometheus metrics. So specifically it exports the weights of the desired model as Prometheus metrics. So by default the Prometheus metrics are always exposed at a metrics endpoint. So if we were to type in if we traverse to the metrics endpoint we can see all of the weights of all of the models here. For this code demo currently Kepler is writing and exporting performance counter CPU core energy consumption. You've seen that the Kepler what it does it just exports all the metrics from the nodes and then we had the model server that trains the model and at last we are talking about power efficiency aware Kubernetes scheduler or PICS that's a project. We haven't started working on this project we have worked on Kepler that's the code repo is there. We also have sufficient progress on the model server. This is we just had the architecture level but the idea is PICS would obtain the thermal temperature cooling and power consumption metrics from Kepler and Kepler runs on each of the node and it exports that. Next you will have PICS download the models from the model server that you see that we have trained and then it will use the prop create model to predict which node is ideal for scheduling a particular workload. And the example is like if someone specifies that I only want to be scheduling my workloads on nodes that are powered by green energy and not fossil fuel then PICS would take that into account and it would schedule other another mix that PICS can use is if it sees that a node has already reached to its maximum power capacity and whatever the part that is to be scheduled and it is after a particular cycle if you think that any workload can lead to higher energy consumption it will not schedule that on that particular node and some other node. So we haven't started working on PICS that is still in progress. Our future plan is also to work on a VPA but that is again like in process we are not there. And we are ready for questions. So many. Okay. Okay, let me just. Sorry, can you, I didn't quite get the question. Okay, let me just get that question too. Yeah. Here, okay, so for the where did the labeling data come from, the actual power that you're using for training the model? You meant like the features in the label? In other words, did that data include actual node power consumption over that time? Yes, yes. And where did that come from? So we had like for this experimentation I think we had an IV machine from where we were running particular workloads and to capture the data but they're not very reliable as you could see that we have issue with the mean error. So we don't have the right data but if we are working with open cloud, I don't remember. Climate change for, yeah. So we are predicting, we are expecting to get better data once it is integrated with that but right now our data set is not very accurate. Two things, I was just curious because there's the Rappel stuff from Intel which is, I assume that that doesn't give you accurate enough data. So if we don't have any model already estimated model then we are using Ralph but again that is just, that is not the only way that we are going to do it. Okay. Waits available, we use that. Okay, so I think at the MGHPC we do have per rack and per node measurements if you want to run some experiments there. Right, right. We are working towards, we're still trying to integrate that. We have full request with Operate First and we are getting to that stage. Awesome. Yeah. Yeah, please. So I think that this is really cool. I'm really excited you guys are doing this. About time people are looking at the energy stuff. I don't know if you've seen the work that Sanjay and Jonathan are doing, sort of related to this about controlling things like, so it seems like in OpenShift you have a lot of knowledge about demands for things, the SLA that things are supposed to have, et cetera. And by actually doing batching and controlling DBFS rather than just having Linux use of standard policies you can achieve the same SLA for dramatically reduced energy use? Yeah. Is that something you've looked at or the control side of things? So first we are working with the OpenShift. Like it's not like we have started, we are just at the brainstorming process of how we can integrate the existing node optimizer that is there in Kubernetes as well as OpenShift. But the difference between probably us and them is like we are more concentrated on how you can get the workload of our part and how you can predict, not just optimize, we are not at the level of optimizing. Right now we don't even know how much is the energy consumption. So like right now we are just focused on getting the value of how much energy every workload on a node costs. And then probably working with other team to not probably actually working with other teams to optimize it. Yeah, but we are not there yet. We are very new. We are just a team of three people, one in turn and two me and women. And we just started working. I think there's a really exciting direction. So it's really cool. I can point to some other folks working related areas that you might be able to talk about. Yeah, we are working with also Cloud Native to start it as sustainability for Google Insight Cloud Native and trying to incubate more projects. Thanks for the talk. I was wondering, one of the nice properties about a linear regression model is you can actually, there's actually formulas for incremental updates. Like as a huge data point comes in, you can literally update the weights. You know, exactly. So that in theory you don't even need to do a batch retraining. You can just do data comes. Yeah. Have you like explored leveraging that at all as a substitute for like batch retraining? Kai, did you get the question? Yeah, we started, I think right now we only have one. We set up only simple linear regression model. So we have an expert back here. I mean, we would, we definitely, we would consider it's, I, again, we said that we are very new and we just had to come with a POC for DevCon. So like right now everything is very simplistic, but that is under the plan. Yep. We have two minutes. Thanks for the talk. All right, so I'll just make a quick. So my first question is, you said, this is pretty new and everything. Have you tried to picture Kepler as that as being part of the default? Like we have, we have this cost management and there is an operator called Curator, which actually aligned with cost management as well. So have you thought about, you know, actually talking to these people, like, you know, within a bunch of to try to see if they can switch Kepler to be in the default package? So that's my first question. And my second question is in terms of user adaptability. So we have got a lot of startups and companies out there like DataDog and that's a good cost. Who are, who are actually, you know, more concerned about the cost part of Kubernetes, but now they're looking into the sustainability part as well. So have you interacted with the community in terms of this and what is their reception? So I have like 15 seconds to answer your question. Definitely we are looking for collaborators. We are talking to a lot of team within Red Hat. The first team we started collaborating is with Seth and we have like one of her people in the talk. Also we are, right now we are aggressively looking for community outreach. And that's why we are starting this work group and CNCF, we are presenting it every venue so that we are not just looking for people who are willing to consume our project, but also looking for open source contributors. So definitely that is enough plan. And I exceeded by 20 seconds, sorry. But I would be in the venue and please feel free to reach out to me. I am on a green signal, so you can just reach out to me and we will talk about this. Thank you so much. Thank you. Thank you. Slides on me. Awesome. Are we starting at any point or are we going to get a cue? I'm not sure, like, yeah, I can't forget that. It would really benefit from monitoring their equipment. I have a background myself in architecture and project development. So I touch on all sorts of things when it comes to the built, lately it's been digital twins. So about our talk today, just to get a sense of what we're talking about, the IOT digital twins and being able to display them in a visual format, visualization, data, what kind of data formats are out there, because if you have all the data out there, that's great, but then how do you actually make people they have a person, whether on a project team, executive, somebody able to see that data in intuitive ways and different ways. And to be a global perspective of how digital twins are used on projects as well. And their benefit for project life so it will give you a sense of what's possible out there, get all together in a global scope. So we won't be like diving into the weeds, but we will talk about lots of different things related to this scope of visualization, digital twin, to how that all comes together. And here's just table of content for you guys that pace us, we're talking about what are digital twins, that's always a fun topic there, what kind of data and standards exist. There's a lot of this current tech stacks and solutions, just gonna give you overview, operations, digital twins can be anything from a tiny little machine to cities. Different kinds of data visualization options and how to plan a project, you can start a first project. Big or small, cool. So let's talk about fundamental question, what are digital twins? This is a basic definition I got, is a digital twin is a digital representation of a physical object or system. So you have a physical object and you have a digital representation and that could be done a hundred different ways. The twin is constructed so it can receive input from sensors from a real world counterpart. So it can't just be make believe, it has to be based on something out there in the world. For example, my Fitbit Watch here is reporting my heart rate, the real world counterpart is my heart. So put it in that sense. And you can simulate the object, the physical object in real time. So whatever is happening in the world, you can simulate it with your sensors in real time. And it produces outputs that can help you create predictions and simulations of how that physical object will behave. For example, we got this machine arm right here, a robot, and we got the digital representation of that machine arm. So the real world object is that robot and the digital representation copying everything that machine arm is doing in the real world. It does not have a mind on its own. It is a one-to-one twin of the physical environment. Like in this example here, this guy up in the inset here, he's controlling a machine. It's rotating around and then the digital twin in this case is just copying whatever the machine is doing so that person somewhere else, anywhere else monitoring this, can see what's happening in real time and not be in the room with this guy. At the same time, you see again information monitoring out of this device. So you know, it's happening in a lot more detail and just looking at it. These are the kind of things that the digital twin is supposed to support. It's supposed to show you what is happening to an object so you can get a lot of information in real time as it's changing. And the ingredients of the digital twin is internet things, some equipment, anything that is hooked up to a, the hardware has some kind of connection to the internet, cloud for all your server needs, and data. You can't get anything out of that. You don't have any data operations. So that's the ingredients to all of these things. And these cases are very manufacturing, just like the Roman arm. You can figure out a factory automation simulation, see what's happening in your factory, how well it's operating. Vehicles, autonomous vehicles and sensors to see what the vehicles are doing, where they're going, how efficient they are, their speeds, everything. So moving objects, healthcare. There's a myriad of devices in healthcare, largest small from pacemakers to a large medical equipment that can be used for monitoring the human body, that's a digital twin. And even entire structures, buildings and cities to monitor the energy used and movement of people in and out of the building or even things like stress. They've been using digital twins to figure out the stress of the bridge because you have monitoring devices probably. So digital twins can be used for a hundred different things. It's just, what is the real world object and how do you then put that into a digital environment as the point of getting a digital twin? And you might think this is fun and all, but what is it in for the average person that would be interested? Well, there's a lot of benefits of digital twins. This is one of the reasons people are investing in it. It's a real time look at what's happening to the end. So digital assets, so you don't have to guess. It gives you a lot of information for anybody to view as long as they have a portal to view it in. And you can get things like history of what happens in that object, metrics, information so that you can use that for analysis or recording or anything and not again, have to guess. You don't have to have any paper or binders. This is still a real world issue for many, many different industries where they have to use hard copies of everything for logs and records. And if you don't have dust going down to the boiler room to check on the equipment or he has the records of what was going on in the boiler room for the last three years, you have no idea what's going on in your boiler room or any other equipment or HVAC or anything. So you would have things digitally stored which is a big benefit and attached to an object. It can scale the operations. This one can be easily scaled because as long as the infrastructure there is just gross. And if you can use it for analysis, you can actually predict outcomes from historical data like maintenance logs or health history or passive vehicles. It's anything you want to study and predict. So it has a lot of benefits if you're owning this equipment of any kind or wanting to make sure you have better outcomes. And when you talk about power, what it can do is a lot of outputs, a lot of things it can do when you talk about power. And this is power to real world examples. For example, if you're owner of a campus like this one, Boston University, you can access data of your assets and explore ways to improve operations. Maybe things can be more improved for the HVAC or you can improve things when it comes to maintenance logs, return on investment, make sure you have better tracking on your equipment and hardware. Facility managers, again, you can figure out what's happening in real time based on information reporting from the physical object to the twin and even the equipment manufacturers. If you have your equipment hooked up to a IoT device that can then report to a database, you can inform the owner of the building operations or vehicle or anything what's happening in their device so you can have a remote monitoring of their equipment. So there's a lot of real useful things that comes out of digital twins or day-to-day operations. So that's said, where they kind of text that straightforward is that no universal solution to interoperability in digital twins. There's lots of solutions, but nothing that comes together to the one-stop shop for digital twins. Yeah, basically you have lots of different hardware devices that either connect directly to a device or the portal. You can have anything from phones to home IoT, to equipment, to camera monitors, you name it. They have to go to a network gateway like a Wi-Fi portal. RFD and then through middleware like cloud servers. And then finally you get to an application. So, and these are two-way streams going back and forth. So there's a lot of content here to manage and digital twins themselves tend to be an exercise in managing complexity. So at any point you may have no interest in the hardware portion. You may have a lot of interest in the middleware portion. Edge computing may have no interest in any of that. You may be kind of a person studying the application use, so there's literally multiple points here where you can get involved in the digital twin infrastructure and applications. And here's just a kind of a map of the solution providers at every different scale from manufacturing to urban design to the equipment themselves to large-scale surveys and 3D engines. You name it, these are all components in a digital twin environment. So you wouldn't be able just to spin up some software at this point. I think there might get to a point where that's possible. But right now you have lots of different sources and endpoints and you have to basically put it together. And it goes any further if you have something like data science. If you're trying to get some real analysis and information out of what you're doing with your IoT and digital twin devices data reporting, as you can see here, there's a lot going on here so that can expand what you're doing to a customized bespoke workflow. So I'd say just to kind of help you keep your head around and wrap your head around this, keep it simple. Basically it's all about sensors from the physical world that report on the hardware, the hard actual object you're talking about. The models, some kind of digital model and that could be multiple things. I'll do the visual part of it. It could be literally a 3D model. It could be a chart. There's different ways to display that information. And hosting, services that connect sensors to models and other endpoints. Those are the three main ingredients and it's basically mix and match depending on what resources you have and how you are trying to go about it. So real quick, I'll talk about industry formats. So we talked about the use cases of digital twins, the tech stacks, where are the kind of format people are using to get this organized? And this cartoon says a lot about that. It's basically every time people have an idea about the format, they want to make a new one. And that kind of happens a lot too. In IoT, industrial IoT, digital twins and everything around it, right now you have multiple different kinds of formats and standardizing bodies. There's industry 4.0, industrial twin associations, industry 4.0 Alliance, GIAX, many more actually. These all exist for a reason. They're not just redundancies. They address specific industry use cases. But they're starting to call us over time into more, let's call it generalized concepts that everybody can agree on. And that has started to become the digital twin consortium. So if you want to see resources on what it takes to go from specific parts of the spectrum to small equipment, large equipment, the server infrastructure, whatever, the digital twin consortium has a good basis for that. So it brings everything together and shows you use cases for how terminology and architecture can be used. For example, just what does computational representation look like? Very simple here. You have an input, the reference data goes to an algorithm, you have an output. It's just something everybody can agree upon. That's just the definition they're offering. And same thing for store representation, instruction information, and it will describe what kind of structure information. CAD, GIS, point clouds, streams history, a snapshot in time. So you would have these kind of shared terminology, shared basis to discuss digital twins in a way that's comfortable across many different organizations, and many different points of reference, and many different professionals, and many different solutions. So you're not mixing up terminology. That's been something that's been good trend lately, is to kind of bring them together into one umbrella. And also talk about what you shouldn't be using. For example, Excel should not be your database of choice for something like a digital twin. Maybe it has like a real quick contribution, maybe a platform for any kind of large scale operations. It's also what you should not be doing. So one of my favorite topics, large digital twins. So if we're going to have some intelligent ways of operating digital twins and make them bigger, how big do they get? Well, they could just fit into your app. In a given apartment or house, you have dozens of objects that are IoT enabled to get started, thermostats, cameras, appliances, your phone. There's a lot in there. So it wouldn't be much of a step to go from something you buy yourself to something you can install as a modular component into, say, an apartment complex, a hotel, a dormitory, a hospital. They're starting to do this now where modular construction game is very popular. It's cost effective. It's efficient if you do it right. And they can just embed inside these projects lots of little devices that can either be used right away over time as the need arises for being able to get the data they need out of the operations. And in this example here, just to give you a sense of how these come together, you would have on the top left corner here a plow land. You build your vertical axis like elevators or stairs. And this handy-dandy crane here pops in these little modules, each one packed with equipment stacked up until you have your apartment complex. This one's being used for a design concept for providing housing for the homeless. And again, what can they do with modular construction and digital twins, packaging them all together so you build this all in a factory setting? So it's not just the equipment here and there and a factory floor, it's not just small-state equipment. It could be your entire building complex. And you would marry together the construction process with the asset of a built project. And it can get a nice dashboard. This is really useful if you're the owner of the building, a developer, a construction manager, architect, being able to get the data out of the project. And it's kind of part of the visualization we're trying to figure out here is how to make that visually useful during the project operations, during closeout, during operations. And it could be a simple as an Arduino board. Andy Daniel Arduino board as a microcontroller to help you get a hardware in your project that can be ported to a digital twin format. And again, product integration applications is a very simple thing. And they could do it like this in a modular format where they just tie on these kind of boards, these microcontroller boards with everything built into them on the other side of a wall or a piece of equipment or desk, for example, in the project. And just install it, forget it ever exists, and turn it on when you need to. Again, this is something they're working on with a lot of projects when it comes to housing. But it can also get very big. These are equipment for the forward board of companies, Dearboard campus, that they install. IOT-enabled digital twin, huge pieces of equipment from motors and engines and everything for their entire factory. They're trying to monitor remotely so that they don't have to worry about gusts going down there or everything's falling apart and they have to shut down the factory. They can get live updates from that facility on the fly without having to worry about things going to ride. So it's getting bigger and bigger when it comes to the environments people are using these twins in. In this case, you see here's a 3D model of the environment that is then tied to hardware from the actual equipment on site that is all coming together in a visual. Then anybody can walk through in 3D like you're playing a video. So now we're going to talk about visualizations. And visualization is partly a technology issue, it's partly a UX issue, it's a UX issue. This example of bad UX where you're not getting what you want out of the experience and it makes it challenging to enjoy being able to be in that environment. And there's lots of different visualization options. There's, of course, good old-fashioned 3D being able to look at what you're looking at in the 3D format as though you're playing a video game, you're walking through the video game, you're walking through your 3D model of the digital twins environment. There's charts, bar charts, graphs, data points, simple charts, this is VRAR, immersiveness, a type of 3D, but in that case, you're also experiencing it as though you're there, you're not just looking at it. And maps, maps are great. Whether it's a floor plan, a photography survey, any kind of map will do. So 2D maps are also a pretty useful visualization. So we'll look at the example of each. 3D models, here's a time series of a building showing you different status trappers during the day. This would be, for example, temperature, we could also be activity of people walking through the building. What I like about this and what's a great take about here at this visualization of data is anybody can intuitively understand it, you don't have to think too hard about it. So even if you're a non-expert, not familiar with floor plans or anything, you get a sense of change over time, and it's easy to track and see with your eyes. That way it's a valuable output, a valuable visualization for anybody who wants to interact with it. And another kind of takeaway from that, I believe this one is temperature in a building and you can see this person's kind of zooming around, looking around at the building and seeing what's happening in here. Again, people like playing around with things, people like to look around and see what's happening, what the highs and lows are, and it's very intuitive for them to understand it. So these are kind of like front-end visualization options people are using to actually display that data they're seeing and this could be an investment about building the hardware, the technology, the backend, everything coming together here as a front-end tool to see what's happening in the environment. But it could also be a simple chart. Here's a just generic time series chart of it could be anything, temperature, it could be the activity of a motor and some pieces of equipment, it could be a lot of different things. Charts are great if you just want to get some straightforward data points, this is another very useful valid visualization. It could be a lot more diverse too, it could be charts, it could be a radio chart telling you different kinds of activity, it could be charts showing you people's movements throughout the space, proximity, it could be all sorts of things. So there's a lot of options that comes to charts that if you want to get some good information out of the metrics, changes over time. Charts can tell a story very quickly without having to have too much concern about 3D environment or any of that stuff if you don't think that's valuable. A lot of engineers do feel like this kind of information at the fingertips is very useful. I work on projects where we have large, heavy equipment being moved around and it's being done around the world. So people want to get a sense of what's happening in real time on their projects. So again, these kind of charts where it just tells you very quickly, digital twin result of physical real-world objects, sensors reporting the data of a active object and you can remotely check the status. It could be anywhere in the world. This is very useful just as at a glance. So these kind of charts, depending on what you're trying to check, it could be just a full time series, a radio chart, it could be a cumulative chart of things adding over time, it could be a lot of things. It could also be a VR setting. For example, in this case, they're showing a use case of using VR to train somebody how to use some equipment. So that if you're trying to train a lot of people in a short amount of time, any equipment could be far away from them. It's a great way to get them oriented to some actual hardware. The key strokes of buttons, the actual environment they're gonna work with about having to be there themselves for one reason or another, whatever your flight got canceled or you have conflicts and you can't attend, but you can still get a lot out of just being able to use a VR scenario of a digital twin that actually corresponds to physical equipment on site. And again, maps, people love maps, people love formats. This one's just showing you temperature of an office. Easy way for anybody to study the environment of an office, classroom, any setting whatsoever, a house can do the same thing. Another great way to visualize things after you install that equipment, after you get the data reporting, all you need is a simple front end interface with the end points of the data and away you go, you can see if there's a comfort zone for temperature in your space. But you can also do things like that, snapshot of time, talk about store data. What you can also get in this case is examples, it's showing you every piece of equipment in the hospital, every stretcher, all the doors, every kind of bed, these kind of inventory tracking of a assets of a facility, might seem mundane, but they're very important for people to know what's inside their campus and it could be cars, it could be small hardware devices, but it's really hard to figure out what you actually have, which is bad for a county, you can't figure out what you have on the state of them. And if they need any kind of attention, how old they are, what's their warranty, this is a type of digital twin where it's a snapshot in time. And you can also go crazy and do something very customized like this Unreal Engine approach to a video game engine to show the environment of a manufacturing center and factory and everything around it. And in this case, see on the left here, you have all the different platforms together, whether it's Katia, Maya, SOLIDWORKS as your data sources for the model. And the other side of the API connections from the data itself, and you can host it in a environment that can be very appealing to somebody who's not familiar with the environment, but they can get lots of great dashboard information at a glance in your heads up display or your hub. So something that works very well in the video game setting can work really well in a facility management setting. But you can also get much, much, much bigger. So you can actually make digital twins of an entire city. And this example here is actually built on a city of Helsinki and Finland. It's their capital and this is an entire urban region where they have all the topography data, the building data, the attributes, the water, the land uses, the zoning and everything being reported to their city GML model that then goes into a database that becomes part of a platform that can be then used as a web platform that you see on the right here for somebody to interact with. So this is the growth stage of digital twins. They get as big as you want them to get. They can be as small as you want them to be. It can be as big as you want them to be. There's a huge variance in size and scope. So there's a lot of opportunity here for a digital twin and visualization to be used for just about anything really and the sky's the limit. And some notes about planning for digital twins. This all sounds really exciting but how do you actually get this going? And you have a big question you have to ask. Who pays for all this? The owner pays for all this. Whoever is in charge of the budget, the owner of the fleet of cars that needs to get telemetry centers, the buildings, the health device monitors, somebody out there, the owner pays for all that. So there's a sense of how to get buying from them to spend that kind of money depending on how involved they are at the area after start from zero and get everything built in. The cloud servers, the IoT equipment, the visualization, the whole shebang. So all you have to do is establish a business case and usually you include this as RP or RQ request for proposals, request for qualifications and tries to include some kind of basic digital twin, any part of the hardware, the server, any part of the service to the owner depending on where you're coming from in the process. And what really works great is to see if you do a pilot project, program or project with your client if you might have a client and say, hey, let's just try it out, maybe low cost, no cost, see how it works depending on what we have to invest here and see what the ROI is on that. Are they getting better results depending on what they're trying to get improvements on, they're trying to get just better improvements on performance of their equipment, better user outcomes, just look on the ROI and then brag about it, market yourself, explore and provide results to the process people. And when it comes to all this, the adoption curve is better to start early with these kinds of services than later. You don't want to be a liar and catch up when it comes to establish a part of any kind of project. Again, a lot of you might be working on these kinds of projects yourself in some capacity, if you're working on the edge computing side, service side, the application side, these might become things you need to address as this market becomes bigger, people have a higher demand for it. Five years ago, nobody cared about digital twins in my world and then very quickly in a year or two after that, people really wanted to get some kind of basis of digital twins going for at least having some kind of starting point. And now we're investing heavily in my current company and getting some real quality infrastructure in IoT servers, edge computing and eventually a qualified digital twin model that we can use for operations. So we'd rather be on the leading edge of that and have a kind of advantage that the wait until somebody tells us to do what might be falling behind. So bring it all together. There is no single software solution. You have to kind of find your own way and navigate that environment. There's lots of suites and packages that can provide you that kind of data and visualization, but you have to figure out that you want to use your content for as a digital twin. And just some, I've been using around so lately as your has a whole digital twins and IoT platform. Pretty good. It has pretty much everything covered for you. The IoT Edge, the IoT Hub, SQL databases, Databricks, Visualization Power BI and much more for presentation. So it comes with pretty much everything there. They call this Cosmos and IoT or Cosmos DB. Check it out if you're interested in that. This is TwinMaker by AWS. Just like everything else, AWS has a package for IoT and a package for twins. And this is an example of 3D environment being hosted in the cloud directly from AWS. A series of activity monitoring on the upper left here, a close up of the 3D environment and time series of what's happening in that environment all built into one platform. But you do have to provide the pieces separately, the 3D model separately, the equipment separately. So it's still a package you have to create, but it's easier now than it used to be. And this is something from Autodesk called TANEM, which does a really excellent job of showing the front end 3D environments with low latency and gives you also charts and bars and graphs and time series. You can walk around in a 3D, you can use it as VR environment, you can use it for plan. So really great for visualization, but you do need to have it back in like AWS or as you're still at that to feed into it. It won't be able to do that itself if I'm getting that right. So lots of great stuff out there, how you recommend you try those out if you're just curious, if you want to get spin around it. And if it comes to the life cycle planning, always start with assessing your current circumstances and what you got. Plan ahead, document, train people, train yourself in the immediate CDE, common data environment. Are you a zero shop? Are you a IBM shop? Are you an AWS shop? What is your common data environment for all this? Take your cloud host, analyze the data as you get it in, study the trends over time, that's where the value comes in and being able to understand trends, updates and repeat forever. Because it never really ends. The cycle just keeps repeating. I'm sure you all experience a cycle like this in your own day-to-day work. It never really ends in the new trends and evolving processes in and of itself. So I highly recommend having some kind of cyclical, continuous improvement. And just to get started, if you guys want some inspiration, inspirational tips to get started. There's four easy steps to get started. This might kind of launch you path for you guys. And remember, I want to kind of revisit these myself. Know what you want. It doesn't have anything complicated. Just what kind of things do you care about? Is it, are you caring about the server side, the edge side, the equipment side? What is it that you care about? Pick something. Figure out where the opportunities are. That means what's the target you can achieve in that environment? Can you make things run faster? Can you get better data? Can you make better visualization? What kind of targets can you possibly achieve? Plan it out. Figure out what do you need to do to get from point A to B to C to get through your process and maybe some critical paths. You need to bridge along the way. And adjust as you go along. Because trust me, I've been in these situations and it's very easy to get lost in the nebulous world of putting a digital twin together because then it's not always just one person or a team. It could be, and this is what happens. It'll be whoever's in, I'm a program manager of a project so I'm trying to bring it all together with a project manager who's in charge of a team along with a perhaps an executive who wants to get some kind of result out of this along with engineers who are packaging these things all together along with vendors who are trying to give you the right services in the right package for you. So there's a lot of different things happening. A lot of different people you gotta talk to is just have to be able to adjust as you go and make sure you're meeting your opportunities and your plan is being met. And also figure out what you wanna achieve. Are you trying to get better performance? Are you trying to get better user feedback? Is that your way of getting information back from people? Are you trying to automate things better and smooth out your operations and just speed things up? What is it you're trying to achieve? I'll also target that. And just a couple of notes. Bring it together on a project, create the connections to the physical environment. That's important if you don't have that. Do not have an IoT to use with Twin. Get live updates from the equipment. Have a place to store the data, of course. Analyze it for breakdown, trends, notifications, so forth. Update your equipment to use with Twin assets along the way. Always try to get the latest updates from where, software, anything. Repeat forever, just like the real world. The ride never ends with Digital Twins. That's half the fun. It's always a learning curve, a process. And if I were to do this talking again, maybe in a year or two from now, it'll be somewhat different. Key takeaways to wrap it up. Digital Twins are communications and physical assets. All Digital Twins exist in the scheme of connected devices and services. Meaning there's a network of things talking to each other. Common ground could come from standardizing bodies for a Digital Twins strategy. So how do you recommend checking out some kind of, like, Digital Twins consortium if you're interested in the kind of process. There are many, many, many different modes of visualizations available for Digital Twins. So don't think it's gotta be either 3D or charts or anything. There's lots of room to experiment. And sometimes you gotta just throw things against the wall and see what sticks. So that's part of it's half the fun, part of it's most of the headache. So think about different ways you can make the most out of the Digital Twins environment for visualization. Creative business plan and room after organization. The more structure people see out of this, the more they'll want to work with you. And again, because you have so many different people you have to work with, it's hard to make this just your own personal project unless it's very, very small in scope. But to really grow it, you do need to have a larger plan in mind. And find opportunities to use Digital Twins. It could be as simple as implementing that Digital Twins project in your own company. Again, it could be a small pilot program. Here are some resources. And I'll give you a link to the deck here at the end here but lots of resources for Digital Twins including the consortium docs as lots of stuff out there, basically starting point. That's the QR code. If you guys want to get this deck, you can always reach out to me. I just search engine optimize, just put those two words together, you can't miss me. And that's it for me. Do you have any questions, complaints, whatever, we could talk about them now but thank you so much for attending my talk. Thank you for the talk. So I'm watching this, trying to figure out what a Digital Twin is. And I guess the theme I kept seeing here is it's basically when you're combining telemetry with a spatial model. As otherwise, let me put it this way. How would you draw the line between a Digital Twin and just go fashion dashboarding? Is there like a distinction you would draw there? I would say the difference is just real time data is the big, big difference monitoring. And I would say immersive information and immersive environments go a long way too because Digital Twins represent the equipment, the space, the activity. So you want to get more than one dimensional sense of what's happening with a Digital Twin. So while dashboards and data are great and it's totally evaluated to show the operational efficiencies that was happening and you can call that part of a Digital Twin on its own, it wouldn't be enough. You want to care that, like that TwinMaker example we saw where we had the chart data, the time series data with the environments, with the equipment, with the logs and records. You want to be able to really show what's happening and what that space environment activity equipment looks like. So you want to have an immersive approach. So I'd say, if it's not just one dimensional, it's multi-dimensional, I'd say you're closer to that Digital Twin Mark. How much of these environment is open source? I would say it's a, depends on the definition of the source, but I would say more than a little bit for that way, but in my experience that as far as what works, it tends not to be very open source, it tends to be very not guarded, but proprietary. So you can definitely do some projects in the, I don't have it here actually, but with Arduino boards, I believe the project is called Open Remote. That Open Remote works for a lot of different microcontrollers like Arduino, Raspberry Pi to be able to help you report information from the equipment of a microcontroller to a dashboard. So those smaller scale things that are open source, that city model, we saw from Helsinki, a lot of that's open source, but when it comes to like, if using the hardware from Siemens or GE or zero AWS, a lot of that tends to not be very open source or open source friendly or open source extensively. So I'd say if you're talking about a real heart, and a lot of this is tied back to the hard equipment, it tends to be very tethered to the, wherever you buy it from. So if you buy equipment from Siemens, the law of the stuff comes from Siemens and there's not much open source out of there. I think it's changing though, because no one can solve the entire puzzle of a digital twin either on the front end, back end or middle work. So I think over time, there'll be more open source friendly. Now people see this as being a really beneficial use case for monitoring and getting ROI. But right now it's say it's not particularly open source for the larger, more industrial commercial operations. But there's definitely open source solutions out there for perhaps smaller scale projects or less intense operations where you're not trying to get as much data. Some of these digital twins will report data a thousand times a second to get really high frequency, high accuracy data, but others would be like linked every one second to get some updates. So if you're talking about low latency, low commercial expenditure projects, lots of open source, but as you get higher and higher into the commercial scale, like a lot of the examples we saw, they tend to be less open source. That's the kind of sliding scale there, but it's starting to get better by that way. On the approach. Sure. So you mentioned the health and key. I wonder, so two questions here. First about the data that the city collect is the data open and then how much the concerns about privacy are involved. Good question actually, because I put it this way, I'll rephrase the privacy of security and a lot of people are concerned about security. Projects I'm working on, they can only ever be a one way reporting schema from the equipment to whoever is looking at the digital twin visualization. So to avoid hacking, otherwise you have really large equipment that could potentially hurt somebody if they're walking around and then acting a vehicle, looking at machines. So security slash privacy is taken very seriously in that world. So there's a lot of concern about that, especially when it comes to stuff you spend money on and stuff that involves safety. Usually that's much more intense than against all based on real world considerations. And it becomes a little sinky. I believe, I can't think I'll tell my head if they're open or they have data freely available, but I believe that architecture, I believe that is an open source project I want to speak to soon, but I believe that their architecture along the resources in that one is open source, but I don't know if they published the data outside of like a report package. They had this concern, something very similar with Google's sidewalk labs in Toronto. You guys might have heard about that one. A couple of years ago, they were trying to do something similar, but more on Google's platforms and basically track where people are going in a city, in a region, a district of the city. And there was a lot of concerns about that. So I think that's an evolving environment with who's seeing the information, who's accessing it. So, and related to that, my friend, he works in a utility for gas pipelines and they're hyper sensitive about privacy. They don't want you to, that privacy data about household information's leak. So the privacy side is very much, I would say, typically protected as is with whatever industry they're going with. So it's usually that kind of information's protected and it's, there's already protocols have guarded and I think people are trying to find ways to, on the one hand, information on large scale usable, on the other hand, protect people's privacy. But I would say it's taken very seriously both on just the security front, the privacy front, but the actual architecture of like the solution tends to be freely available, the data's guarded. That's the best way I can answer that question. Healthcare, okay. Well, I actually don't have one good one here, but I was gonna go to that slide earlier, but that's more of an icon. Yeah, I'd say healthcare that there's a, again, it's almost like whatever equipment you have and it could be from, again, Pippets or Apple Watches that gets reported to a dashboard and not very sophisticated because it literally just goes to a front of an application but what doctors wanna do, for example, is how can they package devices like a pacemaker or something they can wear on the outside of your body in a way that they can monitor your health remotely. So you don't have to always come into the hospital to see what's happening. That's starting to take traction as they mini-try some of these things that also make the device themselves a lot more accurate. So the data they get is actually reliable. And I've seen things kind of emerging of trying to actually map your body in 3D, including your internal anatomy as it relates to whatever they're trying to track so that they have a digital twin of you and being able to track that with some equipment, again, worn on the outside, could be inside of you or could be something that is monitored while you're in the hospital. So it's starting to become more relevant than ever and more useful as the technology gets better, though doctors tend to be very conservative on that regard. So it's not a ton of things, there's a lot happening but it's not mainstream just yet. And it would be very useful, for example, if somebody is disabled, they can't move around too much. You can monitor them remotely, those kind of applications would be the early kind of points of where you want to use them. So once that becomes kind of established and over time would become more and more normalized, the point where the average person might be scanned themselves and then a digital twin of themselves, they get data that worked some years away from that before it becomes healthcare, so that way. But it's definitely on the horizon. Any other questions? Sure, go ahead. Can you tell about an insight or that they have from the Health Sinking Project? An insight on the Health Sinking Project? Yeah, like what did they find out like interesting results that they have? I am not working on that project myself but I was going back to this one here, let's see if you can scroll through it real quick. I'm not working on myself but it's something you're seeing in more and more cities, Hong Kong, Singapore, and something similar. Stuff on the left here is their schema and architecture. The city GML, the Kala Satama, I guess we call it the city GML. And they're just trying to see how they can manage the city as like an asset. That's really what this is at the high level, how can you manage their hydrology, water, the infrastructure, they're starting to put like sensors, like I mentioned earlier, sensors in bridges and tunnels and trying to get like a similar city version of their city. So it's more of like trying to understand what's happening in their city. Not sure if they got an actual insight of it yet but this is something they're working on to get better information on the inner workings of the urban environment and get better, more fine grade analysis of that. So they've been out on the one hand, throwing away money on something that's not working on an urban scale and infrastructure scale. On the other hand, maybe they're finding areas that can correspond better with people's needs and the city's needs. That's kind of a big picture take away. And this is really, really big data because it could be anywhere from static. A lot of this 3D models are pretty cool but they're also trying to track like rainwater. They're trying to track sun exposure to the building. They're trying to track all sorts of things here. So it really depends on the city's particularly use cases but I brought this because of the sense that if you can scale up the city, you can scale it up to a country and it wouldn't really be that much different when you think about it. But that's the kind of things they're trying to do is how can we make our city better and that depends on what they're trying to achieve on their driving results on that one. But I would say check this one out. You wanna see more and the architecture here is actually, it should be, I would say should be most of that's pretty available. I won't speak too soon but there's a couple of examples out there. Any other questions? Thank you all so much. If you have any questions more again, you can check out the QR code, get me up, whatever. And I'll be around today as well. I have another talk in the afternoon, so you can't miss me. So thank you so much. I'll let you guys get, oh, lunch right now, so. Yay. Oh, sure, go ahead. And digital, that's a good one, so far. Now, the attempts or these efforts to get from sequential to spatial code is actually called high level synthesis. And here we'll look deeper at how high level synthesis actually works. Now we can see that we start with a secret at the front end and that is converted into the intermediate representation of the code. By intermediate representation, I mean that it's sort of made independent of the language that we are feeding in. So get it into a raw form and that that raw form is further optimized. By optimization, I mean that developers can have different goals to reduce area or resource utilization or to increase the latency or to increase parallelism or to reduce power, energy, et cetera. So depending on the requirements, you optimize it. And towards the back end, this is the HDL generator in which those operations are scheduled and then the resources are allocated onto the specific FPGA that they're working with. Now this looks like a compilation and indeed, HLS is like a compiler if we look at it like that. Now the challenge is that high level synthesis gives us poor performance. And the slide here shows the work that we did in our lab in Elsevier where we show that OpenCL, which is the high level language, it gives around six to 64 times bad performance compared to a code which is written in a hardware description language. So if HLS is so impossible to do and so hard to do, then why are we trying to do it? This slide shows that these actually our findings from VU, they have shown that you can indeed get good performance out of HLS if you sort of vary your C code, you do the code transformation at the level of your C code, but you write the C code such that it is adapted to the specific specialized back end that you are targeting. And this is a remarkable strategy. It has given indeed pretty good performance, even sometimes better than HDL. But where's the problem? The problem here is that again, this is very specific and we are talking about portability here. So we want an approach that can be translated across compilers, across applications, a robust way of solving this problem. And another thing is that our observation is that compilers can work. We just have to find the magic spell which gives us good optimizations, which gives us good performance out of that compiler. And one way in which we are currently moving forward and looking at it is that we train our compiler in a way that it gives us really good performance for any type of C code that you feed into it. Now, all this does sound simple. Well, not so much as the slide shows that there are so many intricacies involved, the CPU of the strategy does not work for FPGAs like we saw before. Then it varies across workloads. And then the important part is that compilers do optimizations in the form of passes. So there's a pass for everything, loop unrolling, inlining, optimizing your code. And on top of that, those dozens of passes, another problem is that each of most of those passes have different parameters. So how much of loop unrolling you have to do and so on. And then there is an infinite combination of how we can put together those passes. And all those things affect your performance. So here, I tried to illustrate it, but for example, with these boxes, I'm trying to show that there are different passes or strategies that you are telling your compiler what to do and how to get performance out of your code. And then it is impacted not only by the ordering of passes, but it is impacted by the frequency. So how many times you can actually repeat a particular pass. And it is also impacted by the length of passes. So how many passes you can feed into your system and which will give you actually the ideal performance. So we need a strategy to solve this problem. And machine learning is popular. It is everywhere these days and sort of the solution to most of the world's problems. But people have actually used it. For example, these are very recent works by Google and Facebook, et cetera. And they are currently using this solution for compiler optimization. However, there is very little work that has been done in the direction of high level synthesis or specialized back ends like FPG, et cetera. And machine learning itself has several options. So you can use supervised learning. However, for that you need, it may be a good option, but for that you need massive initial computations. So there's another strategy, which is unsupervised learning. That may not be meaningful to the compiler optimization problem, since in that case, our search space is too large. Now, one thing that looks indeed more plausible is using reinforcement learning. Because with that, we can train on the go per program. Now, what is reinforcement learning? I tried to show it and I quickly go with it. It is actually a strategy or an approach in machine learning world where you are, it's actually an interplay between the environment and the agent. So environment is the problem that you have to optimize. For example, your C code. And then your agent is the brain, the model, the policy, which suggests you that what could be the best possible thing to do in this case. Now, there are many ways in which reinforcement learning can be applied to the high level synthesis compiler optimization problem. And one state of art way is shown here, which is that the agent recommends the best possible pass to apply. And then that pass is somehow fit into the system, into the intermediate representation of the code and optimizes the code. And the HLS tool is used to somehow give us the reward or the performance when that specific pass is actually applied. Now, here, this is actually our work. And we have looked at the learning strategies of how we can somehow enhance this process. Now, the reason is that the developer needs to know, needs to have some choices of what they could do or what they want, what they want the agent to learn and how much they wanted to learn. So in order to give that choice to the developer, in order to give them flexibility to choose things according to their learning goals, we apply different strategies. So we look at how we can vary different things and how we can vary somehow make the compiler smart. So we also looked at the learning quality. So normally for compiler optimization, if you look in the literature, they always say that what is the speedup over O3? And O3 in the compiler word means that the state of art, what the compiler is doing. So how much you can actually get better than the traditional compiler strategy. We have looked at several other metrics as well, what may be of interest to the developer. So for example, the learning speed, how much time you take to reach 90% of your steady state learning, then the fluctuation band. So once the agent has learned, how sure it is that it has learned what it is supposed to learn. So there is a certain ripple in how much of that. So performance potential achieved. So how much it is closer to the maximum reward. Okay, so here are some pretty graphs. There's a lot of it, but I will just give a brief overview. Base is actually the state of art. So here, what is the state of art so far? And what our approaches or learning strategies show that for standard applications, how much better we can achieve in terms of each of those metrics. So for learning speed, for example, for this application called ADPCM, we see that our strategy four gives pretty good results. So it is the standard, these are normalized results with respect to each standard application. So that strategy works best for metrics multiply, this strategy two, a method 100 works well. So, and similarly, greater the value, the better the result. And for performance potential, similarly, the greater the value, the better the result. Now here, the fluctuation band, the smaller the value, the better the result. So here, for example, strategy two for ADPCM can give us pretty good results. And for speed up over O3, the higher the value, the better the result. So here in the stable, we tried to enlist which strategy gives us and which options and which approach gives us the best result in terms of the learning quality and in terms of the benchmarks. Okay, so we saw that we can't find optimum strategy for, we can find, sorry, we can find optimum strategy for every metric and every application. And that's what we saw in the table before. Then it depends on the developer goals and on the application. The next step is that we have to sort of make it more intelligent, add some more neural nets into it. So basically a recommender system that can tell on the goal, which strategy would work best for a specific C code or application, then state representation of the code. So efficient way in which you can feed the features of your code into your system, the enforcement learning framework. And then we are currently working on other HLS tools. So this work was done on Legup, which is open source. And now we are working on Vitus, which is not open source, but configuring it and getting better results. Okay, now the last things to look at before we wrap up is that currently in parallel, we are also working with GCC. So we saw previously the results for LLBM. And now, briefly, I will discuss my colleague Robert is doing some brilliant work in it. And it's not so easy because for GCC, we have certain limitations. Pass ordering is typically harder because certain passes must come before certain passes. Otherwise you will get errors. And each pass must be under its designated parent node. So pass ordering and he's been able to accomplish that. Then he's been able to look at the GCC front end and GCC middle end. And towards the back end, he's currently using Panda Bamboo, which is another open source HLS tool. And he's setting up the system in which you can do machine learning. And then the same strategies to optimize the overall system. Thank you. That's the end of my talk today. Thank you so much, everyone. The question was regarding the scope of the code. So what we're trying to enable is people to specify their applications, not just not the infrastructure that's going to run under the applications, which is what this all and hardware operating system do, which will be the next talk after this one. So this one is just, you want to implement a workload, maybe some filtering application, maybe an off-float Kubernetes. Anyone off with part of that? Let's look at the circumstances. What's the scope of the kinds of applications you're going to offer to them? Is it up to four? Is it, you know, like there's a network stuff that there's really the opportunity to do on the package that's provided? Don't, you brought it in still, please. I think it was offered to you on this one. But you can just go to the infrastructure. Yeah, absolutely. So because as long as you can specify, actually, I mentioned it. Okay. So far, we've been testing with standard high-level synthesis applications, like the one I showed, standard Benchmark, CHStone, you know, if you're looking at Rotenia, El Sica. So towards, like the goal is to somehow train the overall system over a standard set of applications, standard, you know, what hardware design patterns it's expected to see for an HLS environment, and then gradually building up and increasing the scope of the applications that we cover. The strategies I'm sure will be applicable to all the different types of patterns. And yeah, thank you. If LLVM seems substantially more flexible, I mean, given that you don't have to, you can order the passes any way you want, why bother with GCC in that case? GCC is mainstream. And another reason is that you want to look at things that others have not done, and you know, explore new things. And the work that Robert currently is doing, hardly anybody has ever done that. And with the Red Hat expertise, great people working on this. So we are honored to work alongside them and to explore this area further. So fundamentally, GCC is much more powerful and mature, and instead of having, say, 60 passes, it has hundreds of passes. I guess one comment on the applicability. And one thing that no system can do is take an application that's mismatched to the technology and make it better, but that's pretty much the only constraint. We hope. Thank you so much, Afsa. Our next talk is going to be about hardware operating systems in about four minutes. So next talk we have in the Open Harbor Initiative series is by Sahand Bannara, who's a third year PhD student, and Sahand Zaitair, who's a second year PhD student, both working with Professor Markman as well. And this talk is going to be on, you know, hardware operating systems, BIOSs, you know, things you've been seeing in CPUs for decades now coming to an FPG interview. Thanks, Ahmed. Good afternoon. My name is Sahand Bannara. Me and Zait, today, we'll be presenting you a dynamic infrastructure layer for, dynamic infrastructure services, there for a considerable hardware or diesel. So let's start off with some motivation, some general motivation. As we all know, the amount of data captured and consumed is increasing exponentially everywhere. And everything about the cloud and everything about the cloud is growing. Today's data centers have hundreds and thousands of servers and many constraints, including security, intense power and packaging requirements, et cetera. Crucially, applications are changing very rapidly. Economies of scale also fit in there with this much incentive to create replicable, beneficial technology. This has led to a virtual cycle of technology advancement and new requirements. For instance, software defined networks require new technologies that when deployed, enable other new capabilities, these create new requirements and demand for further technological innovation. In the last 10 years, we've seen this play out in the Azure cloud with software defined networks, crypto compression and hardware router support leading to the deployment of FPGA-based smart mix, which then enable, provide applications like browser support and applications as a service, such as machine learning, which were not thought of when the initial hardware was deployed. So the data centers are undergoing the transformation from being compute-centric to being data-centric or perhaps being network-centric depending on whom you're reading. Here we have two talks out of potentially dozens. From Hitachi, we see that the compute-centric data center is running out of gas and the CPUs are being replaced by GPUs as the central component. From next platform, we have a description of some of Intel's activities with their IPs or infrastructure processing units and they're placed in the cloud server architecture, especially in running off-loaded code from cloud service providers. So to summarize, the data center is turning into a more network-focused environment. So cloud nodes, storage, accelerator services and application-specific services are all connected together through high-speed networks which also have an embedded compute capability. So let's take a minute to go through some of these requirements. Everything needs to be processed at line rate. Some processing may be in line and other types of processing may involve complex or lucaside operations. The components, they need to process a large number of streams simultaneously on many types of applications and they should also put frequent updates and upgrades. So to summarize that, basically we need software flexibility and the speed of hardware. So these requirements have led Microsoft and other cloud providers to turn to FPGAs. Why? For the last 30 years or so, FPGAs primary market has been communication processing. Because of that market niche, FPGAs have to have a large number of the highest performance transceivers possible in a given process technology. These transceivers can then be combined with computation which we'll describe in a minute. So this is existential for FPGAs. FPGAs are communication processors. Here we see an example from an Arista switch. While many routers and switches have FPGAs for data plan functions, this one also has an FPGAs for application level computation, in particular low latency trading. So now let's take a step back and see where the FPG efficiency comes from. Like GPUs, FPGAs allow the programmer to use huge number of compute pipelines and local memories. But FPGAs have the additional capability of being able to configure these pipelines to fit a particular computation. And perhaps most significantly, the FPGAs also have configurable internal routing. That means the FPGAs can be configured to have complex application-specific communication among all of these pipelines. What it means is the FPGAs do not have artificial limitations like the GPUs have about blocks not talking to other blocks. So going from being basically a flexible GPU to be a communication processor is rather straightforward. We can simply replace connections from the processors to local memories with connections to the communication ports. So therefore, we have direct application level into node communication between application layer and the physical layer. This gives us ultra-low communication latency of just few clock cycles and large number of transceivers gives us a massive bandwidth. So going beyond switch and the NIC, FPGAs are used throughout the data center. Here we have co-processors in storage and as desegregated compute components. Actually, it's only a short step going from these remote FPGAs to edge and IoT devices. So not surprisingly, these are the other major types of markets for FPGAs. So looking at the requirements for edge and IoT scenarios, we see immediately that the requirements are exactly the same as in the emerging data center. While the edge and IoT devices may have to meet certain other standards to be used in critical applications like automotive or medical devices, we still have the need for say the same thing, like line rate processing at multiple nodes, processing many applications and providing support for frequent updates. So again, FPGAs, these communication processors are ideal in this scenario as well. So new paradigms and hardware capabilities are emerging in the data center, edge and IoT. This is pretty much based on smart communication fabric that smart makes with FPGAs. So it's time to continue our virtual cycle now that we have all these cool new technologies, what else can we do with them? So the demand for inline instrumentation gave us smartnecks, which gave us additional types of processing. Now, what can we use these for? A very simple example is a firewall with a bump in the virus. So instead of passing everything through, the packets can be filtered to rules which are easily configured and updated at runtime. So we can have hardware processing of regular expression matching or any other type of rules while processing simultaneously many flows at line rate. So in this prototype system, we envision the FPG as the gateway between the data in the tenant control node which is represented in red and the communication in the public cloud network which is represented in black. The FPGA can contain many different accelerators such as firewalls, custom encryption engines, and hardware root of trust support which enable increased protection for the tenant data. That's pretty much it. Shustered execution. So a very straightforward application is Secure On Flaves. The FPGA can provide processing elements and memory physically separate from the host system. So unlike a typical accelerator use case, the host is not given free reign over the FPGA. The FPGA can dictate the terms for interactions with the host OS and other processes running on the host. A more complex example is offloading parts of the hypervisor functionality in order to ensure that the guest memory is protected from a compromised hypervisor. So the FPGA can perform tasks such as setting up page table entries and setting up other data structures to manage the memory. The host will be only responsible for performing compute and will not be allowed to manage memory. So in this set of deployment, the FPGA has to be controlled only over the network interface and should not be allowed, the host should not be allowed to control the FPGA at all. HPGA as a service. What makes HPGA systems so expensive is communication. So but with network connected FPGAs, latencies can be in the small number of microseconds even across the data centers. So this is particularly important for applications such as molecular modeling which we show small slides of. So now that we've talked about the motivation for this project, why do we need OS like abstractions in hardware? We need that because we get fast turnaround times so that because we don't have to build any of the components that are being used such as memory controllers, PCI controllers, et cetera. And with consistent APIs, we don't have to rewrite or modify application logic to match APIs. Say going from one vendor to another or one cloud service provided to another. This in turn gives us an applicant application portability where the same application should work on different FPGAs. And this lowers the expertise requirements. You don't have to be an expert on PCI memory, network controllers, et cetera. Instead, the developer can focus on their own application. This also enables complex usage models like multi-penance for FPGAs. And the innovation can be targeted on specific areas of the OS because the separate functions are clearly defined and separated. So we claim that we can create these new capabilities. But the point of this project is that current support environments should be augmented with dynamic infrastructure layer. So before we dive into the requirements, design, et cetera, of diesel, let's see what the current state of the art is. So AWS offers FPGAs as core processors have a service. So their hardware OS like implementation provides memory interfaces, host interfaces, et cetera. So the Microsoft Azure shell is exclusively for the internal use. So apart from the capabilities we saw in the previous slide, they also provide bespoke IO memory interfaces, interest PGA ports, and routing support. And when we move to the IoT devices, some additional flexibility is needed. So these can include embedded soft processor cores and support for a variety of peripherals. So the current state of the hardware OS, lots of proprietary systems from different vendors, different service providers, and there's a variety of capabilities which vary drastically depending on the function and who's using it. There's limited open source options and none are portable because they're very vendor specific and there's no clear separation between board specific and generic RTL. So in short, vendors distribute their own, large end users develop their own. For everyone else, the FPGA configuration is inaccessible of two device specific. So the solution, what's this all? This all is an automatic unscalable hardware operating system and BIOS generation framework. Why do we separate OS and BIOS here is so that the OS can be fully device or chip independent and the BIOS component will include all the device specific files and other things. So will we always need both the BIOS and hardware OS and will it be similar to say the software OS and its interaction with BIOS? No, actually the composition and interaction will depend on each individual use case and only the required components will be generated on a per use case basis. So here we provide some example illustrations of one possible hardware operating system plus BIOS that this can generate. So we have the FPGA with a variety of external interfaces. The first stage, a bit stream for the BIOS will be loaded from the flash memory. Flash memory and it'll program this part of this FPGA. So the partial reconsideration controller is in here. That's the part which allows us to reprogram on the part of the FPGA rather than reprogramming the whole thing. Then this BIOS component will include all the controllers we need to access the external interfaces. There could also be another stage of BIOS, let's call it BIOS stage two. So this could include a soft core CPU that can perform kind of a built-in self-test and maybe provide some minimal functionality. So to clarify what a soft core processor is, a low complexity CPU implemented in the FPGA fabric itself. So it'll usually run at a lower frequency, but it's tightly integrated to the software logic. So at this point, the hardware operating system, a bit stream can be loaded either from non-BIOS memory or PCI or even over the network. And the PR controller allows the hardware to reuse the region used by the BIOS stage two previously so that we reduce the resource usage. And finally, the hardware operating system will load the user application bit streams, again from any of these sources and place them in the remaining area of the FPGA. There could be multiple bare metal applications on the FPGA or virtual applications running within a hypervisor implemented in the FPGA or if you're using, if you want to use an IoT FPGA with a lower resource, a lot of resources to use. So we can use an IoT optimized hardware operating system because it has lesser functionality and the applications could also focus more on software as opposed to implement more flexible functionality within a given resource budget. So here are some of the base FPGAs can be used. The columns show the different deployment scenarios and the rows show the different components need to support these. But one thing we can notice is the support required for different deployments is not uniform across. We can also notice there are certain components like soft process course, which are applicable across all different types of deployment. So now that we have discussed about the motivation behind this whole and overall view of body to look like, now we can dive into the more specific details of the progress we've made. So the stage one of the project was focused more on figuring out the tax money, what are the components we need and how to achieve flexibility. So for instance, we figured out that the vendor provided intellectual property course have very little flexibility. We have tested out what's out there. We try to make modifications to those and we now have a sufficient understanding to move on to the next stage where we build these required components to be open source, vendor chip, board, agnostic and flexible. Stage three and beyond is going to be I think these components together to build hardware operating systems and then provide dynamic operation capability to partial reconfiguration using this as a platform to run experiments on and then implementing new functionality such as security, trusted execution and so many other things. So we have identified five components which we believe are applicable of course all the multitude of different use cases and essential for the whole framework. These are the memory, soft processor core, smart switch, network and PCS subsystems. To the rest of the talk, we'll go into details of each of these subsystems and also we'd like to know that a subsystem is not just the controller itself. So it's not just the memory controller but we refer to as the memory sensors. We could include some other things like a cache and any other mechanism to implement other, you know, now functionality. So going back to this slide, so we have already established that, you know, FPGAs are awesome and they're gonna be everywhere in the network in the data center. So each of these FPGAs, they're deployed in different configurations and they have multiple communication interfaces. So typically these communication interfaces are built around vendor providers intellectual property cores. While there are non-vendor provided IPs, both open source and a proprietary, each of these are still designed targeting a small subset of devices. Why is that? That's because a modern FPGA is not simply a sea of configurable logic and there's a lot of hardened logic to perform specialized functions ranging from like floating point operations to transceivers and their related functionality. So the intellectual property cores are usually built using both hardened and reconfigurable logic. So these logic blocks and IPs differ from one vendor to another and not just one vendor to another, they also differ from one device family to another. So any communication infrastructure we built that are have these vendor and device specific aspects to it. That does not stop at the hardware boundary because the drivers used by the host to communicate FPGAs, they also need to take into account the specific detail. So what are we gonna do to change this, the current state? Let's look at that when you talk about the diesel PCI subsystem. So this is what a typical host, FPGA PCI communication infrastructure look like. In order to make it simpler for a developer, we would like the diesel PCI subsystem, diesel PCI infrastructure to be vendor and device agnostic on the host side. And we would like all that specifics to be handled in the FPGA itself. Ideally, the user should not have to maintain or design or maintain either of these components. And we would also like this to work with internal drivers which are part of the standard distribution. So what's the solution to satisfy all of these requirements? We think it's vertio. So this shows the typical use model for vertio drivers. A guest application is using a vertio driver to access a vertio device which is emulated by the host. If there's a corresponding physical device, there's gonna be a driver running in the host kernel space which is specific to that particular device. But do we need to go through all of these layers just to talk to the FPGA? Can we do this instead? So that's exactly what we did. So we have implemented the vertio word queue functions directly on the FPGA, doing so eliminating the requirement for both the device specific drivers and that emulated backend device. So the FPGA person, the same interface. Now both to the host and the guest oversuit. So at this point, logically, there's no difference between a vertio driver in the guest kernel or the host kernel accessing the FPGA, it's just the same driver. It's just the same interface it sees. So far, we have successfully implemented a vertio console device on a Xilin's FPGA as a proof of concept. Our first task was adding some vertio specific capabilities to the capability list of the PCA device. So we achieved this by making modifications to the vendor provided IP code itself. The main challenge was some of the capabilities required to achieve this functionality were not exposed to the users by the standard FPGA approach. The next step was implementing the vertio functionality. So here again, unlike the typical use case where a driver running on the host controls the DMA engine, the vertio logic on the FPGA directly controls the DMA engine to perform data movement between the host and the FPGA. So the next step for the PCA subsystems include integrating the word cues with the rest of the subsystems, implementing other vertio device types, and potentially even defining new vertio device types if it's deemed necessary. And then replacing vendor provided IP components except for the file with open source components and adding new functionality required by the hardware or service. Just... Okay, so the question is about... Where is the IOM? So that's handled, still handled by the vertio driver. So everything to do with programmed IOMMU and setting up the memory regions, that's still handled by the vertio driver. The FPGA is not involved in that, but it just receives the notifications and knows to move data. Yeah? No, that's why the driver has to set those up. Driver will set those address up before passing them to the device itself. Yeah, so then we're gonna see a little demo here. So what we are gonna see in this demo is on the host, we are gonna write something to the vertio driver and the FPGA will move that data. And in the user logic, it will go through a simple function. In this case, it's just switching upper and lowercase letters, so nothing fancy. And then we'll perform a read on the host on the view as well. So let's have a look at how that goes. Logging into the question and you're just listing the PCIe devices here. You can see the last device is a Xilinx FPGA. You have three capabilities, MSIe, PCIe, and Power Management there. If you list the device files, you don't see any vertio devices at this point. There's no vertio devices. Now, next we are gonna program the FPGA with a bit stream we have compiled. This is gonna go for a while, so I'm skipping ahead here a bit, a little longer. Just program the FPGA. So we don't need that anymore. Now, we have the power cycle to machine. Again, we'll skip ahead while I on the computer again. When we are back, we'll login and list the PCIe devices to see what we see now. So, okay, so now we see a red hat vertio console device. So this is the same device, it's the same slot. And we see at the bottom, we see the vertio specific capabilities added to the capability list. And if you list the device files, now you'll see HVC 0 to 7, the vertio console device. And at this point, we'll try to write something to the device file. Let's see, I'm using two terms, I'm loading the game, opening up Minicom. I'm just gonna type in my name there. So, when we read this data back, we'll see the upper and lower case, let us be switched and you see the output there. So, the FBJ has done some process on the data. We're going to do this next slide because it's the same thing again, I'm just typing in most of the data. So, at this point, I'll handle this to Zyde to talk about the other subsystems in this one. Hi, everyone, Zyde here. So, I'll be piggybacking off Zyde slide as it was mentioned that communication for FBJ is really important for in data centers and IoT environments. So, apart from PCI, the network subsystem is also extremely important for these environments. And in this one, we're looking to give, to provide flexibility in the network subsystem so that you can use it in different configurations. So, for the network subsystem in this one, we're using an Ethernet controller for transmission and reception of network packets. So, why can't we just plug and play existing FBJ network controllers in this one? The answer to this is that we can't plug and play because the existing Ethernet controllers aren't that much configurable. And one of the main aims of this one is to provide users a framework so that they can make their own custom hardware operating systems and they can configure these subsystems according to their needs. For example, a user wants to connect a soft core processor with the network subsystem. They will need a different configuration versus they need to connect custom hardware logic with the network subsystem. So, keeping that in mind, we have designed the network subsystem for this flexible and intrastep system connectivity in mind. So, if you start from the right, you are receiving Ethernet packets that are going through the file and then from the file into the Mac. So, this yellow box, which has this gray box included is the FPGA. So, the received packets go into the Mac and from there they go into the Ethernet pipeline. We can have our hardware logic to further process the Ethernet packets, strip off the Ethernet headers and the IP headers and store the data, the Ethernet data as UDP or TCP packets or the Ethernet packets as a whole are readily available. So, this part, the loop error, which is pointed to is what's giving the intrastep system connectivity flexibility. So, as you can see, we have streaming connectivity here, we have DMA support, we have soft core connectivity and the users can add their own connectivity as well. So, this is what would give the digital the flexibility to generate a network subsystem for different use cases according to the user's needs. So, let's go deeper into the software connectivity. So, for example, a user wants to connect their soft core, for example, our risk mic processor with the network subsystem. So, this would provide open source packet processing C++ libraries. So, the software developers would only see functions like TX Ethernet packet. So, this transmit the Ethernet packet. So, what happens behind this first happens in the libraries and then the decoding of that library happens in this software connectivity block. So, another example would be that the user wants to just use streaming interface. They want custom logic to just stream the data in and out. So, then they would use that interface and they can add their own interface as well. Another important thing for the network subsystem is since the software and hardware code is open source, the users can add filters like eBPF filters for packet providing, staring firewalls and other security applications like host isolation, et cetera. So, coming back to my previous point that the limitations for the existing Ethernet controllers are, they're limited in their intra-subsystem connectivity options. That makes them limited to a specific configuration. For example, a soft core plus custom HDL or custom HDL only. The throughput falls short in some cases especially for soft core plus custom HDL configurations. Few vendor specific Ethernet controllers are out there but their codes are even visible. They're encrypted and limited to their boards only. And debugging and simulation tools for network subsystems were limited as well. So, these were the two areas of improvement, the network subsystem improvements and the simulation debugging tool improvements. So, we focused first on the network subsystem improvements. So, this network subsystem works in various configurations as I said before in soft core plus custom HDL or custom HDL only or DMA or any custom configuration due to the flexible intra-subsystem connectivity. We focused on high throughput especially for soft core plus custom HDL configurations. This will provide open source packet processing C++ libraries and they have integrated that to use with the Ethernet controller with the soft core. And the hardware and software as I said, code is open source. For the simulation and debugging tools improvement we have a full packet simulation in this using Python, Cocoa Test Branch, GTK Wave and Ivorylog which are all open source tools. Hence complex packets can be easily formed using Python libraries for simulation. One of the important points here is we have done soft core integration for the simulation for Ethernet packet processing in C++. So, for example, you write a C++ code and you can upload its hex file in the soft core and you can see in simulation on a nanosecond level that what's happening, what is the soft core doing and how is it interacting with the network subsystem. And you can also do live debugging of the soft core and the C++ code in using UART core and using simple functions like a printf in the C++ code. Here's a diagram of a screenshot of the simulation just a high level overview is that the yellow arrow is pointing towards risk five processor reads. The purple arrow is pointing towards the soft processor risk five writes. The blue arrow are the received Ethernet packets from the file and the orange line here, the LED2 is getting high which means that a risk five has read at an Ethernet packet and then LED4 going high is showing that the received Ethernet packet, UDV destination port is 1234. Here's a screenshot of the live C++ UART debugging happening. So this is a blog diagram of an example network subsystem integration in this one. So as I said that there's an application block in the network subsystem and since this code is open source users can add their own custom functionality as well. And that's custom functionality you can see here in the HPPF filter added in front of the received Ethernet five codes for a packet staring and forwarding. So this is basically the demo of the network subsystem. The blog diagram of that we're using an RT100T board. And the demo is basically we've written C++ code for UDP packet processing to run it on the risk five soft core. And the logic in the code is if the received Ethernet packets destination IP address is 192.168.1.128 and the UDP destination port is 1234, the risk five core will display that the correct packet has been received in the UART port. So here's a video of that. So we're using a Python script which we're sending Ethernet packets. We have our IP address that we wanted 192.168.1.128 and 1234 is the UDP destination port number. So here we have our serial monitor. So first we changed the UDP destination port number to a wrong destination port number that soft core isn't looking for. So nothing should get printed on the serial port. So the FPGA is receiving Ethernet packets but nothing will get printed. But once we send the right IP address and the UDP destination port number, we can see the soft core, the risk five soft core is printing this on the serial monitor that we are receiving correct IP address packets and UDP destination port number. We're just a video of the FPGA. The green light and LED blinking is showing Ethernet packets being received and the yellow LED blinking is showing the UART risk five for writing to the UART to the host PC. So the to-do list for the network system is we're looking to implement various applications in the network subsystem. You want to integrate it with new incoming diesel subsystems and we're exploring further areas of optimization in the network subsystem. So the next system for this is the soft core subsystem. Have you seen the previous slide? The soft core are pretty useful in FPGAs and to support a soft core, you need a bunch of other subsystems as well as you can see in the slide with the standard interconnect in between. So this will also provide SDK to compile all your C++ code and also compile C++ code into hardware to HLS and you have your debugging IOs as well, blocks as well. So one of our experiments, we're using a risk pipe soft core and the good thing about using a risk pipe soft core is that the code is open source and you can add your own custom instructions to the custom pipelines. So that gives you a lot of flexibility. So the other subsystem, another subsystem for this is the memory subsystem which is pretty important because traditionally the memory subsystems the memory subsystems that were used were provided by the vendors and they weren't very configurable. So if you would change a few things in those, the whole IP would break. So for this we are working on providing open source DDR controllers and cache controllers and other memory subsystem blocks. The good thing about that is, for example, a memory controller has many other modules inside it and different applications require different configurations to work optimally. So we're providing an open source code for these controllers so you can configure all of these modules inside the DDR controller according to your application's needs which is one of the basic aims of this one. So and another subsystem that would connect everything together is the smart switch. So a smart switch, as you can see in this program is connecting everything together but unlike, for example, a one to N crossbar it has end-to-end connections and the really important thing about it is that it's interface independent. So you could be using Wishbone, bus architecture, you could be using access streaming, you could be using custom bus architectures. Smart switch is independent of that so you don't need to worry about implementing that and connecting them in hardware. In order to connect all of these modules you don't need to change the RTL code you just need to add your module in a HTML file and tell this all that, okay, this is my module these are the IOs and this will automatically connect that to the smart switch. One example of functionality of a smart switch is for example, you're using the network subsystem and your software is communicating with the network subsystem but now that you're done you want your ethernet package to go directly into cache. So smart switch can change these connections on the fly. So once you tell the smart switch, so while running the smart switch will change these connections and for example, change the connection from the network subsystem from RIS 5 to cache. So this gives us a lot of flexibility as well. So combining all of these three subsystems into demo three. So we have our software, we have our smart switch and we have our custom memory subsystems. So on the right, you can see that as I mentioned we're connecting all the modules to a smart switch. You don't need to write a lot of complex RTL code you just need to write it in an HTML file that these are my connections and this will do that automatically for you. So in this sorting demo, we will run different sorting applications on the soft core written in C++ and we will see how does having configurable memory subsystem help us improve the performance. So this is the schematic and as you can see the blue block in between with the smart switch and it's in the middle of the schematic. So here's the demo. First bit stream is being uploaded into the FPGA board. It's an RT35T. As soon as the bit stream gets uploaded, we give it a reset and we get the programming switch high as well. And now it's low. And now we're uploading the hex file, the instruction memory of the C++ code into the FPGA. You can see the blue light lighting up. It's the hex file, the instruction memory going through the program loader into the cache and into the memory. And as soon as it gets loaded, the risk file will start reading the instruction memory and it will start executing the sorting application that we wrote in C++. And you will start seeing those results in the UART port. So first algorithm will select sort of 10,000 items. Other one was quick sort, then shell sort, then queue sort, then insertion sort and then bubble sort. So if you look at the results and the timing of these algorithms. So in this graph, the y-axis is normalized execution time and the lower means better and the x-axis is what's really interesting. These are the cache parameters configurations. So we have cache size associativity, cache line size, cache write policy. And all of the x-axis are the different configurations of these. And for example, the purple and black dotted line, you can see that one configuration of your memory subsystem might be best for one algorithm, but it might not be best for other algorithms. So that's why having this kind of flexibility and being able to generate according to your needs is what's the basic aim of this sort. To generate your own hardware operating systems with your own components and on interconnect. So I think this would be a good first step towards diminishing our own user hardware operating system. So that's the end of the presentation. Thank you. Did you do any testing? I'm sorry. Did you do any testing with utilizing peripherals with Vert.io as opposed to native? Is there a huge difference in latency? Oh, so you mean for the PCI unit? We actually haven't done the comparison yet, but we suspect that the Vert.io might be slightly faster, like not by much, because the fast, like why do we be faster is, what we believe is in a typical use case, the driver running on the host will have to do multiple rounds of communication to program the DM engine. With the Vert.io scenario, what happens is, then the driver loads, it sets up, whether the user ring, available ring and everything and passes the descriptors to the device. And then the only thing needs to be sent at for operation is a notification. So it's a single packet sent. So then the device directly programs the DM engine. So we believe it'll be slightly faster, but no, we haven't actually tested it. So in practice for a developer, let's say, how do all of these different components or how will they all eventually be packaged up so that someone could use them for an application or end use in some way? I think what we envision is something similar to what you saw with the smart switch. So there'll be some other file with these scribes. These are the functions I need. So I need PCIe, I need a memory controller, but I don't need Ethernet. So I can specify that and then the two will automatically generate the RTL. That's what we envision. Maybe I need to have something rad. So that's perfect. Just to add to that. So that's a level where in a certain developers will want to work on because they won't damage control. But for developers who don't even want to work at that level, what we can have happen is integrate this with high-level synthesis. So when you describe, this is the application that I want to run. From that, you automatically deriving based on which FPG are going to put this application on. All the different pieces needed to support that application and automatically building those in. Hi, I was just wondering, in terms of potential hardware overhead that the operating system could use on the FPGA, usually when you study these things and you have some small models that you use for like undergrads, classes or anything, is this intended to be used like in small FPGA devices or this is more center on the higher end components because if the overhead is maybe too much, maybe certain bigger applications cannot be set into the FPGA if you use kind of an operating system like solution. Yeah, that's a great question. So that's where we believe the configurability aspect would help so that we can configure what's needed. So we only need to have what's absolutely necessary. So in contrast to that, say we use a shell provided by the vendor, right of it. So those have all the components fixed. So even if you didn't need a PCI controller, that's gonna be there. Even if you didn't need the network controller, it's gonna be there. So we're hoping to actually reduce the overhead in comparison to what's out there with this new configurability aspect. Thank you. Yes, to that, to that. That's where soft core started playing a major role because if you have a small FPGA, but you really need a functionality, then you could kind of trade off some performance but now have a lower footprint for implementing that additional requirements. So, and if you've got a P2C class board, then you can do up to everything. If you're an IoT device, you can have soft core to most of it. And maybe for some application where you really do have some constraint like line rate processing, you can then get some of the HDL to implementing that. So this is definitely a leading question that having seen this development day to day, I don't think you guys got across just the heroic effort that it took to overcome the lack of, and you said the lack of support, but maybe you could give a few seconds on the different motivations of the FPGA vendors with respect to in contrast to what you're trying to do or what we're trying to do and what needs to get done in order to overcome that. Yeah, so there's actually a couple of aspects to that as well. So the first one is being the driver support. As I covered in the slides, these drivers has to be, they need to have some device or vendor-specific information considered. So the FPGA vendors, they do provide reference drivers, but honestly in my opinion, like in my experience, they do not do a very good job at maintaining those. So for instance, within the few months I was working on this project, for the board I was using, I had to fix the driver two times. And the first time was when I initially cloned the repo, it was not working to begin with. And the kernel update which broke the driver, it happened over two months before I first tried it. So for those two months, the vendor was not updating those. So each and everyone who was using the driver, they had to do it on their own. So that's our motivation for using internal drivers as the title. And the other aspect is when the vendors do, like when they provide their intellectual property cores and related infrastructure, I feel like they tend to believe that the user should use it as it is. The vendor intends it to be used, so they don't think it's necessary to expose some of the options. So what I talk about adding capabilities to the PGI capability list. So the tool flow allows you, so let me just walk back a little bit. Typically, the hardened logic in the FPGA responds to configuration space accesses with whatever the capability is implemented, implemented already in the hardened IP. But it also allows you to forward some of the accesses to user logic so that you can respond to the custom capabilities. But the option to change the next pointer of the last capability already implemented is not exposed. You just simply can't do it. So even if you implemented your custom capabilities, it does not get recognized when the system boots up. So that led me to just hacking through the RPL. So luckily it was a cheap FPGA so the IP was free and I just not entered the IP. I had to then go through the RPL and change it in the RPL level because that option is there in the hardware. It's not a hardware limitation, but the software is limiting you because the vendor does not think you need that advanced feature. So yeah, that's two aspects where we need this flexible open source framework. Have we or will we at some point begin talking with some of the vendors about what you're doing and the possibilities for collaboration or whatever. I'm sure in a lot of cases, the vendors implementing the software on top of this hardware is sort of a, not something they really like to do. You mean about the plan? Yeah, we were trying to make the case for it. I mean, one of the reasons we've had this, we'll be trying to do all these things, trying to make the case that this thing works. It's just not gonna break down the first time it's practical. And then there's certainly opportunities that we're finding where this thing can happen, not just at the vendor level, but also at the board vendor level. So sorry, if I'm not at the chip vendor level, at the board vendor level. So for example, in the slide we had which was an example illustration of this all being implemented. The stuff that gets loaded in the BIOS and the stuff that allows you to then further load an offering system, that is board specific, there's nothing in there that you don't wanna customize because it's so close to the hardware that's gonna use the hard IP blocks. So that's something that could very well shift with the board. So when you, you know, instead of, I mean, I can see the value of these tests that they shipped with, with these turning on. It's very fancy. But they could start shipping with stuff like that. So at the very least when you have a board, you know that, well, if I need to use interface which we have to build some stuff, it's there, the complex stuff, the stuff that has to be tuned for our latencies and these ships and whatnot. So there's definitely opportunities that are coming up in the more we explore, we hope more opportunities come up. So we can have a very, you know, specific discussion and here are the things that we feel that you could do. And they're very practical to do. Like Sahan also found that, you know, they can put in the word I would support by default. It's just they're not doing it, but there's nothing under that says it can't be done. I do like one quick thing, right? I think that our hope is it'll be the same as the open source softened, what will happen. Like with this, this will force when, like when we have more and more stuff out there, it will force the vendor to actually open up there. Well, thank you very much. So we have the next talk, well it's supposed to be at 2.30, but we're gonna start at 2.35. So the next talk is gonna take us, let me be deeper in the rabbit hole and we're gonna look at what happens at lower levels in the hardware stack. Thank you. So now we have the third talk in the open-hardware initiative series. This is by Shachi Kadilkar, who is a third year PhD student at UMass Low in the ECE department, supervised by Professor Martin Marley. And this talk is going to be about the depth of hardware that only a few hardware engineers dare to work at. Thank you so much, Ahmad. So, well I've already been introduced, but I'm Shachi and my talk is open-source tool chain optimization for a 3D CAD tooling. I'm a PhD student at UMass Lowell and I'm working with Ahmad Sarala at Dread Hat and my PhD advisor is Professor Martin Marley at UMass Lowell, so let's get into this. I think that we've already seen multiple slides to this effect. Basically, as we say, they are highly reconfigurable. They are very flexible. They do give, they basically, like the name suggests, they can be reconfigured in the field, giving the developer a lot of flexibility. So they are massively parallel. That's why they can be used for acceleration. I think we've already spoken about FPG being ubiquitous in data center and edge computing. So, yes, FPGAs are awesome and we're gonna move on from this slide. So, since my project mainly deals with more of the lower level hardware details, let's just quickly look at what an FPGA really looks like. So this is what the architecture looks like on the left side of the slide. So, FPGAs are basically very large arrays of logic blocks called configurable logic blocks and there are also hard blocks which are custom hard blocks like BSTN block ramps and there are programmable routing wires or programmable interconnect. So, on the right, we can actually see what's inside. As you'll be, we will be talking about the details of these primitives. So, basically the logic blocks are made of lookup tables and flip blocks. So, lookup tables are, they are used to, for the computation part, they can be used to implement a variety of logic functions and the flip blocks are basically the storage for the computation. So, along with computation and storage, large amount of functionality can be achieved with FPGAs only limited by resources available and timing. So, we have these lookup tables. Again, this is something that is going to keep coming up throughout the talk. So, this is what a three input lookup table would look like. So, typically, most architectures, most FPGAs that are used have, they're usually six input LUTs but the functioning remains the same. So, it's basically here this is an eight is to one multiplexer and yeah, with the one bit output and there is a programming vector e to eight shown in blue. So, the programming vector basically tells the LUT, basically decides how the LUT behaves and an example is a very simple example seeing how the XOR gate would be implemented using an LUT. So, looking at the truth table of the XOR gate, we can see if we want to do I2 XOR with I1 then the outputs will follow the outputs of the XOR gate truth table. So, that would mean if for same inputs, the output would be low and for different inputs, output is high. So, that's how a programming vector can be used to implement an XOR gate in an LUT. So, it can LUTs can implement all sorts of logic gates, custom and also the standard ones. Yes, so, now that we've seen that FPG CAD tooling, I think we've already seen in one of the talks before. We do need the CAD tool chain to convert FPG code to the final mix tool that programs the hardware. So, here at the top of the diagram, we can see high level language code. So, yes, high level synthesis is being used. These, the software methodologies are being used for hardware and that leads to a number of advantages. Then there is the very law code or HDL, any HDL code which leads to the third part, which is the logical like list. So, compiling FPG code to hardware is different from other standard languages just because street, the synthesized code doesn't map to machine code, but it does map to low level hardware on the device. So, typically there are three steps, first is synthesis, then placing out and then with stream generation. So, the first, the parts in green are the synthesis step which convert the HDL code to a net list which basically shows the connection between different blocks, but this is in terms of resources that are available on the FPGA, but it is a logical draft. So, we are still in the logical space and the next part for this is placing out where that logical draft actually has to map to the FPGA hardware. So, the placing out stage needs to be very aware of what is on the FPGA, all the components and also the timing between them. So, that's why we can see the FPGA database and yeah, all the low-lying device details are needed for the implementation or the placing out stage. So, even if we do optimize the software stack, placing out is really important because here by improving placing out and optimizing here, we can greatly influence area power, resource utilization, and also delays, so maximum surface speed. So, this is a very important stage and typically it is difficult to work with. So, just to look a little bit at the tool chain and how it really works. So, this example I synthesized and placed and routed with ZionSovado and it is for using Artic7 devices. So, basically this is what the way NogHQL code looks like. This is behavioral and yeah, as you can see, it's highly abstracted, it just adds two signals. And yeah, I think that's about, there is one clock buffer there which we don't have to worry about. But yeah, it's pretty high level. This is behavioral HDN code. So, this is the input to the synthesis stage. After this, we have the post synthesis logical net list. So, like we've seen before, behavioral HDN is converted to logic states which are represented by LUTs and also there's the clock for storage. So, the added example basically converts to three LUTs and three flip-flops. Also, there are some input and output buffers which basically they isolate the circuitry that's inside from signals coming from outside. So, yes, this is the stage where mapping to LUTs happens and typically there is also a tech mapping stage which will map these LUTs and flip-flops to device specifically so that these blocks are not the same across different FPGAs. So, yes, the next step is a path net list. We are particularly interested in this step. So, basically what faculty does is it looks at the input net list and it analyzes the primitives on the hardware and based on the analysis of the primitives it decides what needs to be kept together. For example, primitives with closed logic connections have to be packed together. So, in what are known as clusters, so this step is also known as clustering. So, the goal for clustering is of course keeping the overall objective in mind which could be optimizing area or reducing delays and basically it needs to make the routing step easier. So, here, yes, basically the LUTs and flip-flops are actually placed on the physical FPGA device. So, after the primitives are clustered the next step is actually finding locations on the FPGA, so a placement is, yeah, that step is placement. So, here you can see those primitives, the LUTs and flip-flops are then placed and within that site you can see the green wires running. So, routing is done within the site and typically depending upon the optimization parameters and the algorithm that's being used, the goal changes. But again, the goal of placement is also to help the router find the most optimal solution. So, this is what a placement looks like. So, the most compute-intensive and the most demanding stage of placement route is routing. So, here, this is the final stage. Post-routing, it is really difficult to see and I'm really sorry about that, but so on the left corner of the FPGA, so the placed elements, they actually form a complex graph based on all the placed elements and timing information associated with them form a graph and the router has to find the optimal path connecting these placed elements. Yeah, based on the routing, based on timing information mostly, but this is extremely compute-intensive. So, between clusters, yeah, and from IO to whatever clusters are there, these wires are routed. Yeah, the goal is always to minimize those, yeah. So, in the detailed view of the router nets can be seen in green that on between clusters and from IO to clusters. So, the thing is, as far as the synthesis stage is concerned, a lot of work is already done in this space, then bit stream and FPGA assembly is only one-to-one mapping. So, after the implemented net list is, yeah, it's converted to an intermediate FASM form and then finally you have the bit stream that programs the FPGA, but placing out really doesn't get enough attention because it's difficult to work in this space and there are a number of issues with open source and closed source tools that are available for placing out, so, yes. So, this brings us to vendor cat tools. Of course, there are a lot of, there's Xilinx, there's Intel, there's Lattice. There are several closed source vendor tools available for FPGA, so, the obvious question is why not just go ahead with these? Well, they do have quite a few problems. The major issue is that they're closed source, so licensing is expensive and, yeah, it's very difficult to, it's impossible to customize these tools. Also, a lot of them are architecture specific and yes, they change with every generation of FPGAs. Also, vendor cat tools do a set of generic optimizations for a large number of generic use cases and often this leads to long turnaround times. So, yes, although these are available, there is a lot of problems with using vendor tools. So, this brings us to open source cat tools, there's definitely a lot of work that's been done in this space as well. So, here I mentioned a few tools, one of them is, of course, Veriloft and Outing. So, this is one of the oldest open source tools around and a lot of groups have actually built on Veriloft and Outing. So, yeah, it's basically using a full set of open source tools from synthesis to bitstream, but this, the thing with these tools for the longest time, open source tooling in FPGAs was only for theoretical models. So, that's because a lot of the device details are needed to actually build these tools for commercial FPGAs. That's why, yeah, in recent years, there've been a few projects like Project Trellis, which have actually explored low-level device details to us. And so now, with open source tooling, we are actually able to use vendor FPGAs, which is a really good thing. And, yeah, F4G is also in the flow, that's another tool which actually modifies VTR a little so that it can be used with Xilin CD7 devices. And, yeah, your system's next PNR is meant for lattice devices. RapidRite is one more tool that I've put on there. So, a few interfaces to vendor tools also exist. So, yeah, they're limited, but they do give a bunch of information on how close source tools, I mean, how the vendor tools are built. But we've tried RapidRite, yeah, it really didn't give us that much information. So, definitely open source CAD tools have their advantages. They're highly customizable. They do not restrict the developer and we can leverage community support with them. But the issue with the current open source CAD tools is hardware quality is just behind the vendor tooling. So, they're not really practical for real workloads. And also, a lot of the tools have a lot of device specific custom coding. So, they are very architecture specific. So, yeah, there's just not enough. There's just not generic enough. And this is where we come in. So, yeah, so of course our focus is place them out and we basically want to, it's time for a break. Let's take a five minute break. Okay, I think I'll start again. So, this brings us to where we come into the picture. So, our research objective is of course to reduce this gap between open source and closed source tooling. And we want to do this by automatically tuning hardware generation policies. So, generic algorithms do produce good results but there's just not good enough. And proprietary tools are likely employing heuristics for different chips to get good results and we do not have this. So, our goal is to do this by building up our own set of heuristics. And ideally this needs to be in a scalable way. So, we are trying to learn how a critical policy in implementation algorithms as a whole should change according to input circuit patterns. So, for now we're doing this manually but the goal is to eventually automate this with machine learning. So, this is the framework that we have built. So, the goal is to find a policy map that basically lets us know how the circuit should change according to different hardware patterns. So, if you, yeah, a detailed study of the CAD algorithm is required. So, it can really be any CAD tool with any architecture but a detailed study of the CAD algorithm helps identify policies that are critical to the process of packing. Along with this, there are three important pieces. One is the set of policies that we identified then. Secondly is the synthetic benchmarks. So, these represent set of hardware patterns which are built to give information on the target policy. And yes, it's important to have a full set of metrics that helps judge hardware quality at each stage. We look for suboptimal policies and we tune them so that the tool can make the best possible decision for a given hardware input pattern. And yeah, we wanna learn how policies should be adapting. And yeah, so typically this is run once on a given device architecture. And yeah, we have been using VPR from the VTR project. So, that's an open source place in our tool. And yeah, on CD7 devices. So, this is what we build for getting the policy map for the map for improving the tool. And so, once we do have the policy map, it is used to improve the CAD tool for a better logical to physical network mapping. Yeah, so the policy map is used. We have used this with the packing state. So, the policy map determines the best packing algorithm policies specific to a given input net list. So, doing this semi manually is fine as an initial step, but it's not scalable. So, we will be using ML and other statistical methods to fully automate this. Yes, so coming back to this. So, there are, for the three stages of implementation, there are different algorithms used for each. So, for VPR, they've used a 3D clustering algorithm, which is architecture aware, and they've used simulated and agent for placement, and they've used pathfinder for routing. So, for each algorithm, for each stage, we need a set of policies, a set of metrics, and a set of sentiment marks. And to show the effectiveness of our framework, we are using packing. So, yes, just to refresh, packing is the state where LUDs and flip-flops are placed into clusters, and based on what set of elements is best when kept together. So, this is the framework we have built is we're using it with packing, or we show its effectiveness using the packing stage. So, this is the process of getting packing hardware quality metrics. So, like I said, there are three stages, and the first stage is basically getting packing hardware metrics. So, for this, we've used VPR, and any set of input design patterns is fine as long as we have a logical net list. So, typically, we want to analyze resource utilization and the tools give out a lot of timing information. So, just percentage device usage, delays, longest parts and things like that, also specific information for clustering. So, we do have a set of metrics that are specifically post-packing. Then we go all the way into the main circuit and do the same thing for routing and bring a set of post-packing metrics. So, since routing is the most critical part, usually, there are several packing algorithms, there are several, they optimize a bunch of things and they always look at how it affects routing. So, another additional thing that we need to know is when just the net list is packed, what heuristics can tell us if packing has done a good job rather than going all the way to routing. So, yes, here's a look at hardware quality metrics. So, based on our literature review of VPR, some of the things that, this is one of the set of things that we're measuring. For example, a site count and a logic clock count, sorry. So, on the right is the FPGA device. Again, this is, Xilinx has a very good graphical representation, so that's where the steam capture is from. But, basically, you can see the tile is a configurable logic clock with two sites inside it and a cluster belongs to one site. So, we have the tile and the site and then the site has the parameters that we talked about. So, it is four luts, a carry chain and eight flip-flops. So, we need site counts, we take tile counts, then we do get all the details of resource utilization, then also fanouts within clusters. So, fanouts typically here would mean the number of nets that are getting absorbed inside the cluster because those reduce the routing that is required outside it. Then, the other thing is whenever there are two elements, two parameters that can implement a certain type of net list element, we look for what choice it makes and if the tool is adapting it based on our input circuits. Yes, so, IO utilize per site, yes, this is significant because it gives you a measure of how tightly the clustering is being done. And, yeah, the last two are standard post-off metrics, so, router violence and the maximum circuit speed that's achieved, yeah, these are commonly used. So, this is one set of post-pack and post-route hardware quality metrics that we're using. So, all of the metrics do correspond to policies and, yes, so, the metrics that we saw in the last slide, site count, tile count, these do impact, of course, utilization and routing, then resource utilization impacts the area, maximum fan-outs within cluster. The more the cluster can absorb, the less external wiring is needed. Then, logical element type and block type, yeah. So, this basically tells you how the tool is making its choices and more than anything else, it's also, it's a good policy to know that even when you do add an extra element or you push the resource utilization, is there any change in what the tool is doing? Yes, input output prints, they do determine packing density and, yes, the last two, I think, connections that cannot be absorbed by the cluster and delays and longest path. So, for each metric, there are policies that they point to and we need to have a full set of these. Yeah, the last piece is actually building synthetic benchmarks. So, there's a lot of benchmarking options for CAD tools, like the Titan benchmark is very popular, Titan VTR7, MCNC, but the thing is that they do not have any isolated hardware patterns. So, synthesis and placement of tools typically optimize a whole lot of the circuits away. So, synthetic benchmarks are like simple, perfect, but there are very specific patterns that we are trying to build with them. Yeah, and, yeah, on the right, this is how we do build synthetic benchmarks. So, identify the patterns with target policies, build a set of templates. It's good to have some sort of automation here so that we can generate a large number of very large files and very resource utilization. And, yeah, best to use an open source tool so optimizations can be reviewed to assist here. And, almost always, there is the need for adjusting the IO footprint just so that all the circuits that we're building do fit with the PGA that we're using. So, this is an example of a synthetic benchmark. So, typically, the synthetic benchmarks will have a simple circuit like a chain of LUTs and here is a chain of LUTs and flip-flops. And then, basically, we increase the size of the circuit until we fill up the whole device with these. So, I don't think I have enough time, but I can, if anybody wants to know, I can talk about the rationale on why, how these kinds of simple circuits can help reveal specific policies. This is another kind of, this is another synthetic benchmark. So, this basically uses 32 of the side pins. So, this is another very good case where if we replicate this module multiple times, we get to see if cluster targets change at all. So, we do get to know if the tool is doing any tuning on its own. So, let's take a look at the initial experiments and proof of concept. So, yeah, we've used USS and VPR for our tool chain and Xilix RTX 7 devices. So, the metric that we're looking at is how many of the input pins in the cluster are utilized. So, typically, there is a target that the open source tool sets. And for VPR, we just noticed that regardless of the input circuit, the utilization is fixed at a certain number. So, what we're trying to do here is empirically tune that number and see a change in hardware result quality. So, here is the first result. So, this is violence. So, for violence, we do expect, of course, to see lesser wiring. So, what's in blue is the default. So, this is what the open source tool is doing by default. And on the right is what we have empirically tuned it to do. So, this is for six experiments. And, yeah, we expect to see lower is better, basically. So, we do see reduced violence. Then, another metric that we've used here is soft logic locks. So, resource utilization reduces significantly. So, based on the empirically tuned value that we've used, we got better outputs for soft logic locks and correspondingly, the area that they were using. And finally, this is the maximum circuit speed. Again, this is a metric that is often used, but here it is slightly mixed. So, in some cases, the default that the tool was doing is doing better than what we have done empirically. But this definitely says that one policy will not fit for all kinds of hardware patterns. So, the tool needs to adapt. And that's where our policy map comes in. So, if we do look at speed violence and logic blocks for most cases, in case of the speed, yes, in a couple of places, we did have the defaults doing better than what we tuned it to be. And, but for violent and soft logic blocks or resource utilization, consistently, the empirical value does better. And that's because we expect it to do that way. So, what this shows is that the tool cannot keep using static policy. And, yeah, so this is our proof of concept. So, to conclude, we have built a framework which identifies suboptimal policies or magic numbers and choose them to generate better quality circuits. And we've demonstrated this with packing, which is a critical part of implementation. So, basically, the same framework can be used for any other implementation stage as well. But, yeah, as long as all of the missing people are figured out. So, that's it. Thank you. I have actually a couple of questions. The first one is about the hardware quality metrics. So, I didn't see routing resource usage, specifically the routing resources usage there. So, was it not just put in there or are those metrics, are those, will those be covered in directly by other metrics you have shown there? So, we have only, so this framework, we've only applied to packing at the time being. So, most of the metrics that I showed were basically taken just after clustering is done. So, because it's VPR, you don't even have to go all the way to place. So, the net list hasn't really been routed. So, we, at this point, we weren't really looking for what routing resources have been used. Actually, I think I'm referring to that table which you had earlier, not the results. So, you have mentioned so many metrics. Sorry. Yes. So, I was wrong about that one. So, the actual routing resources usage, those are like, it's not in the table. Yes, they are not here. Yeah. Are those actually covered by some of these indirectly? No, not really. So, basically, except for the last two, all of these tell you how packing was done. So, we're just focused on packing. The other, the last one is about the graphs. So, how does this compare to the actual vendor product tools? Do you have the comparison? Excellent question. Well, so, I did run them with, compared them with Vivado. So, for this set of policies, for this synthetic benchmark, post-routing VPR was doing better, but it was doing better without R changes as well. So, it's, yeah, it's comparable, but it doesn't really have an improvement over the vendor tool. Yeah, so, that's like one of the next things that we are trying to do. So, our policy tuning has to work with some real benchmarks and have at least a comparable performance with what the vendor tools are doing, the big Vivado, yeah. Thank you. Thank you. So, my point, I know that. So, at 3.30, we have our last talk in the series on an application that leverages FPGA to get significant improvements in functionality and performance versus traditional CPU operation. Databases in these systems are widely adopted in the commercial systems like VGA analysis, or banking, or making a booking to travel, then you are using the relational memory, relational database, and these days, even your phone has the relational database inside. For example, your Android phone has the equip with the SQL life system in it. So, here I'm using the basic healthcare information that uses the name and ID and age, source age, height and weight. So, if we want to query something like this, like if you want to find people's name and ID that whose height is over 170 centimeters, then probably you need to look up two major part of the table. First, you need to check the height field to find the roles that qualifies the condition that you have given, and then the fields are selected by the select right here. So, this is the logical view of the database that you are using. So, how we store these tables is equal. So, there are two representative data layouts, role stores and column stores. Probably you can get by your own name. So, like role stores for the table by role by role. By column stores towards the data, by column by column. So, once after storing all the names in the first field, you store in the ID field and age and height. So, here we focus on the in-memory database. So, we only consider the data movements from the main memory to the field. So, let's assume that if you catch along with identical to the memory line, what can it be? Then first you need to fetch the height field from the first row. So, it's from the first line you have to fetch it and you have the condition that you've given. And then it's not qualified to move on to the height field from the second row. And it's qualified, then you have to fetch the you can build the name and ID. So, entering these queries, as you can see, there are lots of white plates. It is not used in the actual query. That means it has a very poor cash flow card. So, what about in queries in column scores? So, since the height is organized, packed in the last part, so you can just check the first field from the line that you just have fetched and the second row is qualified so you can fetch the name and ID. So, you can do it like this and so there is the first row is qualified so you have to return the name and ID of the ID for. So, in this case, column support can be better than wall store, but it still has some fields that are not necessary in this query. And this kind of the fields that you need to give them to the field, it varies depending on which query you are giving. So, adaptive layouts have been proposed. So, in order to enter in the query, that this is the entire wall. So, it stores the wall, the eight-data is wall store and then slowly converts to the data into column stores so that they can answer to query that of some part of the column. So, this is useful, but it has some problems because it has to maintain most copies of data in different layers. So, it will increase the complexity. That means that it will slow down your process and because of the heavy bits keeping, it will be less scalable and it will make you harder to maintain the process. So, what if we can access the only the desired columns without storing maintaining multiple copies of data? So, I'm going to explain with some more examples. So, we have this kind of table that we want to access. So, when the new query comes, then the optimal layout will be like this. So, here in this query, we give this main ID and height. So, the optimal layout will be the table with three columns. And what if your query changes? So, for this query, the optimal layout will be the table with only one column, height column. While if we have the query that says three fields from the table, then the optimal layout will be something like this. So, how we can answer, how we can provide the optimal layout for any type of query that, any type of given query. So, here we introduce the FMR variable. So, FMR variable is act like just like any other variable but it's never instantiated in the game. So, we do not have to maintain the multiple copies of data. However, it always points to the optimal layout based on your query. So, if your base structure is the table with five fields, so, as the model variable will have three necessary fields. And this optimal layout is generated on the fly using the technique called program or logic in the middle. So, if your query changes, then the optimal layout should be changed. So, once the FMR variable is defined, data will be re-transformed into the optimal layout. So far, I have briefly introduced what is relational databases and how we can use the FMR variable. Now, I have explained that our FMR variable points to the optimal layout all the time using the program or logic in the middle technique. So, from now on, I'm going to explain what is that thing. So, the traditional processes can be something like this, there is a CPU and a VM component, and so you communicate with each other. But you can say there are platforms with FPGAs you need so that you can communicate with the models in the FPGAs you need to do. And there are a lot of commercially available platforms to that has both here and this platform. So, the program or logic in the middle technique, so it captures the transactions from issues by the CPU and do the manipulation that needs to be done and refunds the data without CPU knowledge. So, the major benefit of this technique is you don't need to change anything from the software side. However, the jobs that you want to do can be hidden inside the programmable logic. So, without changing anything from the software side, you can achieve the goal that you want. So, here we implement our relational data memory engine in the FPGA using the same technique. So, our relational memory engine is based on the programmable logic in between the proposition system. And it composes of the four major components. First, Trapper. Trapper is the interface module between the relational memory engine in the 50s and the monitor by itself monitors the completion and availability of the real organization. Because there is the brain of this module and it orchestrates the access to maintenance. Then you need to achieve the chunk of data from the main memory and extract the useful data from the read data. So, once the fmr variable is defined, then our IME module gets the DB geometry. So, here the DB geometry is like a world size world count or how many columns that you want to read or the width of each columns and the position of them. And once the fmr variable is defined, then the CPU will try to read the data from the fmr variable. And the read transaction for the fmr variable is first trapped by the Trapper. And Trapper asks monitor by test whether the request to data is in the data screen. And monitor by test has the corresponding metadata whether the data is available or not. So, if the data is already in the data screen, Trapper gets the data from the monitor by test and transfers to the core. So, then the monitor by test takes the metadata screen and the data is not in the data screen and the monitor by test notifies the requester who needs to fetch the data from the screen. Then the requester programs are fetching and files the read transaction for the DM. So, here, for example, we have the data sent in a code and the highlighted card. We only want the highlighted card and the basic deal is the fields that we are not interested. And request first requester only issues the request lines with this card. And second requester doesn't write. And based on the information from the data sector, you will have to track the useful issues on the line and write with the data screen. So, after screening those data, so, in the data screen, we will have only the column that we want to do. So, in this way, you can provide the optimal layout to any given person. And the one data is filled, the monitor by test has the updated metadata table and then trappers will know that the data is ready and then the trapper transfers to the pending transaction. Yeah, from now on, I'm going to share the experimental research using our IME product. So, we have implemented our IME into the Giant Flutus K GPU platform. So, it provides educated for-pores and has data and instruction test and PSI working for frequencies 1.5 gigahertz and our program or logic is identified using the 100 megahertz frequency. So, this is a committee from the Giant so that you can provide some information about the connection. So, our re-transactions are trapped by the most likely been the resource utilization from our revelation of memory engine is shown. So, because of the data SDM and metadata SDM, we've used quite a lot of memory over 50%. But it's not that lookup table or flip-flops or DSPs, we are using less than 3% of the resources. So, we believe so we can even target the FPGAs with more less resources. So, I think this will make us suitable for edge computing with IOC devices as well. To evaluate the relational database of our queries aspect, we have used five queries. The first one is to see the performance of the production and that query is to see the performance of the selection. And third one is to see the performance for both selection and projection and lots to measure the performance of more complex queries like go-files and going on to pages. We have compared the performance of our R&D with the direct low-wires of this and direct low-wires of this as well. First of all, so we measure the performance of productivity. So, this is the normalized distribution time by WorldStore, so WorldStore is always one. Column performance of ColumnStore, so the productivity is less than four columns. ColumnStore shows the various performance. However, as we increase the number of columns that we need to set, we need to use the performance of ColumnStore got worse. There can be two reasons. One is Column typical reorganization cost is high and another factor is the performance of the project. However, the performance of our R&D shows the table performance for various projectivity and yeah, projectivity. So when the ColumnStore out-performs our R&D when there are one or two columns that we want to use, but in general R&D shows a very stable performance. And I want to emphasize that those two distance like WorldStore and ColumnStore, they are fully working on the PSI, they mean they are working on the 1.5 gigahertz frequency domain while R&D has to go through the floor program of the R&D, which is 100 megahertz, so it may be different. So this is the performance of the R&D quality. Here that the electricity is over 90 percent. This means that 90 percent of the values of AC are the five-disconditions. As you can see the R&D, the electricity of R&D remains what we've been, regardless of the various work size and this is the query for both selection and projection. So here as you can see, we vary the number for the pitch column and the columns for selections as well. So that means that the heat up from our R&D is higher and the red color here is the column store better than the R&D. So from the first experiment that I have shown you, so when the number of columns is less than four, the column store outperforms our R&D. So similarly here, the number of columns is really less than about five, then the column store is better than our R&D. Most of the cases, our R&D can outperform the column store in the best case, it can get up to 2.23 times. So this is the comparison with the R&D and similarly the darker green means our R&D outperforms the R&D and here our R&D never worse than the R&D and the benefit is not very great as it has occurred in the column store but in average it's one point four times faster than the R&D. So this is for the complex query that uses the group file and for both cases, R&D can outperform the row and column store as you can see. This is another complex query then full of join. So here we use the classical hash join algorithm so that means that you need the CPU cost is high so we define the breakdown to divide the CPU cost and the data cost first. So when we only consider the data cost first, the IAM can also write the data measurement cost so this is the overall mix but for in order to evaluate our R&D with a more realistic set up so we borrow two queries from the QPCA so QPCA is a well-known benchmarking database it tends to evaluate the performance of relational databases so we use two queries like 1, 2, 1 and 50 because it uses only one page as shown here so if we look at a query so the first Q1, there are two complex queries group file and order file those two require storing so we choose the QPCA and there are a lot of constraints so there will be a QPCA cost as well compared to Q1, Q2, Q3, and Q4 there are a lot of fine comparisons but if you want to operate some of them so at this point we can get the performance gain of using R&D into how we want to fix so as we have fixed the performance gain from the Q1 it's not so obvious because of the if-you-intensive queries for QPCA we got less of the data size of our IME and then the visual for any data set so we have proposed a relational memory a novel structure hardware code design paradigm that can answer any queries with the optional data later we also proposed a familiar variable a simple and lightweight abstraction to use the relational memory and as shown our relational memory engine can achieve better performance than the the very known day-to-day source for all columns so we believe that our relational memory engine can be useful in other fields one third example is to implement power features into the memory controller so in the previous talk speakers mentioned about the configurable memory controller or the configuring memory controller it will be beneficial to users to use this however a lot of you can add more capability to memory controller I think we can aim higher by adding those features into memory controller and so we have put this statement in the paper and one of the reviewers said it didn't make sense to add this logic into memory controller simply for the database purpose and maybe it's not but we believe that the database organization is quite common for lots of applications like tensile slicing so in tensile slicing you have to do the metric basis and you have to reorganize kind of things I think our RME engine can be easily adapted to solve this problem thank you very much for your attention and yeah so this work is accepted and going to present it in next March in EDD but the paper is available online so if you want to know more about our relational memory engine so it contains more experiments and some techniques how we can provide the concurrency control kind of features that it is so please check our paper thank you very much and also I'm happy to take in questions thank you so we do have a panel until the next in 20, 25 minutes but we'll keep it informal so the speakers will stick around for like 15 or so more minutes if people have any questions then we can do that informally yeah but you still need the microphone to hear are we still on video? yeah yeah so we need the microphone right oh so yeah also I've got so many questions so I will monopolize this next 20 minutes but maybe I will yield the floor after a minute or two so I guess some really basic questions so on your platform that you're using right now how big is that FBGA in other words it's an ultra scale whatever plus but is that a small one or a big one where does that fit in the family? I'm not sure about the exact number I think but as far as I understand the ultra scale GCO 102 platform has a relatively larger family of people and so how does that relate to the size of the memory in the database because you have your size you 60% of the block grams and so yeah I know you can just answer informally so if your database is bigger if it's more complex but how does that relate to the resources that you'll need? there is a very good question so this is comparably larger at FBGA but still when we think about the scales of the database people are using it's far too small so people are dealing with gigabytes of data sometimes terabytes of data but the block memory that FBGA can offer is something megabyte kind of thing so here we are using two megabytes of data SPM for our data SPM and it's more extra for the metadata and so because of that reason so the capability that our relational memory can cache is quite limited even though it only stores the desired column so because of the time limit I didn't go into details about the reconfiguring kind of thing if you look at the numbers here so it's from 2 to 1 to 3 and our data is 2 megabytes which is quite large here but people see it to fill our data at the same time it can be highly sized and here you can fill up to 4 times the value of the data so in order to support the larger size of the data than our own data SPM so whenever the data is being stored then we source the data and you can use it from the base so in that way we can support the point that exceeds our size of the data so if I can ask you one more question then I'll be able to floor so you said very large databases so that kind of takes us into real I and O so FPGAs are being used for database acceleration on IO devices for 20 years or more and the whole companies that do that and so is that something that's how does that approach differ from your approach? that is so true that the database community tries to actually relay some of their functionalities using the hardware accelerator so here we took up on the metal difference so the majority of our approaches have been proposed focusing on accelerating fuel complex operators or offloading the core itself kind of approaches so we took a different path instead of accelerating specific type of operations we tried to optimize the data layout so that we can maximize the cash locally and while we left the hard job to the CPU to handle so here we more focus on optimizing the data movements from the memory to the CPU and on top of our on top of our radiation memory and there's more other techniques that exhalate some cash, the join operators or go by kind of things can be also used So the part of the question is there any other thoughts because that's not the kind of what you're looking at so how much latency does it add to go from the memory I know you did the average time so obviously this is something that can make that much difference I'm so curious so what was the change of latency instead of going directly from memory to the CPU the memory is through the FPGA to the CPU I think that is a very interesting question and it has important but to be honest I'm not a hardware expert so the majority of those implementation and testing is done by other guys from the other group so I'm more focusing on the database aspect of it so to be honest I don't have exact numbers for that question for your partner I'm guessing that doesn't make that much difference because you're streaming you're getting a large amount of data and so the latency gets amortized over all of that so it doesn't make that much difference so I was curious I don't know if this is the appropriate question because you said you're not a hardware expert but does this actually change the number of actual DRAM accesses so if you're using the row or the column where the CPU is you're dealing with memory accesses versus us using the RME does the actual DRAM accesses the number of them that change drastically because of this? I think I have the numbers but I have numbers in some way we have it measuring the number of actual views from the database at the end and the RME is based on the raw store so it's quite close to the numbers from the raw store but why for the column store it has a totally different layout so it has a different number of the memory views, transactions so I think our relational memory has some data from the DRAM similar to the raw store in the paper we have another experiment that didn't share so in that experiment so if you look at the paper so we can have more comparison by changing the offset in the column so that it affects the number of actual views from the DRAM actually one more you and Dhama question I guess so how does the CPU, so the CPU still has to have this data, put it in the cache and use it for something how does the CPU you know that the layout has changed somewhere in the middle that's a very big question so for now so we explicitly so the users already know the geometry of the database and which database they do want so in our experiments we specifically we assign the specific views that we need as a future work we are trying to build an API that generates the form of fmr variable that I have shared with the structure for me so that we can automatically generate fmr variable perform it with fmr variable for the given part and thank you all right so there are no more questions and I guess we can wrap up the panel so thank you so much all speakers wonderful years of talks