 I'm using Golang and JavaScript. So quite an interesting combination. My PDF is online here at this URL, so later you can download it. It will be also shown at the end at the link in my email and search. So I was a system developer doing kernel development, drivers, platforms for quite a while now, and then more recently been doing silicon design, silicon verification, and now owning the last couple of years doing some front-end web work. What I'd say is it's interesting, almost quite enlightening, going from being more low-level than getting in the end to more high-level JavaScript front-end systems. Kind of interesting. Often perhaps it could be people that are working only at a single layer. So I'm going to talk about one of the projects I worked on recently, like the last two years. It's going to go through the problem statements, different ways I could have solved it, and then looking at how I looked at the project design. Going to give a demonstration. Also even have a look at the code as well, and see how it actually looks, some of the core parts of the code. Then also talk about some of the limitations. So in our company I was working on some project to expose the silicon chip counters. So typically silicon chips run at high frequency. They then have other counters that increment many megahertz. So how do you count events? How do you then show that live? Like a kind of graffing or dashboard so that engineers can then use that or customers can then use that? How is my system working? Is it working efficiently? Obviously having a lot of data like that, how do you capture it efficiently? And then how do you graph it in real time? So that was the goal here. And then of course having like a high rates of data being streamed in real time with recently a little latency actually presents quite a conflicting problem there because it's like latency and volume of data is hundreds of megabytes and of course you want to stream like a matter of kilobytes onto your front end. So how do we solve that? Well obviously cloud services are like the future and like now trust boundaries are established and security is there. So there's no issues with pushing this data through onto your front end via like service running in the cloud. So it's obviously the solution you want. Now in terms of choosing different languages there's of course a few newer languages more recently these days but in the past you had things like C which were like more mechanical, you had libraries, you had a lot of integration work to do to use those libraries. Like C++ you had more layering, of course perhaps Java also like many libraries you can integrate and of course so on but then as you get to the higher level languages like Python then you'll end up with slower performance, less scalability, more like memory overhead. And so here of course obviously something is missing here and of course JavaScript and so that's one of the key things here. So if you offload a lot of the processing onto the front end like data reduction or that kind of thing then of course you can have quite a responsive efficient dashboard or interface. So here like Rust and of course Go they're obviously quite recent like the last 10 years and they've been really quite instrumental in changing how you implement or giving you the tools to implement cloud services, web servers and of course have integration across that stack. Whereas previously you'd have PHP running under Apache and then you'd have other like separate middleware running and so on. So here of course with like Go or Rust you can integrate things which like makes it much more efficient and easier managing it. So here I chose Go and then let's see I'd say that there's now like a huge amount of information out there so it's pretty good choice, pretty well supported and also you can cross compile to different architectures. How many people have heard about ARM in the cloud? Exactly yeah a few of us good actually half the audience that's really good. That's the future and it's happening now so now you have many hosting, many companies making chips with 80, 64 cores, 128 cores in one single chip. It's going to take off in the next few years. Anyway so of course one of the key things there is with Go you can very easily cross compile onto ARM so and therefore you can deploy your services efficiently. Now in terms of the architecture that I used for my application pretty simply we need some communication between the client and the server, two-way communication so that the server can send data. Asynchronously back to the client so because HTTP will only request when the client wants then you need a cross channel back from the server so we're going to use like WebSockets for that. So key thing about of course like Golang is it's compiled unlike Python so it's actually quite efficient with CPU usage. Easy to integrate with different libraries either from GitHub or like built-in libraries and you have pretty rapid build times and times to run it. Unlike for example if anyone has used a boost build times can be like many minutes. You have built-in language concurrency which is really really nice and then channels for communicating states and data and then also it has lower like everything is built into the binary so you don't have all these libraries you've got to ship. And actually in fact mentioning that you can have the equivalent of a kind of a dockerized container if you run this, if you run your binary and you edit the system defile then you can isolate all of the network namespace, the PID namespace and so on. So actually you can actually secure and isolate your binary as it's a single binary. Okay so here of course JavaScript obviously it's really mature, very well known so easy pool of talent available so very good choice for this approach. And then here the key thing perhaps when you're developing such an application is you then you like map out your protocol so your message is probably encoded in JSON that happened between the client and the server. Once you've worked out okay the message flow and all the kind of behaviours surrounding that then you can begin to implement. Here what it did is I actually have like a very simple kind of like a handshake for signing on like from the clients then I send states for example any information needed for drawing the UI on the clients and then also the client will then ask okay I want these events to be shown. And so then we'll then like transmit these events out like regular intervals and then rendering will occur on the clients. So in terms of the actual structure of the code it's broken up into different modules in the different files and we have various threads here. So the kind of main thing is events being read from different sources in this case we have our silicon chip and also kernel VM counters and also processor counters as well. So we can when we're running like a workload with our silicon chip then you can see okay what is the kernel doing, the page faults, how is the processor core load. That data is then ingested and sent to a like a file that is mapped in memory so that you have data like history information and encoded into a like a fixed binary format so it's efficient. And then that's then sent also via JSON to the HTML clients there. Likewise we have let's see we have I'm sorry it's sent via the web server threads here rather. That's also for static content so everything is self hosting. And then we also have like a SSH client so that you can easily in your session logged into a server you can monitor these like stats. So like you can very easily see what's happening on a system here. Okay so I'll show a live demonstration. All right so here I'm logged into a system here which has a lot of these processes. Okay so here I'm running an H-top here. H-top shows 144 cores here so that's one of our like our larger servers with our silicon chips inside. Going to run my numerscope project which is there for monitoring the Cashke here into events in our chips. So I'll run that it's now listening here. Not sure if you guys can see that okay. Is that clear? Yeah okay so better now right so it's listening now on port 80. I'm forwarding the port over SSH so I can connect from my browser to this this process running here. What's happening is it's now it's now showing live updates from our silicon there. And we can see since this is showing Cashke here into events occurring on the actual CPU interconnect it's quite hard to explain the meaning of each of the events. But right now we can see there's something like 50,000 occurring like that. Right now there isn't any workload as we can see on H-top. So we're going to run a benchmark. This is the NASA NAS parallel benchmarks which are doing some mathematical operations and we should see the load here climbs significantly. So we have like real time graffing here via D3 OJS. And so we can see now actually different different things happening. We can pause this thing and we can zoom in. And yeah there's 17 million events happening right now. In the meanwhile this is now updating down here as well and we can resume the scrolling here. Now I just want to show you guys there's kernel VM stats here that I'm capturing. This is all done in bootstrap actually and actually pretty simple. And here I'm showing some interesting VM events for example how many page faults are happening which tells you when there's memory allocated and consumed. Also I can select all these different event counters on my silicon chips. So we can get a lot of interesting stats from this about our silicon chip and like if for example it's working as we expect. Just finally we can also right now it's averaging these over all of the six servers that we have. Because our chip lets you boot many servers as one big server. So we have six servers there. I'm just going to show you how it looks when I disable that averaging and then it's broken down per server. So that explains why there's 144 calls. And then we can see there's actually some workload imbalance here. We can see the workload from this of course benchmark occurs differently on different servers. So on two servers it's like two million reads of cache lines a second. And then on four servers it's only like 30,000. So you can see that this is not utilising, the benchmark here is not utilising all the service resources efficiently. Good and in fact I'll go on to show you a little bit of the structure of how I serve these files. The JavaScript, the HTML and also the actual code line as well. So go back to the presentation. Okay so we'll actually look at some of the code here. We have these different files. Generally it's nice breaking things down into files because then you can manage it. Then the Git history you can get logged on a certain file and then it's easier tracking changes. So I have these two files here interact.js and index HTML which is served by the built-in code line web server. And so those are the only content that is served. That means you can actually run this application on an internal network which isn't exposed on the internet. There are no external files and resources needed. Then the actual events and the measurement of the samples is done via these three different files here. So there's the events test file here. That then lets me run go test and then it'll run some automatic verification of the different functions. So then it's easier to then address bugs and everything. And then here I have also my live web version and also a sampling version which I can run offline. So I can then later save a trace that I can load later. I'll also show that as well. And then the top level files here for the argument handling. So I'm going to show you some of the cool things here with go. So here in main.go we have the argument handling. Very simple just with passing any flags that you pass when you run any of the different modes of the binary. So here live is the one that I was showing there for the web interface. Start is like a client version which shows you on the terminal information. And then of course record. Record then gives you the captures like all those stats on the disk so later you can load them. Here we have the function which then starts the HTTP server. In effect all it's doing is calling HTTP file server and then on a certain directory. And saying if the client accesses slash monitor then call function monitor. Monitor then will handle web sockets. And then and then it go here this like prefix go then makes this function HTTP listen and serve makes that one asynchronous. So it's now serving HTTP requests. So here monitor is used for handling incoming connections for the JSON web socket. And then it reads the like any kind of message that was written down it and then compares the string. If it matches the secret key then it'll kind of fall through and if not then it'll just terminate like terminate that connection. After that then it'll send down a bit of state here in like a map that's produced of events. So then the UI can then build those lists of events. So on to the the connection the sampling of the the events here. So how do we actually how do we access the the registers on silicon chips? Well what we do is we call we open Dev Mem and then we we mmap Dev Mem. A map will will let you access from your application like from an array your any any silicon registers any registers in your in your silicon. It could be in any part of the system. And so you then cast this in an array in go. Which is can be considered unsafe because then you can you can access registers you're meant to. And then it like checks is this really our chip does the vendor and like device ID match. And then after that it simply will like access some of the registers here. And then and then later on it's then abled then then to actually sample the the events by accessing this array later in like start control. And then looking at the elapsed the number of cycles elapsed since it like last sampled and then reaching out each of the each of the events of interest. And then a normalising that like by the clock speed like those samples of course then a pass back and marshaled back to the client. And so so then what are the issues that actually that you found well if you're if you're like sending data at like say a thousand hertz via a socket back to the client then it's a lot of congestion. I mean it isn't efficiently parked in Jason also when the the like JavaScript then is passing it. It's it's like doing only like a like very very small loops. So it's like doing a lot of work for a little data. So batching. So batching the the actual these events into like blocks really helps obviously like their time stamps. And then when actually when the graph is drawn then you like scroll it. I don't like a like a little frequency. And then it's it's it's definitely smooth enough. But also it isn't being like drawn a thousand times like a thousand hertz. Also clearly like choosing like choosing like a library that will be efficient for graphing. So a D2JS is quite mature. Some of the other ones aren't as like mature and therefore cannot handle millions of points. And then also in the future one thing I have to to also implement is is a loading loading in like blocks because if I like load like a like a really large trace file it'll it'll like jam the the kind of main loop for. For many many seconds and then it will block the the rendering thread. In fact I'll just show you the loading traces here. So here we can we can simply just access here. I have some traces here. Actually this one is anyway this one has an issue but then also we summarize various stats as well so we can see how many events occurred. And also like the rate of events. And then normally you'd be able to zoom around here. And activated and deactivate different different traces here. And here all of the events are captured so on the chip so then you can really figure out what you want to see. Yep. Plus when I'm in live mode you can vary the the sample rate on the slider here. And that's why that's why when you're sampling at like high frequency like every say like a few milliseconds then you really really must batch the updates. Finally I'll demonstrate the the actual clay based output because that's of course quite useful as well. So here it's now running like a default set of events here. So page false at the end here shows you how many like pages are being being used by the kernel across all applications running. And then here these these events here these end to victim blocks send and so on. These are all cash coherent events. So if I run my workload then we'll see definitely those are going to spike. Let's just resize this. There we go. So we can see that they went from like say 50,000 now to like 1.3 million 2 million that kind of thing. So yep. And finally I can actually record this for later collect like later analysis. So I'll just do a quick capture and then I'll load that into the UI and you can see. Okay I've copied it across now I can just load it. Let's see what's going on. Okay so here now we can analyze the events that have occurred here and here I can average across all servers and if I load that again it will make it easier. Okay good so here we can see it's like something interesting here. So the number of weight cycles if I zoom in a little bit. Yep so we can see that actually on average there was probably about 80% per server waiting for our interconnect for resources. So in this case we can see during this like time period during this benchmark our interconnect is slow and blocking like reducing the throughput of the execution. Good so that wraps it up. This is all published on GitHub along with the history here so I show you that. Yep and this has also all the documentation here as well and then you can of course clone it and then build it something like that. And then you have to run into root because it accesses and maps the chip counters. And then also it tells you how to run it so very good. Thank you guys.