 My name is Anthony Driscoll and I'm from the University of Washington. I'm going to talk about library. We're announcing our new Agile open source risk-by-multi-core. And we're planning it as a host core for a seller against SOCs. This work is done between Washington and Boston University. So the kind of motivation here is that traditional hardware time methodology is usually reliant with proprietary tooling and proprietary IP libraries that don't promote reuse and are not accessible to hobbyists or researchers. So, at the same time, new open source ad goals and open source libraries are providing a way for hobbyist researchers to stand out in this ecosystem. But it's no coincidence that this is happening in an era where developers' workloads require customers to open. Accelerators allow us to contain games, performance games while traditional transistor scaling is kind of petered out. But while custom accelerators are the secret sauce for these new systems, they don't stand alone. You need all sorts of components like the uncor I.O. chip testing capabilities in order to maintain your own infrastructure just to develop a small accelerator to do a specific function task. Again, any of these components wrong can completely destroy all of your accelerated games. So, general purpose host cores are needed to coordinate all of these accelerators, all these moving parts, all these uncor systems. But when you're adding a new host core to the system, they're separate considerations. You want to be efficient, performant, and reliable. It needs to be able to do what you want in your power and performance on the load. It needs to be able to control a diverse set of accelerators, both with standard and custom interfaces. It needs to be trustable. It needs to use standard design principles, standard design methodologies, standard design languages, so that people can just look at it, inspect it, and know that it's quality. And the I thing is to be extensively verified. If you have to do this redesigned re-verification, you'll lost all your benefits of reuse. And last, it needs to be free and open source. You need to be able to play around with this and understand if it will fit into your system without purchasing a license, or a signing, or script of NDAs. So, you end up just being able to click in an accelerator. This is the ideal case. So, in order to fit this niche, we developed BlackPirate. We called it a face class for accelerated SOCs because we envisioned people taking the code, extending it to their own circumstances, and then contributing back so that other people can reuse it. We've designed the architecture flexible enough to allow for this extensibility, and we've released the code under permissive DSD license. But we don't want BlackPirate to be the best score on the market. That's not our goal, our highest performance score, rather. We want BlackPirate to be small, efficient, and obtrusive. We want it to be able to go in as many systems as possible without dominating the performance or area, restricted performance or dominating the area. So, on our route to, you know, going global, we monitor our progress around four distinct axes. We have quality, will people trust the code? Will people be able to, you know, justify putting into silicon? Will people be able to use it everywhere? Will it fit into all use cases and be able to use all accelerators with it? Does BlackPirate have everything that he needs? Can it do all the general purpose code and post all the kind of accelerators that he wants? And just as importantly, does it have a minimal set of features? Unused-functionally, silicon takes up power and area, and most importantly, it adds complexity, which increases verification costs. And lastly, is it efficient? Does it do what you want in the power performance envelope that you need? So, this is kind of the design principles that we have come up with for BlackPirate. When we try to add a new feature, we consider these three factors. We try to be tiny. We want to have a small code base and small design so that it will fit as many places as possible and also so that people can wrap their head around the code, and be able to go into the code base and be able to find issues without having this kind of overwhelming part of the... this overwhelming code base. We try to reuse open source libraries, like based on STL, Google RIS 5DB, Berkeley Partflow, things that are well-verified and well-documented so that people can come in with their own experience and be familiar with the BlackPirate code base. We also want to be modular. We want people to be able to modify only a small part of the code and be able to make their own changes without having to verify other parts of the system. Have these cascading effects that are extremely difficult to demo. And lastly, we want to be friendly. This kind of means, you know, as I'm here, you want to reinvent the wheel all the time because it's super fun. But it makes more sense to reuse libraries that exist, reuse code, and, you know, avoid non-vegetary syndrome as much as possible. So that's kind of setting up the stage of why we developed BlackPirate and what are design goals in methodology model. Now we'll talk about the system architecture, kind of this un-core component that makes it really easy to integrate accelerators into our system. And I want to note, we're not trying to develop an SOC generator. Our role here is more to provide a reference SOC methodology and a bunch of building blocks that people can use to plug in their own accelerators into BlackPirate's SOC or integrate BlackPirate's SOC components into their own system. So, in our system, BlackPirate comprises its three knots. There's what we call the bedrock network for managing co-currents transactions, the IO network, which manages positive communication, and the memory network, which manages DRAM communication. These are all connected via wormhole routers, which are small, efficient, and fairly easy to verify. And we use regularized piles to make sure that it fits into a normal hierarchical pad flow. So I mentioned bedrock. This is our fully-cached co-current multi-core infrastructure, which allows traditional messy-type co-currents using a novel protocol. This protocol divides the system into two components, local cache engines or LCEs and cache coherence engines or CCEs. The interesting thing about this is that CCEs have all the actual protocol information. Local cache engines only respond to very simple messages and have limited shadow state knowledge of the entire system. By doing this, we can eliminate all transient states of the protocol, which makes it vastly simpler to verify. So now let's take a look at the sample of black-fired SOC. This is kind of a reference with all the bells and whistles that you can make with a black-fired SOC. It turns out we can separate black-fired tiles into four distinct categories. Core tiles are the most basic type. A single co-current core of a black-fired multi-core along with one more component is needed to communicate with the rest of the system. Streaming accelerator or IO tiles do not cache coherent memory. Stream data to and from the core. Now this can be something like a GPU or a fixed function accelerator which has a DMA. There's also co-current accelerators which do fine-grained communication between the core and the accelerator cache coherent memory. This is useful for cooperative workloads. This is an interesting part of black-fired because you're able to have cores and accelerators very fine-grained communication. This can use the whole class of SOCs that people can develop. And last, we can extend the memory of the tile throughout the system rather using L2 extension tiles. Extending the distributed L2 and providing more directory tags at L2 cache which allows you to change your memory compute ratio for a specific application. Now let's zoom into a black-fired core tile and look at the important components that make it really easy to do. So core tiles kind of have everything that you need and you'll see a lot of these components repeated in all the other tile types. Next, with base-jump-stl-wormhole routers, this is leveraging an open-source library and we don't maintain code for these routers so we're able to keep our code base lean and narrow. There's also wormhole concentrators which allow all the components to talk through one router. There's a CCE which manages directory tags and the coherence information for a slice of memory in the system. So all the cores are able to request data from this one CCE and all of the protocols. There's also black-fired steeply integrated L1 caches and distributed L2 cache which provides fast performance on tile memory. And last is the actual black-fired core logic. And this is the only custom logic in the tile. Everything else here is reused with other components. In fact, anyone who has their own revised software please come talk to me after this. We'd love to integrate more cores in the black-fired family. So now that we've looked at the un-core components for a black-fired tile, let's take a look at tags on it. L2 extension tiles maintain the coherence directory tags with the standard CCE but packed with the rest of the area with L2 cache. So you basically just rip out the core but reuse all the rest of the memory in this inch tile. Coherent accelerator tiles have a standard LCE but a simpler version of the CCE which only manages configuration data, scratch-pad memory, and local on-tile memory for the rest of the system. Streaming accelerator tiles are reused as a middle CCE but have no need for a coherent LCE. Instead, they communicate over the IO network or simply stream results backing forth for the rest of the system. And even among these accelerators, Blackberry has several different level levels of integration logic that you can have. Because all of the true coherence data is managed by the CCE, there are LCEs that have different types of components seamlessly integrated without having to write any custom coherence logic. For a stream accelerator the stream accelerator only needs to generate a man-to-response request which could be used over any standard memory protocol. In order to attach coherent accelerator the designer can reuse Blackberry's LCEs and data cache and only have custom accelerator logic for the actual accelerator fixed function logic. If a designer wants more control over the specification they're able to use completely custom cache to reuse Blackberry's LCE just reusing this standard cache service interface that we've defined. And lastly, designers are free to implement full custom accelerators which obey the simple LCE functionality required by the bedrock protocols. This option offers the most flexibility who has to try out the longer development times. I don't want this to be so modular however that you can start with your custom accelerator and standard Blackberry cores and then one by one go down the stack and customize to get the best performance and maintain testability of each step. So now we've looked at the system architecture we'll dive a little bit into the Blackberry core architecture. This part of the talk is a little bit this is the first version of the Blackberry core architecture. Because we defined this at standard interfaces you can seamlessly switch in and out of these different core components without any impact on the system. So Blackberry core itself is a simple in-order pipeline in stages with full forwarding FPU and non-blocking, non-stalling backup. Interestingly, the directory controller is not a standard FSM-based directory controller like you'd find in most message-based systems. We actually have a programmable cache appearance engine which uses a custom risk guide set to add functionality This gives you the flexibility to program different types of coherence traffic such as security, functionality and debug information even after you've taped out into silicon. And all these building blocks like I've had a question on have these standard latency and sensitive interfaces. We spend a lot of time making sure these interfaces are able to have a wide range of limitations on either side without impacting the functionality or timing information of other blocks. This is key to making sure that Blackberry has a scalable development model and being able to model in the world are able to modify different components and able to integrate them together. So now we've talked about Blackberry's design philosophy and system architecture so we'll talk about how we're building a flock of users and developers around the project. First we have our modular and hardware-compliant test school and this is driven by the insight that test benches are expensive to simplify small components of the system and use the standard hardware interface compliance test to make sure that their interfaces are an indication of compliance with the standard interfaces. This testing structure allows Blackberry to be absolutely sensible for bus verification. We're trying to develop Blackberry as an example of community-driven micro-reconstruction. Just as Lang says this global stewardship and distributed developer base we want Blackberry to be a reliable entity. The first method is recognizing that ideally Blackberry's users will become its developers. We want to make it as easy as possible for someone to go from downloading Blackberry to making their first folder request. We don't want people to have to have intimate architectural knowledge of the system to improve their small area of expertise. If you have a TLV expert, you don't want to have him or her to understand the brand predictor in order to modify it. Second is building a development infrastructure based on freely available casuals. This lowers the barrier of entry so that hobbyists and researchers are able to get started up and running the system. And last is focusing on out-of-the-box experience. Users should be able to go from GitHub, the simulation, or even ASIM in only a matter of few clicks without having to debug tool chains or install proprietary software. We've successfully run Blackberry and Justice 2 FPGAs. We're also diligently working to get Blackberry into latest, which is an FPGA open source FPGA based open source Python based FPGA design environment which allows people to have their own FPGA and all the peripherals accessible by Blackberry but we don't have to maintain the code in our central infrastructure. We're very excited for this because we don't want to maintain FPGA infrastructure but we would love to get Blackberry into the hands of all the FPGA users in the world. Blackberry has also been taped out in an advanced 12-nanometer process now with many more Tapehuts on the way. This is a process and master design which has been validated in several different techniques. And not only is Blackberry's RTL open source we also open source the Tapehut directories that we use for taking out our chips. This is at github.com flashback our examples. You can go today and look at our synthesis scripts, timing constraints, different memory information and all our Tapehut Tapehuts past and future will be hosting there. We're also hoping that anyone who tapes out Blackberry outside of our group contributes to that repo so that we can share these insights of how to get a very performant SOC URL quality results. We also participate in the open road catalog which is a free and ocean source 24-hour push funding catalog brand new being developed right now. And this allows for you to use open roads, 3PK845 push all the way through an open source can flow today. This is a picture of Blackberry Genesis release. Now we say Genesis release because we view our role as bootstrapping development movements rather than becoming stewards of the code ourselves permanently. We don't want to throw Blackberry over the fence have people, you know, fork it and have it to their own needs and never distribute back and have this kind of central read upon standard also work. We want this length select development model where everyone works on same code base and people are able to parameterize it enough to fit their own specific use cases. So I present to Blackberry the last people who respect multi-core. It's silicon validated and ready to be included in the x-accelerator project. If you want to build up a Blackberry user base please use and explore, break things, crazy issues, pull requests. We're looking for a particular help with porting benchmarks. We need a lot of help on software side just, you know, person power. And according to various FBGAs different development environments are interesting to us. We love to see bugs or I'm trying to debug us a new 1,214 point and we're also looking to have a level of solar system that you have. So together, please come talk to me if you're interested in working with us. We can make Blackberry the global default choice for us to find out. So, thank you. Yeah, you have like a minute to ask some kind of your question. How many contributors have you got outside of two universities? Yeah, to repeat it because we don't have them. What are your questions? What are the theories that we've had outside of two universities? So, Blackbirds kind of developed collaboratively with Fish of SDL. We have many outside contributors from Fish of SDL and we've had issues that have been raised in Blackbirds that have been fixed there. Blackbird itself, we're trying to build up this community more so.