 Hello everyone. I hope you've all been enjoying OSS Summit and I'd like to welcome you to my talk from the ground up How we built the nanos unicolonel My name is Will John. My background is in systems programming and I work for a startup called nano VMs I'm one of the developers responsible for building nanos So while the thrust of this talk is about kernel development We'll first present a general overview of what unicolonels are and concept and then talk about what we're trying to achieve with nanos I'll do a demo of nanos in our staging tool called ops and show you how easy it is to build a nanos image with an arbitrary Linux elf binary application We'll show you how to run this image on your own host under QMU as well spin up an instance on a cloud platform The bulk of the talk will cover the internals in nanos with a special focus on the composable nature of our kernel code I'm going to cover aspects of nanos code that are atypical of kernel programming in C Where some constructs are borrowed from functional and more modern languages, but with a purpose We avoided basing nanos on any existing kernel infrastructure Which has allowed us to employ such constructs to compose complex operations from simpler ones We'll take a look at how such composability has guided our development in the kernel We'll also take a look at some current developments in the area of management particularly involving our interface to An evolving interface to nanos is hierarchical tuple-based configuration and management store From a perspective of usability Nanos runs an expansive set of common services and language runtime environments with many available as pre-built packages With the aid of ops nanos can be easily deployed on a wide range of hypervisors and cloud platforms So what are unicernals anyway? there are some very well research talks around the web aimed at answering this question and This is not going to be one of them, but I'll give some Exposition here for those of you who are new to the concept We'll talk about some of the practical problems in deploying existing unicernal technologies and and show where nanos comes into the picture The unicernal is a type of OS kernel that runs a single application While mainstream general purpose operating systems support multiple running processes with at with address each with their own address space Multiple users and even separate namespaces for partitioning a system into containers Unicernal just supports a single process and program address space This greatly simplifies the kernel as well as the user space environment in doing so the so-called attack surface Or the amount of possibly vulnerable code is reduced a Unicernal image is purpose-built and ideally contains only the components necessary to run the program with which it was staged Okay, so why might someone be interested in running their application on a unicernal? Some of the reasons most commonly cited are that unicernals boot almost instantaneously with startup times typically measured in tens of milliseconds With only a single process to run and minimal OS facilities to support it. It's not difficult to see why unicernals start so quickly Since they contain a single process without a shell or other utilities There would be much less code available to exploit on a compromised instance But it's important to note that many unicernals don't have any privilege separation between the application and kernel code Since the application and kernel are built together in a single image running with full kernel privileges The argument here is something like well if you an exploit an application and can control the whole VM instance So what there's nothing else in the instance besides the application? But it's really not that simple and we'll discuss some of the debate on this matter in a moment Unicernals are small Unicernals that take the library OS approach can be very small as the application links directly against the kernel components needed leaving everything else out Nanos doesn't take this approach, but it has a very small footprint if I take a mainline nanos kernel build today and strip it of debug Symbols and and debug info it comes out to about 750 kilobytes And that's with all the features built in including platform and PV drivers support for hypervisors like KVM Zen ESXi firecracker and hyper V so we could whittle it down even more for a given deployment if needed Now given their small size a single box can in theory run hundreds or thousands of unicernal instances Existing cloud providers and hypervisors in the market aren't really designed to scale in this manner But they could eventually especially if they're written with Unicernals in mind Such scalability could mean that Unicernals have the potential to displace containers in the future without the risks inherent and relying on Kernel namespaces for isolation Unicernals are an ascent technology while some academic projects from the late 90s started to experiment with merging kernel and Application functions or kernel functions directly with a single application Unicernals have only recently found a home in virtualized environments in the cloud where microservices are deployed and mainstream OS OS features like multiple running processes and multiple user support can be jettisoned without consequence Unicernals are incredibly are increasingly being used to isolate services and edge and embedded applications as well this is spurred by the emergence of hardware virtualization features on system on chips such as those based on a arc 64 We're exploring these applications with nanos as well But for the purposes of this talk we'll focus on what has been the primary target for nanos services in the cloud So what sets nanos apart in the field of Unicernals? One of the aims in developing nanos is to lower the barrier to entry to deploy common services in the cloud Many Unicernals have either limited support for languages Require that applications be explicitly targeted or ported to the Unicernals unique environment or our POSIX source compatible But require extensive work to compile a large application Nanos is one of a few Unicernals which aim to offer binary level compatibility with Linux We support Linux elf program binaries in many cases straight from packages of major Linux distributions like Debian And we use standard unmodified builds of G-Lib C and other shared libraries There are times when an application requires some patching to run under nanos For instance where a call to fork could be eliminated or replaced with the use of p-threads But keep in mind that it is not our goal to run every possible application and have complete future parity with Linux We're focusing on running a core set of services in the cloud really well Our orchestration tool ops aims to ease deployment Ops can take arbitrary an arbitrary Linux elf binary extract the shared libraries It depends on from the host environment package it up as a bootable disk image and run a either under QMU locally or deploy it to a cloud provider For convenience ops offers pre-built packages of common services and language runtimes These can be used to quickly spin up an instance with only your source code and or config files By passing to need to stage the application or build it yourself Unlike unicornals that focus on specific language support We concentrate on binary compatibility and as such support virtually any compiled language out of the box As for language runtimes Python Pearl Node Ruby Java PHP closure R elixir and lure Lua are just some of the languages known to work under nanos Cloud platforms currently supported by it net up by ops our AWS EC to and firecracker as your Google cloud platform Open stack vSphere and Volter and despite limited support in ops nanos images may also be deployed on Kubernetes and digital ocean Of course nanos is open-source software and is released under the Apache 2.0 license We currently only support 64 bit x86 binaries at the moment But 64-bit arm support will be arriving soon Ops in the nanos build environment work under Linux and Mac OS with windows support for ops currently in the works Okay before we move on to a demo of ops and nanos I like to talk about our approach to security in addition to forgoing the library OS model in favor of providing a compatible Binary interface we stray from the purest sense of a unicurnal if you will by implementing standard security features typical of more mainstream OSes like we employ the kernel user privilege separation and for one and we use a standard page protections like read-only no execute and We default to using address-based randomization at least for user space mappings. This is similar to ASLR and Linux We also have an optional hardening feature called exec protection Which will only allow execution of pages mapped from the program binary and files explicitly marked as executable in the file system like shared Libraries and such files cannot be written to under any circumstances for programs that do not use JIT or otherwise generate code pages This can help lock down a nanos instance by making executable pages immutable should the application be compromised as I alluded to earlier Some people debate the use of such security measures in a unicurnal arguing that the greatly limited attack surface Justifies running all code at kernel level and that the kernel user separation comes at the cost of an expensive context switch on a system call Keep in mind that for the latter concern such as this call context switch is not nearly as expensive as context switch between processes which involves changing address spaces and Using sys ret and sister turn on x86 64 It's even much lighter less than the quarter of clock cycles according to AMD 64 system manual Then using traditional software interrupts in iret and While assist call is no doubt more expensive than a direct call Practical the practical cost of handling sys calls must be considered in context and weighed against the cost of forgoing kernel-based security Another factor to consider is that system call interfaces are evolving as with the introduction of IOU ring and Linux Also supported by nanos which amortizes the latency of the system call over multiple queued requests The performance question is still a matter of debate But we're really not convinced that the speedup of a trap-free system call is worth the loss of protection from foregoing privilege separation So let's get back to security matters It's outside of the scope of this talk to debate the security risks of using unicurnals But there is a very compelling paper and talk that was produced last year by ncc group titled assessing unicurnal security By spencer michaels and jeff delio They took a look at a number of unicurnals that don't implement the aforementioned features And demonstrate how they can be handily exploited to execute arbitrary code at kernel level Lacking privilege separation means that virtually no security no useful security measures can be implemented in a kernel at all Attempting to lock down pages or implement exact protection would be entirely pointless if the application could run with kernel privileges Keep in mind that for these unicurnals lacking privilege separation A compromised user application has full access to a pair of virtualized devices Including the network interface and can do things like send raw packet frames Hypervisors would need to have strong capability restrictions in place to limit damage from such compromised VMs So you should all now have some basic background on the concepts and major issues surrounding use of unicurnals And we talked about the goals of nanos and making unicurnal deployments easy and practical But that's just talk. So let's see it in action with the demonstration Let's start off with a quick little demo of ops using a pre-built package First we'll visit the website ops.city, which is running on nanos and snarf this little install script to get ops up and going Here we go Okay, now ops is installed in Dot ops in our home directory and a search path has been added to our dot profile So let's open another login shell. So we'll have it in our path and we can just run ops to get some usage Now we can see package related commands. So we'll check those out and now let's get a list of packages Okay, so there's a bunch of packages available here to make it easy to get something going Let's do a quick demo of ops staging a nanos unicurnal using a pre-built nginx package We'll just use nginx with the default configuration without even staging any content to use So we're telling ops to load the package to use port 8083 Okay, that was fast and that even included downloading the package and latest nanos release So let's go back to safari. See if we can hit the page here Okay, so it works and yes, it's supposed to be an error page. That's not too interesting But how about we add a static page? Okay, so here I have a simple little static page and here's an ops configuration file And json which simply tells ops to stage the page along with nginx and we're still using the default nginx configuration Okay, great So, uh, that's a very simple test. It shows you how to add a file to the image Now just to reveal what's actually happening under the hood Let's run it in verbose mode so you can see the qmu command lined and let's have it dump out the manifest file too Uh, so we can see how nanos is being configured then we'll just kill it since we're just interested in the output from ops Okay Note the net dev user Command line option this tells qmu to enable user mode networking Which is handy handy for development because unlike using a tap device and bridging it requires no special privileges And now here we can see the manifest file File is constructed by ops and fed to nano staging process to tell it Where to find nginx shared libraries and configuration files Plus it has arguments and environment environment variables for the program We'll talk about this file a bit later in a section on management So as you can see using ops to stage and run a unicolon on your local machine Could hardly be any easier Okay, let's um Try running a little go-based web server We'll test it locally And then we'll run it on Um on google cloud And just to make things a little more interesting. We're gonna build our own nanos kernel and use that instead of the release binary Okay, so Okay, the build the kernel is built now. So let's build a simple go web server and run that okay, that's the uh right, that's the go web server and We're going to use ops to run a local elf binary rather than the package build But we use a config file that tells ops to use the mbr bootloader kernel and makefs tool from our own build Okay, and uh, so there you go You can see that loads just fine Okay, now let's create an image using uh server binary on google cloud platform We already set up our google cloud credentials So you can just use ops to create the image for us Okay, this will take a little bit and then after that we're gonna create and start an instance feel free to twiddle your thumbs here This takes a little bit of time, but it's um It's not too painful awsec2 takes quite a bit longer So uh, we're creating the instance here specifying our project name Um, as you see the json above for ops has some of the cloud config stored there, but actually I think we are we're being redundant and Yes, we're being a little redundant and specifying some of those options again Okay, so now what we're doing is we're we're back in google cloud console We're trying to find the image because we need to edit the instance and add a network tag to open up port 8080 We have that saved as the nanostag So we click edit scroll down and where's the Here we go the tags and uh, and then we add nanos Save it Okay, so now we're ready to test out the instance by sending a request to the public ip Okay, and there it is Okay, so now let's have a little fun and do a speed test I have two debian instances running on the same subnet So we can use one of them to run apache bench against the go web server on nanos And we can run the same server binary on the other debian instance and profile that too The nanos image was provisioned using a g1 small machine type Whereas the debian instances are n1 standard one So I stopped the nanos instance to change the machine type so that they'd be all the same You can see here that they're all the same Okay, and here we're just checking the server Nanos again Then we're going to grab the internal subnet address and switch over To one of the debian instances Here we go So that we can run our apache bench We're gonna do Right Okay, so we're gonna do uh 1000 requests 100 at a time Okay So we're getting just under 16,000 requests per second with a total transfer rate of just over two megabytes a second So let's switch to the other debian box and run the go web server there Getting the internal subnet ip. Okay. We're running the server Um, okay. We started the server. Uh, let's just make a request to check it and make sure we have our arps in place and everything Okay, here we go Uh, so now for the test same test Okay, so the same server binary running under debian on the same machine type is is getting around 12 and a half thousand requests a second with a transfer rate of just over 1.7 megabytes a second There's there's nothing else running on these debian instances and I started them up just for this test So while this is purely anecdotal and not any kind of serious benchmark It is interesting to see that this web server running over 25 percent is is running over 25 faster Um on nanos than under debian for an equivalent instance type So I think it's fair to say uh, at least in this case that nanos is pretty fast Okay, so now you have a sense of what unicernals are about and you've seen nanos in action So now let's get to the part of the talk that is most interesting to me and that is the internals of the kernel Let me just say up front that there is a lot of material here and I'm going to move at a pretty fast pace There's only so much I can cover in the time we have and if you're really interested You're probably going to check out the source tree anyway So I'm going to skip over any rote enumeration of the various parts of the kernel And instead talk about some of the design decisions that we made with emphasis on what I personally find most interesting Now my hope here is that the aspects of the kernel that are most unusual or compelling to me will also be thought provoking for you as well Some unicernals like osv and rumpron borrow heavily from more established os kernels and yet others go with a clean slate approach When I joined nano vms development was already headed down a ladder path At that time the kernel was entirely the work of a single developer eric hoffman Eric's code was quite different from any code. I had seen before particularly kernel code written in c I quickly recognized that certain decisions were made with specific intentions and that the runtime environment was designed in such a way That would aid composability and minimize redundant housekeeping So I'd like to place a special focus on these aspects that have informed nanos development And it may be that a deep dive into these nan nuances Will be most useful to developers who are beginning to work on the nanos kernel itself The nanos kernel is built on a foundation which we call the runtime As we are self contained and running on bare metal We need some kind of runtime library for formatting output allocating memory managing timers and things like that We also need implementations of basic data structures Most of the data structures we have here are quite ordinary We have a linked list type resizable buffers and vectors a priority queue which is used for timers a lock free queue left leaning red black tree A hash table a symbol table and a tuple type So we'll spend our time here focusing on aspects of the foundation that are less common particularly within c in kernel code Okay, let's first consider a primary function of any runtime system memory allocation Now if I was given the task to build a memory allocator for a new kernel I might just stick to the interface that is most familiar to me malloc and free That's what we're accompanied or accustomed to using in user space with libc Uh, and that's what the other major os kernels use You know linux uses k malloc and k free free bsd uses malloc and free each have more than 4 000 calls to allocate alone Not including other interfaces for getting cached objects But this interface is rather inflexible and there's a great deal of complexity hidden behind it You know many times we just need a place to stash a struct and you know, but in in kernel We often have special constraints that need consideration k malloc and linux may be invoked with dozens of flags and modifiers to convey special constraints Such as to specify a zone or whether user memory might be dma'd to or even whether memory should be zeroed after being allocated So why hide the handling of such constraints behind a single door? If this wasn't complicated enough linux and free bsd end up creating more allocate and free instances to manage Interfaces excuse me to manage cached objects of a given type and these are global as well Now to be fair, this isn't to say that such interfaces are unmanageable And you know, these are a result of years of evolution and very large systems It isn't trivial to change many thousands of invocations over to a new interface But could we address the general problem of memory allocation in a more flexible way? In nanos our allocators are called heaps We have an abstract heat-based type with familiar allocate and deallocate methods When we create an instance of some data structure initialize a subsystem of the kernel or invoke a probe function for a device We specify the allocator or allocators to use as arguments Perhaps most allocations are used for internal data structures with no special constraints needed These are typically covered by a general heat parameter Another type of allocation is for physically contiguous memory, which are usually just called contiguous Now let's think about what allocation is in an abstract sense Allocating is taking some amount of an available resource and apportioning it or displacing it But at the lowest levels in the kernel, we aren't really thinking about carving up memory as a tangible resource Sometimes for us it means carving up number space When allocating, we simply try to find a range of numbers to satisfy the request size and constraints But it so happens that we're carving up numbers for all kinds of purposes in the kernel There are disk blocks file descriptors thread IDs interrupt vectors and so on And all of these number spaces are managed in similar ways primarily via an allocate and deallocate methods So why do our allocators need to be written for memory allocations alone? Okay, so let's take a look at a concrete example of the heap and nanos, the ID heap The ID heap is a general purpose allocator for a number space It serves and manages allocations from a pool of number ranges It extends the base heap interface to allow the setting of constraints or attributes like randomized which we use to implement ASLR First fit or next fit or to allocate only from a given sub range It can draw from a parent heap when more space is needed In the nanos kernel virtual address space is allocated in 4 gigabyte chunks from the virtual huge ID heap The ranges for which are initialized statically according to architecturally defined limits Page size allocations of virtual address space are allocated from the virtual page heap Which in turn draws from these 4 gig sized pages served by the virtual huge heap The physical heap serves allocations of physical memory The ranges for which are initialized according to available system memory probed on an initialization The ability to specify a parent heap on ID heap creation points to the composable nature of heaps and nanos So let's take a look at how we actually get some usable pages of memory using the elemental ID heaps The physically backed heap does just this An allocation from this heap translates into allocations from both virtual and physical parent heaps Followed by a call into the page table code to map the virtual allocation to the physical The return virtual address is usable physically contiguous memory On deallocation the mapping is removed and the virtual and physical allocations are returned to their respective heaps. It's very simple A single instance of the back heap is used to feed general purpose allocators as well as physically contiguous buffers for device and dma use Okay Now let's say we want to carve such map pages into smaller allocations There's more than one way to do this But we typically use an object cache heap called object cache to draw from backed page allocations and serve up smaller allocations of a fixed size Maintaining an in-place free list of objects within a page as well as a list of pages with free objects available This is much like what the slub allocator does in linux To finally make a general purpose heap for serving allocations of arbitrary sizes We have the mcache heap which contains object caches for a range of power of two sizes This is typically what is used for general allocations Such a heap is really a container or a wrapper as it does little more than pass allocation requests onto heaps that do the actual work We have some other container type heaps which simply wrap access to another heap These wrappers make it possible to augment a heaps behavior in some way without rewriting it And they're great to use during development and debugging We have a debug wrapper which can be inserted during development to simply log allocations from some heap A heap wrapper can be used to place and check poison values around allocations We have a locking wrapper which simply holds a spin lock while handing calls off to be served by a parent heap We use this wrapper to serve allocations that might be made or released outside of the confines of the kernel lock Such as during some deferred processing scheduled on behalf of the device interrupt Before the object cache was introduced. We used a free list wrapper to cache freed objects of a given type And as we improve our s and p support We could create a multiplexing heap wrapper that selects the heap specific to the current cpu or node if available Another avenue to explore is the notion of ephemeral heaps Heaps that exist only for the lifetime of an operation or the handling of some event This could make use of the simpler type of heap that Serves allocations by carving up pages from a parent heap only returning pages to the parent once the heap is destroyed In other words the allocate is a no op and the allocations just sort of leak until the end of the operation For example, we could have a persist call heap or a per IO operation heap I hope this little tour through our memory allocation scheme illustrates the power of parameterizing memory allocators And composing complex allocators out of elemental ones While the internals of the allocators themselves are not novel or unusual The change in the interface is transformative Now we'll discuss how we approach the handling of concurrent operations in a kernel So generally speaking a kernel handles events and services requests. There are external events Like reception of network packets or completion of some IO operation There are time-based events that were scheduled into the future And there are precise events stemming from a program causing an exception including system call traps Each of these stimuli initiate some chain of operations to be executed Many can run to completion without ever needing to block or require a context switch But others will include an operation that must be completed or at least continued asynchronously While it is common for os kernels to manage such concurrent operations within kernel threads We try to take what we hope is a lighter weight approach to concurrency The thread model is undoubtedly convenient for the programmer As with threads running in user space, there isn't a need to explicitly save or build an activation record for a blocking call There is the expense of a context switch though The saving and restoring of program state in and out of frame storage more So if such an operation is preempted rather than suspended on a call While threads can be implemented at the kernel level the nanos They're not used by default to manage such asynchronous operations Nanos typically constructs such operations using continuations or more specifically closures Okay, so what are closures? While closures are a common feature in languages like go python swift and scheme They don't seem to be commonly used in os kernels and given that c is language where functions are not first-class Objects closures aren't typically found in any c code for that matter It's much more common to see callback functions pinned to some event or passed as a call argument Typically accompanied by a pointer or some private opaque data Such callbacks are used to implement continuations synchronous or asynchronous or to register an event handler For our purposes closures are callbacks with some saved variables. These variables said to be closed over bound or captured constitute the environment necessary to carry out an operation apart from the lexical environment and stack from which the operation originated For continuation this environment is is the activation record Closures contextualize the callback arguments as well as the captured environment We use closure types to define various kinds of handlers for example a closure that takes a buffer as an argument as Is defined as a buffer handler and one that takes a status is a status handler Such arguments given on closure invocation are of course independent of the callback function or the enclosed environment Without closures We might be inclined to create a type to encapsulate the variables needed to complete an operation Along with a helper functions to allocate initialize inspect and free such a record And perhaps hundreds of such apparatus may be littered across the system Closures hide the redundant aspects of managing these records while adding type checking for the enclosed environment as a whole Making them more than just syntactical sugar While the necessity of decomposing complex operations into partial completions Continuations may take some getting used to it has some nice side effects For one it gives us some very useful standard interfaces like status handler and buffer handler That greatly aid plumbing and factoring within such operations It increases inspectability too because any such completion waiting in a Containing structure somewhere can be identified by the closure type and the enclosed environment with types and meaningful names Can also be inspected In contrast when a thread is blocked you would need to take a look at its save frame Use an instruction level debugger or stared its disassembly and wade through stack activation records to gain insight into its state Okay, let's take a look at the anatomy of a closure in nanos through a thoroughly useless example So here we have a creation of a closure consisting of a closure function called bar and it closed over argument a Stack closure creates the closure which lives on a stack We apply the closure immediately after specifying a free variable b as the argument And the sum of the two numbers is returned Now this isn't a sane use of a closure, but it's just a simple one to illustrate its form Closure under bar function is a macro that defines a type of closure Here the environment consists of a single word l zero and the closure application accepts a single argument r zero The first two number arguments For the macro are the number of enclosed or left-hand side arguments and the number of applied or right-hand side arguments respectively Bound left-hand side variables are accessed using the bound macro. They are mutable too. So they can be on the left-hand side of statements assignment statements Okay, now let's consider an example of an asynchronous continuation Here we have an excerpt from our unix syncs this call an instance of a status handler A closure type which takes status as an argument as I said is made from the Static sync complete closure function and it closed over reference to the current thread Storage sync is called with the status handler which initiates the file system sync routine Note that unlike the previous example the closure is allocated off of the general heap for it may be called after the sync function has returned Thread may be sleep uninterruptible checks if the completion was invoked synchronously Immediately returning from the syscall if so and otherwise calls the scheduler to do something else while the thread goes into an uninterruptible sleep Let's imagine that the thread did go into a sleep and that some dirty pages needed to be needed to be flushed to disk After all file systems are synced the supplied status handler is scheduled to run Sync complete is called with a saved environment in tow which calls syscall return setting the error return value in the Saved thread frame and scheduling the thread to resume execution Finally closure finish deallocates the closure and scheduling resumes after sync complete returns Okay, let's move on to a more complex example This is the read function in our logging extent based file system The read function itself is a closure created per file with references to the file system and file object closed over The read closure is called With a scatter gather list which is just a vector of buffers in which to store the result A range that is essentially a file offset and length and a completion to apply when the operation is finished The file extent map is something called a range map Which is a search tree of ranges in this case ranges of blocks indexed by file offset Embedded within the file extent records The lookup function here traverses the extent map for a given range of blocks Invoking a read extent closure for any extent, which intersects the query range and zero hole for any gap or file hole Note that the merge m allocated above is passed as an argument to read extent So now let's look at the merge A merge provides a way to join together the completions of multiple operations issued in parallel Culminating into a single completion here the complete argument to file system storage read Each call to apply merge returns a newly created status handler And when all such status handlers are called the completion passed to allocate merge is invoked and the merge deallocates itself Now why do we pass the merge to read extent? Well if read extent Determines that there are blocks to be read It will issue a request to the underlying block device But before doing so it will call apply merge Bumping the internal ref count within the merge And creating a status handler This status handler is passed to the block device read call as completion To be invoked asynchronously once the IO has finished Note the first call to apply merge in file system storage read returning the status handler k This takes a single reference to hold the merge open for the span of the function Once the requests for all individual file extents have been issued it is applied at the end of the function This is to prevent asynchronous read completion from prematurely invoking the upstream merge completion before all requests have been issued This excerpt this excerpt from nanos illustrates the power of Composing complex operations from smaller ones together with read extent zero hole and the function which issues the block reads The whole file system read operation consists of just over 70 lines of code Let's take a look at one more application of closures and nanos scheduling Our scheduling is incredibly simple There are a handful of lock free queues for runnable operations Each operation is embedded in a closure that takes no arguments, which is called a thunk Surfacing a queue amounts to this attempt to dequeue from a scheduling queue and if a valid thunk was dequeued apply it The code here is simplified slightly for clarity, but this is pretty much all there is to it What is being scheduled is entirely orthogonal to the scheduling process itself The scheduler doesn't know about unix threads or program state frames or other such details It just calls thunks leaving the details encapsulated within the captured closure environments Here's another illustration of this orthogonality Recall earlier when I said that we could have kernel threads if we wanted them Excuse me kernel threads if we wanted them We would only need to create a runnable thunk to restore a saved frame and stack much like we do for user space threads No special handling would be required on behalf of the scheduler here As with any run to completion type of scheduling this game manages to avoid the need to perform frame saves and restores Standing in contrast to a heavily kernel thread based approach such as what is used in free bsd While this may seem advantageous with regards to performance. There are some caveats for one We are allocating memory and memory and many most Or perhaps most cases where we need a continuation This is ameliorated somewhat through the use of something rather awkwardly called a closure struct Which just embeds a closure environment within an existing activation record We could do a lot more Work to around the system to fold smaller closure environments into larger ones thus reducing allocate allocations When it isn't possible to associate a continuation to a larger context The use of ephemeral heaps as mentioned earlier could also cut down on allocation overhead by simplifying closure allocations and Simply allowing them to leak for the lifetime of an encompassing operation An argument could be made that copying variables in and out of closure environments would counterbalance any potential potential advantage over thread switching Again, this expense is inversely proportional to the persistence of closure environments To be clear this approach of using explicit continuations is not being proposed as a better alternative to the use of kernel threads The two are not mutually exclusive And we are currently exploring the use of in kernel threads to implement modular management services within the kernel But we feel the discipline of decomposing complex operations into partial applications has paid dividends in terms of composability and inspectability Let's now move along to talking about how we approach configuration and management within nanos Perhaps the most interesting aspect of our approach to management Is that we have one common space store values used for configuration as well as revealing information about a running instance This key value store is created when a nanos image is built as part of our logging extent based file system called tfs The system manifest which begins life as a human readable json like text file Contains all the essential values needed to create a nanos image for a given user program The structure of a root file system and location of files to be included The location of a user program and command line arguments environment variables and debug flags Let's take a quick look at an example of a manifest used for our little test go web server Now don't worry You won't need to write such a thing to spin up a nanos instance with your desired program Because ops will take care of that work for you But it may be helpful to get a visual sense of the ingredients needed to configure a nanos This manifest file is fed into our makefs tool which ingests these values and stores them within the file system log The referenced files are taken from the staging environment and written into the file system Note that even file system metadata such as the location and size of file extents also lives in this common value store When nanos boots the log is parsed from storage and a root tuple is created in a system This construction here might be a familiar one. A tuple is a key value store and values can be either tuples or Other data types For now, we just have tuples and buffers where buffers may contain text strings or serialized types like integers This root tuple is the sole source of nanos configuration again, including the file system tree and file metadata So from the kernel's point of view the root tuple describes a system in its entirety This common store not only holds configuration information, but also provides live data describing a running system One act of development in nanos is that the values in a tuple do not necessarily need to be stored in memory They may be backed by methods that produce values on demand This model allows data to be pulled as needed without any intermediate storing of values So for instance a network driver would not need to periodically push interface counters into the root tuple It can just provide backed access methods that serve up the latest counters on demand Yet access to the common store shows such statistics as if they are live We're also in the process of creating optional external interfaces To this information store to enable management services as well as a debugging of the live its instance This might be better illustrated with the demonstration Here we have a nanos instance running and exporting a an experimental management interface It's using a tiny HTTP server running inside the kernel which serves up javascript client Serves up a javascript client and passes live data using web sockets And yes, don't worry. This isn't something that would ever be exposed to the outside world or even built in a nanos image by default It isn't pretty yet, but it is just some javascript code. So the possibilities here for the interface are unlimited For this example, we just expose a few select tuples that uh aggregate some interesting data about the running system Okay, so let's click on children Children is a key whose value is the content of some directory in this case the root directory So we just see some file system metadata here Now if we click on the dev sub directory, we'll see more metadata And if we click on the children on children again, oh look, there's dev u random and dev null Okay, so we can browse around and look at some of the static configuration Okay, now let's go back And we'll take a look at interrupts Here we see the various interrupt sources that we have in nanos So let's click on one of the sources the verdio network transmit interrupt And here we see the interrupt count being updated live. So we have streaming updates too Note that this little web interface is just a component placed on top of the value space and that It could just as well be a telnet interface Or a command line shell utility or something else But this is all made possible by the common value space and its get set interface There's a bunch more work to do here and possibilities to explore We're working to add a degree of introspection for objects around the kernel that can be usefully tracked or managed in some way The heap allocators network interfaces sockets threads cpus and few taxes already exist as objects in the kernel And would be able to directly export a tuple like interface and be inspectable or even mutable if enabled using this common interface Now imagine the possibilities here not only for system management, but for in-depth debugging of live instances And with use of address tagging The developer could readily identify any managed kernel object from its address alone with no other context about it Just imagine how useful this could be when analyzing a core dump the tree itself Contains hints and could contain hints and schema for presentation for example to describe fields Valid ranges or enumerated possible values Or to indicate that a field could be plotted against time or to suggest a certain layout or visualization Such a hierarchical namespace could become part of a global distributed namespace and management tooling could use this to manage sets And even whole hierarchies of nanos instances Okay, so I hope this overview of the nanos kernel internals has not only piqued the interest of systems programmers in the audience Who might be interested in working on nanos? But also Illustrates what is possible when a system is built from structures that are designed to help make code more composable Even in a language like c To me the clarity of structure within such a system and the agility of development are well worth learning curve I also hope that the intro to unicernal Concepts the walkthrough of nanos and a peek into what's coming on the management side We'll be of interest to dev ops out there and that you will head on over to ops.city to give ops and nanos a test drive If you have any questions for me that haven't been addressed in the q&a text chat Feel free to send me an email at wjohn at nano vms.com. That's wjhun Thank you for coming to my talk and I look forward to hearing from you