 Hi, so today I'm going to talk about our work called Seal Embedded, a homomorphic encryption library for the Internet of Things. So first I'll start with the IoT vision. There are kind of two facets to the Internet of Things, or two things that come to mind when someone mentions IoT. The first involves IoT's ability to give us an increased amount of control over our environments. Some examples of this include smart lights and generally smart appliances. But the other important facet of IoT is its ability to give us an increased understanding of our environments. Some examples of this type of IoT use case include wearables for continuous healthcare monitoring, monitoring factories for optimized manufacturing, precision agriculture, or understanding things like traffic and smart cities. This facet of the Internet of Things can really be broken down into three components. The first is the data collection step, which involves collection of real-world data at the source of the data itself. After data collection, we would like to be able to perform analytics on the data at scale. And finally, we hope that these analytics would lead us to useful insights. In practice, these components need to map to actual devices. So most simply, at the inside's end of the flow, we can provide insights to users at their personal devices, such as user-facing PCs or phones. The data collection step, though, happens in specialized IoT embedded devices. These devices are necessary and a good fit for wide-scale deployments in a variety of scenarios because they have a small form factor and are low cost and energy efficient. But there's no such thing as a free lunch, and as a trade-off, these devices are often constrained, particularly in terms of memory. We would now like to perform analytics on this data, but these analytics services often require a lot of data and storage themselves. So we often need to have the embedded devices send their data to an external computing source like the cloud for processing. The cloud is really convenient as a place to aggregate and analyze data for several reasons. First, the cloud is resource-abundant, and it's certainly much less constrained than these embedded devices. It's also easier to manage a cloud infrastructure than several deployments of scattered devices, and additionally allows applications to use third-party analytic services. So the cloud has all these really nice properties, but the problem is that sending data to the cloud involves a trust domain transition. So while both of these ends of the flow involve user-facing devices in the user domain, the cloud exists in a different trust domain, and the data that is sent to and decrypted in the cloud for processing is vulnerable to attacks. A brief note that there are two kinds of reasons why the cloud may be considered to be in a different trust domain. The first is that the cloud provider themselves may not be fully trusted by the user. But it may also be a more indirect reason and a result of simply the nature of cloud computing stacks. Cloud computing stacks are really large and difficult to verify and debug, and this means that third-party attackers can often easily exploit vulnerabilities in the cloud and obtain access to user data this way. So these are two different reasons why the cloud may not be trusted to maintain user data privacy. And the question we ask in this work is can we enable data privacy in the IoT deployments while enabling analytics at scale and maintaining these nice properties of the cloud? A promising solution for this is homomorphic encryption, which I'll briefly review here. So HE is a cryptographic technology similar to standard encryption and that there is a secret key and optionally a public key where the holder of either key can encrypt a message, but only the owner of the secret key can decrypt an encrypted message and obtain knowledge of the underlying data. But HE also offers the additional property that enables computation on encrypted messages to produce meaningful results in the decrypted domain. What do I mean by this? So if we start with some data and use HE to encode and encrypt this message into a ciphertext, then we can apply some function f of x to the ciphertext and end up with another ciphertext, which I'm showing here as ciphertext prime. And when we decrypt and decode the ciphertext prime, we end up with some data that is the same as if we had applied f of x to the initial data directly. The advantage of this method is that it allows us to outsource computation to untrusted parties. So here we can outsource application of this function f of x to an untrusted party, which is called the evaluation step. And then we only need to perform the HE encode and encryption steps and the decode and decryption steps in a trusted domain. Okay, so coming back to this IoT deployment flow, how could we apply HE to this to enable data privacy? Well, it would seem to map nicely to this setup where we perform HE encoding and encryption on the embedded devices, have the evaluation component be performed in the untrusted cloud, and then when a user needs a result, they could simply decrypt and decode the result from the cloud. So this is nice because using HE, we only really need to trust the user endpoints in this flow. And this is where our library seal embedded comes in to enable this HE flow in IoT deployments. But before I introduce that, I want to take a moment to talk about all the things that we would need and would like to have in an HE library solution. So we would need the library to have a low memory use and to be able to be compatible with the embedded device requirements. But we would also like to not have to sacrifice too much performance. Ideally, the library would be applicable to a wide variety of embedded devices too, since the devices can be very diverse in capabilities. And then there's some additional things that we would like for it to do, like be compatible with or enable an end-to-end HE deployment flow like the one shown here. And finally, it would be great if we could support public HE encryption on devices in addition to symmetric HE encryption. And there's a reason we want this last point. In truth, symmetric HE encryption may be sufficient for many deployments, but if we were to deploy secret keys on these embedded devices and a device were compromised by an attacker, let's say, now the attacker would be able to learn the secret key and decrypt all previously sent messages that are perhaps now stored in the cloud. If instead we only enabled public HE encryption on the device and the device were compromised, the attacker would not be able to access legacy encrypted data. So that's not a requirement for an HE library, but it's something that we would like to have. So that brings us to our solution, seal embedded. Seal embedded is an HE library for embedded IoT devices. Devices can use seal embedded to encrypt data on the device and then send the encrypted data to the cloud where it can be read by what we call the seal embedded adapter. The adapter collects the data sent by the device library into an object that is interpretable by the Microsoft seal library. The seal library implements all the components of HE, so since the seal library supports HE evaluation, we can then use the adapter to pass this data off to the seal library so that we can perform homomorphic computation on this data. And then we can finally use a separate instance of the seal library again on the user device at the decryption end to allow a user to view the results when needed. So by virtue of this adapter, we were able to create an end to end HE flow. And overall seal embedded satisfies all the requirements that we talked about wanting on the previous slide. Next, when creating seal embedded, we needed to decide on an HE scheme to support initially. We decided to start with CKKS, which was introduced in 2017 and enables approximate homomorphic encryption. We chose CKKS because it offers the advantage that it's able to operate on encrypted floating point or real valued messages much more efficiently than other popular HE schemes like BGV or BFB. And it's because of this property that is considered the best scheme to use for applications that require or operate over floating point data or can tolerate approximate results such as machine learning. And we feel that these types of computations are most likely to be used for the kinds of analytics that we would like to perform over IoT centered data. So that's why we target CKKS in our library initially. Actually, there are several libraries that exist that already implement CKKS, but these libraries are insufficient for IoT devices. Some of the ones we that we have here that we listed are Microsoft seal, HE lib and Palisade. So to see whether we could use any of these libraries, we actually measured the memory usage of these libraries with three different methods. What we found was that even using modest parameters just for HE encoding and encryption, these libraries consume way too much memory to be able to be deployed on embedded devices. Our target is embedded devices with just 256K of RAM and these libraries consume 9,000 to 9,000 times this amount. And we note that just making minor changes to these libraries like removing dependencies and changing configurations and doing some simple code rewrites to get them to fit on an embedded device would not fundamentally change the memory requirements for these libraries. So the seal embedded approach of creating a new library is definitely needed in order to do this. So here's an overview of the seal embedded library and all the components that it includes. At the bottom, we have optimized addition and multiplication operations, and that includes an optimized assembly that works on ARMv62 devices and above. We do include C as a fallback here though in case that would not work for some devices for maximum portability. Above this, we have some algorithms for modular multiplication and modular addition and subtraction, including an algorithm for efficient modular reduction using Barrett reduction. Above this, we have polynomial multiplication, which uses the NTT or the number theoretic transform. We also have modulars for the inverse fast Fourier transform and a particular type of transformation denoted here by pi inverse, which is used in the encoding procedure. Finally, we have modules for sampling, and this includes random sampling from uniform distributions over certain intervals, and also sampling from a centered binomial distribution. And CKKS encoding and encryption uses all of these components, so we'll talk about how these modules are shrunk together in the next few slides. One of the key aspects of our library that makes it work is its efficient memory management scheme. In particular, we employ certain optimizations throughout the library with the purpose of lowering memory usage while retaining high performance. Additionally, we offer three main configurations of our library, high performance, memory efficient, and a balanced configuration with the main difference between them being the degree of pre-computation that we perform in some of these algorithms. And I'll talk about our memory management scheme in a little more detail in a few slides as well. But first, in order to describe these optimizations, I need to talk a little bit more about the details of the CKKS scheme itself. So elements in CKKS are polynomials that can be written like this. Polynomials can easily be represented in memory by just storing a vector of their coefficients. So for the rest of this talk, I'll often use polynomials and their vector representations anonymously. In CKKS, encrypted elements are polynomials in a ring RQ, like shown here. This basically just means that the coefficients of the polynomial, the AIs here, are integers in the range 0 to Q minus 1. They're integers modular, particular modulus Q. And the degree of these polynomials is n minus 1. In CKKS, n is typically equal to powers of 2 in between 1k and 16k. So now we'll describe the process of encoding and encrypting input data using CKKS at a high level, starting from the encoding step. I won't go into too many details on why we need each step per se, so please read the paper for more details. But we can start by assuming we have a light sensor that outputs the following values, maybe a value of 50.11 and then later a value of 0.12. We can collect these values into a vector, and then we can choose a value called a scale. The purpose of the scale is to preserve some bits of precision of these floating point values when trying to represent them as integers, which we'll do that in a second. Here I'm choosing a simple scale of 10. To encode this vector, we first apply a projection, which permutes the input vector and also doubles its size. Here the permutation is a simple flip of the values, but that's just because here our n is equal to 4. For larger n, this projection permutation is more complex. Then we actually apply an inverse fast Fourier transform to the result, which leaves us with something like this. Then we then scale up the values by the scale we chose in the beginning. Note how the scale pushed some of the least significant bits up into the integer component. Finally, we round the values to obtain an integer polynomial. Note that this rounding is actually rounding with a modulus, so the negative values wrap around depending on the value of the modulus, which is quite large on this case. The final result is what we would normally call the CKKS plaintext, and this entire process would be called the encode procedure. Note that we will never really use polynomials with such a small degree as I'm showing here, and practice the degree would be at least around 1k and up to 16k like I mentioned before. This is just meant for illustrative purposes. Now, coming to the encryption procedure, CKKS is similar to other modern HG schemes in that it derives its security from the ring learning with errors or RLWE problem. Here I'm going to show you how an RLWE encryption of 0 would work in an implementation. So first we would sample coefficients of a polynomial randomly from an integer uniform distribution from 0 to Q minus 1. You would do the same thing for another integer uniform distribution from negative 1 to 1 and obtain a vector s, which would be your secret key. The first sample becomes second component of your ciphertext, and the negation of the ring multiplication of these two values becomes your first ciphertext component. Okay, so far we have something that is insecure if we reveal the ciphertext like this, because it would be easy to recover the secret key s from just these two components. So we introduce a small error term, which is a vector of, with coefficient sampled from this error distribution here. In our implementation, we choose the centered binomial distribution. Other common choices of error distribution may be a discrete or rounded Gaussian. In seal embedded, we choose this distribution in particular because it's faster and isn't considered to weakened security. So once we add this error term, we have two components, one that is uniform random in RQ and one that's indistinguishable from uniform random in RQ. And this essentially means that we cannot recover the secret key from these two components. So this is an R and WE encryption of zero. But how do we encrypt a message this way? Well, it's as simple as adding a message plain text, like the CKKS plain text that we calculated from the previous slide, for example, to this first component of the ciphertext. So now we have hidden our message in the ciphertext as well. One thing to note is that all of these operations must occur in RQ. The problem is that Q is often quite large in HE schemes, which makes a lot of these computations very difficult and expensive. How large can Q get? Often it gets to be too large for even native computer data types. And Q needs to be large because it's directly related to the amount of HE evaluation operations that we can perform. So here I'm showing a typical bit length for a modulus value Q for various sizes of degree N. And you can see that in some cases Q can even be larger than 64 bits and wouldn't fit into a UN64. If we stored all coefficients modulo these larger Qs, even small polynomial would have really large coefficients and occupy a lot of memory. Not only that, but sometimes even a UN64 is undesirable for embedded devices, because these devices may have much more efficient arithmetic implemented for values stored in UN32s instead. And that would be more desirable to target. To deal with these issues, we can actually choose a Q that is a product of smaller primes Qi using the Chinese remainder theorem, also known as the residue number system or RNS. Here I'm showing Q that is a product of L smaller primes Qi. And then we can use this property to map each operation that would normally occur modulo Q to L operations that would occur on modulo Qi. So coming back to our RLW encryption, everywhere we use Q before, we now replace with a Qi. And then we just repeat these components L times for all primes Qi that make up Q. We take advantage of this optimization in our library, and this has the effect of enabling us to store certain values in less memory than we would otherwise. And because now we use the memory across each prime, and we never have to store really large value polynomials with large coefficients at any time. In more detail, here I'm showing the logical flow for symmetric CQQS using this optimization. So here we have our input data on the left that we pass into the encoding procedure and end up with a plain text M. At the top here, I'm showing that we sample vectors E and A like we described earlier, and then we use them to perform an RLWE encryption of zero. We do this using a pre computed secret key value S that we read from flash, and we add this encryption to the plain text M. So I'll use this flow to point out a few of the memory focused optimizations that we make. The first is that like I mentioned, we operate with respect to one RNS prime at any given time, and we send the values away on the network in between the primes. So we operate with respect to the first prime first, then free the memory of the components we don't need anymore to compute the next prime's values and then and so on and so forth. And another thing we do is reorder certain steps in the flow. So here the plain text M is reduced and added to each of these encryptions of zero, but instead we can combine it with the storage of the error E and reduce the two together. This doesn't change the final result at all, but it's just more efficient in terms of memory and even computation. Another thing we do is enable compression of certain values in certain places. The secret key S for instance can be stored as just three bits per coefficient, and the error polynomial only needs a few bits per coefficient as well. So we store these values in compressed form, and sometimes we even opt to sample error polynomials directly where they're needed, coefficient by coefficient. What I mean by this is that since we pull out this error polynomial and add it to the plain text, we never need to have additional storage for the error since we can sample each coefficient and add it in place to M coefficient by coefficient. And I'm not showing the asymmetric flow here, but it's given in the paper and the optimizations that we perform there are very similar to what I'm showing here. In addition to these optimizations, SEAL Embedded offers three tiers of additional configurations which affects these highlighted components in the library. First there's the high performance configuration. Here we pre-compute various values of some of these transformations. The NTT for example in this configuration includes extra pre-computation values that enable even faster modular reduction of NTT roots called lazy reduction. This tier does require a lot of additional flash storage however, so on the opposite end we have the memory-efficient configuration which takes the opposite approach and calculates all roots on the fly as needed. And then we also offer a balanced approach that's a middle ground between these two. In our paper, we give a detailed description of how the operations of encoding and encryption should be efficiently ordered and placed in memory to enable optimal memory usage and performance, all given in terms of the HE parameter N and assuming prime moduli that fit into a UN32 data type. So here I'm showing the high performance configuration example layout of memory. Here we can see that the IFFT roots, the NTT roots, and the public key values are stored in flash. We use these numbered symbols and circles to show the order of operations and the dash lines to denote memory reuse in RAM. Now we could just sort of blindly allocate memory and free it as we go without mapping it out, but this strategy doesn't work well for embedded systems because these systems often don't have sophisticated memory managers and not exactly figuring out the memory allocation can lead to things like memory fragmentation, which can really affect the performance of these device deployments. Here I'm just showing the asymmetric high performance configuration, but the paper gives more details about other configurations as well. Finally, we evaluated our library on two different platforms to demonstrate its range of use. The first platform is the Azure Sphere in which we target the Cortex A7 core. The Azure Sphere is a device that is marketed for IoT deployments, but its Cortex A7 is quite a bit powerful. The Sphere even includes its own custom Linux operating system and it has a caching system and a high clock speed. The other device we target is the Cortex M4 on the NRF52840. This processor is much less powerful than the A7. It has a much lower clock rate and no cache for example, but interestingly we had similar memory requirements for both devices even though normally the Sphere's platform would normally contain much more RAM because the Sphere reserves much of the RAM for the operating system. So in both cases we only had 256k to work with and actually our target RAM usage is much less than this because we needed to be able to have the user application to run too but would collect the initial data. Here are our results for both symmetric and asymmetric encryption for the three configuration tiers of our library. We provide the data memory usage stats and also the runtime for the setting of n equals 4k and using three RNS primes to represent our modulus queue. These parameters are modest but they're still enough to enable homomorphic linear inference for example. Our results show that the Sphere ran less than 0.2 seconds in all cases while the NRF runs running around 0.7 seconds in some cases and double that for the memory efficient version. Keep in mind though that these results are for an encryption of 2048 values in a batch so this cost may not need to be paid per sensor sample and can potentially be amortized. In summary, we presented Seal Embedded, an HE library for the Internet of Things which enables privacy preserving computation for IoT deployments. Seal Embedded is configurable for a variety of devices requiring between 65 and 136 K of RAM and 1 to 264 K of flash. The library takes anywhere between 0.06 seconds to run to about 1.5 seconds for HE encoding and encryption of 2048 values and it's compatible with the Microsoft Seal Library for an end-to-end HE deployment solution. And finally, we made the library open source so you can check it out at the link in the description. So that's the end of the presentation and thanks so much for listening.