 So hello, and good morning. Myself Sri Ram, I'm a senior architect at CalSoft. I have Swapnil, who is my co-presenter. He's an active contributor into OpenStack. Before we jump into the details, the nitty gritties, maybe I should speak about Bipin and Kashish. They are two dynamic young folks who have helped me with each and every re-edits, maybe 10 to 20 times that we had to do before we got this. OK. So we can quickly go through the agenda here. The agenda is plain and simple. The problem statement, essentially, what is that we are trying to address. Secondly, followed by proposed solution. This is an overview of what we are trying to propose, how it is going to make the changes. Then solution prerequisites. So solutions prerequisites are something like these are some set of changes that are kind of required, whether we choose any of the approaches following which are being discussed. Approach one is about client unaware striping. So essentially, you can say that the existing clients would be able to work as it is without any changes. But it would possibly have some kind of limitations, though. Approach two is a client aware striping, where even the client contributes into this effort. And this can make a much bigger impact. This is followed by possibly set of use cases. We have kind of five use cases which we have tried to put here from the real world, which are how to export object store as a volume. Second would be can we implement something like a redirect and write based snapshots, row snapshot kind of thing. Third is a CDP solution. Can it be built on this object store, which is having striping support? Fourth is a bit different thing. Something like trim or unmapped. In a typical storage world, especially this is used with a SAN environment or even with latest SSDs, that you have the trim or unmapped support. So how can that be brought into with striping, actually? And finally, can we build an object-based file system? So these are the five use cases that we are trying to think about. So you can possibly start relating it when we are discussing the approaches, actually. Following this, there are some enhancements in Future Scope. So one of them, which I can quickly mention, is about we are also trying to think, can something like erasure coding be used instead of the current multi-copy mirroring that is happening in Swift? And there are some more Future Scope things. Appendix is typically having something like the JSON headers, which for the HTTP request, that would be modified in this implementation. So it is more for the developers who would believe in this idea and would want to implement it. So it's just some kind of a guideline. Apart from that, there are C also. So C also has something kind of the rest of the papers which we are trying to work, which possibly didn't get selected, but would certainly be put on our website. So there are five more papers that we are trying to work on, actually, which should be available on CalSaf's website soon. Moving on. So the problem statement. So traditionally, an object store has been viewed as a store for smaller size objects. Having said that, Swift does support large objects. It does so. And it's for dynamic, in the form of a dynamic object, large object, or a static large object. And the means of doing it is like it does it by segmentation and via a manifest file. Typically, you must be aware of this. Like, manifest file is, you can say it is the root file to this large object, which is storing all the segment-related information. So essentially, you hit into the manifest file, get all the segments-related information. Where are they lying? And then it's folk to those particular segments. And that's how the get-and-put request would be executed. Limitations, obviously, for large objects are today like it's still costly in terms of time, the network bandwidth usage, or even the storage utilization, I'll say. Manifest file is limited by a number of segments. As the number of segments starts growing, it becomes really difficult to manage large objects and stuff. Well, I've been discussing a lot about large objects. But more important fact that we are trying to cover is varying size objects, be it a regular, small, medium object, or be it large objects. They are being treated separately or differently in Swift today. And that is what we would like to go away from, actually. If there are any questions in between, please feel to raise your hand and ask. That should be OK. I would prefer to have an interactive session. So proposed solution. So what is the proposed solution? So object striping is a technology. It's not a solution by itself, actually. So when it is blended with the sparse object or parallel reads and writes to multiple object servers and vector IO at the object server end, then this equation actually builds up a solution, which we are trying to look out for. So it can be treated as an equation, actually. So delving into the details, object striping is not much different than today's segmentation or chunking. But the most important thing is it is not confined to only large objects. So essentially, as the object size increases, you'll see more and more stripes. So you treat all the objects similarly. Well, this also helps in storing this object. Yeah. Fine. So speaking at this in comparison to erasure encoding, as you're working through this, can you make it kind of distinguish those areas of the panel which are proposing this looking at it or addressing it in a different way? So erasure coding, certainly we haven't thought in this process. But when we went in quite a details about this, we realized that erasure coding is something that can be certainly added. But that, in case it would be more of a problem, we are trying to solve at a larger scale. Essentially, the replication part of whatever proxy tries to do, like whatever our replica number is stuff. So those all stuff would be totally removed, the multi-copy mirroring. But that is not exactly what we are trying to solve here, at least in this paper as such. Here, we are trying to solve maybe a better way of segmentation, better handling of objects. And based on that, what all can be achieved, actually. Does that answer? Yeah. So yeah, going ahead like, yeah. So I was trying to say that object striping, actually, will also enable the object to be spanned across multiple object servers. So we'll see how exactly in just a couple of minutes. Sparse objects. Now, sparse object is something like, essentially, not all the stripes need to be consumed. When I say that it's a sparse object, you just say that it's a 1 GB object, but possibly have just stood a single stripe in it at an offset of maybe 1 GB. So essentially, the disk space consumption, if you're considering, it's just 1 MB. But object size, it's possibly 1 GB. So virtually, you can say that we are also getting away with the limitation of object size, essentially. So you can write to any offset of an object, and it should be done kind of things. Following this, parallel reads and writes. So essentially, here, what we are trying to do is, since the stripes are going to be pushed into multiple partitions and so, essentially, into multiple object servers, the read and write operations are going to involve multiple object servers. So in literal sense, we can get parallelism, because object servers are completely different entities to which you can send the request. You read all the stripes in parallel from all the object servers that are involved. But here, also, we have to be really optimal. We can't just say that we'll try to use maximum number of them. But it should still be an optimal maximum number. It can't be too huge. So that is what we're trying to put here. Following this, like, vector IO. So here, the way to look at this complete object or its life cycle or what to say. It's topology, you can say. So a single object, you can say that it is divided into set of sub-objects. These sub-objects go into maybe unique partitions. And so the term sub-object has been introduced already. So it is nothing but a object or a smaller size object, which is holding one or more number of stripes, but not certainly all the stripes. So here, essentially, what we are trying to do is the sub-object itself will be something like a sparse file or a sparse object by itself. So suppose that there are just two stripes to be written to a particular sub-object, then they would be written to that particular offset. So being a sparse file kind of an implementation, it is not going to consume any extra space for the holes that are going to be created, actually. It is just going to write to those particular offsets. And that is the maximum space that is going to be consumed. So considering this, Vector.io is something that really makes sense, because we might want to write or read multiple stripes from a single sub-object on every partition or every object server. So Vector.io is something that we are trying to propose, which should be used to do the required IO, actually, at the object server end. So essentially, looking at the advantages that can be gained, it's like the maximum number of object servers being involved in read and writes, increase in shifts throughput, then load sharing across multiple object servers. Because when we were trying to do a bit of experimentation with this, we found that even with the existing hashing functionality, the stripes were getting very much evenly distributed across partitions and hence across the object servers. So being an even distribution or at least moving towards even distribution of stripes, the storage consumption on every partition and hence object server is evenly done. This not only helps with the parallel reads and writes, but possibly can help us with rebalancing also. When a new object server node is being added or removed, we move the partitions. And now most of the partitions being consumed evenly, it would really make sense. And the rebalancing effort would be helped a lot, actually. Next advantage could be the object size has no limitation now. You can go to whatever size, be it 10 GB, 100 GB, whatever size you want. It's just the offset that at which the stripe is going to be written. And most importantly, not all the objects, be it a small object, be it a large object, is going to be treated in a similar way or fashion. Any questions about this? So this is a quick glossary. We already introduced the sub-object term, but here it is mentioned. Portion of an object consisting of one or more stripes stored sparsely in a partition, or a partition file, I'll say. There are other terms. We have added a term of stripe ID, a unique identifier from stripe within an object. Stripe size is something that is just to be defined and stored in the objects or the container database more for as the striping information. Container DBs. So this is a change in the database schema that we are trying to propose. Object ID does exist. Stripe size is something that we want to add to it. Object size is what we would like to add. Object on this size, so on this size would be more for what is the actual consumption of the storage is what can be very well known. Object version delta size, so this term will be very much used when we discuss about the use case about snapshotting over object servers or object stores and also a CDP kind of solution, a continuous data protection solution. So essentially, there are some thumb rules that we would like to follow. Different objects may have different stripe sizes. But for a given object, the stripe size remains same. So once it is decided, then it becomes immutable. So I mean, these are thought in a way that we don't want to make drastic changes in Swift because then adopting to it would be really difficult. So we are trying to put certain set of limitations in our approach also. So here are some formulas actually. So what is a stripe ID? A stripe ID can be simply said to be the numbering. But we have tried to define it more as a function, a function of stripe offset, stripe size, object path, and partition. Now possibly, stripe size and offset is very much understood by everyone. The reason we are trying to add this optional parameters of object path and partition size is if you want to implement a better function by which you can even distribute the stripes. You can make better decisions based on the object path and the partition size also. Similarly, there is another function, which is stripe offset, which can be calculated from stripe ID, stripe size, object path, and partition size. So the relationship between these two functions is like they are inverse to each other. So if you have stripe ID, you'll be able to find the stripe offset. If you have stripe offset, you'll be able to find the stripe ID. And that's how these functions need to be defined. So one more thing, like yeah. So essentially, the stripe ID is never, what to say, stored as such. There's no database entry that you have seen here that we have tried to put. So it is something that can be always calculated by any of the entities, actually. So role of this stripe ID hash. So object ID, we know about what it is. Stripe ID hash is nothing but this will be a new URL that we'll try to form. And its hash would be something as stripe ID hash. So we are appending a stripe ID to it. So the salient points here can be like today object ID is being used to determining the partition. It won't be anymore. Stripe ID hash would instead be used, actually, to decide the partitions. And so essentially, partition is now a set of stripe ID hashes and not, what to say, of the object IDs, actually. The advantages, possibly we have discussed this, even distribution, optimal multiple object servers participation, and ring rebalancing. The extended attributes, essentially, today they look in this fashion. We are saying that from object-based extended attributes move to the stripe-based extended attributes. So there's a stripe ID that is being added. And possibly the rest of the things pretty much remain same, but they are with respect to a stripe ID. Essentially, you'll have multiple set of extended attributes because a single sub-object could hold multiple stripes. So this is a basic change in the HTTP request that we are trying to propose here. So this is with respect to a communication between a put request from the proxy to the object server. So this is a new header that we want to really add, which holds the object ID, stripe ID, stripe size, HTTP request offset. So this is actually the data of the stripes that are coming in, which are needed to be sent to an object server. And so all the fields determine to enable the object server to get the required information, be it stripe content length, object size, and stuff. So here, we can note actually that there are multiple objects to which a single HTTP request is being sent. So this will see how it makes sense in the approaches when we discuss them. The response would be something as follows. Like from the object server, for every stripe that has been written, it sends back and state is saying that was it a new write, was it an update, was it a trimmed value? Trimmed here means we are deallocating the stripe. So here, you can see the stripe delta size that is being discussed out. So essentially, whenever there is a new write, there's a change in the size. So that has been mentioned out here. When it is just an update of an existing stripe, then possibly there is no change in the disk space that is being utilized. So it's zero. And trimmed is like you're actually removing the stripe. So essentially, when this response is received by the proxy again, it can just simply add all these values and determine what's the actual object on this size change. And just collect all the HTTP requests coming from all the object servers involved for a particular object. And it should be able to just add all those and update the container database with the respective on this size. So and similarly, stripe max offset is something that can be used to determine the object size actually. So these functions have been defined out in a simple form actually. It's a max of so and so things actually. Moving to the get request, pretty much similar lines. You can see the request sends the stripe size, stripe content length, whatever is the ID, and the object offsets. On the object server side, based on the stripe ID, it identifies the offset and gets the stripes, builds this HTTP request, and sends it all together. So here also, we can see that a collated request of multiple objects and stripes is sent to a single object server. So it's like we're trying to piggyback multiple stripe to read requests in a single HTTP request. Moving on, there are some miscellaneous changes. So fingerprinting would be used at the stripe level. Essentially, it is for in a case where the same stripe has been sent over with no changes, object server just calculates the fingerprint for that particular stripe, compare it with the fingerprint that is stored in the extended attributes. If it is the same possible, it doesn't need to do any IO and just say a success. With regards to replicator, auditor, and updater, here they will have to now work based on stripes, not on objects. So essentially, whenever an auditor has to check the fingerprints from the hash table, it will go to check the stripes fingerprint rather than an object's fingerprint. So essentially, they have to work at a granularity of a stripe and rather not at a object. So yes, we're coming to approach one. So there are these two approaches. One is a client unaware striping and the other one would be client aware striping. So client unaware striping, if you have to see here, is the proxy is actually going to be involved in striping the objects and collating even the request. So this is totally going to be transparent to the client. The client just sends as today the objects. It gets striped on the proxy and then it's sent or collated and then sent to the destined object servers. The proxy is the one who is essentially going to calculate the stripe ID hash and determine the partitions and hence the object servers. Collation, so here what we are trying to do is just to optimize things. Not only request for a single object are sent to the object servers, but the requests which are coming from maybe multiple objects, multiple clients for multiple objects, the stripes from each of this object which are destined for the same object server, they are collated into a single HTTP request and then sent down to the object server. So essentially that has to have some collation criteria and we have defined the simplest way. It could be either time out based or it could be size based. A typical SLA, there's nothing different. So in this approach, yeah, approach one, the stripe size is going to be the same actually for all the objects. The reason being like typically how does, how to determine the stripe size. The stripe sizes typically would be a function of something like what is your network bandwidth that is available? And maybe other processing powers and other stuff. But here, since the striping happens at the proxy and is just sent down to the object server, so you're still in your private network and you know exactly what is the network bandwidth and possibility is going to be same across. So there's no point in having different stripe sizes for different objects because it's not going, it's just going to complicate the things. So here, all the objects would be striped at the same size. This is for approach one. So typically how would the put operation work? The request comes as a complete object to the proxy. The proxy stripes it, proxy decides the partition based on the stripe ID and the calculation that it is going to do. It even collates a request from multiple clients for multiple object server, objects. HTTP request for, so it's already, it's sent in parallel to all the object servers. These sub-objects essentially are returned by the object servers using vector IO and the response is sent back. So we're just trying to put all the rings, the swift proxy and the clients and you can see the stripes of various different objects. So S one is a stripe of first object, S two is a fourth and kind of things. Moving on, so this is how a typical HTTP request would look with regards to the proxy being, the request coming from proxy to the object server and the response coming from the object server back to the proxy. So it's a typical scenario where we are trying to put like client one is trying to write object one and client two is writing object two. So how they get collated and how they are sent down to single object server and how the response comes back actually. So similarly on similar lines, it's for the get request. So here multiple requests coming from various clients are again converted into stripes forwarded to the object servers. When the response comes back from the object servers using vector IO read, it will just collate all those things and send it back to the client. In this process also, well, that's for approach too. Okay, so essentially here the overhead is kind of assembling the object again and then sending it back to the clients actually. So this is just explaining the scenario with regards to the objects actually. The HTTP request, how they are going to look and how it is going to be formed. So what are the pros and cons or any other miscellaneous changes? The pros are like modifications are confined to the Swift server, client is not involved at all. This is an increase in effectiveness of vector IO at the object server as proxy collates the request from multiple clients actually. Since it's trying to collate all the requests coming from various clients and sending it down to the object server, it just is good for vector IO actually. So the performance can be much better. Well, the cons, possibly even today it applies, proxy can be a bottleneck out here in this approach. Collation of stripes of objects from multiple clients leads to synchronization issues. The collation criteria that we have tried to mention which is SLA-based, time out-based or size-based that is certainly going to have an impact on synchronization actually. Well, miscellaneous changes. So essentially the changes are in the Swift container server, proxy and object server. And typically like replication auditor and updater have to work at stripe level. So moving on to approach two. So here what we are trying to suggest is like if the client can be really involved in the striping effort then there's a much more benefit actually. So striping or not really striping as such but Swift client when being used by the applications above it if they send down the request for maybe just taking the use case of building a volume over an object server or an object store. So essentially consider like couple of objects or few number of objects, large objects are collated to form a volume at the client side. And in that case, IOs are happening on a particular volume by any application at a random offset. So based on that actually if that request is directly sent down to the Swift client it would be able to determine that okay. So it is at so and so offset. So it is possibly I can say is part of a particular stripe. So I just need to send that particular stripe as a put request to the object server and that should be it. So essentially what we are trying to say is the Swift client is now capable of taking the request at a stripe level from the applications which are using the Swift client and passing it on over. And similarly this can apply for the read operations at the volume which is being built over it. In this what we are suggesting is also like we can piggyback the information of the striping in the get request put obviously we'll have it but get also we'll try to piggyback. The only the thing that we are trying to consider is like if there are any updates on the object which are not known to the particular Swift client it can get notified via the get response itself. Well there is some amount of metadata cache the client will have to store which would be something similar to what we discussed in the container DB changes. Stripe ID being a function now can also be calculated by the Swift client. Well stripe size of an object. Now here the stripe size can be determined during the very first put of an object. And so it can be a function of the network bandwidth and other stuff based on whatever client sees. So here it is like now we are trying to have an optimization starting from the client end to proxy and from proxy to the object server. So everything the network bandwidth is being used very optimally across end to end actually. So here the client can determine based on its own network bandwidth that okay now I want just maybe a half MB of a stripe size because that's what would suit me better. In other cases it could be one MB. So it depends on the client and its own parameters that what it is the stripe size would suit itself. So it's an optimization which the client itself can do. I mean this is going to complicate the Swift, not really the Swift client but the application that are trying to use the Swift client. But it's at what to say, it's at the optimization that you're going to gain out of it. So the put request. So essentially whenever an application sends a request of that so and so offset of data or data is being changed in this particular object, the stripe is built out of it. So it could be like some plus and plus minus amount of data that the Swift client will have to pick and then start sending it down to or send it to the object servers. So here the use cases that we have tried to discuss in that it's very much possible that the application itself is trying to do something like an vector IO on this volume that is built over the objects. So the Swift client could get multiple read or write requests essentially which are get or put request from Swift's perspective. And these all requests can be then collated and sent out to the proxy actually. So essentially it's like striping is happening out here sending it to the proxy. The request are then sent down to various object servers and they give back the responses and respective clients are given the required set of stripes. It's not the complete object ever. It's always going to be in terms of stripes now. Well this is a scenario where Swift clients want to write or modify some stripes of object one and object two. So here the stripe size we have just taken for an illustration of one zero two four and so one KB and one and half KB. So Swift client forms a single request and clubs information of modified data of stripes in it and sends a write or a put request actually. So this is the put request that would come from client to the proxy. Proxy based on the stripe ID identifies the object servers as was even in the approach one and then sends it to the respective object servers. So S2 for object one and S1, S3 of object two reside on the same object. So all these three stripes would be sent to the same object server. This is how the response would look like. This being a put request of all new writes kind of sake. So the response from object server to proxy is going to be the status as new write, stripe delta size is going to be an addition and when this information is received by a proxy, proxy is going to send back pretty much similarly a response to the clients. So here in this second approach, I would say that proxy is acting more or less like a pass through actually. It's not trying to do heavy or it's not doing any heavy lifting actually. Whatever heavy lifting is being done is mostly at the client end. So essentially the processing and stuff we have pushed it to the clients and we are utilizing more and more resources actually rather than all the load coming to Swift itself. It's the clients also doing some amount of processing. Well, on similar lines, the get request, the whatever are the stripes that are requested by the application to the Swift client, those as get request in a collated form are sent to the proxy. Then proxy actually identifies where these stripes have to be read from essentially the partitions and hence the object servers. Gets all or fetches all of them from the object servers, collates it together and rather here, another benefit is like the proxy is not going to assemble all these stripes. So whatever order it receives from the multiple object servers, it can just club it together and send it back. In earlier approach, if we see, we had to actually assemble the stripes because the clients are not yet aware of striping and stuff. So the proxy has to have a complete object so it has to be assembled together. But here, that's not the case. Even the client is aware of the striping so you can just send it in whatever order you received it. And client is efficient or knowledgeable enough to take the decisions. So here, this is just a, what to say, an illustration of how the HTTP request would look like. These are special headers. These are not part of the existing HTTP. So these are the headers that we are adding. Well, I think we can skip this. This is the response. So here, we can see actually that the piggybacked information in the get response actually. So here, we are trying to say that the object size has been changed which possibly you were not aware of and you can update your metadata cache at the client. So it helps in its further striping and other stuff actually. Well, pros of this, like proxy is offloaded from striping and collation tasks. It's no more its task. It's a task of the client itself. Or rather not even the clients because applications that are using Swift Client are anyways going to do kind of vector IO or multiple offset IO's. So essentially, this has been offloaded in this approach. Then we are utilizing more of the client resources than at the Swift server resources. Reducing network bandwidth usage from client end to the object server end actually. So it's not just limited to the Swift server. So it's much more optimal from network usage. Cons are like proxy still has to assemble all the stripes coming from multiple object servers. So as we know that since the sub-objects could reside on multiple objects, the reads are going to be parallelly done. But at least it has to wait till all the reads are done from the object server. Maybe we can think of this as an optimization where it could be sent down whatever stripes have been red can be sent down back to the Swift client and just say that wait for another set of stripes that are coming. We are not yet done kind of things. So this con can still be removed. Well, miscellaneous changes. Actually, we are touching possibly all the components of Swift in a limited fashion. Our applicator, auditor and updater have to work on stripe basis and not at the object. So we're kind of done with the two approaches. We'll move into use cases if there are any questions on the approaches. Yeah, sure. Okay, projects on GitHub that does striping entirely on the client side with no server support at all. And it's horribly slow because all of the writes, they do come from different object servers. And all of the writes go through one container server because the objects get abducted. The container server is still a lot like. How many are sold in there? So essentially if you have seen the container database changes, we have kept it really minimal. There's no striping information that we are storing at all or stripe information, I'll say. We are just storing basic striping information. By striping information and stripe information, how I would like to distinguish is striping information is the information required to do striping. And stripe information is like every stripes information has been kept out there. So, and that's why it was very important that we had to define stripe ID as a function and not as a database value. And the stripe ID is not a database value. It is a function. And that would make a big difference. You don't need to go to the container. Right, right, right, right, right. Any other questions? Yeah. Your solution is to enable both of these cases or are you suggesting that they choose one or the other? Well, we would recommend approach two, but it would be something like a bigger change. So, well, yes. Certainly this can be done. There's nothing that is going to stop it. And based on the, maybe we can see that the client, based on the client versions, we can just decide whether proxy has to be involved in the striping or not. And just, right, right, right, right. Exactly, yeah. Any other questions? Yeah. Sorry. I'm just wondering if you modify a single stripe. I mean, stripe number five, let's say. How you deal then with generating ETAG for whole object if I just want to do a head? So if I understand the question, right, you're saying multiple clients are updating? No, no. So I have my object, right, and it's built from 10 stripes. Okay. And then, as far as I understand, I can modify one of those stripes. Yes, you can, yeah. So I'll modify stripe number four. Okay. So I just want to get a head for the whole object which contain ETAG, which is basically MD5 sum for the whole object, right? So how you deal in that case? Because you need to regenerate ETAG, right? If I modify the single stripe. So, we are not really storing the fingerprint at an object level. We are saying the fingerprint will be stored only at stripe level. Right, but still if I just do a list. Right. If I just doing list, I'm just getting information that my hash is very specific, right? Right, so if I just do a list for the object which contains all the stripes, it should contain the proper MD5 sum. Otherwise, if I just do a list and then I just get get get, then my validation of the MD5 sum will be incorrect. Right. So, essentially, if we are going by approach two, then possibly even the client is not really requiring any object level MD5 sum. Yes, Barthich is unaware about that. Right, in that case, possibly what we'll have to do is something like an incremental hash function would have to be implemented. Fine, following these are just the use cases. I'll just like take a minute or so. So, because now striping and collision is supported, it can be used as a volume or object storage can be used as volume. So, that is first use case. Second is like, so we were thinking also like once that is done, possibly we can have it put over a SCSI target. The Swift client is coming to a SCSI target and that SCSI target is now exposing those volumes as over various fabrics like I SCSI, FCE, FCOE, anything. So, essentially, it can be extended to that level too. Yeah, I mean, the striping certainly helps build, or with striping we can certainly help build it. Possibly I'm not getting your question. Yeah, maybe. So, the presentation would be available on the URL, possibly which will be published. If there are any questions, you can certainly mail me on my official ID, shriram.poray at calsofftinec.com. So, just the use cases, redirect on write snapshot can be implemented. Morad, continuous data protection can be implemented. And there is something called a trim or unmapped that can be really supported. So, this you'll find our relevance with SSDs today. And finally, something like an object-based file system. So, maybe you can read through this presentation and if you have any questions offline or otherwise over mail, I'll be more than happy to answer them. Thanks.