 Okay, so hello everyone, my name is a run. I am the nonofficial ptl for stolets nonofficial because it's not yet an official project With me is you're safe moati from IBM and This work represents Both use cases and work that we've done with grid pocket whose CEO Philip could not make it Great pocket is a company that does personal smart grid solutions And this talk is about what type of use cases can we do with? Open-stuck stolets So first of all, what are stolets all about? Stolets are about Collocating storage and compute. What does it mean to collocate storage and compute? It means that if you have a lot of data you need to process then instead of copying their data over from the Storage cluster to the compute cluster. You basically put the compute near the storage and you don't have to move any data You bring the compute to the data but more specifically stolets are about Collocating dockerized computations inside open-stuck swift nodes In a serverless fashion, so I've said three things. I've said swift. I've said dockerized computations and I've said serverless fashion So let's talk a little bit about each of them so swift is a massively scalable object storage system that takes care of Data redundancy through for example replication across failure domains at its base It's a very very basic API of put a data blob or get a data blob So it's very simple, but on the other hand it sees massively scalable both in terms of The capacity of a data that can it hold that it can hold as well as the number of user concurrent users that can access the data Docker well, I don't need to explain what docker is the reason we're using docker is so that The compute the transinside the storage system is more is it is done more securely and isolated because after all we need to Make sure that the storage system continues to work What do I mean by serverless fashion? The idea here is that the end user bring his own computer program and Uploads it to swift as if it was a regular object and we take care of the rest without the user needing to do any server side Um configurations or whatever So basically this is what we're what we're doing. We're allowing to do docker at compute within open stock swift What is it good for? So one thing it is good for is big data. What happens when the big data gets big and what does that mean? It means that the big data does not longer fit into the expensive primary storage And so what we do we move it to a secondary storage don't get confused the data didn't shrink It's the object store. It's the object. It's the storage system that grew larger But when we move the data to a secondary storage, it doesn't mean that we no longer want to produce Information out of it. We still want to query to query that data and we want to do this efficiently So by efficiently we mean that we don't need to copy the data back to the primary storage in order to do something so In case you you're confused here the boat is going from the right-hand side to the left-hand side Installates can give us just that we can put the compute where the big data is We'll now move to a concrete example of this use case given by yourself from IBM Hi, so in the next 10 minutes We will discuss great pockets use case So we will see the problem or solution Solution performance measurements both in terms of test beds and results and then finally we will have a shorter demo So what's the problem? The problem is data ingest great pocket is a smart energy grid for company. It's platform supports for up to millions of smart meters and these meters produce 10 of terabytes of CSV data per year now How do we analyze all this data? the infrastructure is composed of a spark compute cluster and Swift object store which are Disaggrated clusters So now the problem is that for each SQL query that you run you have to ingest terabytes of data What could we do for that? Could we catch the data? Very good idea. The problem is that in general you don't have the necessary terabytes of memory So you can't do that Could we index the data? also a good idea but The problem is that data scientists evolve their SQL queries so that you can't you just can't do it so What happens is that? when Your infrastructure grows both in terms of Swift cluster and spark cluster then what happens is that you have to grow and to grow your network and and This is a problem. So what's our solution or? Solution is to bring to Swift to the Swift object store part of the traditional database smartness and There's more specifically the The user defined functions. What does it give it gives a possibility for the user to run at the data side Code that he wrote specifically in our case we want to filter out and Wanted data and want to want to be this to be done at the data side not after bringing this to this to the cluster to the spark or whatever compute site so Here is our solution we bring to the Swift side The ability to filter out the the data by writing. That's what we did a storage which just is filters the CSV data out According to what whatever we we want and this is past as parameters so a Obviously, we also had to modify the The spark side and we did that by extending the CSV the spark CSV library so that we now implemented the Necessary APIs which permit to push down the filter to the object to to the Swift side Now we will present Experiments which we did at OSIC OSIC is the open-stack innovation center We had a large and very strong cluster Composed of that we had a strong spark cluster a strong Swift cluster They were linked through the load balancer, which was at the 10 in gigabits per second and very quickly we saw that we had as As we explained previously we we experience a bottleneck at the connection between the two clusters so Before we present the results of our experiments What what readings did we use we we used real Obfuscated data which were composed out of ten columns and Each row was about one hundred and red car characters They we used two kinds of queries. We used real industrial queries of the of the Grid pocket all and we also used synthetic queries, which were which we composed in order to target given selectivity the query focused is on ingest that is the ingesting the data from the data Store the from the object store to the to the spark is side and Therefore, we do not present results for for a queries which which have long post processing for instance you could You could read machine learning At the spark side which which connects to SQL brings the data does some SQL processing and then use that for one hour of Machine learning computation that we don't this is not our focus. So we used and we We use the queries which were with small post processing Here is is a Grid pocket query where you can see The in the where you can see that this is a row Selection and you can see also the column projection with the data and didx index Which means that two columns will be used out of the ten columns of the data So what did we measure before we present the results? We measured mainly two things first one is the speed up How long does it take To run an SQL query without pushdown versus with pushdown What's the ratio and also in term of utilized resources? Okay, so this first chart shows that for three terabytes that the leftmost column it takes a No less than more than one hour just to run a simple SQL query Means you have to wait for one hour. This is unacceptable in term of data scientists, which expect to analyze the data By the way, all the plots presented belong to research work under submission. Now what What? What speed up did we experience for half of a terabyte? We experienced a more than if we are at a 90 percent Data selection we see that we have more than than 10 10x speed up if we move to three terabytes Then we have even and even a bit more a bit better speed up And the important point is to notice that in the real world. This is The selectivity is typically higher than 80 percent. So we are all work permits to really improve tremendously To give a very big speed up for real world queries This chart shows for for queries which Which gave the selectivity of even close to 99 percent even more than 99 percent We can show we can see that the speed up went up to even 30 or even more than 30 I will skip this this chart for lack of time Now resource consumption At the spark side no no big difference in mean memory used in Term of a CPU used when you don't have pushed down you use In the on the average more than twice the average the CPU that is used with pushdown Average bandwidth is three time higher. I forgot by the way to mention that all these numbers pertain to a specific query against a three terabyte data set that selectivity of 99 percent and for which we experienced a speed up of 20 so The the average bandwidth is three time higher without pushdown At the swift side no big difference in term of memory. However, CPU is much more used When on the average when using pushdown, this is clear because we are running at this At this at the swift side. We are running stall it. It takes a CPU now What is important to notice and to remember is that the query duration is 20 times Longer without pushdown so that when we say for instance that we have an average bandwidth Which is three time higher without pushdown. It's not only three time higher on the average. It's also for 20 times More 20 more times so that's that's a very big difference Now we move to to the demo which is a recorded demo We had this nice test bed for three weeks after that we we came back to our original test bed, which was a bit different so the demo is has been recorded for a Small test bed with three low-end spark machines and three low-end swift object nodes We also addressed a small Data set this is because we had to to to show this in quite swiftly and Also the speed up because all of this setup the speed up is also quite modest three to four which is nice but that's that's not the 30 that we saw and also What we will see in this demo is in fact a comparison of Running a series of queries without and with pushdown So I'll play it and you'll talk Thank you So what we see at the upper hand is the loading of sweet swift data sets and then we have a series of series of queries which after their their first run with pushdown and the time it takes is is is Printed after each query completes and then Then the grid pocket guys use the results to For graphic demonstration as you can see so here you see that it took 20 seconds and this is in real time However when we will move to the no pushdown we Shorten the the times because we don't have it takes it's takes long So here we progress with all the queries 21 seconds for this one Perhaps we need to mention that this this was done by the grid pocket team This recording and the and of course the queries and the ipython notebook on which it's running Yeah, I forgot to mention Jupiter now. We're almost done and now now They we we re-rend the the queries without pushdown and now this is accelerated This is not real time and finally you can stop here. Thank you Finally here is the interesting part is it gives you In fact, we we run twice without pushdown twice with pushdown and you see the time comparison for each of the queries So that's that gives you the speed up about three All right, thank you very much Okay, so this was the big data related use case. I'll now move to the next use case Which is oh by the way, I forgot to mention this is all in the open So from the storage side of things, this is part of the open stock storage repository and from the Spark side of things there is this report there The test they code the test the read me with how to set it up and test it and so on and so forth Okay, so the next use case is data Privacy, thank you. So it's data privacy Not data piracy So the idea is that we're holding a sensitive piece of data in the object store However, we may benefit from sharing this data once it's obfuscated. So perhaps sharing with pirates isn't Exactly beneficial. However, in other cases it is such as in another use case. We have from grid pocket so Smart meters are here As evidence from this quote, they're over 60 million smart meters deployed by end of 2015 and it's only part of the Deployment and they're not only here. They're also produced data. So it comes to with like Small resolution collection we can get to 10 terabytes of data per year And if there is higher resolution it can get to tens of petabytes However, this data is highly sensitive as evident from this graph here So actually just looking at the data one can tell whether our refrigerator is turned on or an Oven element turned on and actually Philip once told us that with a high enough resolution one can tell which channel is being watched on TV. So definitely this is sensitive and This is also made clear by this White House report which says that smart meters can turn homes into transparent fish tanks completely penetratable to marketers police and criminals and As you know in the you in Europe data sensitivity is something important and it is expected that At the year 2018 there are going to be regulations that gonna impact Energy services provider that will have to make sure the data is safe So how can store let's help you Here is what we can do. Oh, sorry most important So on one hand, there's a lot of data on the other hand, it's sensitive However sharing it might be beneficial as I've mentioned before so it either may be beneficial from the Sharing it with the provider that can just improve the service if he has the data Alternatively, if we give it the data to a third party that does The trans efficiency tool on that we may save money. It can help it can recommend to us how we can Consume electricity in a smarter way. So on one hand, it's a sensitive on the other hand. It might be beneficial to share it So what can we do? We can use toilets To offer skater data so that the utility company for example doesn't get the raw data But rather an average or some low-pass filter that makes it less sensitive So this way if the data owner on the left-hand side allows it to access only via storage then the utility user Wouldn't be able to get the data as is but it will be able to get it obfuscated via a stall it So we're gonna demonstrate that What the demo would show is that the data owner can give access via a store it to a user and That the user cannot access the data Before it is given access and only get access via stall it The demo is going to use Firefox stress plug-in We're gonna use the same Firefox instance to drive both the data owner and the utility user the requests so to make sure we Differentiate both for the data owner. We're gonna use an exos token starting with be exos tokens like the standard Credential that the user needs to show a service before it To call its API so the data owner exos token is going to start with be and the user utility Gonna have an exos token starting with see Okay, so let's move to the demo Here is the Firefox plug-in before we start I'll just show you how the data looks like so I thought that instead of showing like average numbers It would be nicer to do this on pictures So this is the data owner as evident here. You can see that the exos token starts with B We're doing a get request. This is how a get request looks in Swift. We're going to look at a Dragonfly picture That is in my objects container. So let's run this query and we've got this nice oops How do we share it this nice creature here? So we want to protect This nice creatures identity and so we wouldn't let the utility user to access it seeing the face So let's try and do a get request using By the utility user So here's the request it it is a get is to the same place But here as you can see the exos token starts with C. So this is done on behalf of the utility user Let's do a send here and forbidden Let's now try also to add the header of extra install it We're now I'm trying to do the same get with the extra header that says to the store to this to Swift run the blurring storage there Let's try that oops Yeah, okay, it worked because what okay. It won't because I forgot to delete The the credentials so here it is once the story is executed Okay, what I didn't show was This request here This request is the request that needs to be run by the data owner so that the utility user would get access So this is a post request Targeted at the container in Swift ACLs are done on the container level rather than for each object And we can see here that okay, so this is the data on a request by this header It says give the utility user a Read via stall it and the stall it needs to be the blow stall it one Okay, so this was done beforehand and this is why we saw that we can succeed with the stall it All right, so this was the data privacy use case Present mode one second And now I moved to what I call the cheap bakers use case so the idea behind the cheap bakers use case is that for some workloads it might be more beneficial in terms of price to To invest in more CPU on the storage side side rather than to copy the data to the compute side I've Tried to develop a pricing model a concrete pricing model that can help Decide for a certain workload whether it would be more beneficial to run it on the to add more CPU to the storage side rather than Moving across the network. It's very initial. I would love to get feedback on that I have I already have some improvement that I need to put there in It's really initial Next use case is what I call the super user use case So in the super user use case a deployer Might want to add some functionality through stall it to the base system. Why would you want to do that? Maybe it's a proprietary functionality. Maybe it's something that cannot go upstream for any reason so we can do this via stall it's right There is a European project called iostuck That we've been working with The iostuck project actually took this idea one step further So not only that they add like more functionality via stall it they also add a layer of policy there That can give you different SLAs to different users Why so so certain user would be able to use that that type of Functionality and other this kind of functionality and this is going to be presented later on today in quarter to Three o'clock in the V Brown break session V Brown back sessions Okay Some short history and status of the project so it was first opened by IBM in August 2015 then from December to June 2016 Koda and Takashi from entity group did a major refactoring work And became core members of the project They also added unit testing that was missing Then in July, we've added new documentation. We've added functionality so that solids can create new objects from existing objects We've done a lot of work around spark integration on August. We've added various various additional functionalities Written here. I won't go over it and then we approach the the open stack TC to become part of the big tent They're really positive about it. We do we were given some homework and a mentor We did pretty much most of the work And we'll approach them again once we're once we're done and we have a stable Newton release questions, thanks Aaron for the great session, so I'm one can show for the security demos so you have the look for the custom middle rare to enable the default Learning strides, right? So so if you were asking whether the store it also does the face recognition or just the very yeah, yeah, yeah, okay So, okay, it's a good question. So, um the I had I had the face locations as a metadata You can think of it that there is another store that does metadata extraction metadata enrichment So that when you upload the you actually do a put with extra install it we implemented this before in other demos so this This toilet would identify the the face. I didn't try it on insects though It would run it it would enrich your swift metadata in the store that I actually ran already looks at this metadata and does the blurring So just the blurring. Oh by the way, it is done with open CV. Any other questions? I guess okay, so thank you very much