 All right, welcome to the session regarding adding metadata search to OpenStack Swift. This will be a talk with my co-presenters Aaron Rom and Nilesh Boseil from all of us from IBM, and this was joint work with Pala Desheema and Guy Hayden as well from the Haifa lab of IBM. All right, so the first thing we want to do is go over what is object metadata and Why is it useful? The second thing is is now that you have metadata in your object store How do you search it and how why are the what are the use cases and benefits of having such a feature? then we'll have a sort demo about some work that the team has done and Go through some of the implementation parts and future work All right, so what is metadata? You know It's metadata, right? I mean, it's pretty generic, but at the same time in the context of object stores metadata is now Beginning to have a lot more importance There's user-defined metadata. So this is metadata that users applications all of us are adding on to our objects Adding on to the system to basically describe What are we putting into you know? What are there's hundreds of petabytes now? What the hell is it? So that's where a lot of the user metadata comes in But you know if you're a system administrator system made of system metadata is really probably what you spend your time looking at Which is you know all the objects came in what time did they come in all of the aspects of the system? That you might be trolling through to help manage the system in a better way So I think a key point here is that Metadata is not just something that you might want to throw away and then you know the data is the important thing here, right? So the metadata is just as important as the data and you're really building up two sets of data If you think about it this way There is the structured Metadata and so that's the metadata part and it's becoming a structure to all of your big massive Unstructured data and so this structure is something that we're leveraging inside of this talk for metadata search so you know there's a lot of different examples, but There's some of the basic things that you and I all know about in terms of you know looking at images and all the little Metadata is on those in our music files and everything else But a lot of the different scientific areas are now Defining their own metadata formats and that they're adding on to their system And this is also metadata as well. And so all of these structured formats All are providing more structure again to the unmassive amounts of unstructured data And this is something again that we can index and leverage as part of the use of metadata All right, so In Swift so Swift and object stores in general are becoming a little more unique with respect to metadata You know if you've used file systems They're there to sort of take a poll about how many people use X adders, right? I mean, it's you know a lot of people might sort of know what they are But how many people have actually set X adders when you're a file application developer Whereas in object stores today, probably every single person that's written some level of Object app has already used metadata on top of those objects to describe it in some way or another So it is something that is integral into the use of objects and through Swift You can set user metadata through a variety of different things on the hierarchy So either at the highest level, which is the account down at the container Which is built inside of the accounts or on the objects themselves And the same thing can happen. You can delete them as well. Obviously System and metadata then is inherent metadata that is just being tracked as part of the system So when you uploaded the objects and a lot of the other internal aspects of the system Some interesting semantics about how to Swift use metadata so For accounts and containers as you add in Just as you add in new objects to a container You can add in new metadata to that container as well and just keep up pending more and more and more of it With objects though things work on a whole object basis Meaning it's the same concept for metadata when you add new metadata You're adding all of the metadata for an object at any one time So you're not adding additional metadata aspects You're literally open loading all of the metadata for that object if you want to update it Then you upload all of the new metadata for that same object So if you just want to add one object one piece of metadata for example You can read all the old metadata add in your new object and then upload it back So there's some semantics around how copy works and then if you want to retrieve that metadata back You use the head command on either your account container or object and that you can view all the metadata for that element So with a search well, it is basically as it sounds The goal here is to again take the structure that you're building with all of this metadata and Index it and now provide a really simple easy way to Access all of the objects in your system The goal as well is to provide a rest API for doing all that searching just like the rest of Everything in open stack and make it easy to use and an important point here is that There could be other implementations out there But actually some version of search is already up and available in the IBM soft layer Swift object store. Alright, so I've kind of already given a little bit of background and why metadata search is valuable but you know It's really an interesting concept here where you know it sounds dramatic But if you have hundreds of petabytes or even just petabytes of objects out there and you have no way to really find it You know, it's almost like the internet where you know back in the day We didn't have Google and we would have to index everything and remember where everything was in order to find it And it's the same thing, you know with object stores today to some degree It's like having a Linux system with all of this file data, but you don't have it fine command I mean, you know, so you have to remember where everything was or you have to somehow again provide an index to it In an application external to the system So the goal here is again to have Support large object stores support millions billions of files, but also now find that again that Needle in the open stack if you will And this can help users again with their data and their user-mated data And this can help administrators in terms of how they might want to search The system metadata in the system to manage the system From use cases, you know, I think that once you have the ability to find what you need quickly There's a lot of different use cases and we're going to go through several of them Data mining there are data warehousing, but effectively saying I have all of this data What are the important parts in that they want to analyze? So if I'm using Spark or I'm using any other framework out there What are the key pieces that I can bring in without having to read in all of the objects and then scans rule the metadata And then eventually find out which are the one important ones that I really want to look at All right, so I Think that's my intro and Nilesh take over Thanks for that Dean. So as Dean gave an Background about metadata and what is search meant to the metadata. So we'll go over some of the use cases So to start off with I'll go over a couple of sample use cases Get the Understanding with everyone over here and then we'll go over a couple of real-world use cases that we are working with so in this example as it is shown it's a Advanced photo album wherein users upload their photos and they also tag or put metadata associated with these objects for in this case Very simplistic metadata added to the on top of the objects like city The name of the city and then time as day night and all all these things and Then we have the search query down there get all the objects which has city equal to Rome and Time equal to date. So this is a Complex query. I mean in some term that you are having two constraints in the search Query and then it returns back the objects which match to this search query the The objects which have the metadata matching to this query then another Search query over here with time equal to night and then you get two objects because there are two photos with time equal to night So this is a very simplistic example to make it a bit complex. We have We can do complex searches based on Date ranges free text matching and integer comparison as it is shown in this particular example Here you are searching in your photos my photo space Container or account based on tags like John Bob or Alice. So this is kind of a free text matching. So there can be photos uploaded onto the object store which has Like in Facebook we do we tagged the photos with the Names of the persons in that photo right so you can tag on that and those tags can be internally stored under the object store And you can search based on those tags and you can actually do a free text matching So there can be John Dickinson As a full name of that particular person and then you just search on John So it's kind of a substring or free free text matching. You are doing here Then you can have dates associated with the photo or the object and then you can do a date range searches Where in here in this example, you are searching for objects uploaded or with the date as 212 2012 and between 2 3 12 2013 so you're doing all these things and this all these complex searches are also possible With the metadata search APIs Now we look at some of the real-world example as I said, right? So this comes from RAI, which is a television broadcaster in Italy So they put their video files into the object store They have some metadata associated with this but then we are running in some tools Within the system to enrich this meta data. So in this example, we are using stolates Stolates is a technology that is open source by IBM Contributed to the open source community wherein you can run some engine where the data resides and Process on that data. So in this example, we are using a metadata enrichment store Let which is looking at this data and actually calculating some kind of a loudness value. So in typical video or audio Application there can be different loudness values associated with that object, right? So this is calculating that to the objects which are being uploaded onto the object store and Adding that as a metadata for that specific object and then let's see how this helps in going forward. So You can search for objects with a faulty loudness value, right? Since it is it is having the enriched metadata the loudness value associated with that particular object You can easily search. Give me which are the faulty objects like with loudness less than minus 15 or something like that and it will give you that you can Do for the processing on those objects or whatever you want, but this is how the search capability is really helping you out This example We go on to the another example and the use case wherein We are going to say how metadata search is helping you out in terms of analytics or analytics applications So in this example As a swift object store is being used as a back-end store for the objects And then you are doing how do for spark and of analytics application on top of this data Where in you write you use this spark SQL Which has the SQL syntax you run the SQL query. Give me objects with timeframes 8m to 12pn So you can just get a set of objects and then do further analysis or further processing on top of that So this is kind of a machine learning algorithm that is running on to your objects And you want to do your further processing only on subset of the data Here also this really helps in So in this particular example, you are using the SQL query and that is internally being translated into the metadata search APIs and search queries and you are getting back the responses Right. It also shows some Advantage over the time spent and the improvements in that This is the use case that I was talking about just give me the objects with this particular with the metadata within this range In the next example again, this is again a real-life scenario. This we are developing with the EMT bus service in Madrid So in this example, this is the IOT use case The search capability allows understanding of the traffic on a particular day at a particular time slot And that can be used for further analysis and planning for future events. So it happens So in this example, what it happens is You have the IOT devices mounted on the bus all the buses that are running on the in Madrid Run by EMT bus service These IOT devices emit IOT logs and those are forwarded to the IOT servers centralized servers and from that server from the IOT logs We take out the objects create objects put it into object store and also Associate those objects with some metadata So some more details on this So this is how it looks like the bus services are there bus There are logs like what is the current location of the bus? Whether the door is open or closed What is the current time? All these things are generated by these devices and stored into the central server then From this logs IOT logs again some store let runs into the object store that takes pieces of this object logs Of these IOT logs and create objects out of it. So you are continuously coming in of IOT logs, then you make chunks of these logs and store those as objects and on that particular chunk you Define some of the metadata that I'll go over the metadata like what is the start time of this bus trip section? there there can be a big bus trip and You chunk it into multiple sections, right? At each section has a start time each section has a end time and the coordinates like starting Coordinate and coordinate the geo points as we say and Then we'll go over a demo So in this demo what we are going to see is that we have integrated a Google Maps kind of Application inside a web application wherein you give the object storage URL and Authentication token because authentication token is required for querying the the even the metadata Based for the metadata search and then you provide a container in which you have stored the Objects the pieces of the IOT logs, right? And here what we are going to do is we are going to draw Geo bounding box on this map which will have a geocoordinates top left bottom right and That will translate into a search query and that search query will run behind the screens and bring back the objects That are matching with that search query So let me play this demo. Okay. So object URL object storage URL then token and container that are Put in here What it essentially shows is the rich capability that we can have with metadata search You have data types associated with the metadata and then you can really do the Geo bounding box searches date ranges time stamp ranges all these things Now we are drawing Geo bounding box. So this is the bounding box and this returns back So now as you see all these different colored boxes within the big box These are the search results that you get each box represents search a bus section of a bus trip Right, so you see lots of them that are being returned Within this time frame 12 p.m Now we are reducing that to 9 p.m. 9 9 a.m. And now you can see the search results have reduced Because you have reduced the time frame Again, we are we'll change the time frame like 3 a.m. To 5 a.m. This is very early morning and you see there are not many trips that happening in this particular bounding box Again, we increase that and you see there are something that is written now you click on to one of these boxes that represents one search result or one object and Then we'll see what are the details that are there? So when we click on one of the box, it shows this section of the bus trip right, this is the route that was followed by this birds and If you see that there is URL object URL So this is the object that is being searched and it's a return back 0 91 And then we'll see what is the data inside this object and what are the headers or what are the metadata? So this is the actual object the data which is a chunk of the IT logs and Then we'll see what is there in the headers that will show shows the metadata associated with this object. So this is the metadata the top left bottom right the geo points of this particular bus trip and start and end time and as I said this where introduced or Put on top of the data by using the storelets that are running inside your object store So these are the coordinates based on which we did the search and these are the time frames So that's pretty much on the on the demo side So we'll get into the implementation details. How this is actually implemented behind the scenes I'll invite Iran to do that Thanks English Let's see It's a Mac Wait, I make disabled I'm also very jet lagged so do bear with me Okay, so let's talk about the behind the scenes of metadata search So as we as we mentioned we basically need to cover here two flows one flow is index the objects metadata And the other flow is to serve search queries So let's assume this is the storage system system input data path Every storage system has such a data pass which typically ends with some Storage at the end of it where some data needs to be written to Now on the data pass we place our indexer so that we can intercept requests for either uploads of new objects or updates of metadata That indexer would I think currently what would he do when whenever it intercepts an upload for example It would copy the metadata out of the request and Asynchronously send it to a queue so that the original request can continue the input data path as soon as possible and The the information would then would be picked up from an index search cluster So in our case the input data path is the swift proxy pipeline the indexer is a whiz gear swift miller The persistent storage part on the right hand side is the storage swift here We as revenue queue for the queue and elastic search for the index search cluster This is how we index like on a high level I would do the indexing How do we serve on search requests? So here is the output data path Which is in the proxy pipeline and then we add yet another middleware, which we call the MD search middleware Whenever a get request comes in Actually, whenever a search query comes in the the middleware intercepts the request and synchronously Routes it to the elastic search cluster Get a response back and then back to the user as you can see the the storage tier is not involved here in any case We don't need it. We just need the results from the metadata search, which the information Realizing the elastic search cluster Here is a possible overall architecture of the system, so we've got our Swift cluster it has the proxy nodes up there. It has the storage nodes down there It has an elastic search at the side. We've got a load balancer up there. So whenever a request comes in It is the load balancer the load balancer routes it to one of the proxies where we have the indexer with rabbit to do the To do the indexing or the search middleware for forwarding query requests to the elastic search cluster Let's dive a little bit deeper to the API To the query API and you let's talk about it a little bit, but I'll dive a little a little deeper So we see what we see here is a get request from the demo of the buses It's built out of a typical get request of Swift you can see there the Here's a get This is the host name v1 Currently Swift has only v1 then the container then the account name and then the container So the request is targeted at a container if we had targeted the request at the account level We would have searched all the object within all the account not just a container Okay, the query itself comes inside the query string of the request So we've got your query equals and now we're going to look at the x object metadata top left Which is the metadata that we're interested in and we're seeing and we're searching whether it is in the bounding box Right, so we have your like two coordinates. Here is one coordinate. Here's the other coordinate this defines the bounding box that Nilesh mentioned and Actually, what we want here is all the object whose top left Is in that bounding box, but also the bottom right is in the bounding box. This is how we got only the buses were interested in Then again, we have an extra header here called x content search Which basically tells our middle word to kick in to intercept the request forward into the elastic search cluster and And that's it. There's a little bit of redundancy here The query in the query string and the header, but it's an implementation detail What we can take out from this example is kind of what are the features? So one feature is the multi-criteria. We have here an end between two criterias We support various operators. So in this example, we use the in operator But there's also the other ones. I'll mention that this one here stands for Free text search And we support metadata types Why why is it important so that we can use the in operator the in operator would be Treated differently when we're talking about Coordinates rather than we're using Integers right so assume that the x made a top left g wasn't A set of coordinates, but rather a integer or string then we would have need to Run something else behind the in operator, right? This is why we want to keep data types inside that that Defines the values in the metadata Items I hope that this point went Okay, Dina, let's wrap up Thanks, Aaron so that's the system that they've built and Where do we go from here? I think that You know first off Object stores is one interesting aspect here, but I think once you're building already a indexed level of metadata You might want to use it by your other objects storage systems as well So your file system especially for us we do a lot of work with the Swift on file project Integrating file and objects into a single system and in that case We want to be able to index all of the files as well as the objects in the single system So we want to make sure that the back-end API is that are being pushed or standardized so that Whether we use you know rabid mq and elastic search or you know customers want to use MongoDB or wherever That it can all then plug in to the single system And as well, of course, you know once you've built the system, you know being able to visualize the index databases through cabana and your typical Type efforts there as well Of course, there's other things just starting to move Inside the open stack community There's been some work around Notifications inside of Swift so we would like to integrate with those and see where they fit inside of this architecture And as well as the open stack search light project is just starting As far as we know, you know, it's built on the same set of tools elastic search and rabid mq that this prototype is built on So hopefully there's a lot of nice overlap there They haven't started. I believe they're starting with glance and they have a Swift in their road map And so that's something where when we get there One of the real benefits here is a standardized search API. So, you know as Iran mentioned There's a lot of different aspects there with how this search API works and you know, you want you want to be able to do as let's say I'm a user is write an API, you know that I can use anywhere, right? So I don't have to worry about if it's an integer in one case I describe it one way and it's an integer in another project. I describe it another way So making sure that everything is standardized across that I think is good for everyone And so that's something we want to follow on as well And okay, so how do we get what we've built We're initially releasing it with IBM spectrum scale and the objects part of that product So and that's going to come in a couple different ways One is to use it through a virtual appliance that you can sort of try and buy type thing and another one is we have a white paper Being released this fall as well. It'll include the code. That's required So this is the middle pieces of middleware that are required in the proxy server in order to do the all the different functions So that's our current way of delivering it But again, we want to work with the the community and make sure we standardize all this stuff and get it out There as well moving forward. So that's that's everything. So thank you very much Any any questions? Oh? Yeah, sorry your hands up You want to wait for Mike or yell do you guys want to stand? Okay Have you given any thought to For example, there could be take an object say a piece of a map and One app could be adding metadata that says here's here's where the water flows is where the water pipe goes and the other could be Here's where the roads are and the third could be here's where the IEDs have exploded etc. Etc. Etc. Right any thought on Customizing the search engines so that it doesn't index Every piece of metadata that it sees but only builds indexes For certain pieces of metadata that I'm interested in right right. Yeah, I think that you need to build your custom schema For specific types anyways, so you know the more efficiency on how you do that. I think we'll give better results Do you have anything to add? So during the work that we've done with RAI the the Italian broadcaster So they were using like really complex Schemas to describe their metadata and part of that was which metadata to index and whatnot. So yeah, thanks. Yeah, and any other questions Yes Hey guys, it's really great. We work on the searchlight project So perfect come to our session and we do we have nova servers glance images and designate data now And it's a very similar architecture. So we want to work with you for sure. Thanks. That sounds great anything else Yes over there. Hey, maybe just wait a second. He's almost out. Yeah We'll get it on the video as well I mean you talked about the standardizing the interface is right and I forgot some of the earlier presenters Talked about some of the store-led functionality. That's you know done to extract the metadata. Yes so are you kind of standardizing the Store-led functionality or what exactly you know or plan to standardize. Yeah So just a quick comment and Aaron can speak but there's actually a session on store. It's on Thursday morning Don't remember the exact hour. I think nine or so So so storage, I'll mention it in the talk, but storage provide you a complimentary functionality where the metadata search can help you narrow down the number of objects you actually need to look at and Then you can use storage to do the actual Computation over those objects. I'll talk about it on Thursday here The store-lets were used for just extracting more metadata, right? So it's a different path. Yeah Go ahead Sorry, so as Iran mentioned store-lets can be used in multiple ways you can use store-lets for analyzing Based on the metadata or you can use store-lets for enriching the metadata So one of the example over here it was shown that store-lets Are used to add some more metadata on top of the objects that you have, right? So store-lets is not the only way so you can have your custom tools to enrich the metadata Store-let is just happens to be one of the way, right? Yeah, it's actually an important point that we didn't really mention before is that this is all built upon Swift metadata But Swift metadata, you know is set through either users actually explicitly setting the metadata or it has to get in there One way or another in the again into the scientific communities a lot of the way that the applications write data It's inherent within the object themselves, right? So the only way that this architecture works today would be as you need that way to extract it and then get it added so How metadata gets into the system is actually kind of an interesting Area in the sense that some users might want to set explicitly some want to write store-lets or other things to extract the Metadata or there might be other ways out there as well where it can somehow get into the system, right to basically again provide that structure As we talk about the metadata search API store-lets is just another part of it. It's not really tightly coupled Yes, and the Swift's eventual consistency models present any challenges In terms of representing the current state of what's in Swift. Well as a you know the current architecture in terms of the Swift The elastic search part is also eventual consistent as well in this in because it's asynchronously updating the system So it does present challenges in I would say the for example the spark example, right? Where let's say you literally just look loaded the objects and now you're just Searching for your objects and now you want to analyze them. It is possible that they haven't actually been written yet so, you know because of the asynchronous nature everything is happening independently and you know how All that stuff that comes up. So yeah, right now It's all happening independently and therefore in fact you could end up with a search result that gave you an object that you're Still waiting for it to actually appear in the system So it is it is possible, but you know, I think the key here I think is more of unobtrusive indexing and so, you know, if you want to be a little more Trusive you could do it synchronously. Yes Is there the possibility to kind of be permanently out of sync like so let's say you get To create some to to the leaps and you know, how do you know who actually won those races? with respect to the Inside the object store or inside the the actual indexing part So how do you ensure that the current state of the object stores like the permanent final state? Right is accurately reflected in the meta data in the in the elastic search So elastic search is a distributed no sequel database, it's real near near time So it it maintains the state across the cluster So you can have multiple nodes in the elastic search cluster doing maintaining the index data And it maintains the state across the cluster. So even if you as in the example It was shown elastic search cluster was outside of the the strip cluster, right? You can have it on on top of the strip cluster as well, but it was outside and It meant that maintains its own state. So when objects are being uploaded via multiple proxy server nodes There can be eventual in concern. They'll be eventually eventually consistent But there can be inconsistency in in between time But when you upload a object from server B, and you try to search on server a Right, even if the object is currently not to sync up It will get to the in search database Which is kind of in sync and it will return you the results But in in certain cases it can happen that you you got the results But the object is not there on that particular system right now just a specific point though to your question the Proxy server is time stamping the objects that they come in and Swift Consistency says that the last time saw it wins and so you can then do the same thing back to them We need to make sure that the that the elastic search cluster is in sync with that right with those timestamps So yeah, this is a challenge. We need to work with them Any other questions or maybe near the time okay great. Thank you very much