 Welcome to IBM Accelerate Discovery Forum, Distinguished Speaker Series. It's my great pleasure to introduce our speaker, Romney Rory today. Romney is a senior researcher and manager of the cloud system analytics group and IBM Research Armageddon. He has been with IBM for more than 18 years. In his role, he continues to invent, design and transform next generation data center solutions. Reading from cutting edge underpinning cloud technology to innovative cloud management systems using advanced big data analytics. The essence of those solutions lie in the unique integration of data science with domain specific applications on the large scale cloud system analytics paradigm. Romney has made a significant impact by architecting and implementing key components of the IBM Spectrum Control Product. IBM Cloud Platform and IBM Service Delivery Framework. Romney has been awarded IBM Corporate Award, CEO's Best of IBM Award and a multiple outstanding technical achievement awards. He is a IBM Master Inventor and a member of IBM Academy of Technology. Most recently, he also serves the Armageddon Annabarcade for IBM Global Technology Outlook 2017. Okay, I'll give it to Romney. Thank you. Thank you, Mo for the introduction. So in the talk today, we'll be covering data-driven cloud systems analytics and beyond. So before I go into my talk, I would like to draw your attention to maybe three motivation points. And from that point on, I will be taking you through some key storyboard scenarios that are going to show you that how we have implemented Apache Spark based analytics as a service system which is driving cost and performance optimization in IBM Cloud as well as in IBM GTS Global Technology Services environments. So let me go through a few of the points to draw your attention. So the first one is I would like to show you the IDC survey which is this graph on the left-hand side that you see. And the key point that I would like to draw your attention is the blue line that you see. So essentially the blue line is showing the physical server installed base starting from 1996 going to 2013. And the red line which you see is actually the logical server base. So essentially the gap that you see is mostly because of the virtualization that has been introduced and the significant amount of instrumentation that exists which is driving a lot of parameters that are getting collected from this environment. So essentially there is a lot of moving parts that has been introduced because of the virtual introduction of virtualization. And second is with introduction of containers it is even more granular and so on. So there is a huge amount of data being collected from these environments. But how good that data is unless you are doing something with it. Then the second point that I would like to draw your attention is the installed base or the kind of share from a private public cloud perspective as well as there is significant amount of traditional IT environments that might not be called as a cloud environment because it might be possible that they do not have those gadgets which might qualify them as clouds. But if you see again IDC forecast going from 2014 to 2019. So the point also is that there is a significant chunk still going to be traditional IT going from going to 2019. The third motivation point is also there is a significant push on the convergence of business strategy with the IT strategy and there are many points that are kind of given as requirements. So there is a availability of public cloud and in the traditional IT environment there is a significant budget pressure point and there are all those requirements with respect to agility, availability, resiliency and so on. And essentially the goal for the enterprise is to provide an infrastructure as a service which is efficient, automated as well as dynamic. So given these three points let me go to the core of the talk. So essentially in today's talk what I will be covering is how do we create automated actionable insights from this significant amount of data that has been collected through this powerful set of instrumentation that all the data center managed elements provide. And second I am going to also show you some constructs that how we have evolved from an infrastructure going upwards to the application and we are looking into the data not only into the infrastructure. So what I mean by that is from an infrastructure perspective you might only look at file, object and blocks. But essentially if we look from the top of the stack then we might look from application data, what files are there, what objects are there, do they contain any sensitive information and so on. So those are the two main topics and I will skip the API economy portion of it for today's talk. So essentially the way I would at a very high level 50,000 feet view I would structure is our work from a cloud analytics perspective or from a cognitive delivery perspective the way it is is the positioning across these three circles. So one is what we are calling as cloud. So when I say cloud here I would mean three things. So one is the public cloud like IBM software or IBM cloud managed services. But we also mean the traditional IT environment which is the hundreds of global technology services managed strategic outsourcing accounts that are spread throughout the globe. Then the second bubble that you see which is the data science essentially using technology such as Spark we will be collecting huge amount of data from the environment and how do we trend it, predict it or show some anomalous behavior. How do we come up with insights which will give us cost and performance optimization. So that is the second circle that you see data science. Third one is open ecosystem. So essentially the insights that we are generating are actionable insights. So if you or if the enterprise want to go ahead and implement these changes they can essentially leverage the open API ecosystem that exists. So that they can invoke an API and implement those change to achieve that optimization. So that is mostly we are driving through open stack because given the heterogeneity of the environment open stack is one of the key component that we are relying to achieve that end goal. So which is the third point but essentially for today's talk I will be skipping that point. So going to the technicality of it how do we implement it. So first is on the left hand side as you can see these are basically our managed environments. So managed environment as I mentioned could be the public cloud like software or it could be on premise IBM hosted shared many of these traditional IT enterprise environments which are managed by IBM GTS. So in that environment we deploy in some case we deploy a lightweight sensor. Basically that collects the configuration and performance information. So configuration meaning the end to end configuration such as how many DMs are there what is the physical logical server configuration the network the storage configuration and so on. And then performances for example a time series record for every 5 minute duration how many IOs are incurred from a storage perspective out of that number of IOs. How much is read versus write random versus sequential and so on. So in some environment we deploy a lightweight sensor in certain other environment we do not have to deploy any lightweight sensor or agent. We can essentially pull the data because there is a port open and we have the information from a user password credential perspective. So that data is essentially pulled on to the right hand side which is a big data cluster as you see. So essentially in this cluster we have few components one is a HDFS which is providing a large scale file system. Then the data essentially from these sensors come and land in this file system and these data is basically curated and organized and essentially sometimes it goes through effectively some ETL processing. And then this data is stored in a cloud and ingestion form. And then the third layer which you see is the spark which is accessing the data from the cloud and then producing these insights. So essentially the high level point is that it's a data science enabled infrastructure optimization that we are providing. And as you might see here there are few kind of examples of some of the optimization that we provide such as storage reclamation thing to think which I will go little later. So this platform effectively is providing a software as a service and it is an automated advisor that provides cost and performance optimization. And apart from this cost and performance optimization I will also highlight some of the analytics such as predictive analytics anomaly detection and collective insights just to kind of maybe show some other aspects of this platform as well. So one quick contrast I would like to make is that as you know this can be achieved on premise as well. So essentially some of the collective intelligence those aspects will also highlight that what are the additional advantages we can get because we are collecting all of the data from thousands of these environments into one place. So first point quickly to go through is that why is this a big data problem. So just to give some scenarios that in this environment we have one enterprise where the healthcare enterprise and the total number of DICOM files or VCF files and medical images that is there is about 430 million files. So you can think the amount of metadata generated just from one environment. The second with respect to velocity perspective just to give you an example in one of our installation the total number of backup jobs that we are collecting is more than 3 million in every 24 hour period. So every 24 hour period we have a 3 million backup jobs metadata that is being collected. And a variety is about more than 20 storage platforms 12 backup products and 8 management software platforms such as spectrum control we center from these environments that we collect the data. So as I mentioned earlier so modern monitoring tools are producing more granular metadata but essentially the way we can show value is if we are generating some insights from this data and then making it to use. Before I go into the exact use cases and take you through them to kind of maybe going in the reverse order prove the point that from a scale as well as value perspective in one of our environment where we are analyzing more than 350 petabytes of storage. We saw 1.8 million storage volumes and out of which we detected 190,000 orphan volumes. So I will go into a bit what we mean by orphan volumes and similarly as a value add from a value add perspective we also used this system for few enterprises and within using this analytics as a service for two weeks. The enterprise could see 12% improvement in the backup success rate improvement. And third is also from this 1.8 million storage volumes we detected about 345,000 volumes which could be converted from thick storage to thin storage. So let's get into the detail. So one quick example like I mentioned out of 1.8 million storage volumes we detected about 150,000 or so to be orphaned. So what orphan means is that first we detect that there are these volumes that are being used in the managed environments and out of these volumes how many of those volumes are either connected to host or not connected to host. And they are not participating in any copy relationship meaning they are not part of backup or replication and also they have not seen any IOR activity in the last 60 days. So essentially that would qualify as an orphan. We call potential orphan because essentially from a machine perspective we detected that it's an orphan. But the accounts do a due diligence on it and then they out of that given list that is generated by us they would come back with a subset maybe or maybe the complete list. Suggesting list they would agree that it is an orphan. Then the accounts would go through a simple process their account specific process to delete those storage. So for example some account might suggest that unassign it from the host wait for two weeks to see if any ticket is generated. If not then they will go ahead and delete it. Then at the very end when the subsequent collection occurs we would detect that if our insight has been implemented or not meaning a volume was detected orphan. Now the volume has been deleted. So essentially from a before after we would know that our recommendation has been accepted so that we would count as savings. So some quick examples in the very similar fashion. By collecting data and correlating it from application perspective as well backup perspective as well as replication perspective we can determine what is the copy data scenario. So how much of is actually application data versus how much data is dedicated towards the replication. So this probably gives a big insight into how much storage we are spending from a copy data perspective versus application perspective and end to end we can get an overall view as well. The third one is thin performance provisioning. So this is one of the key insights that we provide is how do we maximize the performance as well as cost optimization. Meaning in a given environment when we are doing it to thin conversion the most focus goes on the capacity optimization. But here as what we do as we are trying to allocate different amount of storage from performance perspective. We also calibrate not only the capacity but also the performance and then come back with the recommendation as to how best to the thin provision storage can be made. So I will skip through these slides but essentially the high level idea is that very similar set of recommendation. One is static down tearing meaning what are the amount of storage that are positioned in low tiers sorry hot tiers. Meaning they are placed in high cost high performance storage tiers but they are not showing any characteristics of it. Meaning let's say you have provisioned storage on a high end device like XIV. But whereas you are XIV or the spectrum accelerate but whereas the if you look at the IO density which is the IO per gigabyte per second. They show a very low characteristics meaning the data is very cold or maybe lukewarm. So essentially the recommendation would suggest that you can down tear that storage and get a savings. Similarly we would suggest compression enablement so the data is very compression friendly. We should go ahead and compress it. You can optimize your license from a copy service perspective or the tool can also provide certain insights into hotspot detection noisy neighbor and so on. So one key insight here is that as we are projecting these insights we also introduce something called a predictive insight or predictive procurement deferral. So from these insights as you can see you can save cost because you are saving storage capacity and you are optimizing on your performance. So the top line that you see is basically the line or at the different data points that is showing how the storage is growing in a given environment. So what we would suggest that for this environment if they go at this current rate when they would hit the 80 percent of their maximum capacity they have in on premise. So the drop points that you see is basically in the second line where how we have optimized or saves certain capacity. So we have deferred their procurement for certain period of time. The third or pink line that you see is suggesting that we still have a lot of outstanding recommendation. If you accept them then we can defer your procurement delay for the period of time. And also just to highlight the point that it does not have to be a linear prediction but if for an administrator if certain workload is coming or if they have some idea they can also introduce the stack function that we have which can change and adjust the hit date or the storage run out date for that environment as well. The next insights as I mentioned is the collective intelligence. So as compared to on premise where we collect data from one enterprise analyze and give advice in the cloud delivery model or software as a service model we are collecting data from thousands of these environments into a single cluster or maybe only four or five installations throughout the globe. So in that scenario one of the advantages is that without diverging the privacy you can pick an account and then you can find who the best ten neighbors are and then compare contrast them across multiple metrics. So for example if account A has 10% of orphan storage but all the neighbors show that they have only 3% of orphan storage that means that particular account that we have selected either is not operationally efficient. They do not have a good storage activation the activation process. And second thing is that once we are generating these insights all the other accounts are taking on an average maybe 30 to 45 days to do a due diligence and delete those orphan storage but whereas this particular account may be sitting on it for 120 days. Not only we can do it across accounts we can also do it for different departments or different data centers of a single enterprise as well. So this basically introduces a social context to systems management meaning different data center administrators or practitioners can actually talk to each other share scripts share reference architecture paradigms and things of that nature. So far what I showed is essentially focusing on the storage optimization aspect. So going to maybe a slightly connected domain where we focus on the analytics as a service for backup or resiliency environment. So what I mean by resiliency environment is typically you have a protected environment which is your data center where the applications are running. And then we introduce something called a data mover. So data mover could be as simple as a agent that goes on a VM like for example spectrum protect where you copy the data and protected some other place behind the backup server. Or you could be doing storage replication in that case storage replicator of the storage controller is the data mover who is moving the data and replicating and it could be synchronous asynchronous replication and so on. So in all of this scenario when the data is moved from protected entity from on premise to again to on premise or on to on cloud we collect significant amount of metadata. So you can see some example for a simple backup we collect such as which is the client machine from which data is being backed up which backup server it is going to when did the backup start when did the backup complete. Whether it is it was incremental full file count error count job duration and a lot of these parameters. So by collecting this data essentially we provide a basic set of reporting which helps us improve the backup success rate or the service level agreement. Meaning are there any consecutive failure what is the daily backup success rate how do you do problem determination fault isolation and so on. And also things like backup coverage validation that is basically by comparing contrasting different management towers. So if the asset tower is showing there are new assets we can say that these are not getting backed up. So the argument would be that they have been procured but they have not been configured on the raised floor. So but we receive CPU network memory that means the server has been activated. Now we want to take the difference and show that the server has been activated it's on the raised floor but it's not getting back up. And then similarly if the server let's say has a database but we are backing up the server as a file system then the backup backup is of no use. So essentially by doing that we can come back with something called application coverage validation is the application being backed with respect to application semantics or not. So apart from doing the basic set of reporting to improve the service level we also look at the success scenarios where we detect are there any anomalous behavior that is happening. So for example let's say a particular application or a server was getting backed up from last 45 days and throughout that 45 days period maybe incremental backups were around 20,000 files and full backups were around maybe 20 million files. But what we notice is that after some configuration change suddenly only 2000 files are getting backed up. So essentially that would show up as an anomalous behavior which would raise an alert and will help us the system automatically will compare several characteristics and then give you insight which are the conflict parameters which seems to have changed. Similarly we also detect as we are restoring any backups then is there any anomalous behavior in the restore times as well which will give you insight that based on these characteristics that we observe. Is there any threat to the recovery point objective or recovery time objective that might be predefined as the baseline when we started this backup as a service. So maybe at this point going from the infrastructure analytics where we essentially put data from the infrastructure backup storage server application that related area. So we want to now go into the data or move upward in the kind of stack so essentially highlighting to the point that all the analytics that you saw so far is basically the storage map. So from that layer we collected significant amount of data and then from application perspective we got somewhat insight as to what application what server and things of that nature. So in this project in stack insights what we are introducing is something called a data map. So essentially we use IBM components like stored IQ to pick into the data that does it contain any sensitive information such as if there is a file does it contain any regular expression which matches to a social security number. Does it contain keyword like copyright sensitive and things of that nature or if it is a medical image does it contain a medical record number. So once all of these information is captured from these three layers essentially we correlate them together but the output is pretty much put in the at a very high level in this X and Y axis form. On X axis we want to show all the storage that is consumed going from cold to hot from a IO density perspective and then on Y axis we are showing the data being from low sensitive going to high sensitive. So it can have multiple use cases the very classic one is that if we identify which part is low sensitive and cold essentially those set of application associated storage is a potential for moving from on premise to cloud. And which is where we have been collaborating with IBM's new acquisition Gravitant. And similarly from a box perspective let's say you are using a kind of a sort of a hybrid cloud giving an example environment where you are sharing a lot of documents you are using something like enterprise content management as a service. So you want to keep from a governance perspective is there any sensitive data that is being moved from on premise. So essentially you can do the same classification on the box folder and highlight that if there is any data that needs to be moved back and things of that nature. So here in this case essentially I am using box and as an example but it could be a set of storage elements on a public cloud environment as well. So the point here also to note is that as we are collecting the data the metadata that we collect from storage map or application topology or the file metadata or the object metadata such as size owner owner age. So S1 S2 S3 is pretty easy to capture because this you go at control plan and capture it but whereas S3 is little harder because it is intrusive and it also comes in the data path. So for that we also have a mechanism on how to sample and get a intelligent way to derive what would be confidential or what would be not confidential. So essentially to reduce the intrusiveness when we are trying to assess a given environment from a data assessment perspective. So essentially the convergence here is that we want to do the given that we showed you lot of work which are focusing on information life cycle management which is the ILM aspect. And the second angle that we are going is information life cycle governance aspect the convergence of these two will bring a kind of deep insight into not only the infrastructure perspective as well as the application or data sensitivity perspective. And the high level goal here is that by connecting all of these metrics we will be able to derive whether how sensitive the data is and how critical the data is and what is the locality. Where is it placed right now is it on on premise is it on on cloud or things on a higher tier lower tier and things of that nature. So we have exercised several cognitive API such as visual recognition and so on for medical images. And the output is basically a sort of a three dimensional structure that on one dimension we want to project how sensitive the data is from a PHI, PII or SPL perspective. Then second is that which user or which folder is generating that sensitivity, meaning contaminating the data folder with respect to sensitivity. And then third is what time did that occur. So if you simply put an example as a box administrator for a given enterprise you can actually detect that if the data is continuously getting contaminated or after you have given an advice. If it is getting decontaminated and getting to its regular stage. So this kind of gives you a quick overview of the different analytics that we are providing. Essentially connecting metadata from the infrastructure metrics as well as application metrics as well as the insights from the data we correlate together to provide optimization for both infrastructure. From a constant performance perspective as well as giving insights into the value of data given the digital explosion how better we can manage from a hybrid cloud perspective or from a information lifecycle governance perspective. So here I would like to maybe conclude and also like to highlight that most of this work we have published in several of the conferences. Please feel free to then I can point you to the different publications as well. So that's it. Thank you. Hello, my question is what parameters do you consider to identify the neighbors for the collective intelligence? So the question is how do we figure out the neighbors for doing collective intelligence. So in collective intelligence when we when a single account is selected without diverging the privacy we want to find who the 10 best neighbors are or best neighbors are. So sometime there is a constraint that the neighbors has to be within the same geography or country and things like that because of the operational practices. But at a very high level we look at the size of the outcome from a capacity perspective. We also look at something called vendor spread, model spread. So that means the heterogeneity of models and vendors and also we look at the industry and sector. But apart from that we also have several secret components sort of to say secret sauce. So by looking at these parameters we by different weightage we come back with who the best neighbors are so that way we can compare contrast across different parameters. Yes, so essentially as I mentioned in the beginning of hot-up so the cloud and data science from these two perspectives we generate a significant amount of analytics. But the third one which I skipped the open ecosystem so not only we generate the insights but these are as I mentioned these are actionable insights. So what we also do is we provide a rich set of API economy that could be stitched together to implement a process specific workflow where you can take action. So the simple example I was giving you was that let's say the system determined or generated a insight which said here is the 25 volumes which are often. So essentially after due diligence you might say here is my subset of 23 volumes that I would like to delete. So then there is a workflow in place where you invoke certain API parameter which confirms that the recommendation still holds good. And then it also goes into unassigning the volume waiting for two weeks period to if there is any ticket getting generated or pretty much somebody raising an alert. If not then going ahead and booking the delete volume command. So that is basically where we leverage the open stack but essentially I skipped that portion from this step. Thank you. All of the recommendations that you make a story. No that is a good question. So essentially in this environment or the management towers that we engage with the practitioners are storage practitioners storage subject matter experts. And also not only storage outside data. So the people who are dealing with storage data that set up management in a given environment. So most of our insights are pretty much geared towards that because they are the people who log into the system take the insights and take action on them. But this system pretty much can be extended to do. For example advising on orphan DMs or any security vulnerability best practice violation and things of that nature. So the platform or the architecture remains the same. But what needs to be extended is you collect additional metrics from security server network perspective. And then you analyze it from that angle and then create those advice. So right now we do we are very storage and data centered because they are the people who have been who gave us requirements. And they are the ones who are taking these recommendations as well. Thank you.