 Hello, my name is Natasha Gadjic and I'm coming here from Rackspace where I work at IT Department, Enterprise Business Intelligence. Enterprise Business Intelligence processes large amounts of data from various sources in various formats. Some information is structured, some is unstructured. Some information provides the answers to the known questions, while some we need just to store and allow business community to access it for their research. As we all know, the amounts of data around us is growing exponentially and for the analytical environment is harder and harder to predict either the near future capacity requirements. For traditional consumers of analytical databases like business analysts, marketing analysts, data scientists, the certain delay in the data in analytical environment is acceptable. However, there is a more and more demand on the analytical environment to provide data to the support or external customers where delaying data just doesn't cut it anymore. On the market today we have access to the various technologies for data stores, columnar relational Hadoop file system and each of these technologies has been built with a certain purpose and it does have its optimal use case in mind. Our idea was to expose these various technologies by a uniform interface and allow customers to select the optimal use case or optimal data store for the use case they are working on. We had access to the Rackspace private cloud powered by OpenStack and that allows us to think in this direction. We named such solution analytical compute grid. As the first thing, I'm just going to go briefly over the current Rackspace EBI environment. We run a lot of Windows and Linux operating system with Oracle and Microsoft database solutions. We have data loader is done through SSIS and Informatica. Very Informatica is our newer ETL choice and it's loading both Oracle and Microsoft databases. All these processes are running on dedicated servers and currently we are faced with rapid data set growth. So moving into the big data arena with the current environment, we envisioned a lot of problems. First and foremost the cost of purchasing licenses. We are concerned about the time it takes to set up the new hardware. We see increased demand for DBA resources. Overall, we are concerned about system performance, system scalability and capacity. So in order to resolve this big data problem, we put our heads together and said, can Rackspace private cloud powered by OpenStack help us here? And during our discussions, we envisioned the system with the following features. So we would like our system to host ever growing set of data, provide quick data collection and retrieval, scale rapidly up and down, be easy to maintain, provide standard data access API. In addition to this, we would like to provide a variety of storage types, columnar, relational, HDFS, and we would like to enable users to select the optimal type of information, optimal type of storage for information they collect and that would be done use case by use case base. We would definitely want it to leverage OpenStack private cloud powered by, Rackspace private cloud powered by the OpenStack and open source technology in general. At the same time, we also defined the quality attributes that we would like our new ACG system to possess and we can see them on this slide. We believe that list of features and these quality attributes is quite desirable for any big data analytical environment. Later on, we will see how Rackspace private cloud powered by the OpenStack and ACG work together to attain these attributes. But before we get there, we have to know our ACG system a little bit better. So on the next few slides, we will go over the ACG high level architecture and we will try to explain the dynamics of the system. So at the beginning of its all, there is the Rackspace private cloud powered by the OpenStack and here below you see very high level specification of the system. We currently have, by keeping in mind that this is very beginning of this system and we foresee a significant growth of this environment in the next year. The ACG system starts with the creation of an image. As I mentioned earlier, we would like to support three data stores. So therefore, we created three images, columnar, relational and HDFS. At the same time, we also selected the database engines that will run under each of these images and for columnar, we selected Cassandra for relational progress and for HDFS, of course, Hadoop. A member of ACG system is a node and a node is instantiated from corresponding image. So the node has all the information, the new node has all information and all processes needed to join the ACG system. As you can see there, there is a corresponding database engine and there is a data store controller. Data store controller is a database specific process that manages the database on the node during the lifetime of the node in the ACG system. When node starts up, data store controller gathers all information about the environment that the node is joining to, prepares database configuration and starts up the database accordingly. During the lifetime of the node, the data store controller manages the database during system reboot, maintains knowledge about configuration of the system and maintains the knowledge of the activities within the system. Also keeps track of the health of the node and issues any corrections if necessary. In addition to the data store controller, there is a system statistics controller collector and as the name says, it basically just collects statistics about the node that is running to it. We collect three types of statistics. We collect OS statistics, which is CPU utilization and free memory. JVM statistics are applicable, mostly focusing on the heap size and database statistics. Number of reads, number of writes, size of the database and any other statistics that database can provide. In addition to this process, there is the ACG indexing structure. This structure is in the core of many quality attributes we saw previously and therefore we will go into details of that structure a little bit later in the presentation. For this system to function, we need a controller. On the controller side, we have a following component. ACG manager, which is the RESTful web service, rule engine, node manager and persistent data store. ACG manager as a RESTful web service is the central communication point of the system. Data store controllers contact ACG manager to learn about the environment that the node operates within. Depending on the data stores, the details are different. For Cassandra, that would be the seed nodes or name of the cluster on the startup. For relational database, it can be, is the database starting as primary or a replicate. For HDFS system is the configuration of the system itself, location of the name, name nodes, job trackers and also data file. The system statistics collector contacts, submits its collected statistics to the controller via ACG manager. As the ACG manager accepts the statistics, it stores it into ACG data store so that it can be presented in the UI. In the UI, we can see the current nodes that are running in the system, their current statuses and also a brief history in about 24-hour history of the utilization. In addition to storing the statistics into the data store, ACG manager plays statistics into the queue. Rule engine consumes the information from the queue. Rule engine is quite robust engine and allows us to define the rules that control the elasticity of the ACG system. We have two types of rules, system utilization rules and scheduling rules. System utilization rules can be created from any combination of system statistics collected. For example, we can say if percent of the CPU utilization goes above a certain number and available memory is less than another number and database experience is high IO. In addition to those, we can throw in the time dimension. We can say, okay, for how many consecutive reads this needs to have to occur and what is the percentage of the nodes in the system that needs to experience a bottleneck? As for the scheduling rules, they can be used to prepare the system for heavy activities. As we are talking here about analytical environment, lots of tasks and analytical environments are batch in nature. We can add more CPU power to the system to support the heavy activities and then pull it back once the jobs are done. Scheduling rules also can be used to periodically check the health of the system and verify that the database capacities are available. Regardless of the rule type, once the rule threshold is met, the rule engine contacts node manager and request certain operations to be performed. In the case that we experienced the high utilization of the system, the node manager will contact Rackspace Private Cloud Power by OpenStack to add nodes to the environment. At the same time as these transactions occur, node manager records information into the data store. As the new nodes come up, the data store controller on these nodes learn about the environment by contacting ACG manager and then they position themselves properly in the new environment. At the same time, the UI is updated with information from the data store and the existing ACG node learn about the changes by contacting ACG manager. On the data access side, we are currently working. We would like to provide the standard interface to access data in the system and for now we are working on completing the JDBC driver while we will also expose native libraries of the underlying data engine and native bulk loaders. We see them be used in Informatica, mostly as the ETL tool. So as now we saw a little bit about how the system functions, let's take a look at this indexing structure. So what is the indexing structure? Indexing structure is a single entry point into the system that is fully load balanced and replicated and it resides on the set of Rackspace private cloud power by open stack instances. It is a set of pointers ultimately pointing to the database entities and in our case, database entities are in the relational sense, it would be the table or HDFS would be a file or in column there would be a column family. It is a structure that is fully controlled by the controller the same way as the data nodes are controlled as we saw previously. Therefore, as the rule engine finds out that the index nodes are overutilized, it can issue change in the indexing structure. So the indexing structure is elastic itself. However, as we need to maintain the single entry point into the system, this structure dynamically expands vertically and horizontally and it also addresses the growing data set. So what does it enable us to do? First of all, it enables us to distribute databases across many instances in traditional sense database that we know of today. It allows us also to split large data sets like tables or structures across many instances. Consequently, it allows us to run large queries in parallel across available instances and allows us to deploy data stores with their optimal configuration so that we minimize maintenance. So we can have the configuration that doesn't need to change during the lifecycle of the node within the system. Also, as a single entry point allows us to access various storage types via uniform interface. Here we see several additional components and that is the sorter and aggregator that is required to resolve queries that need any kind of summarization aggregation and also to combine the result from various data stores. In the future, we are looking to actually add an option where we will be able to run the queries that combine the data from heterogeneous data sets. Now when we know a little bit more about our system let's take a look at the quality attributes we listed at the beginning and see how Rackspace, Private Cloud, Power by OpenStack and ACG work together. First thing we can take a look is performance. There are two types of performance. Velasticity of the system and query performances. Rackspace, Private Cloud, Power by OpenStack creates a single instantiated node within 30 seconds. So from the image that we created to get the VM that is joined that is in the environment running the database it takes about 30 seconds. We also can create nodes concurrently. Some rules when they start they can say, okay, we need to double the size of the nodes that we have in the system and all requests are issued concurrently and that number of new nodes is available almost at the same time. When we have the ability to control the dataset size allows us to quickly... I mean in this environment why is important to control the dataset size? So that when the system expands and when the data distribution occurs that that can complete fast. So that is why we need the indexing structure to control the dataset size. It also allows us on a data retrieval to run queries in parallel and therefore to get the result quickly. Another feature that we want to accomplish is scalability. And many... actually many... functionality that enables performances of elasticity is pretty much the one that enables scalability but it's very important for us. So we have a... the system will scale quickly because we can quickly create the nodes and we can do that concurrently. We also have ability to resize existing nodes. Sometimes we don't need to... in some rules we don't need to add a new node but we can add a CPU power to the existing nodes to perform the tasks and then pull that out. And we also have ability to remove nodes which allows us to scale down. Again on the ACG side indexing structure and control dataset stabilize system quickly after the expansion or contraction occur. Availability. We can rapidly, as we saw, we can rapidly replace failed nodes. And underneath within ACG because we are deploying the existing database engine, we're actually deploying the dataset or native availability mechanisms. Replication, data distribution, anything that's corresponding data store comes with. It's all available. This is significant gain on the maintainability side by using the ACG system. Adding new nodes actually increase our storage capacity, increase our CPU power and our RAM while at the same time does not require any intervention from system administrator or database administrator. The fact that we are controlling the dataset size enables us to run the databases with optimal and stable configuration. So once we come up with the configuration, we don't need, like, that doesn't need to... It's in the image and doesn't need to change during the lifecycle of the node on within the system. We also reducing the demand for managing a dataset store object. In traditional sense, on very large databases, we would have indexes and partitions, and we need to take care of their placement and many other things. In this situation, indexes, of course, still exist, but special managing of those objects is not required any longer. In addition to that, we will benefit from the stable query execution plans. In the large databases, often, when even when we do the, like, go through full tuning exercise, have right indexes in place, everything is there, but the database keeps on growing and growing, and when it reaches certain threshold, the query plan just decides to take a different path, and our performance is then suffered. And that cannot be resolved until another major tuning exercise that usually results in the database configuration changes, table configuration changes, and changing even the queries that are running on it to introduce the hints and find different ways to get the query executed fast. Now, in this situation, because the database size is controlled and then it grows over a certain threshold, it starts another one, we shouldn't be experiencing these kinds of problems. Flexibility, it's very important for us because we process the data in various formats and receive them in various formats. So we would like to enable, as we mentioned earlier, three different storage types, and we believe that each storage type has its optimal case. So for columnar, we choose Cassandra and we see that it's a really good choice for time-serious data, and time-serious data is a very big subject in analytical database. For relational, we use Postgres and, for example, data warehouse, legacy data warehouse would be the environment that would run in this environment. And, of course, HDFS is Hadoop and we would like to use it for unstructured data because we are not trying to make something else out of it. We would like it to run as it's intended. We will enable users to select the optimal storage type for the data that they are working with, and that selection would occur on the data intake. Usability, we mentioned earlier like we are looking to provide the standard interface. We will support the SQL language, JDBC API for now. We will be looking into our DBC option, but JDBC is something that will suit our needs at the moment. We will enable data storage native calls as well, native backloader utility, and in the future, we would like to support the joining of heterogeneous data sets. So columnar and query that would run on request data from columnar and relational, for example. And this presentation is really a use case. We would like to tell you briefly about very recent experience we had with this environment. Keeping in mind, again, that this is very beginning of it. We've been working on this for, I would say, less than six months. And recently, we were faced with a very high-profile project that processed very large amounts of data, performed calculation, that produced a large number of records. And as we had, we had some of the data already in our legacy database, which is Microsoft SQL Server Data Warehouse Environment. So we already had some data that are required for this calculation. So as the first thing, we decided to add the sources that we did not have and develop our calculation using stored procedure to get the result in the data within the Data Warehouse Environment. At the same time, we were kind of feeling ready that we can deploy our, do the same in our ACG environment. So we developed the ETL process for which we have already, like we already did a lot of work there and it would load any of this environment just by providing, selecting the destination and providing the rows or records that you want to load. So we, this actually gave us opportunity to compare two systems side by side and this is what this use case is about. So the subject of this was a complex, it's complex availability calculation and it's sourcing only three months of the day of the monitoring data and creating one billion records in initial calculation. The first environment was our Data Warehouse Environment. So that was the SQL server and we used the SSIS for data loading. We actually had we were at the same time in the process of moving our Data Warehouse to another server. So we had that new server on our disposal for initial calculation. So it was doing nothing else but this calculation. We had of course run calculations, I mean we ran it in the store procedure and the results and source were in traditional Data Warehouse structures like star schema and so on. So for the second environment we didn't have our environment and we, through our node manager, we instantiated the whole environment in about a minute. It was there. It consisted on four Cassandra nodes and indexing structure and it was that environment for the calculation was registered with an indexing structure. Each node was running on two CPUs each node has two CPUs and eight gigabytes of RAM we developed calculation in Java and source and results were stored in a columnar structure and we see that as a suitable structure for time series data. So these are some of the results. What we ran on a Microsoft SQL server lasted for about five days while an ACG finished in three and a half hours. On the storage side we gained substantial savings and granted in the Microsoft SQL server there are many indexes that take a chunk of this space. We do have but we don't really need those in the columnar structure. They are not required and as the columnar structure is better suitable for time series data our Java program was way simpler than store procedure that was doing the same task. So what did we gain? We can summarize as the following. We gained substantial performance improvement we reduced storage demand we simplified our processes we now have ability to process terabytes of data per day close to real time or demand. If you remember at the very beginning of this use case I said this is just three months of data that this was done for. So at this point we feel confident that we can go ahead and continue to provide these results of this calculation and sources that contribute to this calculation continuously. We also improved our trending and reporting. When we start working with the Microsoft SQL server we knew that we will be about 24 hours behind. Right now our calculation is just coming and it's almost real time all the records that contribute to this calculation are available as they occur in the sources. So there is significant improvement and we achieve significant cost reduction because we will not be expanding our legacy as we had to if we want to support the amount of data that we are faced with. Thank you. Thank you. Just a quick question. Did you consider a Hive or some already a query language, SQL like language? Right now for Hadoop we run it under in ACG we run it under Hive. We would like to actually remove Hive and have like my producer the service available. We don't want to make them do something else. They were not originally intended to do so which I think the Hive is kind of doing that. Well, I mean how you usually you would double the if your application and distribution is on 3 you would add another 3 nodes so that it actually evens out and you keep on following that exponent as you add more nodes. So now elasticity is the most significant is on the Cassandra the Cassandra level here. The Hadoop by itself is really storing large amounts of data anyway so it's not something that will occur very often in the Hadoop environment but in Cassandra because the Cassandra actually distributes the data as it expands you have to make sure that the size is controlled so that process doesn't take very long time. Scaling back yes it is we see that scaling back will mostly happen when we add the CPUs and take them down but we can get the nodes out and we run into this situation where we actually take the nodes down and it's redistributed itself it takes a little bit longer time but we don't again all this elasticity stuff will be it has to be taken in the consideration of the real environment scaling up and down is mostly on the CPU adding and we have we have three types of storage column relational and HDFS so whatever they provide we use that what is underneath it's a block storage okay in this in this case right now what we have our environment right now is very very beginning of our environment right now is a local storage that is correct okay if we if we change that we will eventually again I'm saying this is at this point we will have a completely new environment and we will have that dynamically ability to add the storage and make use of the storage within a private our system as the consumer of the workspace private cloud and that will happen next year right now it's it's a we will see with our private cloud team and we will talk to them about our needs and we see what will be the best while localization of the data is whatever basically with the Hadoop in general you don't really need to redistribute you keep whatever is there then you add a new growing you can add the new VMs that will then take over and try start inserting there and when the queries are running in that case then they will know to run in parallel across so that that's that's how we resolved when it's expanding or contracting when you have a CCD occur yes yes web server well they see it outside like that is the system ACG system is composed of the knows the store data web server and other stuff are sitting on totally different environment they have access to each other and they control it but it does not impact it in that way the loading is actually occurring through the indexing structure thank you