 I am Avinas and we are running short on time already we are starting late. So we will quickly go through the use case, the demo and we will cover the decisions which we have taken. After that if the time permits question and answer I will introduce myself and my work. So it is better that this one we get covered first. So the use case is about 15 billion computations on a investment portfolio is a form wide computation and in the investment bank what happens is you have got lot of you have got 3 million positions and the positions are like I own 10 shares of Microsoft you that is a position you own let us say 15 shares of Cisco that is a position there are going to be for one day there is going to be around 3 million positions. So it is all about quantity position is all about how many stocks you have and what is the quantity you could have futures you could have many other things. The other aspect is the model the financial market they think what is going to be the price of this stock if there is going to rain if the oil price is going to go up if the gold is going to come down. So they do like 5000 6000 models will be there and then they are going to apply this multiply for each model what is going to be the total portfolio value and based on that the risk is calculated so the business case like 3 million into 5000 you have to touch each and everything and calculate and find out the risk. So it is 15 billion computations on top of that the lookup is going to happen for each model there is 2 million products so it is a big 2 million into one model then you look what is the price on that is imagine it is a hash map and you are doing continuous lookup for each calculation so it is a lookup plus calculation both happens now and we want to do it in real time the current system which is just looking 5 days it is not good enough we want to see 6 months to see that how my rick's model is working and with the current data is in the tape so there is a big process of 2-3 days to bring it from the tape and load it and then see it so I am going to show you a demo also how we are doing it okay let us go to the demo before the demo let me show you the data simple data so this is a account this is the product and this many stocks are futures I am holding so this is one type of data which is all about quantity so now I will check okay I will say okay this is the product ID 1 okay holding says 18323 let go to the prices and go to the in the model 1 there is going to be like 2 million 2 million of them will be there in model 1 model 2 like that there are 5000 models I have taken okay each model will look into what is the price and then we compute and make it total in the end of the day so 5000 models in each model what is my total portfolio value so 15 billion computation so I have this infrastructure the infrastructure has got 8 machines these machines are from Amazon and the front server where we have hosted this application is in Azure so we have mixed Azure plus Amazon so we hit the Azure front machine then we hit the cluster of the Amazon it has got 8 CPU machines of 10 of them and hard disk is just a regular SATA we have EBS also there and total memory is 300 GB each machine has got 30 GB so 10 into 30 is 300 GB now we are I am going to fire the computation so after this I show it quickly we will go to the disk will I will discuss about the design and then we will open up for the question and answer so I am going to run it whenever I am going to run it I am going to start my VM start it is going to take 40 seconds around if I am okay so the whole objective for us was to get idle time to less than 10% and maximum CPU uses and we are running it in the 4 GB JVM so JVM is 4 GB we have seen we have simulated a ultra fast SSD by reading the data from the OS CAS so all the data is in the OS CAS so we read it from there it is like acting like a ultra fast SSD in the real production it will it is like all the SSD drives are there for this demo I am reading it from the I just OS CAS is giving me very fast but the JVM size is only 4 GB it is over because I can see 100% and all the things other things are done so let us go ahead okay the result has come and it is 44 second this time okay these are the different models and the total portfolio value on different models got calculated and the chart is I way this is what is the rig start which is coming okay so that is fine so now let me let me go and start opening up the architecture behind it and I believe that time you can correlate a lot of things this is if we give 64 CPU core machines and a lot of them the millisecond performance is going to come and I have just 8 core VMs with Amazon those are regular AMD processors it is coming this 40 seconds I am going to share you share with you one of the benchmarks once we do it in the Intel labs next week so you can directly access the demo from the online and can see the findings so the business benefit is we don't have to just work with 5 day data they can see all the six months data and that is available for the readily computing and second is the solution is not JVM cash based solution it is only 4 GB JVM cash per machine for the processing so it's only going to be like 10 4 into 10 which is 40 GB only so you can see here the infrastructure this run my overall CPU usage is this is a historical run I have taken it from here so it's like overall cluster the CPU use usage is 95 above whereas this is actually wait time is absolutely zero there is no IO wait is simulated ultrafast SSD simulated by Linux OSCAS and my idle time is almost like less than 10% and the memory you see here is the overall memory we have 300 GB and this is the JVM memory we are using which is only 40 GB across all the JVMs and the used memory is which is the OSCAS is doing you can see actually around 200 plus little bit 200 plus okay so let's go through okay so so the design so I please feel free to stop me and ask anytime the questions because this is we are going to go through the design solutions our one HBS cell contains either 18 MB or 16 MB so one key value is 16 MB or 18 MB we don't store one string or integer or float one big chunk we take 16 MB and we store it there and the position table is what we as we are processing we slide it like we load it and then we just keep on passing it and keep on computing across different models and these models gets computed one model okay all this byte array which is 16 MB or 18 MB are binary sorted legy deserialization we do it's all binary sorted and we do the legy deserialization we never deserialize and put it into the list we just deserialize and then look at it or in the binary data itself we search and find it out so we never do like load it to the list or map and then look we don't do that because we cannot handle the JVM if we do that so the schema design for the positions the key is the date and for that date all the positions which are 3 million and the total size binary size is 18 MB text size is 50 MB the models again the key is price model and value is modeled the price and all of them 2 million 2 million prices are there it's it's 16 MB and now the cross product happens between these two so as I was mentioning we are doing actually in one model we are keeping all 2 million in one cell if we don't do that we found out in H base it's going to take around 500 MB for one day actually say 500 GB for one day for one day data if you just load it into the H base it's 500 GB and 6 months data is there if if we do all all basically one model and one cell everything in one big chunk it is only coming to 80 around 80 70 80 GB so that much around 80 GB it comes actually so that much benefit we are getting that's why we decided to load it in big big chunk the second things why we moved away from the RAM based to SSD based we don't want to process in RAM because the amount of garbage which is being generated and the Java process which is collecting that equation was not good it was generating more and more garbage than it can collect so we come out from that and we moved into the SSD and what we achieved which I was showing you 95% CPU uses and with only 4 GB JVM memory node so we don't have any Java GC issues we don't like GCs because we don't know how to tame them okay how much we do setting going to the cloud error JVM specifications how to tune it going through all stack overflow or god knows okay all doing our own thing still we don't know how to tame the GC so this is a big problem for us so we said finally we are not going to handle GC and Dix lat latency if we don't do in the RAM then it's a disk for the disk we did SSD so we are going to hook it into the SSD so the amount of Dix IO we get or we can read the CPU is getting maximized so Dix is not a bottleneck with the SSD the third thing is for this demo I have simulated a ultrafast SSD with Linux OS CAS again it is not Java it's operating system Linux OS CAS so no IO 8 no garbage collection so this is we don't like I mean I personally had had GC the second concept which we did was we wanted to compute everything locally in parallel okay it's a very big concept whenever I am reading the data the data there is in the hard of there is a DFS client which HBC is going to use and read the data and DFS client is going to read in some chunk and chunk over the network though it is coming from the local note but it is over the network we wanted to directly read from the disk and compute and process no intermediate network so that we did with the with the sur-circuit the HDFS read sur-circuit you can see there is a flag or sur-circuit if you do that it's going to directly read from the local disk and just give going to give it to you give it to you so there is no network is coming in between so again no network call no DNS lookup complete local as if like my program is opening a file and ready reading it the second is the DFS client CAS so we wanted all the name node locations sorry data block locations it should not go to the name node and access it so we are casting it and the CAS is very efficient because there are only 500 blocks because we are having like such a big big chunk of size so and we have set it up in that way it's only 500 blocks it gets cast so it doesn't basically make repetitive call to the name node so everything goes local you will see no network call nothing is just local read the file from SSD computed so this is the next thing how we how we calculated our region server size the way we wanted to process is one CPU or one core one price model second core second price model we don't want interrupts we don't want context switch we don't want any of those conflicts so our game is one one core one price model computed finish it so the way we calculated that's way one thread for price model 16 name is the size and let's say five thousand we have taken five thousand models and number of CPU course available if we give the settings the region server is going to be appropriately created and one thread one it will go so if this is please wait I wanted to show you the Intel labs case is not here we should have been finished by now next week I'm going to finish it I'm going to share it with you I'll post it there this is all small machines okay this is map reduce architectural pattern it's not map reduce the way you do the job the HBS co-processor we make calls to all the regions and there is a 16 meg get processed each key value and finally that's like we have a map reduce kind of a pattern implemented it is going to make a call so we process it there so everybody knows is very simplified API wise after the processing is over everything gets again combined in the co-processor region server level and get pushed so it is no sorting and softening which you might have seen in the last presentation mr. so we are not doing any sorting and softening no network it's just if API or a framework which we have introduced because people who are coming they already doing map reduce so we have introduced the same call so that they don't have any additional learning they just quickly go through that so as a co-processor gets partly processed and finally we get the output we are in using internally this map reduce which I said real-time map reduce you and then this byte array DC allation there is a open source called as such which is in the github you can take it out as such core primarily we are using it for the pharma I have used it for pharma we are using it for the financial market we have used it for the telecom so this one we are using a lot and the new problem which has come to us is we have to use it for processing around 18 terabytes real-time around that in oil wells and other things so please take it out this one so this is the brogues architecture we did a feed loading which is a search index we do a indexer where big big blocks gets caught and then getting finally written to the HDFS other side we do a real-time mr which we process and get the data out so we like the H base because scalable real-time apache licensed and we did a search because it sorted binary legy DC realization real-time apt reduce and extreme parallelization so I'm open to questions please shoot me because it has been after the launch it's been good yeah can you talk a little about your serialization deserialization and how you came up with the 18 MB sort of a size did you benchmark it how do you think it played with your RAM and the way that your disk weather would that 18 MB change in some other situation and one other question was you said that you are using SSD instead of RAM so are you deserializing and putting it on disk and then doing what did you mean by that statement so 16 meg is very simple very simple way if you see is one product ID and one price so 4 bite is the product ID 4 bite is the price 8 bite 2 million 16 MB so could you have taken more data could you have like you know put multiple product IDs with like 32 MB of you know chunk size would that have given you better yes you are you are correct so for this demo I have given that that's very subjective and we give a basically a little bit bigger size so that it doesn't get split it so to fit it we we can say 24 MB is the size cell size so two will not fit one will fit there so that's how we want to process we can make it bigger actually it's not a problem the only thing is then perceive you one price model computation we have to write that parallelization logic read it and then do it both also that also we have done we we make it very big and then we do we read it for and then at one shot we'd say parallelize and those 16 MBs we are just the array we refresh every time we reuse reuse reuse fill it process it free fill it process it we do that was so when you deserialize do you put it on this that I want to answer the second question no it's not deserialized we never deserialized it's lazy deserialization as I am deserializing I am computing because we don't use any list inside the code no list no map nothing it should not be any list if we do the list again the jbm is going to come and we hit that so we just read it it's like is a visitor pattern we give the computing to that inside imagine this 16 MB we have a visitor pattern and the code is just continuously calling it and processing it as it is deserializing it's calling it hi you explained about using the direct file method to improve the speed is it the same HDFS you use some parts for raw data computation and some parts for real-time computation or how was this speed achieved in terms of direct file let me refresh the question if I understand correctly you are asking about the real-time and the batch the two different things so the way I look and look into this thing the real-time configuration and the batch configurations are totally different the real-time configuration is so the clusters also should be different it should not be the same cluster yeah but you said about the local right you started using local what is that configuration how you improved okay okay otherwise I got it otherwise I could use a simple file system right why I need an Hadoop so I just put a direct into a directory with SSD and I access then it becomes real-time so in in the new new version of Hadoop and HBS distribution you will find a configuration called shortcut to short circuit and you make it actually on or off but the thing is if you make it on it still may not be available locally so to make it available locally you have to touch once or you might you want to basically have another way of knowing that which is with in which location which is DFS the data block actually you may want to have that otherwise you just the easiest way is touch once then you have it the second thing is if you if you restart your cluster then only this problem comes not the first time do you have a single reduced shot meaning how are you avoiding the shuffle is because the way we during the time of debt keeping the data we have kept the whole price model at one place so there is nothing it's one price model and all related data never goes to the other server that's a holds good even for the for the pharma industry where one clinical study goes to one cell even holds good for the building energy efficiency one building data goes to one cell or everywhere it is the same thing is a big one chunk that is become the merge key and on that we do the grouping distributed grouping more questions how much time we have left very good so okay as we have little time let me introduce myself I am committer of a search and I presented a search first time in 2011 at the yahoo summit yahoo big data summit in Bangalore so that's the time we started writing it we write it we wrote a lot of code but we didn't focus on the adoption side and no documentation and nothing but one good thing we did when we applied it to different customer locations on different use cases we have to make it really very easy for people to do it so second thing is what use cases is being a search is addressing is distributed grouping solar less than 1 terabyte no distributed grouping solar is very good more than 1 terabyte or 1 terabyte distributed grouping a search is very good more than 1 terabyte a search is very good so this is how we are doing it's all real time we are we are only focusing on the real-time nature of it and all the language feature of the solar whether stemming lemmutation all is supported whenever we are doing the indexing whenever we do the index it goes into this big big cells and from there we process it and and we have given a plug-in architecture where the map reduce you can write so you can take a moment read it otherwise if somebody come up with some questions I am here after this if you want to discuss your use cases specific design I will be available here also thank you for listening