 Hi everybody. My name is Billy Bosworth and I'm not Matt Fial. As it says on the on the sheet I'm the CEO of DataStacks. Matt is our co-founder. They got caught down in Austin, Texas on some travel. So much like in my company I am largely overhead here for this presentation and Scott is going to be the one giving you all the meat and the details but we just had a really great time in the past couple of days catching up on two different technologies and stuff and one of the things that is interesting about Scott's case is it's not the typical big data case. It's not a world where they were overwhelmed with a velocity problem with a volume problem and what brought them to the world of no sequel was quite different actually. So we'll be discussing a few of those points as he goes through the presentation but I'll turn it over to him now and let him get it kicked off and we'll take it from there and hopefully you guys can get some good data out of this. So Scott thanks for coming and take it away. Thank you. Hello everybody. Today we're going to be talking about getting big data healthy. A company I work for is called HealthX. We're probably held small company in Indianapolis Indiana. We have around 60 employees with half those being development IT staff. What we do is provide insurance companies with web portals. So when you think you're going to your insurance company site you're actually going to us and doing all your tasks that you would do on our site. So some of the things we provide are ways to talk to your HR, ways to ask questions to the insurance company and also things like looking at your claims and getting new temp ID cards, things like that. We started in the smaller market called TPAs which is third-party administrators at first and have now grown to have health plans and Medicaid Medicare. So HealthX's technology we are 100% Microsoft shop and to be more accurate now we are a 97% Microsoft shop. We run our server on a Microsoft SQL 2005 database. It's about six terabytes and we run that mirrored and about 3.8 billion rows. So we're not tiny but we're not by any means big. We obviously run Microsoft IAS web servers on the front end and our middle tiers Microsoft also and of course then we use C sharp as a programming language. The service that we moved over to Cassandra was our provider directory search. In SQL server the team that wrote this service went extremely normalized structure and we ended up about 20 million or 20 tables about 5.7 million records in it. What we found was when the first client requested this new feature we only added a small segment just what they needed. Well then the next client found out we were doing this and they said oh we want to do this too but we need this to be added. We need this to be added. Every single new client we added to the service caused us more index pains. So what we end up with today is as a member you can go on our site and find the right doctor for you. So let's say you're having something wrong with your foot. So you're going to go and look specifically for podiatrist. You want to be within 10 miles of your house. Maybe you want the podiatrist to speak German and so you can do that query to find doctors to meet those criteria. Well when we originally designed it it was just to find doctors close to my house. So next time we know we were adding new indexes to handle the new search of what their specialty is or are they affiliated with a certain hospital and they kept getting worse and worse. We're adding new indexes all the time and we would start missing indexes and we'd have bad joins because of the so many different tables we'd run into some bad joins which would cause us have poor performance. So we looked at a different option besides SQL server and the reason we did that is we're always trying to lower our reads and writes on our main SQL server. We do that by continual optimization of all of our stored procedures but this was an opportunity to just move entire subsystem off our SQL server and put it into something else. We looked at adding a different SQL server to do this but not only do we have the issues with licensing costs of SQL server now with the new licensing being core driven it was even going to be worse in our case. We have a sand behind it we have to also upgrade and so we started researching different new SQL solutions and the way that research worked would be my boss as a CTO would hear of some company running some new database and whatever it is and he would throw it over the wall to me and I would fiddle with it and try to get to work see what the results were and go on and he just kept doing this so we were trying different ones over the years some of them I admit I couldn't even get installed so I'm like that's not a good option for us because we are I'm a Microsoft shop and finally one day he threw Cassandra over the wall and it just clicked so the biggest thing for me being more of an IT than a developer was that my database servers become a farm so one key that caught me with that Cassandra in particular but no SQL in general was that I don't have this big iron box anymore I can put in lots of little boxes and I can treat it just like I do my web service today so if a web server pages me in the middle of the night and says they're blown up we turn off the page and go back to bed if we get one bad page on our database service today everybody gets that page and we're all getting up determining is this a moment we had to go to the mirrors or something we can fix real quick so that was something that was very intriguing to me of making that database server not so critical that we could have multiples and it was no big deal another thing was maintenance tasks on our system today are kind of ugly when I need to shut down the database server I have to choose either to flip over to the mirrors can be a long outage or I kind of try to sneak in outages sometimes and get some patches put on as we need with the provider directory having the search problem data stacks enterprise glued solar to Cassandra that's now I have DSC search available to me and for provider directory this is index everything and we'll deal with it later with us choosing to go with a commercial side of the product instead of just doing Cassandra only was because this was a major transition from us to go from completely Microsoft stack to open source stack we were gonna need some help and data stacks was there for us and was really helpful with us getting us going early on so Scott before you go off this slide we were talking quite a bit about some of the conversations were here and here at the show there is no small amount of passion in debating between the traditional relational technologies and these new technologies so you being a hundred Microsoft shop you told some good stories last night I think might be helpful for people what what cultural barriers were you hitting in addition to the technology barriers right because it had the technology considerations on the one side but you had an army of people who have been doing this for a long time right the 100% Microsoft shop what was that like first it was really interesting my lead developer I was gonna need to use on the Cassandra project he wears Microsoft socks he bleeds Microsoft he uses Microsoft only exclusively that is his choice for everything and here I was telling him you need to learn Java he's like I know Java but I choose not to use it because he sure it's better you know whatever so he's really struggled at first with that but I gave him the ability just to do whatever worked for him so he actually ended up with pieces of it wrote in Java pieces of it were wrote with Python and PICASAs he's just been learning new tools and now it's fun for him I know it's not really part of this but we just implemented some new stuff on some monitoring and he actually wrote it all in Python it's becoming his little language of choice right now so that we ended up with part of us I was more of a Unix background so it wasn't the transition wasn't as bad for me the other developer was more excited about the ability to do something different and then third developer he kind of got into it eventually but it took a while it was a rough transition so our solution is our DSC ring is three commodity class machines I specifically went with our standard web server config and at a desk that was it so when I looked at buying a new database server just a server alone the three rings run about half the price of the three machines run about half the price is what one of my database servers have run we still had some things on the provider data that we used inside of our application and so what we decided to do for our solution was let SQL still handle the ETL we do all kinds of tasks on the data as it comes from our clients before we put it into our database why do we want to rewrite that for Cassandra so we loaded the data still through SQL extract it back out of SQL getting the exact answers we needed to solve these queries made this we only made a small change to our data access layer basically at the DC record exists we convert the query to the solar search and but then the goal was to have standard results pull back to our system from either side whether it's Cassandra or SQL and then we needed the ability to move clients back and forth at will because we just knew that this was gonna be a big change for us and we needed to be able to go back to home but in a way back to SQL and one thing also as this has started and moved forward we've used a lot of different terms if I say DSC if I say solar if I say Cassandra if I even throw in a Cylandra it's all the same to us I'm talking about DSC using the search abilities so loading the data the actual extracting of the data we use a mix of Java and Python picasso the reason we chose Java for the first part was specifically because I had Microsoft drivers it was something to make our developers comfortable but they knew that we had a good drivers talking to the SQL server so the Java program all it does is literally extract the data out in the from the data that we needed from these stored procedures so we had two stored procedures we had to worry about we looked at what the results were from that and hold all the data needed from those stored procedures then we have our Python piece and the Python piece transforms the data creating a hash field of our multiple keys we needed to get unique records and converted the dates and geolocation formats to what solar and solar needed then at this point we just used picasso Python picasso to load the data to the DSC range our data set is actually small it's a very very small segment of our system but we wanted to test and see how this would work so we can reload our entire DSC range and under an hour it runs about 64 gig so if you think about our six terabyte database which is our SQL production we pulled off a little teeny tiny chunk into 64 gig as we use this as our test case but a test case we're taking all the way to production so we didn't have to redevelop the entire world against accomplice you just took one small subsystem and moved it over with your small three-node ring talk a little bit about your data center considerations how we talked about how you have a single data center but yet you kind of treated like two it was important for you to spread that that load across correct yes so our data centers unique unique for us in the fact that we actually have each room it's like their own mini data center so each room has its own AC supplies each room has its own network each room has its own power so we put ourselves into two different rooms for redundancy so one room is powered by company a another room is powered by company B and that allows us to have a good redundancy even though we're in one data center so we've had power issues and stuff like that where we've lost one room but the data center itself the other side is still fine we put DSC in place we didn't take advantage of the same thing so I actually have every single of the three nodes they're all in a different rack and they're also split between the two different rooms we haven't seen any issues with this it took us a bit to get the settings right to get us doing the networking right on the extender level but once we figured it out it just it just runs now the data access layer changes they went back and forth on this a lot on determining where and when we were going to try to get into making the switch so our app is very standard it's we have the UI interface which talks to our business layer and then also talks to the data access layer and then that's what chooses our source so we for the SQL side we standard SQL or standard C sharp to get into our database which returns us an I data reader so what we did was we wrote our own customized JSON data reader so that it also returned the exact same results to our application and so our data transport object coming from the data access layer back up to the business layers it doesn't matter whether campus or or from SQL it does the same thing the way we did it was when the store procedure comes in we wrap that around our store procedure class at that point we can chat we can grab it and say this store procedure is a special or not if it was for the Cassandra project and it popped over to another table in the SQL table in the SQL server made a transition it's known to get DSE and there we put the query string for solar so then it pulls that up does the JSON reader grabs the data and pulls it back into the I data reader just like we do some observations we find that SQL does take longer for the query than DSE one of the reasons for this would be there are no joins we have a lot of joins in our data right now and the index is pre-built as you've thrown your data in the solar so that means the index is always clean for us we do find that because we're using HTTP layer SQL is more efficient on larger data set poles so somebody goes to the provider directory and is asked for Dr. Smith and doesn't care about anything else they can get a lot of results turned back to them SQL does a better job of throwing that across the wire than using HTTPS or HTTP we can obviously switch off of going to through solar and go down to Lucene to get out that issue but that was something we've noticed fun things we've noticed afterwards are as we've been getting the data in the solar working on it trying to figure out issues we had with it now it started to see it would be better if we designed it this way inside of DSE search so if the day ever comes for a rewrite we will most likely rewrite the provider directory structure to use things to take advantage of the way solar works in the searching we had a client and we have lots of clients and elected to test on us at different times and also now the blue I got this case that their provider directory results are not working anymore and I was panicking oh crap what do we do now we get to flip again and I was looking through it and they didn't give us the queries they ran they just gave us their total numbers and so I got back with them said okay what results were you looking for what were you doing in your searches well then we took all that through an interest into our test tools and tried to figure out what in worlds going on why is SQL reporting different numbers in solar and they weren't lying when we ran our test sure enough results were way off and we could not figure out what was going on and it was just a surprise but solar actually produces better results than our SQL does our SQL if you have a zip code that's nine digits and you send it to five inches zip code you're not gonna find a match because it didn't match the full text and solar that's tokenized on the dash so five digit would work for digit would work either way would work if you typed in the name Jose you might not get any results on the SQL side but on the solar side it would understand that oh you want Jose or you may want San Jose and so it's kind of it's kind of needing to telling our client then oh by the way we switch you to a new system that does the searches better they were happy we wasn't broke again that we did something wrong that was a neat thing we found out could you have forced that to work properly well as expected in SQL with using the different wildcards and I wonder why that wasn't done from the outset it's painful on cost for ease when you have started doing your likes inside of your and especially zip codes a small one that could have easily done that but on larger text fields in particular you can get yourself in a mess on the timing of your likes and our company really focuses on the reads and writes of every single stored procedure and so the developer knew for sure that if he was adding extra things in the query getting bad performance he was gonna get tagged for it so that was I'm sure a choice by the developer to say okay I know this will be a tighter query won't pull the exact results back but it should get us good enough challenges there were challenges to making this transition we load the data into Cassandra Cassandra handles taking care of the indexing matching the schema but what we realized and found out was Cassandra would take whatever we put in the data so the date format was just a little different or the lat long was a little different Cassandra didn't care but solar did care and so we run into times where everything was broke on us so during this time I was trying to fix these issues we ran into a lot of schema changes if you're not careful with your schema changes then the index can get out of whack from the day that's in Cassandra and we went round and round and round trying to find out and just to solve this we were lucky in the fact that our data set was small so we did a lot of drop it rebuild it drop it rebuild it trying to figure out what was going on and when data stacks 2.0 was released they actually included a new tool under the DSC tool that allowed us to do full index rebuilds when we needed so if we mucked with the schema got some inconsistent results we were able to use the DSC tool rebuild underscore index to solve any kind of index issues we caused yourself on the sequel side we do lots of monitoring we track every single query against our system we track the reads and writes report on them and we are real hardcore on any kind of bad reads and writes and so we've been doing this for 10 years and have really tightened down what are good reads what are bad reads and can know really quickly you need to work on the store procedure you need to fix this query we have yet to get that exactly in solar the way we're used to seeing it and so we've had some issues monitoring solar work on just is everything okay is it not okay beyond the standard IT things the developers would rather see that this store procedure work this way that's what they're used to seeing and so we're working on ways right now to get our monitoring to look even though it's coming from standard to make it look more like sequel so they can use comparisons there so the future of DSC at HealthX we started a new project on the provider directory now it's been live for about three months and we haven't had any issues with it performance wise it's been running well and the next thing was okay what's the next thing to put on DSC so we have a new issue that came up it was file logging with email updates so we get a lot of files from our clients and they want to notify their members when a file is updated sometimes this is notifying the HR person that hears you guys is billing statements and sometimes it is to notify all the member that hey there's been a plan change we want to tell all the members about a plan change we get about 50,000 files a day from our FTP servers and so we decided to build this new piece and DSC also and this piece uses only Cassandra doesn't use solar which was a little different from us and this actually just went live yesterday so now we use both sequel and DSC to send our notification emails so the logging is all tracked Cassandra we run a job on that to go track to see who we need to send emails to look at the emails in our SQL server and then push out the emails or email servers our next task we're looking at is moving more of our application logging to use Hadoop for reporting we track every single click on our system and including all the web two types asynchronous transactions that can cause us hotspots on our SQL at times and so we're looking at moving that application logging to the DSC ring to then take advantage of new reporting the log doesn't need any indexes for itself but reporting we need lots of different indexes on it which causes issue and interests them so we always bite limited number of indexes and speed of the inserts we have a dream for us and DSC is to eventually implement the one search box and so the idea might the owner always likes to say we want to Google search we want to be able to search for a doctor to be able to come in and say show me all my patients and the one search box to say give me your thing you know about Joe Smith or are we having a lot of broken arms for some reason and you'll search in human text and said just click this click this click this to get answers from our data that's something on our list we keep testing and doing little tests with it and we haven't got great success yet but we're continuing to work towards that at this time we're staying away from PHI on the cluster we have a lot of security concerns with having given data we do have a partial solution for this which is to do encrypted hard drives and so we run IBM servers they allow us at the RAID level to do encryption and the hard drives which then would allow us to have our data at rest being encrypted just like it is a SQL server but until we get that all settled and have a better security on the front end we're going to stay away from PHI and just look at it more for logging at this point as soon as we have a full solution of security on Cassandra then we'll start heavily looking at moving some of our larger datasets over to take advantage of solar searches in particular but then also to try to the full transition also SQL on to data stacks we don't believe that SQL will ever go away we have 10 years of applications being wrote under the base our current system but we see that new things we do will always be Cassandra first any things are giving us pains we'll look over as we fix different things like the PHI issue we'll look at new features of possibly duplicating data becoming the Cassandra ring just becomes another option and so we might have stuff running in SQL and both Cassandra or try to move some of our bigger subsystems over like our claims systems over so Scott you had a small project you took a piece of it you're working on the integration 64 gig not big by any standards right we could put that on our phone or our tablets you earlier in the presentation you said open source sort of we're using data stacks why didn't you just go pure open source why not just go get Cassandra from the Apache site solar from the Apache site how about it well when you're used to the Microsoft stack you're used to everything working together you know once you have visual studio running it's all there you can talk to SQL server you're gonna do inside of visual studio if we need to talk or write an application you're gonna do it see sharp or you can even change your language you're all in your visual studio so the idea of losing that piece of glue for us and now we're looking at maybe we have to use Java for peace or maybe we have to use Python for peace oh we got a Linux service in here they're not managing the same way where you manage everything else there was a lot of things to look at at one time so we chose to go with the DSC product to get that expert knowledge on the open source stack and so the conversations we had data stacks was hey guys it is what it is but there's only a few people in spell innings let alone use it in our company you're gonna have to help us through this transition as we're getting some things through and so sometimes we ask stupid questions and it just is what it is we didn't know the answer we didn't know the answer but they were always good I'm letting us ask the stupid questions maybe ask a stupid question more than once also and that was real important because if you have a developer that's been coding for 20 years he knows his job he knows how to write programs and he did not like having to ask what he considered very simple questions because you could not find the answers was you know different stack for me and do you see they didn't care we could ask the stupid questions to them you know there's always that little bit of fear of sending off your messages on the web of then getting pounded because you're stupid for asking that question and that was something that we did not have to worry about with data stacks it was kind of like now we we paid just so too bad you have to answer answer stupid questions so it was very helpful for us and a lot of us to get the project going faster give us a higher comfort level and what we are doing to make this success so one thing I'll leave with and then if anybody has questions we can take some questions but has this changed your recruiting makeup as the people that you're looking for now to come on as developers or as operators or whatever how has it changed the skill sets that you're looking for and the types of people that you're hiring are you finding that they're coming from obviously broader backgrounds now you were very narrow it's equal before has that helped has it become harder I know a lot of the challenges people are having with these new systems particularly our customers is great once they get the technology going but then they got to source it they have to find people to develop against this stuff and it's so it's a new ecosystem how's that been for you guys it hasn't hurt us by any shape or form in the words that probably has helped us a little bit our newer developers tend to be younger and so the idea that we're running an open source product it gives them the cool fact that they enjoy it's kind of funny they're not the ones actually working on this project yet but they talk about something they enjoyed doing so it definitely has opened this up and then also opened up our eyes on developers because before obviously we wouldn't talk to unless you were a Microsoft stack guy because we ran Microsoft that was it now together maybe has some open source experience we look at it differently than what we used to yes so in may of 2011 was when we finally decided that this is it we started working with Cassandra as a very very side project just trying to get our hands get our fingers wet figuring out what's going on so we took we have ways to make our data sets so we just start building data sets throwing in things Sandra see what happens and this was in point seven Cassandra so there were a lot of things that wasn't there yet then we would kind of just like we get busy would put it down for bit and so then about November of 2011 we ran into salandra which was Jake's I can say Luciani Luciani's project and that's he did the first gluing solar on top of Cassandra so then we got excited about the search we knew we had this problem provided directory it seemed like you could fit as the provider director was a small subset so we started really hard going down this pipe of looking at salandra then right around the end of the year we got into serious talks with data stacks we talked to him a couple times before but now that we were looking that this was going to happen for real I needed to have some support on this project so in around December we started working data stacks and then data stacks told us let's get the NDAs in place and just let you know we are gonna be adding search so we immediately got on the early access program and so got basically Jake's salander pieces that they had already thrown into the data stacks 2.0 and started working on it then so for us we started working on it hard in January the application developer who did the glue piece on the data access layer he probably didn't even start till March and it was more of me on the server side and my other developer trying to figure out what we're doing having lots of issues at this time working with data stacks saying hey we needed to do this or that and then say no actually don't you just need to do it directly and so the project really got full swing in the first part of this year the second data stacks was released with 2.0 we upgraded everything threw it up ran tests got into some schema problems and took us probably a month that really did that nailed and so we went live then an April and so the real time period was all the way from May to April but we really were working hard on it for the first this year and it was a lot of that was just we're gonna try Hector we're gonna try Picasso at first he was fighting either one of those and we were gonna use Achilles I believe in the C-sharps arena we were trying different things just to see what was gonna work for us and also the biology we did end up switching to using C-sharp again for that using the fluent library that's out for Cassandra so we're kind of letting everybody's do what they want to do right now to try to figure out what's the best tool for this yes our schema promise was because Cassandra would accept the data solar wind and when the early releases we ran solar did not return the airs to us so we were having some issues with performance also on getting the data loaded and then once we got a loaded solar didn't know the answers and so once we got working with Jake on that chicken oh I can't expose that to you and so then we realized that oh solar doesn't understand our date format so once we just figured that out then we flipped our date format to the correct format and boom magically it started working we had the same issue with the lat-long get that in the correct format because there was such a delay before the data access person started writing we had all the solar stuff done he starts writing it and then he says oopsie I need another field and so then we went back to the scheme and I had to change that the field and that's we started writing to the scheme of things and we get that solved and you'd say oh this field needs to be tokenized it can't be just text and then we tokenize that field and as we're doing this we're constantly updating the data then we're getting at that point we were running live streams and so we were just constantly running these little tweaks of change in a schema with live data coming in it does not work well for us at that time now that we stop changing the schema we haven't any issues like that yes yes we're maybe the solar will handle what we looked at right now and this is still very preliminary we're just trying to figure this out we're taking all of our keywords and throwing it into a single search box the search field is playing solar that would give us our answers back and we're trying to work on the VA the business logic part of how do you translate that to give them the right results and the tests on the solar side of taking all the fields that matter and throwing it to one field in the solar side is working fine you literally you know rip out all these things you don't want to put the answers and you do put them in one field tokenize it on the space and that's gonna work and find for us it's now taking that and giving you good results any other questions alright thank you guys