 Good morning and welcome to the November 27, 2012, peer insight. The topic for today's discussion is Optimizing Infrastructure for Analytics-Driven Real-Time Decision Making. Just a couple of reminders before we get started. If you're not speaking, please mute your line. You can press star six to mute. And then if you have a question, you can press star six to unmute. This call is being recorded. For those of you who are tweeting, please use the hashtag wikibon, W-I-K-I-B-O-N. Also, we are streaming live on siliconangle.tv, Silicon Angle TV. And I think that's it for the intros. First of all, I want to thank Doug Leoden, who is co-founder and CTO of TAPID, who is joining us. If you are watching live on Silicon Angle TV, you will see him live streaming. He's Skyping in from New York today. Is that right, Doug? Welcome. That's correct. Thank you. Great. Thanks for being with us. Also, dialing in from the airport is Jeff Kelly. He'll be on mute, except when he's asking questions or providing insight. So we won't get too much background. But Jeff, thanks for joining us. I love you. And here in the studio with me is Dave Vellante, who's a co-founder and one percenter here at Wikibon. Hi, John. Hi, everybody. And David Flawyer is on as well as Bert Latimore and the Wikibon community. So, Doug, again, thanks for joining us today. We are going to talk about analytics real-time decision-making. And you've got an interesting environment there. For those people who aren't familiar with TAPID, could you just give a brief background on what TAPID does? And then we'll get into the discussion on the infrastructure that you're using there. Sure. Thanks. Thank you. So from a very, very high level, TAPID is an advertising technology company. What we do is that we give advertisers the ability to look at individual consumers as they move across multiple devices. So we're able to do what we call personalized advertising for users or consumers based on the behavior that we observe on multiple devices. So a user, a typical consumer user these days, is using multiple devices to access the internet services. So they might have a tablet and a phone and a laptop and or maybe a setup box, a gaming console or a TV. And for most ad tech companies, those devices will be presenting a very fragmented view of the user whereas we try to give a more holistic perspective on how a single user is accessing services across the different devices. And we work with publishers and advertisers. So we enable publishers to sell, for instance, audience information if they have specific behavior associated with users that are visiting their laptop or regular web environments. And we might be able to help them leverage that data and sell it on other devices later. So we're talking about real time decision making. How real time is real time? How long do you have to make a decision? So all of our most ad tech decision making these days are based on real time based bidding which basically means that as the ad impression is being served, that's the only time in place you know exactly which user you're looking at. And you have to make a decision on which ad and which campaign you want to show to a specific user as the page is rendering basically. So the time frame from page load until the ad has to be shown to the user is typically somewhere below 100 milliseconds otherwise the user will perceive that as lag. So from the time that the page load begins until the whole chain of decision making has been made and the ad is being actually served, you have about 100 milliseconds. And in the ad tech world, I mean the ad that you serve up is dependent upon the user device. It's dependent upon what you know about the user. Yeah, exactly. So it's all about the information that we have about the device or the user. And those can be simple things such as which kind of site are they on? What's their geographic location and such things? But it's also in these days more about data that we know based on previous behaviors. So let's say that we have seen a device associated with this device that has visited a site about cars and we have an auto vendor that is running a campaign with us. Then we might take a look at the device, see okay, this device is actually visited a site about cars and that makes the expectation of the value of this particular ad impression higher for this auto advertiser. So it's about taking a look at all the data that we have available about the specific device and also the devices associated with that device. And all of this is of course something that needs to be stored for retrieval very, very quickly. Just in terms of volumes, ATEC, we have a lot of ad units and ad views flowing through the system at all times and they're all very small requests about their data driven and they're in the range of, well I know there are companies that are even bigger than, much bigger than that as well, but we're looking at about 150,000 ad units per second at peak time. And all of these decisions are actually data driven so it's an interesting scaling challenge. So you've got information, you're making 150,000 decisions a second, right? We're actually making way more than that because each individual advertiser campaign that is running in our system will need to make their own decision about the specific ad impression that's coming in so that there is an increase there in terms of 1,000. So we're talking about millions and millions of decisions, but the decisions are made based on the same data. So for each request that we see, we will typically make approximately two lookups in our data source to see the final information we have about that individual device. Okay, so you've already sort of profiled the individual as this is a person that might be interested in this kind of thing and when they come online then you, or when they make a request then you pop up? Yeah, I would be cautious to use the word profiled, but if we have seen the device interacting with sites that have signals that statistically indicate that the user will be more preventive to buy a product then we will make it a decision. Okay, I can have to close the door in just a second. Well edited. No, but it's very important to debunk some of the privacy methods also regarding the ad space and we try to use less inflammatory words that is one of the one part of it, but it's definitely not used for profiling all of our statistics. Okay, so let's spend a little bit of time talking about the infrastructure because you, when you were making the decisions regarding infrastructure for TAPID you must have looked at a variety of sort of back end databases that you might deploy for your infrastructure. What did you look at and what did you rule out right away? Yeah, tell us about the tech behind all this magic. Well, it's a long, big topic, but when you're looking at the kinds of data sizes, the amount of data we need to store, the latency and throughput required from a storage solution like this and also the access pattern of a data storage solution. That's what you kind of start, need to take into account when you start looking at the data storage solutions and I am not part of that camp that now will apply these new NoSQL solutions to any problem that I see, but to us it became pretty quickly apparent that regular solutions would probably not cut it and I guess we could have made traditional relational databases work somehow, but they have a lot of features that we don't need. So the access pattern for our system is very straightforward. We have an identifier associated with each device that we see and we need to read a little binary blob which contains the information we have about the device. We need to read that on every single request and whenever we see something or we serve and add or something, that we need to update it. It's a very, very straightforward put and get access pattern. So this lends itself very nicely to a key value store. So in NoSQL there are a lot of different types of data stores these days. They range from column oriented stores to document stores and then you have, there are many kinds, but the most simplistic ones are just the pure key value stores that just basically support querying and updating single IDs. And we looked at a variety of solutions for that. We looked at a variety of deployment solutions for that. So in terms of infrastructure and we, I don't know how much deep police you go into right now, but basically we key value stores, scaling them typically requires a lot of RAM. So storing values in RAM is pretty straightforward. It's very, very high performance and it's generally very fast. The problem is that in our case we are storing in excess of a billion devices and we have quite a bit of information about each of them, so maybe a kilobyte or so with data. So we very quickly, and if we include the keys, we very quickly end up in the terabytes range. And a terabyte of RAM is still something that is very expensive. What's a terabyte of RAM cost? Ballpark. That's a good question. I would have to Google that and check the prices for that right now. But if you are looking at server class hardware with a terabyte, that's going to put you down several tens of thousands. Of course one server wouldn't be enough anyway. So you would need to look at multiple servers and there's another problem with RAM and it's not just about cost but it's also about the time it takes to bootstrap something. So if you need to boot a new server, put it into the cluster and a restore will require you to read a terabyte of data from rotational drives into RAM and that's going to take you a lot of time. Of course if you have your servers, your storage partitioned maybe over six or ten services so the amount of data that needs to be read is way less but it's still going to take a lot of time. In either case, storage such as in our case we ended up using SSDs is still, I mean SSDs are expensive but they're way less expensive than RAM and you can easily put half a terabyte or a terabyte of SSD storage in a server without breaking the bank completely. What SSD storage are you using? So we're using one, well let's see, are these? I'm actually not sure about the make that we ended up on. I think we're running IBM SSD drives. The difference between, well basically there are two very different classes of SSD storage. You have the consumer grade and you have the enterprise grade and there is still significant difference between these two types of drives and it's about the fact that SSDs actually have read write fatigue or I don't know what the technical expression for it is but it actually wears out so as you use your SSDs the flash RAM will actually decay and become less useful in lower performance so you have to get the more expensive kind but still compared to RAM it's way way cheaper. So Dag, this is Dave Vellante. So it sounds like you're using the flash as an extension of main memory to give you a persistent layer, am I understanding that correctly? Yes, that's correct. We're using our key value store software. It's called Aerospike and what they do with this key value store is that they store the actual indices of the keys, they store those in RAM and then they store all the data on SSDs. So this means that the indices are of course a couple of orders of magnitude smaller than the actual data so it requires less RAM and it's faster to start up but then the data is actually always stored on the SSDs and will never be cached in RAM and this turns out to be very, very predictable in terms of performance so every single read that we do that is not or any read that we do for a key that actually exists will always hit the SSD and as you know the access time on SSDs are really, really good so as in very, very small. So the combination of using RAM for indices and SSDs for the actual storage turns out to be a very, very performance solution. Great, okay. And so you're saying that the Aerospike database is exploiting that architecture in a particular way that makes it, I guess, consistent performance and judging on your earlier statements from the cost of RAM much more cost effective presumably. Yeah, it's more cost effective and it's easier to scale in terms of hardware so if you, let's say you start out with, say you want to provision your service with 64 gigs of RAM which is a fairly sizable RAM amount but if you need two terabytes of storage it's going to run you a lot of servers right there and the problem with RAM based solutions and in combination with the access patterns that we are seeing is that we don't really have any inherent caching heuristic that we can apply all the 1.5 billion devices that we see they can occur at any time of course there are some heuristics such as if you have a page that's loaded and it has five ads on it then it's likely we'll see five requests and we will access the same device five times but other than that we're seeing reads across our entire key space at all times so a caching solution is not really that effective what will happen is that your cache will be very quickly will be saturated and then you'll start hitting your storage and if your storage is not as fast as your, I mean so on an rotational drive you can get access times now down to maybe the three, four milliseconds or so but you're still limited by that which means that if you have multiple heads in your drives and so on you still have a very, very hard physical limit on how many reads you can do so the problem with these RAM plus rotational drive solutions is typically that once you start getting a lot of cache misses your performance is just going to drop completely off a cliff and the only way to kind of work to resolve that is by adding more servers with more RAM but again 64 gigabytes is not a lot of storage so adding a new server for every 64 of course very inefficient you can add then more and more RAM but that will require you to get more and more expensive hardware to go with the RAM sticks so it's just, it's cost wise it's way more cost efficient to use SSDs and again compared to the predictability of knowing that every value that is being fetched will always have the worst case performing characteristic which is hitting the persistent drive is very, very convenient especially when you have these very, very hard SLIs that we do we have to respond within a certain amount of time otherwise we're going to cause trouble for the website that is running down So the performance criterion is predictability and consistent performance is that not necessarily the top end performance, is that right? Yeah, so that's right if you have a pure RAM based solution then you will definitely have the potential to have lower average maybe even, well it depends on the actual parameters but you can have lower response times but using the RAM and SSD solution will give you a very, very predictable response time Okay and you mentioned before the where issue of flash can you talk about that a little bit more is that a major concern of yours, how are you dealing with that or is the business case such that you can just burn through the stuff and keep replacing it because you're able to monetize the data can you just elucidate this a little bit? So I'm not an expert on the SSD where but I know that there are certain access patterns so basically if you have a very hot spot on your storage let's say an index file in a traditional database you have a certain area of the drive that is consistently being read and written to then you're going to have more wear on that specific area and then there are certain stoppers that techniques that can be applied that will more evenly spread the wear out across the SSD so Airaspycast is built in it actually uses the SSD drive as block devices it doesn't have a file system on it it does everything on the device raw and that includes wear and tear protection I know that I think I saw that actually Twitter had a patch to my SQL that did something similar for their own fork or my SQL there are some well-known techniques to wear that kind of tear on the drives we use the Airaspy solution for this and we haven't had any issues yet we've been running with the same SSDs now for a year and a half so we're expecting some to fail soon but if you can do a year and a half on each drive then you can probably do more than that but for us that's easily, we could easily replace all of them now and still have great value out of this this is the very core of our system so for us to replace a couple SSDs every year that's not a big deal and SSDs are dropping dramatically in price going back to the RAM question just for a second there are servers that can take you up to I think 512 gigabytes these days we can do a terabyte why isn't that a solution with fewer servers? so that would probably it is a solution but again, the price I mean you have to look at I can't really quote any prices but the price for a terabyte first the server class hardware that can accept the terabyte is pretty expensive and then getting the chips or the sticks of RAM that will give you up to a terabyte it's going to be a very, very expensive solution it's a traditional scale up kind of situation the problem is still the one I mentioned about getting this data into RAM is still going to be a problem if you have a single server, let's say you do two then for fall tolerance let's say you could fit your data in a terabyte of RAM if you bring a server online and it has a terabyte of rotational disks backing this you're still going to have to read a terabyte of data into RAM before the entire data set is hot, so to say and again, as I mentioned earlier, to us we don't have any heuristics that will say this section of the data set will be less frequently accessed so you start loading this out of the data set and if we did caching and loading into RAM only on demand that would mean that we would have this more performance for the first I don't know how quickly we can go through but it would have a significant start-up time but there is certainly, the performance would probably have been at least a latency per request would be lower in RAM until you start hitting the limits of the network adapter and so on which you very quickly do when you have these kinds of volumes I was also wondering whether when you got to very large RAM you mentioned that there is very little locality of reference within your key pairs so would that also mean that the larger the RAM the less the more the L1 and L2 caches would get overwhelmed and then the performance starts to drop that way so definitely I mean if you compare at some point your RAM bus will probably become a bottleneck as well and if you're on one huge machine then possibly depending on access patterns and depending on the architecture of the server and so on this is kind of this is not something you can be very black and white about but definitely the caching part of the use of the CPU the caches and so on can become an issue and the reason we chatted about this earlier was because there was a blog post made by one of the AirSpike engineers about how you can set up a single machine to do a million requests per second and part of the puning there was actually pinning individual ethernet ports to individual cores on the actual physical CPU dies to make sure that the caches were used in the most efficient manner and so on but this is beyond what we're doing but it's definitely an interesting topic specifically now that we have we're in a period of time where the multi-core CPU revolution has just barely started but in just a couple of weeks ago Intel finally announced their night's core processor series which is basically a fixed-core processor and you really want to leverage the CPU caches and these kinds of server hardware going forward for us, for our key value stores this has not been something that we've been looking into tuning but I'm sure this is going to be an interesting area of research going forward Can you give some... Go ahead I was wondering if we could turn for a moment to the actual use case a little bit more and maybe dig into some of the details of the actual process, maybe you could walk us through in a little bit more detail what actually happens when the user logs on the certain device what actually occurs in terms of the back end in terms of communicating with the advertisers what that kind of system looks like particularly for people who are not that familiar with AdSec what does the kind of architecture look like in terms of how you connect with the actual advertisers maybe they can also to the analytics team the types of analytics you're running on the data The request chain of an AdBeam server typically starts in the users browser by the fact that the publisher has a script included or an iframe on their page that loads the URL that points at some sort of advertising technology company For us specifically this can either be something that we're serving directly off of our servers or in many cases it can be something that is served through what we call an ad exchange and I guess I can spend a little bit of time looking at that model first because the ad exchange model is something that has become more and more prevalent in the ad tech industry over the last few years basically the browser requests an ad unit from an ad exchange to back the server and this ad exchange doesn't necessarily or usually doesn't have any direct advertiser relationships on its own what it does it actually broadcasts information about the device cookie ID typically that they're seeing they might have the IP address they might have an ID for the site they might have some key words about on the page that is currently loading and so on and they pass all that information back to multiple parties that are willing to potentially buy this single ad impression and this is what we call a one-off auction so basically every buyer potential buyer then looks through its campaigns that it has running and the other data was passed through from the ad exchange and comes up makes a decision for how much they're willing to pay for this specific ad unit so in our case for what we call if we look at a performance based advertising campaign so by performance based I mean a campaign that has a certain end goal that we want to achieve for the advertiser and that can be signing up for some sort of online service or visiting a site or whatever the goal can be pretty much anything and they associate a target value to that so let's say we're driving if you go back to the auto industry example then the goal might be to get the user to that auto company's online car configurator or submit for a test drive for instance so we have a monetary value on each of those actions and what our system then does is it takes all the attributes or features as we call them from the ad unit itself so what site is it on what site is it, what time is it what geography is this user and so on and then it complements this with attributes that we load from our data store which might be what other sites that we've seen this user on have they visited the advertiser site before what other types of devices do they have have they had any other behavior on a device associated with it and so on and then we have machine learning algorithms that are used for predictively or for predicting the probability that this specific ad unit will result at some point in that user doing what we wanted to and then the bid that is placed is actually based on the probability so we have a target goal value and we have what the machine learning models the probabilistic models are expecting the probability of the event happening and they basically make a bid back to the ad exchange based on that and then the ad exchange will review all the bids that it got in from the different potential buyers and the winner is the one that places the ad in just one kind of full request cycle and that whole chain lasts 100 milliseconds or less is that right? That whole chain is 100 milliseconds or less? Yes, so that's typically 100 milliseconds and that includes all the hops so if we're running this through an ad exchange it means that we will have somewhere between 50 to 80 milliseconds maybe so you mentioned for instance that the advertiser will have a I'll be running it for a while and I'll have certain targets in mind and you'll see when your ad will help determine help them determine the odds that the person will reach one of those targets so you do that for each potential advertiser which is a customized type of analytics against all this data for each advertiser each time there's a request but I understand that right? Not only for every single advertiser but also for an advertiser might have multiple campaigns running in the system and the campaign might have multiple targeting strategies associated with it so there are and these might have different targeted constraints but they also might have different goals depending on how the campaign is set up so basically every single targeting line in the system makes a decision about how much they're willing to pay for this individual ad request so if it's outside of the targeting the answer is very simply no they're not going to pay anything if it's a machine learning backed model and the targeting matches then the model will be applied to the features that we have gotten from so combine the stuff that is sent from the ad exchange or from the browser presented with the features that we have in our data stores and that is applied to make a prediction and then a bid value but every single applicable buying strategy in the system will place a bid and this is how our system is built and there's nothing predefined about it there's not no static predefined values here it's all done in real time dynamically well that's impressive I wonder maybe could you put it in perspective a little bit I mean what to this sector of this market look like even just three or four or five years ago how has this kind of evolved from the way advertising was done at the dawn of the web since the early 2000s until now it's done like it must have been quite a revolution to reach the point where we are at now with the type of beef cases and applications we're talking about yeah so first of all I mean TAPA is now two years old approximately and we've been fortunate that we weren't the first company to try to do this so technology such as SSDs and software such as Aerospike was available to us we didn't have to build it from scratch and try to figure it out ourselves so we're very fortunate but yeah definitely the ad tech space has evolved significantly over the last six years so if you go five or six years back then from the advertiser perspective it was mostly non data driven well it's not correct to say data driven but you will be able to set up targeting tactics or constraints targeting a certain type of browser in a certain state for instance based on the IP look up you could say something about I want to target sites which have these kinds of keywords and so on and so forth but pretty pretty predefined statically predefined things now with the advent of what we call audience buying and retargeting and behavioral based targeting it changes the picture quite a bit there's a wide range of I guess sophistication and the amount of dynamicism that is available in these real-time buying systems but for us everything we do is based on the decisions or rules that are being applied onto the data as we see the ad request coming in and the only reason why we're able to do this is due to these storage solutions that allows us to retrieve information at the scale and at the the latencies that we require but it's also about modern CPU architectures and having very high-speed multi-core server hardware that allows us to do these kinds of things if we just started a couple of years before then the hardware cost involved with building out a solution like this would have been a lot more expensive and this even supersedes I would think actually Moore's law here because just the access to affordable gigabit ethernet and 10 gigabit routing equipment and so on is just something that has become so much much cheaper over the last few years Outside of the ad tech industry, where do you see technology like Aerospikes being applied what kind of applications make sense for a no-SQL approach? So, I mean, the basic idea of a key value store it's a very simple abstraction and therefore it can be applied in many areas I would not ditch additional data stores, I mean, we're running my SQL servers as well because they're really good at many things so I wouldn't try to replace anything that would fit in a relational database with any kind of SQL solution for one but this specifically key value stores such as Aerospike they lend themselves very well to data sets where you are looking at specific IDs, so you don't have to query, you don't need secondary indices I know Aerospike is actually working on that but that's not currently in there but you just want to have something like a memcache, I guess a memcache is almost equivalent to what you can do with Aerospike except that an Aerospike is persistent, it supports transactions and it's more fault tolerant and redundant so I don't know, in the use cases there are many many use cases and there are also things you can build on top of key value stores so you can build other types of analytics engine based on a very solid storage engine which basically is what Aerospike is right now I know again they are working on some other more advanced analytics features on top of that based on a system called Alchemy BB I'm not very familiar with that but that will definitely broaden the range of applicable use cases for the system but anywhere where you are using memcache and you would like to have the data persistent I would say that Aerospike is a good fit I want to make sure that we have time for quite a few people online here with us make sure that people have a chance to ask any questions should that people know too you can tweet us I'm at DeValante at Wikibon we will get your tweets online, your questions there just before we take a few questions, I just wanted to jump in I mean some use cases that come to mind are really any data driven real time bidding process are really applicable to the utilities industry certainly to the financial services industry just two use cases up the top of my head that I can see where this type of technology will be certainly very useful where you don't have the it's not just the scale of the performance in the time given to make a decision it is sub second certainly beyond this is David Fleur, I'd like to ask what who your competitors are what do you need to do to be competitive with them, obviously Google comes to mind so who tap as competitors are yes well that's a very open-ended question so we work with publishers and advertisers and currently the offering that we are providing is unique in the market space we're not competing head to head right now with someone offering the exact same product as we are but there are definitely players as you mentioned anyone that has a lot of cross-platform audience information you have Google as you mentioned you have Facebook we do work with I don't think I'm going to call out specific companies now but in general anyone that tries to maximize value for advertisers we're competing with them we have an offering that we feel very strongly about will increase the value to advertisers we also feel very strongly about this holistic approach to looking at users as they're moving across multiple devices and providing analytics on how as they add impressions on a mobile device may affect users signing up for servers on their tablet at a later point that they for instance those kinds of analytics are we feel very strong that is the right way to go about advertising in the coming decade you've mentioned cost you've mentioned ease of scaling what are the biggest and you've mentioned performance what are the biggest drivers for your decision and how fast are you having to scale so the decision was made one and a half ago now we had been running a couple of other key value stores one based on the JVM and one based on the Erlang ecosystem and they both had pretty good performance characteristics but had well when I say good they're fast and they still like several times faster but the main difference was actually how we were able to do a failover efficiently and adding those to the cluster without having any problems with rebalancing the cluster that was the problem we were seeing with the other solution that we were running is that when we were testing things adding a note or having a note fail and then adding a new one and having them rebalance and repartition the data it worked fine when we were testing it but then we actually had to do it in real life production we had issues with it and we took a little gamble we trusted that Aerospike would help us if we ran into any issues like that we haven't had any issues with rebalancing at all we have actually had several servers fail at the same time but due to human error as in unplugging the wrong equipment and we have also increased the capacity of our server cluster a couple of times over the last year and a half and it has been working really well I think one of the architectural smart decisions that were made is that the client is actually not a very thin client it tries to do it is aware of all the notes in the cluster and it will be aware of the latency that is associated with each note in the cluster and you can use that for routing traffic but it's not a pure client based routing it also has during a rebalancing so when your data is moving or petitions are moving between the nodes in the server and in the cluster the nodes themselves will actually also reroute traffic so you kind of have fault tolerance both in the client driver and also in the servers themselves so it is a very very stable client to server system and we had very very few issues with it can you give us some sense of the size of the infrastructure there how many terabytes or of data you're dealing with sure so we're currently running 5 of these citrus leaf nodes and each server has 6 120 gigabyte drives so a total size of 3.6 terabytes of data now that is replicated with a fault for Kfactor 1 which basically means that we can afford to lose one node without losing any data so it means that we have a certain redundancy of the data which means that we're not storing our data set if we extracted it it wouldn't be a full 3.6 terabytes of course we're also over provisioned in terms of storage but I would say that the total data set is maybe in the a terabyte and a half area or something like that so it's not huge although maybe in the world of databases just a reminder for those who are I'll try figuring out who's got the piano going in the background but if you could mute your line if you're not speaking that would be great that's gone thank you it was nice piano if you compare this data set to the data sizes that you're looking for if you're looking at more tabular data sets that are used for analytics processing like a Hadoop cluster or something one terabyte is not a lot but if you're looking at a data store that actually enables you to access this one point whatever I said there's more than one terabyte of data in less than a millisecond if you count it then that's something very different this key value store is not a analytics database you can't compare this to something like Matiza or Vertica or Hadoop file store it's not like that at all this is about having this data accessible in a very low latency high throughput way you will need and we use big data analytics solutions as well as for doing database analytics and that's not what we use that being said we are now able to do a lot of the processing and algorithms that we run we can do them as we are processing a request instead of having to pre-compute them using Hadoop cluster because we have access to this data and all of it in a very low latency high pressure I think that's a really important point we're talking about a big time real time transactional applications versus some of the end user analytics that are often thought of as big data analytics kind of returning to that question you laid out very eloquently why the approach that aerospace takes is appropriate for this kind of workload what does your take on for more analytics type workloads does that paradigm apply or do ram focused databases that don't imply flash are those appropriate for those types of use cases what's your take on that specifically about more traditional analytics counting and aggregating things we're seeing things like SAP HANA for instance also SAP says for transactional applications we're seeing a lot of more end user analytics I'll not do that that's a viable solution for that more end user analytics or the flash equations sorry can you repeat the first part of the question well just curious the architecture you laid out in the way aerospace uses at this stage is connected with ram versus just ram only whereas we're seeing some new memory databases the more end user analytics side don't imply flash so just wondering is that a reasonable approach in your opinion to multiply more end user type analytics so if you're looking at a more traditional type of application or database let's say my SQL or a Postgres server or a Microsoft SQL server or Oracle then I think that SSDs probably they will increase the performance of your applications in many cases either prioritize these solutions such as Vertica or Natesa well Natesa has actually had a question because they had a lot of hardware but any of the big analytics engines and or like Hadoop and MapRedu is built on top of Hadoop on HDFS then usually your applications will actually be CPU bound more than they will be IO bound usually is wrong but in many cases they might be so a rotational drive for instance it can have very very high bandwidth when it comes to reading data sequentially SSDs are actually not that great at having their fast but compared to the cost per megabytes read off of disk per second in a linear fashion then you might get a long way by using rotational drives and for a lot of these queries that are being run like you want to count the number of unique users that have visited your site you want to sum up the number of page views or whatever those kinds of analytics then typically you're sifting through a linear data set and you're doing grouping and so on on the fly as you're going through a data set I think that rotational drives with high capacity and great setups will actually I think they will live on for a while longer before they get completely overtaken by SSDs just because they actually lend themselves very well to that for smaller data sets where you want more random access SSDs are definitely a good thing Doug we've got a question from Brian came in through Twitter he said he asked could the matching optimizations be applied to massive social gaming massive social gaming well I'm sure that that gaming or anything that is high volume will have a lot of the similar things so I mean it all depends on what kind of data you need to have in order to make decisions I mean if you're doing if you're updating inventory somehow my store of weapons and stuff like that it's a tough question but anything that random access a lot of data very quickly will probably be at least applicable for this we've had some folks on in the past who are in that space and I know they analyze a lot of the kills and stuff like that I'd like to go back if I may to what you were mentioning before that you were doing some analytics in real time could you describe that a little bit more because this is from what I understand a little bit more than just a transactional system you're doing real time analytics to decide things could you talk a little bit more about what those analytics are and the limitations of them and things you might want to do in the future to expand that so not just analytics but in terms of the logic that we can apply so typically if you wanted to in a more traditional older old school way if you wanted to let's say implement a frequency cap which is what we call we use this for you don't want to show the ad too many times to single user and typically if you don't have access to a solution like this then either you could use the cookie store of the device to say okay I've shown you a number of ads and I'm going to stop that but that kind of bloats the cookie store of the device a lot and in many cases you won't have the access to that but instead you would have to just store the ad and you would have to go back on a like an hourly basis or something to run a batch offline job counting how many users have been exposed to this ad and how many are above this threshold and then put that in somewhere and make that actable in the next decision whereas if we're going to have access to the real time data store we can actually just we see an impression coming and we can store that on that device immediately and when we see another ad request coming in from that device I can even we will already be aware immediately that this device has now capped out the kind of logic that I was mentioning when I said there are things we can do in real time when it comes to analytics we are not actually using the error spike platform as much for analytics that we might have been able to what I meant was additional analytics of the type you just mentioned so we're basically expanding bringing forward into this platform things you might do into this platform is what I meant pure analytics we have a lot of these kinds of situations for instance we're working with a lot of other ad networks and what we do is we call multi or cross device attribution reporting which basically means that if we are working with a mobile ad network that is showing ads through mobile apps for instance and then they are tracking the impressions of those ads through our system we're storing this in impression history and then at some later point in time we see that user on a different device but that is associated with this device that the other ad network showed an ad to and we will see that user actually sign up for a service then as that request comes into us the notification that the user sign up for a service then we will immediately have access to all the ad impressions that led up to that what we call conversion so the user actually doing something and we will be able to to pull out the impression history, figure out exactly which partners we worked with that actually were participating and building up to this event happening and we will call them out immediately so we don't have to go back in the alternative approach to this would be to have like a nightly batch job that went through all the conversion events and try to line them up with the impression history and then calling them out but we can do all these things in real-time because we have the read and the write access to this data continuously which makes some things much simpler and it's a big game but in terms of analytics there are some interesting cases we're using Redis for a lot of our real-time analytics because it supports it has these primitives for counters and bit operations and so on the problem with Redis is that it's purely RAM based Aerospyte has in the last few months actually gained support for things like atomic counters and so on which makes it more interesting for using for more pure analytics so there's a lot of interest these days in stream processing and doing real-time analytics based on stream processing and to do stream processing efficiently you need to store the results of those streams or the results of those analytics and putting that into a real-time type database or datastore a datastore such as Aerospyte or Redis or some other key-value based store that supports atomic operations like increments or even bit operations will be very, very valuable going forward I think and if I understand it correctly the real value of this is that you can change the weightings and change the bit algorithms the results of the bit algorithms in real-time as things are developing both the actual bidding algorithms themselves will because they're looking at the data in real-time and looking at the data that is there exactly at that point in time they will immediately for a single device the second that device has been tagged or added some sort of attribute to it that will be relevant for the decision-making next second that data will be part of the decision-making so I guess John McArthur I want to thank you for joining us today very interesting what I'm hearing a little bit is that the definition of databases is not a consolidation around any single database and and really optimized for specific use cases is that a fair summary yes I think that's a very, very important thing to realize is that there is no the reason for the growth of the most equal solutions is because there are certain use cases and usage patterns specific types of architectures lend themselves very well and if you want to cope with the scale and the growth that you have with a lot of the modern services that are provided to users now you have to not just use the single solution like you can't just buy an oracle machine or cluster oracle servers and then you're done you actually have to look at what your use case is, what the access patterns are, what your scaling needs are and then use a number of different technologies I don't think you should use over use and have your whole system end up with like a quagmire of fragmented technologies that's not definitely not a good idea but trying to apply one thing for all the data patterns is not a good idea again, thank you very much for joining us today also thanks to David Flore, David Volante to Brian and to Jeff from the airport I appreciate you dialing in and for your questions and comments today just a reminder we'll have 6 research notes up later this week discussing various aspects of what we learned from today's call this is a we are publishing on a wiki so feel free and Doug this goes to you as well feel free to hit the edit button and correct enhance or improve the documents we'll send out I'll send out to you a list of the documents as they're published we'll also have a podcast of this research meeting up on the wiki bond site and on youtube this later this week thanks again for joining us and watch the upcoming peer insights page for future peer insights thanks very much thank you, thanks Doug thanks, bye