 So, today's presentation we, it's about Flipboard is a social aggregation mobile app that aims to transform how people discover view and share content by combining the beauty of ease print with the power of social media. This session will give an overview of the data life cycle. Ashish and Rob will discuss the data strategy and architecture of the power of Python and in-memory processing and the role of QBO and Flipboards, Python, SDK and other cloud technologies to quickly and easily assess data, analyze it and feed it into other models. And today's first speaker, we have Ashish. Yes. Thank you. Hello, folks. My name is Ashish Thosu. I'm the CEO and co-founder of a company called QBOL. So, today's talk is going to be divided into two parts. The first part of the talk, which I'm going to be presenting, is going to be talking about generally the big data landscape, some of the technologies and open source that have emerged there. And I will also focus a little bit about when you're running, you know, what are some of the issues that organizations would face when running at a very, very large scale and especially how cloud helps in that. And after that, I'll pass it over to Rob, who will be talking about in-depth of all the goodness around Python and, you know, in-memory processing and stuff at Flipboard. And that would be a very, very strong use case of how all these technologies are used in real-world environments and, you know, in helping there. So, to start with, a little bit about myself, my name, as I mentioned, is Ashish Thosu. I'm the CEO and co-founder of a company called QBOL. Essentially, we offer a cloud-based platform for big data. That's what QBOL does. And before QBOL, I was at Facebook. I was instrumental in building out the data infrastructure at Facebook. I led that team for a number of years. In the process, I also created Apache Hive. That was one of my inventions and my co-founders' inventions. So, having in this environment, you know, when big data was not called big data, from that perspective, I've seen this industry grow and seen what are the advances and changes that have gone in this area. So, just a little fact check. So, why has big data become so important recently? A lot of this has got to do with changes that have happened in the data landscape itself. If you go back to the 90s, a lot of data that was produced was produced through business applications. This is what I would call transactional data. This is essentially data which should be sitting in a database, created as a result of filling out certain forms or transactions. You know, banking transactions is a canonical use case on which the whole of the RDBMS industry was built on. So, a lot of that was transactional data. But as a lot of things started moving online as the web expanded and the web applications expanded and the mobile applications expanded, what we saw is a shift in the nature of the data itself. What has become more important now in the last 15, 20 years or last one and a half decades has been what I call interaction data. This is essentially data which is generated because of entities interacting, whether these could be human beings, these could be machines. And inherently this interaction data is a lot more unstructured, has a lot more volume and a lot more velocity as compared to the previous generation data sets which are mostly transactional data sets. And that essentially has led to, you know, the previous generation systems are found wanting in being able to handle these data sets and being able to provide tools for processing of these types of data sets. And that's really what has led to the evolution of big data. Now, having seen data infrastructures run and operated for this type of data, of course, the most logical thing that people realize is, you know, you need some sort of an infrastructure that scales with commodity nodes because this is really, really large data sets. You really have to think about horizontal scalability, not really vertical scalability. But one thing which has also become important is as these data sets have grown larger and larger, as more data has become available, the number of use cases where data is applied has also mushroomed a lot. So apart from scalability on just the infrastructure side, a modern day data infrastructure also needs scalability on what types of users and what type of use cases it's able to support. It's not just your traditional SQL data warehousing use cases. It is also machine learning use cases. Streaming analytics is becoming more and more important as well, especially with the emergence of the whole IoT industry, whether it's consumer variables or whether it is in the manufacturing sector. And of course, you know, data preparation as well. So all of these use cases have mushroomed. They have become more complex. They have become more advanced. So as a result, a modern day data infrastructure not only does it need to support horizontal scalability to deal with the volumes of data, it also needs to support multiple interfaces and support for multiple different user personas doing different types of analysis. So if you dissect that further, if you go down, you know, if you dissect that data infrastructure for a modern day company further, you would be able to categorize systems in various different categories. And those categories essentially follow a life cycle of the data. Data emerges. It's created in apps. It then moves through infrastructure to collect that data. So that is what is usually called data ingestion. From data ingestion, it then goes into these different systems, some used for ETL analysis, ETL, some used for edoc analysis, some used for machine learning and deep learning by data scientists, and some may be used just for developers for developing certain applications. And finally, the payloads or the summaries arising from these analysis are either used to drive applications or they are used to drive dashboards and visualizations. So consumers for dashboards and visualizations could be your end users who are essentially taking some actions on the basis of those insights. And, you know, applications could be anything. It could be consumer facing. It could be applications driving certain optimizations and so on and so forth. So that is typically how different layers of data infrastructure have to be in place for all of these things to be covered to really have a platform which can subsume all these use cases and make big data very simple. What has also happened is because there was a void of systems being able to support all these different types of, you know, these different categories at scale with big data, a lot of innovation has happened in open source in trying to fill each of these voids. So, you know, in almost every one of these boxes there is some open source technology which is available today and in certain cases multiple open source technologies are available today which allow you to solve that particular box's functionality and when you plug in all these things together you would get a full blown data platform or data infrastructure platform that can solve big data use cases. On collection side, a standard that has emerged quite rapidly in the last few years is Apache Kafka. There are other technologies also available in the cloud for example Pinesis in AWS but these technologies essentially have really made it simple for collecting datasets, large datasets generated from multiple endpoints and putting them in a place where they can then be analyzed. On processing side, streaming analytics has grown multiple, has actually undergrown multiple different, you know, evolutions. Started off with Storm being one of the technologies under the Hadoop umbrella which was used a lot for streaming analysis. Spark has recently emerged as a technology that is solving this use case especially because of its strong roots in-memory processing which is something that you need for streaming analysis. On batch processing, on classical batch processing area, Hadoop has been there for a long time. Hive attacks that use case quite heavily for doing large scale data processing in batch oriented manners. Machine learning again, Spark has emerged as a strong contender there and there are a few other projects like Sphinx which have also emerged and trying to attack that particular use case. They used to be a project called Mahoot under the Hadoop ecosystem which attacked that use case. And on ad hoc analysis, you are seeing also emergence of technologies like Presto and, you know, Cloud has Impala and so on and so forth which are trying to attack those use cases. Net-in-net, what has happened is in the last few, in the last, I would say, seven, eight years, a lot of innovation has happened in the open source stack where now you have a lot of options available to plug in the holes of these, you know, plug in each of these boxes in a comprehensive data infrastructure which you would be trying to put in place for a company or so on and so forth. And what we see, you know, at Cubol essentially we see a lot of these use cases and we see, you know, essentially see a microcosm of how different use cases are using different technologies, what are their relative merits and demerits and so on and so forth and really, you know, we, you know, a modern-day data team essentially needs all of these things in place. It's not just one piece of technology that cuts across different use cases but you really need all of these things in place, all of these technologies in place to serve your analysts, to serve your data scientists, to serve your developers or even to serve your, you know, end-of-line-of-business users. So that is all great. There's a lot of open source technology available. There are templates that have emerged. There are, you know, architectures that have emerged and so on and so forth. But still a lot of companies find it extremely difficult to put together this stack, to put together a data, a big data infrastructure. And many times in many companies you would see a silo of users using big data. It could be a developer team using big data or it could be a data science team using big data but it becomes very difficult for a company to really raise the bar and create a central platform which all of these personas can use for accessing data and for doing data processing at scale. Both in terms of scale of volume but also in terms of the scale of the use cases. And that is where I think Cloud, in my point of view, plays a very, very massive role. Some of the reasons why companies are not able to achieve a true vision of a central data platform that can solve all these problems is because one day the infrastructure, when you're building this infrastructure on-prem, it is inherently static. There is a fixed capacity that you can deal with. And the result of that is always that the central team which is managing that infrastructure tries to lock it down because the capacity is not elastic. Because the cloud is elastic, it gives an on-demand infrastructure. It gives you a lot of flexibility for that central team to open up this infrastructure for multiple different use cases. So Cloud plays a very, very fundamental role there because of its elasticity, because of its flexibility. It also plays a fundamental role in being a cell service platform. A lot of technologies in the cloud are catered towards building things which are cell service and really if a company needs to move to a vision where they want to create a cell service data infrastructure, it plays a very, very central role. Now putting this stack together, so essentially my message here is that in order to really, really be successful with big data, I think Cloud becomes a very, very strong option to build on top of to build your data infrastructure on top of which it becomes a substrate which allows you to open up data to a lot of users and at the same time keep the operational overhead extremely low. However, the way these technologies are built for an on-prem data center versus the cloud are fundamentally very, very different. And I'll just highlight a few differences there to give you a flavor of what it means to make this for the cloud. The very first difference is the storage. A lot has gone into the Hadoop ecosystem to define the de facto storage for big data to be HDFS in case of on-prem clusters. On the cloud, the storage mechanism is completely different. Object stores are the right place and the right technology to store this data because they do a bunch of things. Number one, they make things extremely elastic. So because they are super scalable, you can essentially store data, small data as well as large data and you can adjust them that way. The second thing is by separating the abstraction between compute and storage, they essentially make compute elastic as well. So you don't need infrastructure to be running all the time. You only need it when you need the processing to happen. So object stores fundamentally are a very fundamental building block of the cloud and if big data has to be done properly in the cloud, they should be heavily, heavily leveraged because they give fundamental advantages in terms of operational efficiencies as well as in terms of elasticity that can be exploited there. Along with the object store, the fundamental building block of the cloud is also elastic compute. I think cloud for the first time gives the ability for infrastructure to fit an application. You know, we've been all taught about setting up an infrastructure and then trying to make the applications fit into the infrastructure. You know, I want to optimize this job. I want to change this because my infrastructure cannot do something like that. My machines don't have enough memory or something like that. With elastic compute and with the flexibility that it offers, you can on the fly change the infrastructure to food the application and this becomes extremely handy especially in big data use cases which can be extremely bursty with a wide spectrum of use cases. So you might need extremely high memory machines for machine learning use cases and very low memory machines for ETL work use cases. So that flexibility also is very important when you're trying to do big data on the cloud and it's very fundamentally different from what you get on-prem with your fixed, you know, boxes that have a certain amount of memory profile or a CPU profile or a disk profile and so on so forth. So when you combine these things together an object store and elastic compute you can get to a platform which makes big data cell service which is able to do that in a completely on-demand manner which also feature-proofs any evolution that happens in any of those projects or any evolution in your use cases because of the fundamental flexibility of the cloud. Now one thing that is constantly brought up as a counterpoint is hey, how secure is my data in the cloud? That picture is also very, very rapidly changing. AWS has a lot of product features built around encryption so have the other clouds. There are a lot of product features built around compliance. And fundamentally when I talk about this these are just some of the features that have been implemented on the cloud to address security but fundamentally when I talk about this I always give people an example of whether they think their data is safe in their own data center versus a cloud data center and the analogy to draw here is you store your money in a safe in your home or you put it in a safe in the bank. Some of the best practices around security are undertaken by the cloud providers and as a result I feel that this thing the security perception of the cloud is changing very, very rapidly so it becomes an even stronger option to build and put together a unified data platform which has all these different engines plugged in for various different use cases all provided within a secure environment and I think that is where most modern data teams should aspire to and should run towards in order to make big data successful for the enterprise. So I will pause there I will stop there and pass it over to Rob who will talk about Flipboard and how they are using big data at a much lower detail level this is a much more high level picture but he will talk about lot more details around usage and real world use cases. Thank you. Can everyone hear me? Definitely some really good points. See now that people we are not using transactional data anymore we are using interaction data so if you remember people weren't using phones awhile back and now everyone has a device so we have insane amounts of data going through our platforms and we need big data tools to analyze those large data sets. Okay, sorry. Okay, so I am going to talk about how we use big data at Flipboard and first I am going to give an introduction to Flipboard and tell you what Flipboard is about and what our aim is as far as how we envision the user experience and then I will talk about some of the techniques and some of the products we use as far as a little bit of information about how we use Q-Ball and we have it available to us. So talk a little bit about Flipboard. So Flipboard is a news aggregation or a news recommendation platform. So we think of Flipboard as a place for your interest. What this is is this is the first launch experience in Flipboard so how it works is you open up Flipboard for the first time and you are prompted to this kind of like this questionnaire and ask you what topics you are interested in news, crafting, entrepreneurship. And what we aim to do is take some information using your usage data and the information we get from this questionnaire and pipe really high quality articles to the reader and make it to where they have a really great experience in reading nice long form content and there is another component of Flipboard and that is we work in the area of collection where you can collect things on Flipboard and you can add news articles to what we call magazines so you can create your own magazine and your friends can subscribe to it and they can read the things you are reading and you can have your own point of view exactly as far as what your take is on coffee or surfing or mountain biking and other users can consume that and it is kind of like a social ecosystem around content. So basically what we are trying to do is whenever a user uses the application on their phone or on the web we are taking into account their interactions with particular entities within Flipboard so we will look at a user and let's say this user reads 10 articles we try to take the usage data that they provided to us and then come up with a good strategy to where next time they enter the application they are presented with even better material so basically the idea is that Flipboard learns your preferences as you use it more and more. So right now we have about 80 million active readers plus 80 million and so as far as the data infrastructure or it is part of what my job is I am a data scientist at Flipboard and recommendations of these particular entities like magazines, articles, topics. So there is two different sides of the data pipeline. So one side is the ETL process and working with Hive and getting all the data into our SQL data store which is Amazon Redshift and that serves as our analytics platform and that feeds into a lot of our tools for analyzing A-B tests looking at our particular, let's say our MAR number over the course of the last year and just basically serving as analytics platform and there is a few other recommendations products coming out of the Amazon Redshift data store but most of it actually comes from our recommender index which is pulled from, the data is pulled from S3 and then moved into what we have in this graph database and this graph database produces recommendations so that is another topology or pipeline so okay we have ETL reporting and here are some of the tools is everyone mostly a Python person here or any Java people a few okay and yeah so we use Spark, Spark Streaming actually we are not using this for production ready tools or anything but this is something we are starting to implement and Spark is actually kind of new to Flipboard we haven't been using that long we have been using a Dupenheim for a long time and SciPy and NumPy these are a lot of things I work with as a data scientist they are generally in memory and we do a lot of our prototyping on our in memory you know do a lot of in memory prototyping I could say okay so I was going to talk about people who ask the question like why use Python for data so Python has a lot of really great libraries there is a really great community around Python and it really lends itself well to data analysis because there is a kind of the scientific community uses Python a lot and what's nice is that in my language I feel that an analyst or someone that may be an expert at using Excel or SQL would be able to pick up Python fairly easily and I could show in this example so I'm going to talk about the word count example which is just a map reduce job for counting the words and corpus documents so you can see here we just have a program so we have a mapper and a reducer and then the mapper is reading line by line in a file and it's splitting and then it's adding one it's creating these tuples of a word and a one and then the reducer is combining these together and grouping them and what's nice about this is that you don't actually have to know anything about inheritance or you really don't have even really have to be that a programmer let's say with a degree in computer science or something. You could pick up Python and it's something where you know the inputs and outputs and you know what you want as a result and then the rest kind of comes a little bit easier than using a compiled language so yeah you understand the standard in, standard out you understand looping going line by line and you just need to know how to run this and where to run it from so this is workhound in Java I'm sorry this is kind of small but you can see there's a lot of you have to understand classes the concept of of inheritance and a lot of these other other entities within this language that may not be as easy to understand to an analyst as opposed to a programmer someone who's you know went to school and learned in C++ or Java and I think now most universities require Java as a language compiled language so this is actually kind of something kind of new I'm not sure why they chose this but it's kind of like an industry standard having everyone learn Java so workhound in Java you need to know about objects, classes, inheritance there's a context variable there's a lot of boilerplate code in this example there's a lot of things to import so it just makes it that much larger of a barrier for analysts or people that are new and also there could be other people math people that don't have programming backgrounds that need to learn all of this so we just want them to hit the ground running and be able to start doing data analysis with Python so there are a few advantages to Java obviously it's faster it is more modular I know that YouTube was using Python for a while and now they're kind of moving on to go and using Java because it's packaged a lot easier and easier to send around and like I said more programmers know Java than many other languages because the kind of requirements when you're in school now you have to learn Java okay so this is the part where I'm going to talk about our data stores our pipelines running from S3 and where the sources and destinations are and how they get from point A to point B and all the transformations we make with our data so usage by far is the most important component of Flipboards data because with that usage we don't really have a product and if an hour or two goes by and we don't get usage data that's a serious problem that's like everything has to stop that has to run every hour and this is something that all of our usage pipes into our products so we basically can't have products without our usage data so we have two different pipelines one of them is the redshift pipeline so that's the one I talked about earlier which is our analytics platform and a lot of this data lives in S3 and it's in a JSON serialized format and we do some transformations on it and then we turn it into put it into a column format and then insert that into redshift and then there's a lot of products that we have that pull from redshift the other one is the recommender index so that is kind of a separate data store, a completely separate pipeline and it still reads from the same location it reads from S3 and we have a collector which collects all the data and transforms it and then throws it into Kafka and then the recommender index or the graph database we use our graph database is what houses all of our recommendations and our latent factor models and basically all the tools we have for providing recommendations to users so that server is off of Kafka and then updates in real time so if we have some new interactions from a user it may be an hour or two and then we're able to provide recommendations based on that new knowledge okay so a little bit about a recommender index it's in memory this is the graph database and it's just one giant box it's about I think 250 gigs and we run all of our optimizations because it's very fast and a lot of what we do on a recommender index has to do with clustering and we have a few clobber filtering models running on there and we also have every user's latent factors or every user's information on their usage condensed into these these factor models which provide us with a good means providing recommendations to users and we're talking about Redshift so this is the this is our ETL setup for getting data into Redshift so we use kubel and there's just two steps here you have a kubel configure and a you can create a hive command running this hivecommand.create she's just a better person if you have any questions on that and so this is basically this python ETL setup this is basically what executes whenever we want to run a job on hive and you see here it's just a we have a few attempts that run we have a timer and we have a hivecommand.run and it runs this query that's passed in to this execute I'm sorry this is really small so this is just a general yeah so this is part of the Redshift pipeline so this is just a general execution of this query so whatever query we have and if you want to run it on hive this is what runs it so I'll talk about the process from our data usage, our usage data straight into Amazon Redshift so this is our usage data pipeline it comes in through S3 and these are just rows of online data and then run through a hive job there are some other operations that happen on the data like there's some IP to geolocation user defined functions that get added and once the transformations happen it gets put into kind of like a culinary format and stuff back into S3 and then from S3 the job there's a process that takes it from S3 and then puts it straight into Redshift and then from Redshift that's where you know a lot of our you know a lot of our analytics platforms consume the Redshift data store as well as like ad hoc queries and things okay this is just the process that puts it from S3 to Redshift I'm going to talk about let's see instance management here so previously we had a bunch of Amazon reserve instances on this VPC and we manage them ourselves we actually don't we have about I say three people on the data infrastructure team now and having a fixed size VPC you have to manage everything yourself and there's changing requirements all the time and it just gets to be unmanageable and unwieldy so it's something that it really helps us to have some type of auto scaling and the infrastructure that we use with the cloud it's really nice for us because we spend a lot less money on reserve instances and if we're not using an instance we don't have to pay for it if we're not running a job on it which is really nice and a lot of these jobs we have them on KubeL and as far as configuration, tearing down and bringing up instances KubeL does that on the fly so we don't have to manage that as well so that's also really nice and so let's talk about auto scaling we have a dynamic data infrastructure and so we have all the jobs running on the same cluster if there's let's say two jobs running at one time if there's a new job that comes in the auto scaling deals with the demand of this new job and then kind of comes up with its own strategy to handle running these jobs all at the same time so with running on KubeL KubeL is the instances are hosted by KubeL but owned by us so whenever they whenever we're running a new job and we need more machines KubeL and then D commissions commissions them as the job is finished and let's say there's a lot of money in the past and this is something that we're really happy to have okay so this is one application we use with Spark so we have a follower suggestion recommendation within Flipboard so as you're going down your top stories or your main feet of articles we use we use a recommendation module that pops up and it's basically a user recommendation module or like people you know on Flipboard so this uses a Spark graph so what we're doing in the past is we were computing follower suggestions using Hadoop and this took somewhere around like 40 hours or so to complete and sometimes it wouldn't even complete and this is a lot of the reason why this has is because MapReduce is not in memory so it takes a really long time to get through a clustering algorithm so we've moved everything to Spark and what's nice about Spark is that everything's in memory so these connections from user A to user B they're basically seen as instead of being on different partitions each user is seen as being each connection is seen as being completely separate so each let's say you have a node of a user this node is if it's referenced many many times that's able to sit in memory so that way you can access it easily so this is so we use Spark so this is talking about how long it took for the Hadoop job to run so 40 plus hours of Spark time was 20-30 minutes and it vertices individually and uses caching and holds on to the most visited so you imagine that if there's a user that's very popular on a particular social media platform that user's going to be accessed more than any other user and it's better to have that these nodes decoupled so that way they're not just living on one particular partition you have to revisit over and over again on disk so this makes computation time a lot a lot faster now I'm talking about a user so like social media connection so there are nodes and edges and let's say if I'm following you on social media then I would be a node and there would be a directed edge for me to you okay I'll talk a little bit about Spark streaming and then I'll do a quick demo at the end and the demo is basically just a topic cloud of user affinities so a user has particular affinities to topics and then basically the more affinity a user has to a topic the larger the node will be and I'll show that at the end so this is basically an interest graph visualization so our interest graph pretty much feeds into a lot of our products and each user has a topical affinity vector and we use this topical affinity vector to feed you articles topic recommendations and magazine recommendations as well as it serves our ad platform as well okay so this is a real-time demo it uses Spark streaming so it listens to Kafka and reads events for articles so if a user reads an article we look at what topic the article is in and then we basically the size of the affinity for that user to that topic and so there's just a few steps here this is it creates a streaming context and reads from an input data stream from Kafka and based on the user's time within an article it applies more weight to that article let's say if the user is reading an article for 10 seconds there's a larger weight than if the user goes into the article then jumps out of it so the reason why we chose this is because users that click into an article then jump right out there may be a sign of clickbait or some type of content the user is not really satisfied with so when the user jumps into an article and reads it for a while we assume that the user has a higher affinity to that topic and we apply a larger weight and let's see and I'll do a quick, I'll show you guys this tool so let's see, I can get this to show up here so these are the topics of Flipboard for this particular user you can see that they read more articles about Google, machine learning, psychology and from these sources Huffington Post Business Insider so this is reading from from Kafka and it's giving for each user interaction with an article there is a certain weight applied to a particular article based on a particular topic based on the article's topic so a few other users here have different topical affinities these users' topical affinities are more general yes? oh no, this is something that was developed internally so I think it's just reading off of Redis and so this basically just shows you what the person is interested in and this is a graph for one particular user and you can see here they follow read a lot of space, photography, Apple news, and Facebook articles and it just gives us a good idea exactly what users are interested in and what this user might be inclined to buy if we wanted to show them an ad maybe they're into photography we'd like to show them an ad for a camera and the more we know about users' topical affinities and their interests the better job we can do in providing with articles and content they're interested in so and that concludes my talk do you guys have any questions? so we have so from the devices the devices send to one of our servers and from there we have what we call a flap server which this is the server infrastructure that talks directly with the phones and all of our devices so from there the data is being sent to that service and the service is taking it, putting it into JSON format and throwing it into S3 and then from there we consume it so basically you could say it's not really in real time but it's in it's basically chronologically ordered so whatever we have in the recommender index is all the data and all the usage for a particular group of users up to let's say an hour ago so whenever we get new data it's put into Kafka and then whatever the last hour of usage is it gets put into our recommender index so that way we could users don't generally come every hour or so but we like them to where you have data up to a certain point and then we just keep on feeding more into it, feeding the most recent and generally as a rule of thumb we don't try to analyze like let's say if you're a user of Flipboard for the last four years or something and we don't actually just try to analyze the last four years of your interest and every article you've ever read we just look at the last let's say like a few months or something or six months of interaction one because your interest change over time and two because we analyze less data so we use a sliding window yes you're doing the map reduced still on the with Hive so in a Hadoop Box or in this case you're saying also it's a spark, why is that are you trying to move to Kafka or is there some stuff that Kafka does that I'm not sure of or in addition to we just use it as a let's see we just use it as a queue so basically we just want to read the most recent usage data off of this queue and we want to update our recommender index in all of our stores using that so the most recent usage data is living on there and also we also use SQS we're trying to move over to Kafka but our SQS queue we're throwing article content SQ and it's being also being consumed by the recommender index so recommend index also has our topics so for each article we're able to get a ranking of topics for that article so that way we can have a better idea exactly like what the article is about and so we can also send it to the right people who are interested in reading it so yes more details as to why you're moving from SQS to Kafka well SQS is really expensive mostly yeah I'm not sure for what I understand the reason why we're doing it is because we already have Kafka and if we're going through the trouble of maintaining it and using it then we might as well just use it for one extra purpose instead of having to pay Amazon more money so we probably spend about a million dollars a month on AWS so it's a lot of money we don't have to we don't want to have to pay more for something we already have the means to deal with right now using Kafka yeah I mean we really don't have that many people on the data infrastructure side so it really helps to have a few chosen tools or favorite tools that we want to use I think also at a larger scale when your scale increases you know people who move from say MPP databases towards Hadoop and Hype as your scale increases SQS technologies become less scalable as compared to what Kafka and those things will offer I don't know about their reason but what I have seen is generally as the scale increases you need to start moving towards these technologies which are more catering towards higher volume data and stuff like that and those projects also that also helps so I think it's a combination of both but primarily the places where I have seen moving from some these are all message think of them as message buses so moving from an enterprise message bus to a cloud scale message bus or a web scale message bus which is where Kafka's of the world reside usually happens because of volume good point that's a really really good question so there's a lot of parts there so she wants to know how we use topical affinity to recommend content to users and she also is asking there must be more to that besides just looking at the topical content and then giving recommendations so there's basically two components so if you look at a high level for recommender systems there's usually two types of approaches there is and let's just say in news so there is the topical component so there's content filtering and then there's collaborator filtering so content filtering has to do with the content of the sample let's say that I want to recommend a news article of someone and a news article is about big data and machine learning and Elon Musk so what we do is we could take the article and extract topics from it and create what's called a topic vector for that document and we could use that information in the future whenever we want to generate recommendations so you can imagine that you have a topic vector of size 40,000 and a user's topic vector which represents a topical affinity for all different types of topics is the same size and you can just perform a dot product and there's two vectors and you can come up with a score so you can imagine that you have for a particular user you have this vector and you basically match this vector against the last 50,000 documents or 100,000 documents that came to the pipe recently and then you can just sort based on score so that's just the content part the collaborator filtering part is a little bit more advanced and a little bit more powerful and that it looks at the users that are similar to you based on your usage history and creates kind of amoebas or like clusters of users around the content or around the usage so in order to facilitate this we use an alternate squares algorithm which is an optimization algorithm and it's known as also known as a latent factor model so what a latent factor model is is a model which could take a bunch of user data which is let's say a large I don't want to say matrix but you have let's say a lot of users and then a lot of documents and for each one of those users they have a particular score for each one of those documents so what you can do is create after you've run this optimization you can create a bunch of factor a bunch of factors for each user and then factors for the documents and what these factors represent are the interactions with the user to the content so and you basically same way as cloud filtering as we do with content filtering you match these factor vectors to represent the user in a document and you go through each one of these documents and you come up with a score and then you rank them and we merge both of these models together to come up with our recommender system and the users the pipelining and everything is everything kind of lives on this graph database and that's where all the optimizations happen and where all the factor models are computed does that answer your question? oh so the algorithm I can't tell you exactly which one it is but it's an alternate least squares algorithm alternate least squares? you're welcome yes? so our graph database is internally developed we call it Podex and it was written in C++ by one of our developers we have boost libraries written in Python so basically you can act on this database like kind of like natively in Python it almost feels like but under the hood it's all running C++ it's really fast it's just one giant behemoth box and all the data lives in this box and it's just a large amount of memory on this instance and that's pretty much all we need for doing this yeah well this is actually before Neo4j before a lot of things I think I'm not sure what year it was written in but it was before all these different tools so I guess I could explain like how it would be used I go into I create like a database instance or a database object and create a the object and I could just type db.user and then brackets and just like in like a dictionary grab like a user ID or pass in a string like some like just string user ID and it comes up with all right have this user which is a node in the graph and then I could say okay I'd like to go through all the views of the user and basically any you know their magazines and anything lives within this users part of the graph and it's really useful for analysis and it's a lot more useful for doing certain things like getting the most recent chronologically ordered documents or likes the user had and it's a lot better suited for applications than let's say SQL because SQL is just one giant spreadsheet almost but yeah that was a good question and I don't think he's going to I'm hoping he's going to open source it soon I've been pestering about it but maybe it's something he developed that he just wants to bring whenever he goes to a new job or something or maybe he just doesn't want to have to support it because there's a lot of work with supporting that I think it's probably like 4000 or 5000 no it's not and it's just in memory so it's not like yeah I haven't really I haven't dug through it yet but yeah it works really well works well for our purposes generally that's written in Java and for ETL it's mostly Java well no way no that's I'm sorry that's in Python I can't remember exactly yeah I mean it really depends I guess but like for whatever I showed you there ETL going from S3 that's all in Python so like what types of distributed processing what are we doing we're doing high processing there are a lot of jobs running probably a better question for one of our infrastructure people that's right no I think we are more of a provider and they use our platform to do that how did we discover ourselves I don't know I just met you like an hour ago yeah but I guess I guess there is a pin point and there is a solution to that pin point and essentially that's how we came together so we specialize a lot more on big data on the cloud Flipboard is using the cloud very heavily and that's where the intersection happened so there is a programmatic interface to Kibble and then there is an infrastructure orchestration piece behind the scenes so the thing which cloud allows you to do is to create an infrastructure on the flyer so if you marry the programmatic interface with the orchestration piece you get a very powerful paradigm where everything is created on demand it fits to the purpose of your application what doesn't happen so this is auto scaling it's not just spinning things up and down it's actually spinning things up and down based on the demand and so the algorithms which figure out we probably need more machines or we want to shrink it down and then auto scaling in a stateless application is easy and a stateful application where data is involved so if you are scaling down how to cluster for example you knock off a couple of replicas and then what happens to that data so you have to make sure there is complexity involved around data placement and stuff like that so that is the central part of our orchestration piece and when you are using the cloud so companies like Flipboard and all can write applications on top of this which are doing these higher value things where you are creating these affinity graphs and recommendations and stuff like that so behind that application there is a whole pipeline of things and one part of that is solved by QBOL and another part is solved by the graph database and stuff like that we are programming straight into AWS because lot of configuration management software is general purpose and when you are doing something very specific to big data then you need access to the daemons and the data or the states of the daemons to know those are the inputs that go into the prediction algorithm to figure out how out of scaling has to be done so we use we have kind of built that out cafe so yeah you can use a cafe, a library to use in our networks in what context you mean like in the cloud I don't know I think I built a cafe box with Kuda and I had to build everything myself are you guys looking to go more in that area or we do support integration with art but we have not looked at cafe as such now the platform allows you so you have to draw a box as to what is supportable by the platform and what is not we essentially draw a box around Hadoop, Spark, Hive and Presto those are the four technologies now you can use cafe libraries through or any generic library you can absolutely use that in kibble however that has to fit into the paradigm of one of these infrastructures for example do you want to hook up write a python streaming job in Hadoop or Spark using using cafe or any of these other libraries you can certainly do that but that becomes beyond the scope of the platform itself so any programmatic library that you want to use like mllib a lot of people use mllib with spark you can use that with spark right right so that's programmatic interface that programmatic interface talks to our interface on job submission execution results movement of data and so on so forth that is the interface that we provide below that is the orchestration piece that happens yeah so I guess it depends on what level you enter yeah right that's very focused so like I'm sure you can use cafe for many other things but what I use it for is for a deep learning network and that just basically it looks at an image and classifies it into one of a thousand different categories that is like so niche right now that it's not even useful I mean for us it may be useful for a few other companies like a camera company or like drop cam or something like that but I think maybe soon the adoption will be more and more with deep learning and but I mean still at the level you guys are at now and you could do it but I think does MLLib have any I think MLLib have some deep learning stuff now I'm not sure I don't know if anyone uses that but cafe you could set up with a GPU but you have to do up the management yourself if you're doing it straight up like that I set it up it took me about a day it's a lot and actually they killed my box because I wasn't using it so now I have to set it up again so it happens and you don't what's that I haven't used Docker yet and I had the pleasure I use an AMI so Amazon Reserve Instance to set it all up it makes it a little bit easier but yeah I think we are out of time is it? thank you very much