 breaking conformity and you know going moving out of mainstream right so in some ways this talk is also about at index what whatever we have tried to do to break out of mainstream in terms of large implementing a large scale data processing system pipeline we started with the few tried and tested way of doing things and we failed and then we applied tried applying at least the functional programming principles and to a large extent it was very successful for us and then we scale this platform using her so those are the three major themes of my talk talk about what data systems are like what challenges we face what principles functional programming principles you can apply to tackle those challenges and how do you scale them so that's what this talk is going to be okay I'll begin with what a data pipeline actually is because a lot of us come from like even my own background before joining index course from typical web-facing consumer apps and so on where you typically have one database it's mostly mostly about people people's information records transactional nature but when you actually deal with a data pipeline the processing systems are a little different so typically you have you have a this is your data pipeline you have raw data that comes into your system this blue guy there and then it basically gets phoped and then either serially processed or parallely processed through all the subsystems and each of the subsystems can actually use some metadata from outside so all this is massage and then it finally gets joined and then you have this structured data right so if you look at our pipeline our own pipeline it index what we offer is product intelligence solutions to our customers so what that means is you have products being sold by brands and retailers we go crawl them and then so you end up with HTML pages that look like this product pages right Samsung Galaxy S5 fairly unstructured you can call it semi-structured because it's somewhat HTML but people we realize they use horrible HTML so it's almost unstructured data so we put it through like multiple systems just like just like this kind of this diagram here we put it through multiple system not necessarily sequentially and then at the right at the end of video going to have this structured product information like okay this is actually a title and this is a prize and it's actually a mobile phone and so on so so a very quick introduction this is what a data system looks like and I'm going to use this as an example to move forward let's just begin first with some of the challenges you face when you have to build a system like this right first thing is the data is going to change continuously this can appear in two forms one if you are if you're doing crawling for example like what we do as you crawl and more and more pages as you discover more pages you have a constant stream of data that enters your system right and new data could come and existing data when they are processed by the systems if one system let's say you roll out an algorithm update let's say you are you're classifying this page as a shoe or a mobile phone or whatever right so if there is a classification change logic change that happens in your algorithm all the downstream systems that dependent on this data needs to reprocess it so there's always data that's flowing through the system right it's a never-ending thing it goes 24 seven and since as we saw in the diagram the data is going to be touched by multiple systems you always have to join and fork and any you know stitch the data together together at various parts to actually come up with the final final output whatever that you want so this whole orchestration and stitching data together that's that's another challenge and with data the thing is you don't have any control over what the sites can throw at you you will think that you are you are actually crawling a HTML page you will end up with some zip file or a exe file or God knows what so so recoverings from such bad data and then and then you you you will have to you will have to basically rerun your pipeline ignore the bad data on things like that how do you how nimble are you at are you at rerunning this thing because you have gone you're going to have a constant stream of data that's flowing in so recovery is very tricky in that sense I can't emphasize how important metrics and aggregations are because when you have millions of records flowing through the system this is the only way you know okay something is going wrong right you you should you and many of the metrics are in terms of aggregation so if I have if I'm crawling hundred stores and in each store I know that I have 10 10 products or 15,000 products 20,000 products and so on this aggregation the count in each of the store is going to help me figure out okay tomorrow this store I'm generally expecting it to have 10,000 products today it's having only thousand something happened right so this metrics and this aggregation needs to be implemented at every single system and and also as a system as a whole right so these are some of the challenges they are many more I'm going to take these four things and then and then we'll see how like taking some leave out of the functional programming well how can how we can address these scale I thought deserved a separate slide because whatever we do we need to be cognizant of the fact that we are dealing at like a lot of data at least this is what we deal with that index we get for 4 terabytes of fresh data in every day and then we have about 450 million product URLs so there are systems which which are machine learning dependent and so on so the scale is he becomes very important as we look at this whole whole thing okay so here's our first mainstream attempt at building this a lot of us came from as I said consumer facing web app background so we thought okay you just have a database shared mutable state right this is the row that it's going to be processed so this is like a URL of a page right this is what is going to be processed by all your system let's give it one column for each one one system or one component so component one you know goes and crawls puts it here maybe for example then component to reach the HTML processes it maybe extract some any structured data from that put it here component 3 reads and so on so so this this was like the most simplest thing that we started with right obviously didn't go well because we face multiple issues most of it could be traced back to this fact that given any row right any cell the values are constantly mutating because when you look at when you look at assist when you when you look at these components each component will have its own lifetime you can have component one let's take these to be like let's say bulk jobs that you run on your data right so this could run maybe every one hour because this component is something that deals with the price and price changes very frequently so we need to run it frequently maybe this doesn't change very often right like product attributes don't change often a title is going to be the same title right so this maybe system will move slowly so in then what happens is in a given row you have so many cells in each cell you're going to have each system writing a different paper different speeds this causes like a huge pain point when something goes wrong because you have no idea how a value reached there like you see a value in one cell which is 200 you don't know how it you don't know this exact sequence of step that it arrived for that value of 200 so error recovery is also very painful because you have one big database it's right so happening consistently and reach are also happening so how do you like take snapshots how do you back this up how do you scale it it's a huge it's a huge nightmare so mainstream sucks so this was our mainstream approach how we would have thought about implementing this whole whole thing so we went back and we thought okay this sounds very familiar right I mean it's almost like you have a program it's shared mutable state a class with all these variables all over the place and you're trying to do concurrency or multi-threading on top of that that that's how this whole thing looks like so we thought maybe you know what they should be a way to tackle these problems and you know kind of build a system that centered around functional programming principle so when you look at some of the principles that that we know and I'm sure like a lot of us see the value in these things in utility and containing side effects pure functions idempotence monauts right so can we see whether we can apply these things are to bring some sanity in order to our system so that was that was the idea that we began with the very first thing we thought was we should embrace immutability right because as I said earlier when you see a value 100 you need to know how it became 100 because that's the only way you can find out if something goes wrong oh actually you know what component a did this component we did this and there's actually a error there so that's why this hundred came into picture when you have a shared mutable a database which is just like a cell and you just update its value or delete its value right you lose a lot of history behind it right every update and delete statement actually tells you a history behind behind what happened so we thought we should never lose data we should although you although you can do updates on delete but the fact that you did updates on delete should not be lost right so that was the very first thing that we realized we should be doing so we thought of just using the operations that we do on the data as just just as log law events on a log right so you have you have every time an update happens on a row we say okay this is updated the whole value of this new value of this and so on you just have this consecutive sequence of events that's thrown into your log file what this also means is we are kind of separating instead of having one big database where read and write happens at the same time we have one dedicated up and only right store where only rights happen right and then whenever you want to read from this you will have to do some kind of reconcilation to form this view here okay so you can use a like a Kafka like end of queue here because it's it gives you it's really serious value in our case you just decided to go with htfs just writing to the file system up and up and only files right so so the idea was okay let's have two different use this is this is primarily concerned with high throughput right this is primarily concerned with reach and you need to do something here to reconcile this events to form your views okay stop me if I am going fast or something feel free to so we don't really others is a very since we deal with large amounts of data and it's not really customer later or financial later we didn't really have any shit fine with eventual consistency and so on so so let's look at how we will do the reconcilation right so let's say you at time t1 you had a job a bulk job that ran on your data it read all your data and put some value values for each row in the in the data right and and let's say that fail the failure could mean maybe because there was a bug in the code you wrote the same value for all the rows maybe the code actually didn't run or maybe it failed halfway through that's fine so so what do you do you rerun the job on the same data one more time so in the in the mainstream or a traditional database world you will have for the key k you're going to have v1 which is the wrong value and then that's corrected by the run the job to you you just replace that value with v2 that's it so that's how the old model used to work how do you do it if you are looking at only up and only store you don't get random updates you don't get random rights to specific rows in your database so this is what you do you you run a job right this is the reconciliation step I was talking about so you run a job every record is going to have an ID right in our case we crawl product pages so the URL is the unique identifier for us so we group it by the ID so if you had run your job multiple times for example let's say you run your job two times you're going to get to record the duplicate is going to be a list and it's going to have two records right with the same ID so you can apply a function here that fix the latest because in this case we want the latest guy but the cool thing is since we are actually having entire history of your product or product records or your records you can actually do this read time resolution in multiple ways for example you don't necessarily need to pick the latest maybe you want to pick field 1 and 2 from the latest run and then maybe field 3 and 4 from the previous run that's possible so this read time resolution gives you a lot of flexibility in that way because you have a complete history of what happened to your record right along the way there are those there also other benefits of using this up and only model and creating separating your rights and reads item potents you can run the same operation multiple times and based on your read logic it will give you the same output over and over again and the up and only storage is also very reliable and the primary reason for that is if you use a random right store like take my sequel or we were using h-base what happens is a lot of data even the implementation details vary but generally the data gets right written to your to your memory and then there is a read log read right log that gets written and then this gets merged and then your indexes needs to be updated there is a lot of rights there's a lot of complexity around maintaining a random right store when you think up and only storage none of this none of this applies so it's really easy to scale and as I said you can use Kafka or just htfs files you can just write to them and this actually this whole thing forms the basis of this phrase this term called lambda architecture this was coined by Nathan mass of Twitter so we actually use I'm not going to go in too much detail in this but we actually use lambda architecture at index if you are interested you can see our engineering blog we talk about there is multiple multi-part series about how we implement some of the things that I just touched upon right now okay let's move on so we spoke about immutability now let's talk about a side effect containment right now now you have a system that only does writes in app and only fashion there is no random there is no random writes reaps are kind of segmented away now but that doesn't mean that you're done because you want to treat every every component or every system the the initial diagram I showed every circle there as a pure function so what what what does it mean you you you give it you give it an input and it gives you output there is no state anywhere outside it doesn't one component doesn't inadvertently change the story for another component so you you want to add in pseudo code or in functional pseudo code if I have you can actually imagine the transformation of your data like this so I have HTML pages right I get a HTML page I pass it I get some past record for example then I classify it and so on so I'm able to change this data transformations in this in this manner this is this is what I want to do because each each function here can be a component this could be a component that passes pages it could be a component that classifies pages and so on you can you can change things this way this is this is a very easy pseudo code form of looking at your data pipeline in fact if you want to look at the function signature or something of that it looks like this right you have a you have a collection of type M of products of type T and then you are actually transforming T to you in this case the HTML page becomes a past record and so on but at the same time the container M remains as it is so from M of T it becomes M of Q can anybody tell sorry so so I'm trying to I'm trying to link them together so I am saying that you should treat your component some component like a pure function in the sense drawing inspiration from from how you have how your pure function behaves right it has a very clear contract input it has a very clear contract output that's right yes yes and and so can anyone say what this yeah right so this is this is nothing but a monad I'm not going to explain or dive into monads or the definition and the monadic rules and so on because I think that's going to take some time but I'm going to explain for even for some some of us like myself it was it was a very new thing when when I learned it so I'm going to explain how monoid is used monad is useful right so very quick explanation monad allows you to express a series of transformations or a sequence of operations on your data right such that you actually have a data type that keep changing in every a that can potentially change after every operation so data type T becomes new here and then it becomes yes but if you look at it as a whole it is it is being boxed by this monad type M right it's a monad the same monad is preserved as the data inside gets transformed so so this way of looking at it as a monad why is it useful right why is it useful because I will also touch upon this subject later on but it gives you a general abstraction to look at your whole system right so you your monad has certain properties and and if you if you think of your system like this it gives you a very nicely nice abstraction especially those of us who have who come from the functional world to look at your system system this way and as I mentioned before scale is a really important factor about this whole thing so far we've been touching on okay we can we can take principles from functional programming all that is fine so how do you deal with scale and how do you deal with this orchestration that I spoke about right reading from multiple things for King joining all this difference different things of data so this is a this is a quote that says how do please the map reduces enterprise job means of her time in many ways if you look at getting your platform data platform I think how do please definitely the best option available to do that today it's been it's too much to work at big company and and it's really good but the problem is the raw map reduce API is that Hadoop gives you let's say you want to sequence this various job that form of map reduce job is really really ugly when I mean by ugly they are very low level right it's it's it's good in the sense they are very flexible you can do a lot of things but it's not very intuitive it's not you definitely can't easily express your algorithm in terms of map reduce map reduce okay every time you you come up with algorithm now you have to like go and think okay how is this going to fit into the map reduce model is this is this operation going to be a map and and is this going to be a reduce what should I emit from the mapper in the reducer how should I grew by what key and you know this it's not a very natural way to do it even though how do please in my opinion is the best platform available for like data crunching at large scale right so definitely we didn't want to use raw map reduce API that means you have to build some abstractions why do that when you already have cascading cascading is a Java API on top of this map reduce API is that allows you to forget about map reduce and think about data flows so what does it mean they have borrowed a lot of concepts from the world of plumbing so so so you have tabs the tap is tap can be either a source or a sink right a source is some source is something that you read from like your input file is a source and think is something that you write to right so you read from a source and you write to a thing and and you can read from a text file for example and write to a sequence file you can read from a CSV write to a CSV all this abstracted away from you and and once you read from a source you're going to get a pipe so this pipe represents a immutable stream of data so if you're reading let's say a CSV file this pipe is going to contain lines for you right every every every record in every record in that pipe is going to have a line and then you can apply some map operations on it you can filter it a map will mean okay I'm getting a number one I'm going to multiply it by 2 that's a map I'm getting a number one I'm going to like not let it go through because it's not divisible by 2 that's a filter you can specify functions that will either let the pipe to transform the data that flows through it or you can let it filter the data that flows through it and and then you can plug in these aggregators stuff like you have thousand you're reading thousand records from a file thousand lines from a TSE file I don't really need to know what are each lines I just want to know that there are thousand lines in that file so that's a aggregator count aggregator for example that you can plug in so you if you notice here right apart from this map which is more like the functional map there is no map reduce at all here you you you don't have to think about map reduce you just think in terms of data flows I have a pipe of data I'm going to map it I'm going to filter it I'm going to fork it I'm going to join it things like that and and capcading takes care of actually creating the map reduce jobs under the food we have had some 20 lines of code producing like 13 14 map reduce jobs in the past so sorry about that right so you do a word count in cascading so so you define a source which is a tap you're reading from your input path and then you have a thing right and then and then you you're initiating a new pipe I'm saying I need I want a pipe called word count and and then and then you and and then these are the operations that you do on a pipe I'm doing I'm splitting the I'm splitting the line into different words and then and then I'm going to group grouping on the word and then I'm going to aggregate on the on the number of words and then this is how you assemble a pipe so so again no map no reduce no fancy term it's all very logical chunks how you will deal with data that that's it but we still weren't happy with this because this is still quite verbose right and and you are doing all this stuff like reassigning all this mutation and it's like doesn't look functionally enough to us so we thought there must be a better way to do it right so scouting is a is a wrapper on top of cascading so this is how it fits you have raw Hadoop APIs you have cascading which gives you which gives you stuff like this like high-level abstraction and then you have Calding which gives you a scala API on top of the cascading APIs and scouting is not the only tool you also have other things like scuby but we chose Calding because Twitter uses them heavily there's a huge community around that and so let's look at the same word count example in Scalding four lines so in fact if I just ask you to just print at it for one second and close it you will say that this is not even a map reduce code you're probably going to say it's a scala code those of you are familiar with scala because this almost looks exactly like how you will do it in scala you read from my input you flat map convert the line to a list of words and then you're grouping it by a word and then you're finding the site size and then you're writing it out to an output that's that's about it it's four lines we use scouting very very heavily we have hundreds of jobs that we run every day so it becomes some kind of a sequel replacement for us so when you used to say okay I want to know what some some analytics requirements something I'll be doing select star from in the usually if you have a sequel database what we do is we just write a four line five lines Calding job we run it it will give you the data out right so that it's really powerful because it really empowers you to run this ad of queries otherwise you'll end up writing this or worse this kind of code for each of your job and there's going to be a lot of boiler plate when you're dealing with data your your data reading and writing is going to take most time network latency and I use going to take most of the time the thing you need to be slightly worried about is will this be transformed into a single map reduce job in if you write it in how to write map reduce you can write it in a single map reduce job if but it's Calding is not good it could potentially be not very optimal it can spawn two three jobs but from our experience we have never seen that happen it's really good they have a very good flow planner which which clearly says that okay this is the code minimum asthma as as if you will write it by hand they will generate that kind of number of map reduce so we don't because you don't think about the raw map reduce jobs at all you you don't because as I said we had a as I said we had like a 20 line program that that actually 13 map reduce jobs because of course we read from multiple sources we join we do like okay so we do this kind of thing we read from one pipe and then we join it with another pipe and on the same key right so you have some let's say you have some records with field A to B A to E and then you have another input source which seems F to B or something like that and you want to combine this you can do this kind of high level abstraction writing a join in map reduce it's really really painful and we have dropped that do four or five joints and the code is only like eight to ten lines so so and we have not seen any noticeable performance because we did start with raw map reduce and then we moved to Scalding it was a it was an evolution so we really didn't feel the pinch right Scalding uses cascading and cascading and cascading at the at the at the end it generates generate the actual jobs right so so spark is mostly we also have a spa we also have some some components in our system in our data science platform which actually they use part for it part is mostly for in-memory crunching for example if you have a group by let's say you have a group by operation on a on a fee let's say you want to group by on a store right for example and and in our in our system for example we Amazon has millions of you are else in a given in its store so in if you try to do that in spark today it will you will run out of memory because at least as of now there's a jira for them to address it but at least as of now you can't do a group by on a very large set of it needs to fit in memory spark is very useful when your data set fits in memory and you want to do iterative things like in machine learning wanted to run the same I'll go multiple times that's very useful because I'm not going to go too much into that but starting it's still this it's still how do slow but the data need not fit in memory that's the flexibility that you get okay so I spoke about how important matrix and aggregations are because this is like your heartbeat you kind of monitor it and you see that okay am I doing well so in our in in our for example in our system we have multiple stores and this is one aggregation that we do for every store we count the number of products that we have on the in that store let's say suddenly this count drops we are expecting this to be like eight million and then it's only 200 thousand we know that something went wrong either we didn't crawl properly or some some system some system just eight of the files eight of the record for something like that so this matrix and aggregation are very very useful for example let's take count right you want to count records and you're no longer your data doesn't fit in one machine it fits in multiple machines so how do you do a aggregation over multiple machines you do the local aggregations count one count two count three and then you only send the counts these three values to a central node and then you and then you sum them up so you don't have to actually send all the data to a single node to count you can actually do this local aggregation and count and count is not the only kind of aggregations you do you do max you will do min max and min are useful for doing threshold deduction are you going very high are you you can do outlier direction for example there are e-commerce sites that accidentally have products with 20 million 20 million as a price right so you want to remove that from your system but because you are showing analytics to your customer that will kind of your average will now be very very high so you can do you so you need you need you need max and min and you want to do some and things like that if you want to do average you need to do some so if you look at parallelizing this aggregation operation right we can only do it is that that particular operation satisfies these three properties the first of all the operation needs to be associative that means a in a dot b dot c should be the same as a dot b dot c right so because you don't know in which order they are going to be summed up in which node so the operation needs to be associative and then the result should belong to the same same set you add two integers you need to get an integer and an identity element is important for example if you are doing counts let's say one node did not even have that thing that you are counting then you have to send zero right you say add a zero with some number is going to give you zero in this case for count zero is your identity element so can someone think of a data structure that satisfies this property sorry right so again this ties back to a monoid this is nothing but a monoid monoid monoid actually has these three properties which we spoke about and and and and and to explain what a monoid is monoid gives you like a view around your data so that your data is x and y a monoid is this pink circle around it is a view on the data so a max or a min or a count or some is a view right because you're finding the maximum representation of that thing so that's a which is like a view on your data so if you can come up with an operation right that satisfies these three properties on this view then that view qualifies to be a monoid so we spoke about max min and some those are all monoids because they are actually viewed and then there are operations on them that satisfy this property and more monoids this is a monoid because when you add two elements when you add two lists the elements belong to the individual list and identity element is a nil so you can add a nil with any list is going to get itself and then similarly for set set also so so why are but why are monoids important just like how I explain why monoids are useful right why are monoids important because all I need to ask is how do I represent this computation as a monoid that's all I need to do so I want to do some aggregation I want to parallelize it and I don't need to go to my colleague and explain to him hey I want to do this that and all that I just need to tell him hey how do I represent this as a monoid it's a like you form your own lingo right and and and if you can't represent that a computation as a monoid it means you cannot you cannot parallelize it and and that that's one cool thing about having that abstraction in place and if you're using scalding you can easily define a monoid like this and plug it in so so what what usually happens in a in a in a company is you are going to have different teams that are working on the same set of problems for example in my in my in my system I have a need to do a max in maybe the the the system next to me they also have a need to do a Mac if you don't have this common contract or a common way to look at your aggregation using monoids I'm going to write my own max function is going to write his own max function and the code gets duplicated and so on if you think in terms of monoid scalding already gives you a set of monoids and then you can also define define your monoid using using this library called algebra which which is what scouting uses so you can for example define a set monoid you just need to define two operations on it you define the identity element and you define the actual operation that you need and then you and then you and then you can plug it in and that's it once it's in your standard library anybody who wants to do a set in future can just use it you're not going to have code duplications and it gives you a very nice subtraction of course there are always exceptions not every computation can fit in as a monoid for example if you want to do a median of your data there's no easy way to do it you can't parallelize that you need to bring the data to a single machine and then find the medium there's no way but at least from from our experience these are very rare I don't remember when was the last time somebody ever asked me what's the median of this data set it doesn't apply to us maybe the situation is different for you but there are of course exceptions in those cases you don't have so I want to summarize whatever whatever things that we have gone through the first lesson is immutability why why it's useful why why you want to move away from databases a shared mutable state and and having multiple components writing to different columns in the same row and and the use of an up and only store and read time reconciliation of the data gives you a lot of flexibility how we want to merge how we want to create views and it also allows you to scale really well thinking in terms of data flows instead of map reduce please use cascading and starting if you are already using raw map reduce api definitely make your life easier and it also gives you a very natural way to think of your data in terms of forking joining and so on finally monods and monods offer like very good abstractions on the common operations that you do so it establishes a like a vocabulary within your company about okay can I represent this as a monoid or you know how do I do this how do I sequence this as a monod and things like that so that really helps in having good clean code and sharing stuff across teams so that's about it