 I'm going to talk on the topic abstractions of a managed streaming platform and I'm also going to talk about how we are doing that at scale in Flipkart. Before we get into that, let me tell you a bit about myself. I am a senior architect at Flipkart with about 10 years of industry experience. I love solving deep complex problems. I am a conference traveler. I just love attending and presenting in conferences. I think conferences are great energy and a great learning place. Coming to the agenda for today. To start with, I'll be presenting some use cases and examples to help you guys build an intuition on what stream processing is all about. I'll take some of the examples that we have at Flipkart and use that to explain to you guys. Then I'm going to talk about why do we require a stream platform? You know different independent jobs, why do we require a platform around it? What are the characteristics of a platform needs to be so that and why do we require that? Then I'm going to introduce a stream, which is our in-house managed stateful stream processing platform. Minded that it's a closed source project right now. Today, the idea is to present intuitions and abstractions that we feel are important. At some point in time, it will get into the open source community for people to use. Finally, I'm going to talk about the architecture components and the various things that we have thought and built while solving the stream processing problem. What are the different use cases? Let me tell you about some of the use cases in Flipkart that we have. How many of you guys have bought something in a flash sale? Raise your hands. Quite a bit. How many of you, I'm pretty sure using flash sales is pretty hard, right? Flipkart actually was the first company that did flash sales in India 2014. I was one of the team that did flash sales the first time. We realized this after flash sales that it is a lot more complicated, not just for the users, but also for the internal uses of Flipkart of how a flash sale needs to be done. What are the things to be measured? One of the things that came across was the flash sale happened so quickly, the inventories get over so quickly that it's very difficult for analysts to go to understand what really happened. Typically, at that point in time, we had like a batch processing system. By next day, some of the metrics would start coming in and we realized that we would burn a lot more than expected, right? So therefore, we required a system which could actually process the data in-stream and give a very near real-time information to people like analysts who want to plan how a full filament should look like. If the sale has really gone well, then maybe temper down the next bit of sales so that we could protect the burn, protect supply chain, right? Another example is of trending products. You know typically buying is a very psychological behavior. We like to buy things that others are buying, right? We want to know what others are doing. Another example of stream processing use case is that of trending products. What we have done is that we collate information from various people buying on Flipkart and use that in a near real-time manner to tell a user, as a new user has come in, to tell what people are buying and therefore what is trending on the website. This again requires a lot of near real-time processing of data so that we could actually surface this information back to the users who are trying, who have just come in, right? So this again is an example of near real-time processing of data that needs to be consumed to be browsing behavior, purchasing data. Yet another example is sort of search auto-suggest. So if you typically go on a website and you know this example you type NI, right? Depending upon what is the most popular search result match, it will either give you a Nikon camera or a Nike shoe depends upon what is trending at that time. But if you know what the intent of the user is, if he has previously shown an intent to actually look for a shoe, then when you type NNITE, it makes more sense to show him Nike instead of Nikon, right? Now how do we do that? For that it means that we have to process his session information. This session behavior because across sessions, you know it could be few hours, few hours apart or few days apart, we might have lost the context. So within a session it means that we have to actually process that information and use that to augment the search results and therefore give him a much more personalized search query versus other. So actually today in Flipkart if you type something like NI, you will get a personalized search result experience most of the times, right? So this is another example where you're required to process data in a near real-time manner within the session context and use that. There are other examples also, for example, resell of fraud detection, search ranking improvements, some of the other examples that requires a lot going to the depth of it, but I'm hoping that by now you have got an intuition of what is required, why do we require real-time processing for solving various business use cases? So today in Flipkart we have about 500 plus jobs that are processing at a peak throughput of 400 k per second and solving a bunch of business use cases that otherwise before was not possible. Now if you go back to the use cases that I talked about, what does it mean for processing? It means, let me take back you to the flash sale example, right? So here we were trying to process audit data. Now audit data needs to be grouped by category or product for the, so that the supply chain analyst who's looking into it can get the data for a particular product whose flash sale has just gone by. You also need to create time windowed aggregates because now you're trying to get this information and put that in a time windowed bucket and aggregating the number of orders for that particular product. And then pushing that to a reporting database from which, you know, report visualizations can be created. Similarly for the trending products example, you are looking into browse events, grouping them by category because now you want to show the same category results of a trending product to another user, right? And then again you require time windowed aggregates because you want to factor in whatever is the current or the most recent times, I mean time windows in which the trend has gone up, right? For the older time window, the data is no longer relevant, right? So again you need like a time windowed aggregate and then you possibly need to sink that down to some trending database, right? So if you see some of these are some of the compute paradigms that are required, you need to join, you need time windows, you need aggregations, you need transformations, writing to some external things. And over the course of the talk, I'll explain to you how to build on these primitives we built up our stream processing platform. Yeah, so this is what I talked about, right? Now, why do we require a platform? These could be independent jobs that could be written by different people. Why do we require a platform? So typically for stream platform, stream processing, the entry bar is pretty high. Lot of domain expertise is needed because unlike batch processing which has really matured, stream processing is still new and it requires a lot of understanding of how data will be processed. And in some of the more future examples, I'll take you why that entry bar is high. It requires complex stateful operators. You need time windows, you need state management because you need to now start understanding when was the data arrived, what is the window size and all that. And then you ensure that information to be able to process it. And you're doing it at a high scale. Let me tell you through an example why stateful operations are difficult and why the entry bar is so high. So typically, as events are happening, there are two notions of time that you would have. There is something called an event time that is a time at which the event actually occurred. And then there is a notion of processing time. The time at which you actually process the event. In an ideal world, event time and processing time is equal, but that is not true. Processing time is typically way, is typically skewed than the event time. And that could be because of processing delays. It could be because of event failures, network failures and so on and so forth. So typically your processing time skews and therefore you are either getting late data or out of order data. And that's why you need to now start remembering states so as to be able to process that. Yeah, again, infrastructure management is hard. You have to manage compute, you have to manage storage, and then expecting everyone in the company to not focus on the business problem. But then focusing on this defeats the purpose of doing it in the first place. Finally, housekeeping is also hard. You require metrics, you require alerting, you require a job configuration, job management and all the bells and whistles around managing this so that it becomes easy for a person, for a developer, for an analyst to just write the focus on the business computation and then let the platform handle it. Right? So in our opinion, we felt that for an ideal streaming platform, these are the six pivots that are really required. You need a good programming model. You need a schedule operations. You want to lower the entry bar. You want infrastructure management, monitoring, alerting and job like circle management. So these are the six pivots on which we started thinking about a platform. But then there are a lot of open source alternatives, right? So why cannot we use some of them? And I'll explain to you what our analysis was and I'll take you through this. The storm is typically the first stream processing engine that got built in 2011. It had the concept of a sprout bolt, a master node, zookeeper to manage the computation state. I'll not go into the architecture of a storm, but I'll explain from the pivots that we had created to explain what it fulfills, what it did not fulfill. So it lowered the entry bar because now you could actually start just writing some spouts and bolts, stitching them up, and then being able to deploy your job. You could test the job in your local environment by pointing to your product configurations and that helped developers move faster. There was some sort of monitoring because the storm has a user interface on which you can actually see how the jobs are performing, what is the throughput, a lot of how many spouts are running, how many bolts are running and so on and so forth. So it had some sort of monitoring, of course there's no alerting around it. And it also had a job like circle because you could actually now deploy the job somewhere, kill it from the UI, right? So it had some job like circle management. But then it did not do any infrastructure. You had to manage infrastructure on your own. There was no concept of stateful operations. Typically the state, typically the information was processed from one spot to another spot could be to a bolt and then gone. There was no state that was contained. And there was no programming model around in storm. There was no concept of how to basically think about computation. There was more in terms of spouts and bolts which are computing paradigms but not really a programming model. Then came Spark, which is essentially the second generation stream processing engine. It had a concept of a executor or a driver and then a bunch of executors, always the work could be executed. Again from our pivots of how the idle streaming platform should be. Apart from lowering the entry bar and life cycle management. Spark also had UI very similar to Spark but much more detailed. You could see stages and tasks and get more information from the UI. So it had monitoring, it had also lowered the bar. But most importantly Spark came up with a programming model. By that what I mean is in Spark you could actually create a computational DAG. You could think of paradigms like map or a flat map or reduce. You could actually write to a sync. There was a function called write as HDFS or simply write to sync. So Spark was the first engine which actually came up with a programming model. With that it became easy for developers to actually start thinking about data processing in terms of the computation that they have. In terms of the computation that they want to do. And not think in terms of abstract terms like bolts and spouts. And then finally the coolest, the new generation stream processing engine is Flink. Flink does lot more things and from our pivot of the idle streaming platform what it really brought to the table was stateful operations. In Flink every compute node has an associative state. You can actually store state and then do things like trigger. You could trigger so you can wait for a time window and then trigger whenever a time window has reached. It had a much more evolved programming model than what Spark offered. It had additional constructs not just of map and reduce at a time but triggers and late arriving triggers and early triggers and so on and so forth. So its programming model really evolved from what Spark really offered. With respect to job life cycle, entry bar and monitoring it was there somewhat natively but not done a lot. They focused mostly on stateful operations and programming model. And finally there was no infrastructure management. So you had to do that on your own. So these are typically the options that are available in the market right now and if you have to think of a new stream platform you can choose any one of them but then they have their own problem sets. None of them offers everything that you want your platform to be. And that is why we thought that we need to build something which provides a higher level abstraction that what Plink or Spark or Storm is providing and for that we came up with fstream which is our in-house managed state streaming platform. So I will talk to you about from the same lens of same pivot that I have been talking about but before that let me tell you a brief about the architecture of fstream. In the core of it there is stream processing which has a programming model around how to map filter group by dedupe. So we have created additional constructs around dedupe, aggregations, joining of streams and so on and so forth. We natively connect to multiple sources. It could be Kafka for streaming or it could even be HDFS or Hive for loading the same job that you have written via some batch. And I will get into that. Then a bunch of syncs to which the data could be written to. It connects to a execution engine so you can plug in and play and plug out your execution engine. So you have Spark. If you are particularly use case you don't want to use Spark but use Flink. So you can plug in and plug out Spark with Flink. We have our own state store which we manage externally. It's on via HVIS. You can also plug in your own state store simply by extending our interfaces. It has a job repository with which you can manage jobs. So you can manage configurations, you can manage file, think about the versioning of it and so on and so forth. So at a high level this is the stream architecture. I will walk through each of the pivots and what in our mind we thought of the right constructs for offering it to our customers who are basically writing stream processing jobs. So as I talked about there are a bunch of operators. They connect from a source and they write to a sync. That in essence is our programming model. All the interfaces are pluggable so you can plug in a sync plug or different implementations. Tomorrow if you want to write some data to Redis you can just implement the Redis sync and that's done. The rest of the code does not change. It supports internal checkpointing. Typically in Spark or even in Flink checkpoint is maintained internally. So if your code changes it has to be managed by you, by the person who's writing the job. In fstream we have improved upon that by the platform internally doing a checkpoint. So we maintain our own checkpointed state and then whenever the job, with whatever code, with whatever version of code is resorted it remembers the last offsets or the last checkpoint from which it had previously read and then it resorts from that. For the syncs, these things are typically terminal outputs. That is what it notifies at the end of the program. The interface is all pluggable and you can also do change notifications by writing, by using a change notification APIs. Then there are a bunch, so this was sourced in that and the sync in between there are a bunch of operators and the operators could actually be written in a fluent style. You can, on your stream, do stream.map and then do join. With join you get, you have looked into another stream and then you can do possibly a group byte and then write to sync. So you are basically describing the computation in a fluent style and then plugging in, and then plugging in the operator which you feel is the right operator for your stream processing job, computation. So I talked about join and I'll deep dive into join and explain why joins are typically hard. It's more hard than what you would think of. In fstream, so before we get into that in a joint scenario there are two use cases. There could be a stream to stream join. Both the data arriving in stream and then you want to join. Another use case is table to stream join. There is some starting data and you want to join that from the data that is coming in stream. Now in some cases the joins could have an indefinite window. An example of that, suppose you created an address in 2014 and you're trying to order against that address now. In a stream processing job you wanted to join the address with order. So if you see the join window is now indefinite. You could not have said that I'll do a five minute join and then that's about it because you would not have the data back of the address data right now because that address is actually created in 2014. So we are talking about window join which is indefinite. Then there is a lot of late arriving data. As I mentioned network failures, issues around processing, data could arrive later than usual. So you could miss the window that you have defined in your join. So we came up with the concept of a mandatory join and a non mandatory join. And with this what we were able to achieve was a low latency because you could actually process the data in a really low latency manner. You could also go for correctness. So that's the mandatory join. That if the join has not succeeded, then I would not process the data. So that's correctness for you. Then we also introduced the concept of an eventual correctness. If the join is not successful right now, park the event, but whenever the data comes it could be a late arriving data. Then process as soon as the data arrives. So that's eventual correctness. And the third bit is process the data right now but when the join arrives process it again. So that's basically mixing both the low latency and the eventual correctness. So these are some and if you think about it, all these types of joins means that you have to start storing state somewhere. You have to store the state of the join so that when the event arrives we could trigger the pending joins. So to achieve this we created a declarative syntax to define join scenarios. We internally implemented a state full stream, a state store on which the joins could be done. Now with this people are able to reuse pipelines. If one person has actually written an order to address join and someone now wants to use that, it's reusable. The delayed joins are processed by the platform inherently. So that was joined for you. So another complex operator is the stateful aggregations. Typically stateful aggregations are done for a particular time window bucket and it's done on Q field. Think of them as metric that you want to compute on dimensions that you are against which you want the data. An example of that would be if you are visiting the product page then the metric would be number of product page views. The dimension would be against which category of product. Was it live cell or was it electronics? Here the category becomes the dimension and the metrics would be product page views. Now you want to compute that in a time window. To do that you need to start storing stateful aggregations. You need to remember that I have processed this data and this data was for a particular time T1. As long as you continue receiving the data for that time window T1 bucket you will keep aggregating that So for doing this typically the aggregation job will look like first you have gotten the data you have transformed that into a metric dimension kind of tuple then you filter them out because you only wanted to process data that are required for the aggregation that you are interested in then you need to do a reduce by because now you are reducing the data for a particular dimension and again map it back to the format in which it could be written to sync. Now all this complexity is taken out by the platform by a simple aggregation operator. The user just has to do in his job dot aggregate and all this complexity is reduced. Today in fstream we power about 1000 odd reports and about 1 billion plus events are typically what is processed by a platform computing more than 50 dimensions. So far we have talked about programming model and we talked about stateful operations. Now how does fstream lower the entry bar? So in fstream we also built an extensive test suite with this no user has to worry about deploying a software or figuring out how to run this they can test their job from their IDE the entire job could be written and then test it from your local IDE. You don't have to worry about deploying your code anywhere. So all the functional and even non-functional not from the scale point of view but from like for example if you are using spark then it actually brings up the spark environment in your local IDE runs it and then you actually visualize the results after the spark the temporary spark job environment has been shut down. So from that point of view it has become very easy for developers to iterate very quickly. Now they don't have to write the code ship to production, realize or staging realize a bug then complete the cycle. They do everything on their IDE. We have built multiple fault tolerance capabilities in the platform. See fault can occur because of network, because of infrastructure data could be corrupted. So we came up with the concepts of retry topics and silent topics. So every time your job is processing and say you are writing the data to a sync or you are doing a lookup from some intermediate store we have built the capability to automatically detect which one is actually a recoverable error and which one is a non-recoverable error. For example if you are street full store for example if you are using HBS it's down temporarily then the job continues processing and all such requests fail to be written automatically gets put into a retry queue. So your job does not fail even if your HBS sync is down. At some point in time the idea is the person who is managing that sync or the HBS cluster will bring that back up and the stream job that you had written will automatically process the records that we are pending. So no manual operations needed for an extreme job writer all he needs are the fault the faults are inherently tolerated by the platform and made available for processing. For unparsable records wherein basically of which we cannot recover by retries we have a concept of a sideline topic wherein such data gets pushed to a sideline queue and then at that point in time the users are made aware of data being available in a sideline queue from which they can take a manual action of figuring out what is the problem with that data and then reprocess asynchronously. So with the extensive test suite and managing fault we were able to lower the entry bar. Now what we do is we create a job life cycle. So we created a concept of a job repository in which jobs could be versioned we could capture metadata about the job when it was the last run who ran it who is the owner of the job and with this we were also able to build a lineage because typically in stream processing what will happen is you will process a lot of data you will push it to say an intermediate message you like Kafka and then another job will probably pick it up and then process a bunch of information again. We created a chain of jobs. So with job energy you can actually find a lineage of the end output to the initial output by linking all that and we were able to do that because we had the capability of we built the capability of job repository where all the metadata could be captured and made available to our users and then you had to launch kill jobs for the users. Finally what for monitoring and alerting? So spark link provides a lot of metrics but we felt that we wanted more granular metrics around the computation that we are doing. For example if you are doing a join you would want to know how many joins are actually failed how many joins are waiting for retry right? See these are some of the metrics that you would want to know. Similarly you would want to know alerts around has your job gone into a scheduling lag as in the job is processing lower than the expected so it's gone into a lag. Is a latency of processing high so all these metrics are much more granular than what is provided by spark of link and that is what we built automatically into the platform. So a user who is writing a stream job he does not worry about creating the metrics that is required for him to monitor all that is done internally by the platform and made available to a dashboard like Grafana where it could be measured right? Like these are some of the examples of our job. For example we have a section for engine which in this case is say spark then we capture metrics like number of records are processed was there a scheduling delay then we have another section for source reading data from Kafka how many have you record how many have you read how many have you passed failed is there a lag in processing from Kafka. For example if you from offset y and the source and the current offset that the data is being written to by a producer is say x what is the delta bit y and x right? So all these metrics are automatically surface after the user so he gets an understanding of what is causing the problem is there a Kafka read that's a problem or spark is having a problem and so on and so forth. Like what I joined I mentioned you want to figure out how many joins were successful how many failed and so on and so forth right? Similarly for alerts some of the alerts that we have built internally is is there a lag in processing did the job fail how many records are going into a sideline queue or a retry queue is there a source lag that is very high so all these are alerts that we have built and made available out of the box for end users. So this is most of the work that we have done so far some of the future work revolves around creating a SQL interface for this so today are the only piece of code that someone needs to write is describing the job which could be like a dataflow.map.join.aggregate.sync that's the one line of code that people typically write today but we are creating a SQL interface now so you could actually describe your computation that I described right now via code in SQL and then that gets generates code and gets deployed for processing so this will really open up a whole section of users like analysts who are non-tech people who have difficulty in writing actual code so for them SQL is a language of choice wherein they could actually express the same streaming computation so that all your data analysts could actually start using this now we want to create a serverless compute construct around fstream as in essentially being able to automatically scale up and scale down we do that today in some in some manner but this is more in terms of how can we scale to really spiky loads and then recover from that recover from that automatically right we want to bring in constructs of multi tenancy we want to provide isolation and provide isolation guarantees to our different types of users because we have like multiple use cases that are running on platform and sometimes one use case kind of over themselves so we want to provide those multi tenancy guardrails and then finally we want to we are working on the user interface for the job lifecycle manager today we have like programmable APIs through which the users interact now we want to make it self-serve via user interface for which the same thing that could be done yep so from all this I think the key take away so you guys would be think about the programming model in a streaming job a programming model is very important because that helps get the clarity of how a job should look like you should have a low barrier for entry no matter who your end customer end user is be developers be it analysts you should always try to lower the barrier of entry for them monitoring alerting job management while it may look trivial is very important because when when things go bad and things do go bad every every time when things do back go back the your monitor how robust is your monitoring alerting how robust is your job management is what will help the users of the platform really like using the platform otherwise you will be spending sleepless cycles late night cycles and finally job management because to be able to to be it gives the users of the platform a very good hands on experience of you know managing the job they can think about versioning they can they can they can think about how to actually went to onboard and went to kill all the these things you don't want to be in a way for for for for managing that right so for a stream processing these are some of the key abstractions that are required for I mean if you have such use cases in the company that's pretty much it if you guys have any questions I am open to the floor