 Thank you for having me. My name is Aamun Deep and I work as a senior software engineer with hybrid green tech and before that I have been working in banking principal IT consultant for event streaming projects related to Kafka. And these are the 10 best practices basically which I have put here, learnt out of experience, learnt of like hard times, learnt out of crazy bad mistakes and realizing going back and forth. So these are like those 10 practices are not in any particular order but they are all equally useful I would say this way. To get started like why we need event streaming or the event streaming is like anything happening in the world it affects the business. If a business is in 24 by 7 it needs to respond to probably those events and we are not talking just about stock market and market volatility and fluctuation of price and executing those trades. We are talking about like new products which are emerging around us like flash loans and crypto space. If you are watching in the DeFi universe smart grids within an electricity distribution with who is contributing to the grid at what time and at what like a 250 millisecond you have to kind of measure the voltage and based on that the relay has to act upon. And of course e-commerce like event transactions and all the user interactions if you can find out the user is having particular behavior repeatedly like can you take a quick action can you have any so basically they are all the cases for event streaming. Now coming to like there is no one argument which come like oh do I invest in streaming I can store them I can produce consume you know like the typical classical way and do I need streaming if I have storage maybe maybe not you can have data lake database time series database or another data warehouse but what about your data shape size criteria querying properties data cleanup how big that responsibility will be and now come on like how has been existing data warehouse helping you do you require to segregate those even events once you are collected then why don't you collect in a right way a right shape or stream them yeah so some terminology I think most of the audience would know here but like topology as we call an event streaming is like a it's a graph of stream processing nodes basically there's like the pieces of logic which are in a certain sequence which is a topology so there's like a source processors there are processors processing nodes and they are like sync processors and then the concept of mediator EDA versus broker EDA I believe you have a kind of basic understanding and the CQRS command query response segregation with us for the same data you can have a read heavy interface and you for the same data you can have a write heavy interface also so command inquiry are segregated streaming approaches let's also kind of like throw the area here to discuss a lambda architecture we discuss all the time also it's a real-time patch then this is the nickel kappa architecture kappa is basically like it it has a characteristics of both patch every entire time but I'm not getting into discussion of those here but this is what you must be hearing around but everything with all these architectures and data warehouses and data lakes goes along until number one number one one out of them setting expectations setting expectations related to events why are we doing this okay events are coming and we're going to act upon them sure what is the order of these events do we receive the events in the same order as they are occurring in the real world reliability availability guarantee of delivery there's they look like a broader size terms but they are actually business terms here and then if we looked more is a guaranteed deliveries and kind of another discussion itself event volume how many even priority which even is better than the other one even size what's the size of payload coming how much I should plan for and if I have to act upon I quickly have a certain event coming it better be lean so that I can pass it quickly for example an external event or maybe have to sync up with some other event in the outside world call it a actually an event maybe it was a false alarm so maybe it's not an event so setting all those expectations there is the most important thing I will say I might repeat the word most for all the 10 because 10 I find are our kind of finds so setting expectations the first one then keeping in mind that you are not trying to build a legacy application again what has been a purpose of dumping data and data where I was sorry kind of like a big chunk they may frame anything how about splitting the data into domains then you have a read heavy and write heavy way of approaching the same data of that domains maybe that's the first start it's not it's not we are not even defining events it you are first defining those those blocks those domains which will have their data changing based on events coming so those deconstructing those domains and subdomains is very important when we classify the events and create those event dictionaries which correspond to changes in data dictionary being aware of the business gurams gurams are like a complicated relationship find a fun way to be said but basically from ER diagramming perspective as you say like they are the composite entities and that's the thing where sometimes some data ownership responsibilities fall between the chairs where you kind of split the domains and then you cannot really find because it's a composite entity with some attribute value from this domain some from this domain and is useful for some business purpose which is not applicable these domains is a kind of like a composite entity who will maintain its data ownership so that also kind of things like if if an entity of a customer relation with a bank or customer relationship history is received which kind of contains the customer transactions and customer data and we are talking about customer is one domain and transactions another domain so customer customer transaction history so this is guaranteed so how do we maintain that like if this data object has to be transported like whose responsibility is that if a certain event has come and it has to carry this as a payload or the sub payload who will use the kind of a data change owner or responsible domain so such decisions also you gonna get into not over investing in a single platform maybe the slide will be a slight misnomer but what I'm actually trying to say is don't be so hard on yourself like we have this technology and that's all we got so this typically happens with when you're streaming events in Kafka for example and the data is going through like from one topic with certain processing enrichment aggregations to another topic to another a typical like heart apology works and when I say that like I will quote my previous organization anyway like oh our data is in SQL sure but it's JSON data passing transforming JSON aggregations JSON field value injections and then it lands on a new topic sure and it has a state store intermediate store also and that intermediate store now is SQL wow so this JSON which is still intermediate gets turned into a relational row to go into SQL database read again from there convert it to a JSON so what I'm trying to say is like SQL and no SQL so have a pragmatic decision like yeah for intermediate topic storage I'm finding no SQL what's what's so wrong with it because maybe for if I say my hard business database is SQL so sure I will reach there but I will not try to end up like it with vendor comparison talk about confluent Apache Kafka and what's on that Panda Amazon's MSK somewhere somewhere it is they have to each has its own processing cost which kind of measure maybe need more than one for example if red point has a single binary and it's a Kafka API compatible and you have already kind of implemented Kafka system and because Kafka brought a lot of JVM in your technology stack and your new developer hires do not like it so that Panda is kind of like can go very well with existing Kafka same thing if you're using confluent and if you're not just doing the streaming the wave with the Kafka streaming DSL and you get addicted to using ksql it's very convenient you have alternative like materialized also a materialized database which kind of gives you those materialized views on different topics and then you can do whatever you like just like almost like ksql because people who speak or use Apache Kafka or MSK what says those who use confluent they kind of like ksql is one thing that means other than some kind of any connectors also so that's like a go pragmatic not over investing one single platform and then very important I know I'm repeating it developer upscaling developer upscaling is all about thinking in streams like thinking anything happening in the world is like a stream it's like if a water is coming to you and your job is to kind of like maybe process the water and sweeten the water or something so a typical approach as you kind of like would pick is like okay I'll pick this water and do something I'll keep it there or something no you do not have that option in streaming universe like there's like a flood coming here so you cut the canal you create the tributaries but you kind of stop the water anywhere and hold it till your process is because it is always a data in motion it is it is always flowing so developers of mindset really needs to be up to they are that thinking that it's a streaming system it's not kind of like a pick value of a and pick value of b from somewhere and by the time I pick things in the world can wait no by the time you have picked a b maybe thousand more of them have come as well so and because of the topologies which you build like those tributaries like after this after this do this do this do this do this and you define those topologies and as you can understand those are very hard to debug as well so and they always change based on they always figure out that something better can be done and also if you are changing the data models like at any time from the source side of the kind of destination side it can always change your topology okay maybe some kind of a fields added but if it is leading to continuous code refactoring on a regular basis then probably your data modeling skills needs to be upscaled and also question comes out of event stores of api endpoints should i give api endpoints or these state event show there's no kind of like a final straight answer to it but based on your use cases you can always work out whether it's important to expose something in api endpoint or what i'll come to that because that's that's something coming yeah even processing topologies can debugging can be extremely challenging as i said so key challenging which comes with adopting any streaming platform like afka streaming is debugging and even is streaming technology output at end of each stage can be overwhelming maybe you use stream visualization polling there are few for java universe it can be hard to break for your aggregation results also if any bug creeps in your aggregation stage are laid it will propagate all the way through and skew all the aggregation results so it's a peak is one quite a few for example kafka stream dsl operation peak like it's a kind of a system out print teller if you are a java guy that at every stage i'm just doing a print and seeing and like peak is basically copy of that stream and you see what that stream contains now after certain intermediate transforming so again later technology um inefficient topology can have a lasting consequence nobody i know nobody probably have been told also writes a data processing pipeline data pipeline or topology write in one attempt nobody it always gets kind of transformed changed improvised a better way emerges and all that and it's therefore like one was talking about breaking down the domains and also important is breaking down the data transformation in simple chucks one thing which i kind of like about kafka over apache sparks for example is like you cannot do like a six way merge or five way aggregations and all that and kafka when you're doing like you're doing always like a two at a time so i aggregate topic one to topic two maybe the aggregate result now i can aggregate topic three maybe that result which is aggregate of topic one and two aggregated and then aggregate with topic three maybe now this can be aggregated with topic four sounds complicated it's actually not when it's kind of a neat topology code but that really helps when you get your data transformation is broken down into chunks because and peak operation as i was telling so at each stage you can kind of see what's happening that with the kind of like stream of data coming how it's getting transformed when those chunks are manageable and small you have a better control of the topology okay that's my favorite one here on the seventh item dumb pipe smart endpoints it's not my own creation it's smart and follows own word dumb pipe smart creation smart endpoints like data movements should be dumb process processes it should just like put like at the microstage but if the data is required for something then should be served in a presentable nice aggregated way so that if that is a job of when you write that smart endpoint so this is Martin follows kind of guidance principle it also helps an organization to have a complete control over data complete processing logic keep the data in the raw form whichever way they would like to keep for auditing purposes or replaying purposes and everything but this approach is not in benefit of vendors to sell whether they're selling data products so so you can google and find some articles they're calling the end of dumb pipes smart pipes are the new thing because yeah this this vendor brings with the data processing tool and they pick data they do some kind of a transformation or some changes maybe for example time zone change or times in uniform format and they deliver i'm just telling you like i said they promise lot more kind of workflows and whatnot fetching aggregated aggregated table data from rest of the universe and making something for you to deliver in our invention so eventually the job of such data vendor the data tool vendor is to make you dependent on that data pipe tool which is basically a smart pipe because there's a lot of processing they inject in that so you will hear a lot about end of depth of dumb pipes the smart pipes are here and these are the like pre-created workflows and you can create not just data transfer instructions but you can have a whole fancy workflow well depends it might work for some but if you want a complete ownership i still kind of and we we talk with all these vendors all the time and some vendors we use also for some data transformation and movement purposes but it's it's always good to keep your kind of ears grounded on this principle dumb pipes smart endpoints so that you know that what data is actually coming and what are my processing stages so some processing stages not obscured by some vendor implementation over there so and if you get stuck down with certain topology happening in in one of the smart pipe you have a better control of redoing it if you change the vendor audio you still want to start with the raw and found a better algorithm better you want to split this time so so you have a better control when you follow this principle stateful workloads okay that is something which comes to because Kafka or any image streaming framework we have concept of stateful transactions stateful sorry or stateful transformations and state less transformation state less transformation we are we are not much bothered because now the car now these now this species of logic but topologies are deployed on for example Kubernetes parts open ship Kubernetes parts and and how the distributed computing and the parts work is the parts can go down come back up but like this is this is how it is what about and they are they are designed for state less transformations now stateful workloads for the Kubernetes part is a difficult thing it's a new thing altogether then then then we also have state stores you know rocks rocks DB with Kafka you can always create those state stores the state stores also stay out of that part the part is coming back up then we are the state store is so state store is actually the word says state so Kubernetes is a CSI is one of the things to think about to use about it takes care of like which whichever part is hosting that storage and storage that as a state store is available to you so why why this is this comes up again in discussion is because Kubernetes or any kind of container orchestration system they have always been designed for state less workloads and whenever there's a data streaming streaming has a lot of stateful operations so you will always come come somewhere like there frequent changing needs can post challenges this is there's no thing called final aggregation there's an even stream even though you kind of reach the precise time windows and the precise aggregation functions was changing requirements can always correct situation for a topology change imagine I say are you aware stream replaying becomes a requirement for you like it should never become a requirement for you because if you're even streaming the stream data comes in it reaches certain destination and in normal circumstances you should not be asked to replay it what if you're kind of like I came across as an interesting product that punders tiered storage like its purpose is like you can collect the data keep in a binary format and as long as we use our licenses it's a good strategy to kind of read back the data in binary proprietary form and it will be replayed from the topic it belongs to good but Kafka or any other even streaming system they are not a database they are distributed logs and their job is to carry this information helps you build your data streaming topology and put the data to destination where you want to and from there like you should have a complete control of read back and do something or make it make a news source connected to that data statement but if in a situation you are replaying it becomes as a business requirement no no no something is absolutely not okay here and something needs to be openly discussed so and such frequent changing news can be challenging to your whole image streaming kind of objective itself last one centralized ownership centralized ownership can be a threat yes because we're talking about we started with like defining business domains decoupled businesses processes decoupled business process yes allow you to kind of even stream between them or when an outside event comes you kind of like pull certain items to go to one domain and pull certain items to go to another domain I can give an example of like in banking the customer address change customer address changes so it's a notification comes so many different processes have to kind of spin off in a bank for example customer customer master data change customer address change means the anti-money laundering kind of scripts and systems will be alerted like hey for this address change something then we need to check secondary like this like a whole workflow customer data change customer address change specifically it's like okay customer experience team oh they treat their customers nicely whenever the customer changes their address what they do is they send a welcome letter home like hey welcome to your new home this in thanks for banking with us and this is the nearest bank branch available to you customer experience for the customer address change but a centralized ownership in this case it might you might conquer that it would be better but more centralized ownership can create more problems because this these are independent business workflows we are talking about decoupling the wind day on design is a kind of a nice approach to deal with it but so that's why it is a more you decouple your business process on the start more you decompose your data processing chunks also in kind of like a series of chunks manageable chunks and that is the approach which kind of leads you basically sum up all of my 10 points also to come from like hey these are my source data structures processors this is the whole topology and this is the topology too this is how it works and I have a complete control of all the chunks of this topology and it delivers the final data out to certain place so basically these are the 10 points I wanted to kind of cover I wrote like a lot of text on slides as well maybe because it is useful it kind of catches the eyes and you could be using this slides for more kind of a follow-ups other than that I am open to questions thank you