 A very good afternoon to all of you present here. I'm Krishnam Priya from MadStreetDen and the topic of the talk is plumping data science pipelines. So data science, artificial intelligence, machine learning, these are terms and jargons we hear quite often. Why is building an application which falls into either of these categories a challenge? So the real challenge is the data itself because data has to be aggregated and made compatible for every phase throughout the application. And every phase has to be chained together and the application has to be real time and it has to be scalable. So let me explain this with a couple of examples. Airbnb, which is a popular alternative to rentals and accommodations worldwide, is using something called Airflow Management System. So they had a couple of issues that they had to resolve. Their tasks were mission critical and they had this scheduling sequencing problem and their process was evolving. So they needed a scalable, robust architecture that would resolve all these issues and that's what Airflow did for them. The next example is something we've all heard for, providing recommendations to the user. It might appear really simple, but when I say recommendations, we give recommendations based on the user's browsing history. So the history is not something that the user browsed yesterday. The history is also real time. When I say real time history, it means what the user is browsed pertaining to a session which is near real time. So we provide recommendations based on that. So how do you build a pipeline for that application? And the next one is a popular example. Every one of us, every one of us who's building an application will have to take care of logging because only if you log you will get to know how your application is performed, what is gone right, what is gone wrong, how is the performance been over a period of time. So for this talk, I'm going to take up a small use case of logging to explain how we resolve the problems of logging by building a pipeline end to end. Mostly the do's and don'ts of plumbing the pipeline. Okay, the plumbing story. So you can split it up into three major parts. The first part is the preparation phase. As I said, you need to prepare the data first, figure out what is the problem you're going to solve, ask questions, collect and organize the data. And in the next phase, you will be applying this and writing algorithms, applying models on the data, processing the data, and you'll come up with an analysis. This analysis is in the final phase applied and you provide recommendations, you send reports or do some visualization. So the tech stack for the day is going to be Celery, RabbitMQ, Readers and ElkStack. So why Celery? Celery is a robust scalable and it helps you build a real-time application end to end. RabbitMQ is a queuing platform and it is also a broker. So it's not just another queuing platform. What is RabbitMQ do is it also takes care of the queuing. So for example, the subscriber has not acknowledged back for a particular message. What RabbitMQ can do is it can send the message back again and then find out what has gone wrong. So it literally manages the queuing. So that's why it's a broker. And ElkStack. ElkStack is a combination of aggregate, analyze and visualize. You have LogStack which helps you to parse the logs and you have ElasticSearch where you actually analyze, put the data and Kibana helps you in visualizing what you put. So this is a brief of the use case I'm going to explain. So you have the logs, they are passed through the RabbitMQ. The RabbitMQ is going to send it to the Celery workers and finally to the ElasticSearch. So this is a little bit in detail, the ETL workflow. So we have the CloudFront and the CloudFront S3 where you have the LogStore and from there you send it through an SQS and you pull the messages from the SQS and RabbitMQ is the broker that manages to send it to the Celery workers. Why is Reader's coming there? I'll explain that to you in a bit. And then finally they are pushed to ElasticSearch. The Kinesis and Redshift here are backups in case the push to ElasticSearch does not work. We're not going to be explaining that here, we'll stop with ElasticSearch. Okay. So I split the use case into three simple parts. The first part is going to be the polling to SQS, the second part is going to be processing the logs which the Celery workers will be doing and the third part is going to be the push to ElasticSearch. So a little bit more about Celery. So what you can do in Celery is you can handle both compute optimized and memory optimized tasks and you can assign workers to them. And what you can actually do is you can completely use all the cores. You can maximize the CPU utilization to even 100% but keep the memory really, really low. And it is heavily parallelized. Of course it's asynchronous. Okay. So let's get to the actual methods, the poll and the push. So the polling is actually going to poll from the SQS queue. So there is something called max retries that I have mentioned here. Similarly, in the push there's something called rate limit I've mentioned here. So I will explain that to you with use cases. So max retries, in case on the polling side, the connection to the SQS is lost or there is some failure that has happened on the SQS side and you have to handle the failure. What this max retries can do is it can back off for a particular amount of time and retry again. So based on the time taken for back off and how many times it retries, you will get time enough to fix the failure and before the last retry that is fixed, the application is going to go on. So the next is the rate limit. So how do you find the rate limit? RabbitMQ has a beautiful UI that you can enable on the command prompt, enable RabbitMQ management and you can see how many tasks per second, how many messages per second the RabbitMQ is actually doing. So in RabbitMQ, because we have two methods, the polling and the pushing. So we have two queues, the poll queue and the L queue. The rate limit problem in this use case might usually occur on the L side. For example, you're pulling a lot of messages, but on the elastic search based on the cluster size, it's not able to push it at the rate of the polling. You need to rate limit it so that your messages are not lost. So that's where the rate limit comes into place. So this is more of a producer consumer problem. You're producing at a very high rate, but the consumer is not able to consume the messages at that rate. So what you do is find out how the RabbitMQ is performing. As of now, it is performing at 73 per second. So when I talk about salary workers, you imagine there are 10 workers and every worker is doing 73 per second, which means almost 700 tasks per second are there and they are pushing to elastic search. If elastic search is not able to handle that, you rate limit it. So what I have done is I have rate limited it to 20, which means if I have 10 workers, so you sum it up, you'll only have 200 tasks per second. Maybe elastic search will be able to handle it. This is probably a conservative way, but it'll take some time for you to figure out the balance between the producer and consumer and come up with a proper rate limit. Read is. So we saw read is coming there in the ETL workflow, right? So read is an in-memory DB where you can store some amount of information in the memory. So I'll explain a use case for this. In the SQS, for example, the messages on the SQS are becoming zero. There are no messages. So you don't want the polling to happen for an empty queue, right? So instead what you can do is you can save the attributes of the queue in a session on read is. And at the same time, if the messages on the SQS queue are really high. So what you can do is when you regularly get the attributes of the queue and save it on the read is session, these workers can go and get the session information. And based on that, what you can do is you will have a runner that is going to manage to manage the entire thing. So the runner can be dynamically throttle. So if there are no messages on the queue, then you can put a lot of time, a lot of sleep time on the runner and then try polling after some time. There are a lot of messages on the SQS queue that increase the increase the sleep time so that you throttle the runner based on that. So that's where read is comes into play. But how do you refresh the read is? So that's where you have something called the salary beat worker. What the beat worker does is in regular intervals that you've mentioned it at the program in the program, it'll go and refresh the sessions on the on the readers. So this is where you will mention how many times you want the beat worker to do this refresh. So you're talking about workers, right? And the producer-consumer problem. So once you've figured out how many poll workers and how many elk workers can this application handle, what you should do is when you want to when you want to increase the number of workers, you'll do it in blocks because there is a balance between the producer-consumer. So you have one block of the poll worker and elk worker handling it. So when you're adding more workers, you will add another block of workers and not randomly add elk workers and poll workers so that your application always is in a balance. And Celery has a really beautiful way of letting you know that a new worker has actually joined the party. So you can actually tell the workers and find out how it has been performing. So there is something called autoscale. So you can actually set a worker to autoscale if you think that a particular task, it has the ability to, you need a sudden throttling of the task. You need the workers to scale up during peak times. So that's where this autoscale comes into place. But we saw that there is something called rate limit. So when you're autoscaling, you have to keep in mind that the maximum autoscale number that you've mentioned, the rate limit, and the number of concurrent workers, everything should be a balance. So that on the consumer side, it still does not have a problem and you don't lose any data. Hedgetop. Hedgetop is your best buddy. So after doing all this, what you're supposed to do is go to Hedgetop and see how much of CPU you've utilized and how much of memory you've utilized. So an ideal situation is where the CPU can be even 100%. We've used only 70% to 80% of the CPU and the memory is also really low. But in case you're using the CPU completely but your memory is really, really high, then there is something that you have to change on the code. So how do you do the memory profiling? Before I get into that, let me explain a little bit on the Elkstack. So Elkstack, as we saw, is a combination of Elasticsearch, LogStash, and Kibana. So what LogStash does is it helps you with the parsing. So you can actually mention a pattern by which it can parse and then push it to Elasticsearch. We're not using LogStash here for the talk. It's out of the scope of the talk. But it's fairly simple to set up a LogStash. And then after you push it to Elasticsearch, you visualize it on Kibana. So on a Kibana dashboard, on Kibana, you don't have to really know how Elasticsearch query is done. If you know how you want to filter your data, how you want to visualize your data, you can just apply the filters right away. What do you want to see on the X-axis? Whether you want to see the count? How do you want to aggregate? You have a fair idea about it. You can just set a range with which you want to see the data and then visualize the data automatically. In case you're familiar with Elasticsearch query, you can do that too. So there is a tab called DevTools here where you can actually execute the query and get results right away. So we're talking about memory profiling, right? So there are a couple of things you have to look at when you're doing memory profiling. First of all, you have to clean up the code. When I say clean up the code, you have to look for cyclic references if you have on your code and you have lists versus generators. And if you have objects that are really large, occupying and accumulating a lot of memory, take a look at all of that. And we were having readest sessions, right? So what the salary workers will do is they will maintain the state of the sessions in memory. So you'll have to take a look at that and monitor that too. And connection objects. We have two connection objects in this use case. We have an SQS connection object and we have an Elasticsearch connection object. It's best if you keep the connection object outside the salary task. Because every time you create a task, you create a connection object that's going to sit in the memory. Instead, you can keep these connection objects and connection pools, which will help you in profiling the memory. So a summary of all this is how we built a real time scalable streaming solution. It helped us in handling customers in real time, high demand and the search history as I said, search history pertaining to a session which was also near real time. And we were able to do in-memory scaling, lower latency, everything. So to put it all in a summary, how you can do some pointers of what we learned a perfect plumbing can have is find out which part of your program is computer optimized, which is memory optimized and assign separate workers for the computer and memory optimized tasks. And then find out the producer consumer balance and then see and keep monitoring on HitchTap, HitchTap to see if your memory is low and your computation, how it is happening. And finally, the RabbitMQ, it is best if you keep the RabbitMQ on a separate instance and not in the same instance where you're actually doing the salary tasks. So that's about it. Any questions? Questions? I see one there. Hi. Thanks for the talk. Yeah, so my question was actually kind of, I was just curious why there are two queuing services. So you have SQS and RabbitMQ kind of serving, I don't know, similar purposes. Could you possibly have the workers directly polling SQS and then push it to elastic search or maybe push the logs directly into RabbitMQ. So I'm just wondering why you had two sort of queues. Okay, so that was the similar use case that we had. The messages were usually coming from the SQS for us. So we didn't really play around with the SQS part. So the use case started off only from the SQS for us. How do we get the messages from SQS and then push it to elastic search and visualize it? So we didn't really resolve the, we didn't think of resolving the SQS part of it. A log stash is blocking. It does not have an asynchronous call. So you needed something like salary to make it asynchronous. Next question? Yeah, so my question is, you are using RabbitMQ and ElkStack, right? So did you consider using anything else? Like in place of RabbitMQ try something like Kafka and then Spark instead of Elk? Yeah, so this is probably a medium scale application and RabbitMQ, what it does is it helps you do the real time. As and when the message is published, you get it back on this side. Kafka is handling it probably on a much higher scale and setting it up and everything is going to require a lot more of work. So for an application as simple as that, this this tech stack help us do it much easier. Questions? I guess that's it. Thank you. Thank you, Krishanthia.