 So, guys, today I'm going to talk about building and scaling a log analytics platform, giving serverless approach. Sorry, I had to refer this, it's been so long. So here, Shree spoke about all our microservices and I'll just continue from where you left off. This is not going to be a comical one, it'll be more like a technical one and we'll be having more or less a live demo in the next half of the talk and let's get started. So before I start, a bit about me, I'm Narayan. I work as a product engineer at Mastery then, probably you should know by now it's a startup and I love coding in Python, Golang, and I'm an open source enthusiast, that's how my usual day looks like, I'll be sitting in dog partner and coding, if I'm not coding, I'll be cycling outside, I'm not doing both, I'll be treating myself with some good food. I'm certainly active in Twitter, you can follow me in Twitter on that handle, do the code and that's my website and blog. So yeah, before we go into the talk, let me do a quick comment, so how many of them here have used serverless in production? Okay, how many of you have heard of serverless, you know serverless, what a serverless is? So can I assume this doesn't know what a serverless is? Yes or no? So as Shrini explained, we just abstracted a very large monolithic architecture and we brought up microservices, it pretty much operates very well, but it's somehow chaotic like this. We need to know what all the each microservice is doing, how much latency it's taking or is it affecting our SLA. So for that, we need to come up with a proper logging framework and the teams are together, we as usual, we had a range conversation and we came up with an architecture and even before coming up with the architecture, we analyzed the different areas of the services. So as Shrini said, we had few services which are just the ingestion part, which is the crunching of the data. Those services mostly will be idle, it kicks off only when the client sends their data and the other end of the thing is the API service, which is active all the time and there will be surges and spikes and there will be the real time data. So these will be some of the behaviors of our services or any microservice, it will be hard on the scaling during traffic and there will be rapid scale down. So any logging framework, so these are the expectations from a logging framework and if a logging framework should be dynamically scalable along with your microservices, it should be highly available, secure, it should be highly performing and the logging framework should not be your bottleneck for your microservice and there shouldn't be any data loss or over rating. And before designing our architecture, we separated our microservices into two groups or the log which we are getting into two groups, one is real time analysis and the historical analysis now so that we can architect our logging framework based on this. So mostly the API parts, we are near the real time analytics because it will be directly impacting our, you know, the client's SLA or their website and the other, I mean the first part which is ingestion, it, we can do historical analysis or batch analysis on that like how many, how much data we crunched like our past man and we didn't need any real time or mission critical analysis on that. So and then we came up with a conventional architecture, we had the clients, the clients into pushing the microservices, the logs to the logging microservice and we had some highly persistent queue which is Kafka, we had two sub-papers, one sub-papers will direct the logs into the real-time pipeline and the other one, actually this one is the real-time pipeline, the one at the bottom and the top one is for the cold storage or historical analysis and for cold storage we used, we tried cross-dress, we used InfluxDB and to visualize that we used Kibana. We took a batch from the cold storage and we used Kibana and Prefana to visualize then to visualize the real-time logs and set the alerts, we went for the traditional L-stack and we had a log-stache Docker instances and a rastic search in Kibana set up in different instances. So this is how our old architecture looked like, we faced some problems especially with L. So if you have been using Yelp, you will know that there is something called Threadpool queue size which is a default of 200, it's based on your mission size and it will cost the latency if your log or data throughput increases, this will be a bottleneck and of course the only way to why this is scale up your mission but you are burning more money, if we have a scale up the mission, we were not able to auto scale it since it is a distributed data store, scaling up and scaling down was not so easy, it took some time to redistribute the data and the shards, really different nodes on the cluster and adding log-stache filters which means adding filters to modify the logs on the fly was a pain, people need to know got which is the kind of regs to add log-stache filter or either you should know Ruby, you should go and edit the log-stache Ruby code to add new filters and the other problem we faced is who is going to monitor all the systems that is monitoring other systems which is meta and here since we had the Kafka and we had to maintain our clusters, we had to maintain, we had to keep an eye on the uptime of the clusters, the offsets, nodes and in the ElkStack as I said auto scaling was not possible, we had to burn some money on idle time, so we designed the first version of the architecture then we found something very common in that blocks which are servers and every time people ask me why the server went down that night, so then we figured out okay so we are designing a logging framework which is to you know make I mean ease the pressure from the developers as well as the other microservices and the logging framework should not be a pressure, so then we decided to go serverless, so since few people told we are not, I mean not out of serverless, I mean here's the Wikipedia definition of serverless to make it simple, it's just you don't have to maintain your servers, you can deploy your code, everything will be managed by your provider and when AWS released or announced their lambda, CT of Amazon told no server is easier to maintain than no server but all your serverless architectures, all the services, and so on servers then why we call it serverless, it's just because all the heavy lifting is hidden from the developer and don't need to maintain the servers and whenever you put your code it will be stored in a container and you can set an event to trigger your code and it will just spin up a new instance in a few milliseconds or microseconds then the code will get executed, so you will be paying only for the time which the code gets executed and yeah, so we came up with the first question which involved servers and we decided okay how to go with serverless architecture and since we heavily use AWS we thought okay fine we will go explore the serverless services offered by AWS and okay what's the one thing that comes to your mind when someone says AWS serverless, yeah, but hold on, lambda is not the only service that AWS offers, even before lambda they were offering different managed services which are serverless, let's see hold them, so that's how our fast architecture or function as a service architecture look like and we have the clients as well, the clients which are pushing the data to the logs to the server, sorry serverless and we had Amazon Kinesis, we replaced Kafka with Amazon Kinesis, Kinesis I think but it's a giving service can easily configure your producers and consumers and just pick up a button and deliver your data to the destination and use lambda to transform the logs on the fly which is instead of logs dash and for post storage use AWS S3 and S3 probably should have known about S3 it's you can store data in massive scale and half of the internet users S3, when S3 goes down half the internet users goes down which happened last December and we use Athena, Athena is a beautiful querying service, it's a serverless querying service so you can point your source data source it can be either from S3 or CSV file or even say source dashboard and you can do SQL queries on the unstructured data and we visualized the batch data using QuickSight, QuickSight is another managed visualization tool provided by AWS and that's the historical analysis part now coming to the real-time part used Kinesis log analytics for running SQL queries on the real-time data so it more or less Kinesis log analytics to an extent can replace Spark streaming or strong so you can filter your data and run SQL queries on the fly in real-time and then to visualize since QuickSight was not supporting you know the real-time data visualization more over it was for batch or business insights and we used CloudWatch to visualize the real-time data and we set up few triggers in CloudWatch where we connected used video APIs or page and duty to you know get another doing some incidents or any failures that's how our fast architecture look like I think I explained it in a single slide let's go through and this so you can manage any serverless model either you have to use your provider's web console mostly mostly it will be you know for you or you can if you're using you can use the SDKs provided for us it was AWS it is AWS so we use AWS SDK since we use Python use Boto3 which is the official SDK or you can you know interact with the serverless through your command line let's get to see the demo and actually I have a live demo or tabs open running bang but according to Murphy's law I am going to mess up on stage so I'm going to show us some screencast which I took yesterday so this is our fast architecture the first part is the kinesis we are setting up the kinesis stream which is the kinesis dashboard and we give a name for stream you can give the number of shots since it's just going to be a sample logs they're going to push sample logs so we are giving char number as one and they're creating the kinesis stream so the entry point to our serverless Python is created and next we need to the next thing if you remember it's the lambda we need to transform the records so we need to create a lambda function and lambda has different templates you can search for Python or kinesis template something like that and all you can just author from scratch package means right in by yourself so we are naming our function we are setting our roles so this is the popular rate of this lambda dashboard since we are going to use Python we are changing our runtime to Python then we will paste our code what this code does is nothing for at least for this demo it's just it's just it's just access a pass through but we can you know manipulate the logs using any conditions or whatever and you can even do conditional routing in lambda so we have type the code into the console there are few other features like you can set up the environment variables if you have to you can set up the environment variables for that the roles and permissions and lambda for each lambda function it provides you memory you can increase the memory and time was for that then you can set up VPCs and real thing of your code so every time you don't have to push the real data to test your lambda function pick up a lambda function there is provides test even and you can go select a test even then configure it the test even walks the data from your data source our data source is kinesis so this even gives some more data of kinesis or from kinesis it's getting executed when just able to see the data which we said through so the next one is we have created a lambda to transform the data now we are to deliver the data to the stream which is the cold storage for that we have to create a kinesis delivery stream and we are naming it we created the sources the kinesis stream which we created the entry point we are naming I mean we are mentioning the entry point of our data which is the kinesis stream which you created a while back then it asks whether you need to transform the records using lambda so we just wrote a lambda function we say yes I'm going to transfer the records on the fly and we are configuring the lambda or giving the lambda functions name on the one thing here is lambda allows you to do pushing so for example if you have added a new feature to your function and you can just go and publish it as a new version and tag it so whenever we find this function is failing for you even so you can roll back to your previous versions the sources the entry point kinesis stream and the destination is our cold storage is three bucket so we are telling kinesis stream to put the logs into our history bucket which is our destination specify our prefix which is basically a parent folder inside the stream you don't need any backup so let's go and give the buffer size so the kinesis just holds on a batch holds on a batch so sorry it holds a batch of data so whenever it reaches the buffer size of the mentioned buffer size or the buffer interval it will just push the data towards destination okay okay now we are choosing the permissions and roles for the kinesis stream yes the summary in our configurations then we create a release so we have created a kinesis stream this is the entry point we transform it using lambda then we created a kinesis pharaohs to take from that stream and deliver to error this is three now we push some sample logs so that we can do query on that I have written a very simple script which pushes a dictionary with service response created at and client latency as keys we'll push to our pipeline and when we start pushing it will it should flow through the streams it should get transformed with the lambda then it should be delivered to S3 so we'll start pushing you see the kinesis you know the monitoring dashboard monitor dashboard we can see the logs flowing through the multiple spikes of events happening so you can use this dashboard to debug or you know get your metrics their throughput of your data and you're going checking out these three bucket variable c or logs so the logs are stored as multiple five since we gave the buffer centers one and B which is very small and in a short period of time the multiple files have been created and this is 8 of this etina now we have our logs in our S3 bucket we need to query to SQL query on that we are writing a SQL query which is create table and we are mentioning the if you remember I showed you in the code right the key names of the JSON log object I'm sure it again I have to do a also screencast hold on so yeah these are the keys and we are mentioning that inside SQL query and the data source which is the S3 bucket when we run the query now we'll be able to see a table which is getting created this data is pulled from history and it's you know a relational table got created for us we can run any random SQL queries now from our S3 bucket or the data on the S3 bucket so we're able to get the service service or nothing but your life in Joker Gordon and next we need to visualize our batch or historical logs we create a new data source in Athena right now our data source is sorry data source in quicksig right now our data source is etina since we are queried from history and it's waiting for us as the SQL tables it's preparing the SQL query so now we are selecting the tables which we saw in era of the etina so in the left column we can see all our keys from the log message we sent we are able to visualize and take quick ad hoc our business insights from the batch next we go to the real-time analysis pipeline here we are using something called kinesis analytics which I told you can run SQL queries on your real-time data here creating the kinesis analytics application and the source of the data is another kinesis cube we are mentioning the kinesis cube string and we are not going to pre-produce anything so there is something called a discuss schema so as and when the logs close through all the data close through in this analytics will try to discover the schema based on the keys you're selling so it's a first way it's got a schema now it's able to show you know the keys as columns in the table then we go to the query editor and we go to the query editor there there is something kinesis analytics has something called streams and pumps so a stream is nothing but there's a mainstream of real-time data flowing through a system right we can create another stream by picking up specific fields from the mainstream then you can start pumping the data through it you can do some SQL queries on the data and start pumping the data through it so we are creating some stream called eBay stream this is our source stream and once you go to the real time analytics tab you're running first from the mainstream and a new stream is getting created and a pump is getting created which will pump only the eBay data do the query on the real-time incoming data and pumps the eBay data through that sub stream or the eBay stream so eBay stream got created and as you can see like as I mentioned the logs comes in the query which we wrote here simple SQL queries are getting applied on each and every data we see through the pipeline in real time we can visualize using the Amazon cloud watch service so Amazon cloud watch has many dashboards more or less like Kibana there's many dashboards and except that you don't have to make any service dashboards many dashboards and different widgets so these are some of our production you know metrics since the type means CPU utilization, Robin CPU utilization, Batman CPU utilization so the ingestion part which is the crunchy data so that's it like within few minutes you are able to architects or you know assemble the conference from the AWS and you are able to architect logging pipeline which we did and we pushed sample logs through it the ran SQL queries in real time the ran SQL queries in batch mode as well the pipeline which we set up now it's horizontally scalable and since all the servers side is managed by the AWS which is your provider you can now use this pipeline just for production and it will scale automatically infinitely theoretically like based on your throughput and the summary the usual architecture which we designed at first there are certain level of expertise and we needed a guy or resource to learn Kafka about its system and it took some weeks and months to deploy and maintain and the serverless architecture you are able to as you can see you were able to do it in you know minutes and no maintenance nearly zero developer interventions after deployment once we a part of this architecture which we showed this in production right now and the last time we visited the architecture is when we deployed it when we you know when it started interesting the data so we didn't after that scales up and down automatically yeah I am running out of bullet points yes I have any questions so I'm really interested in knowing is using lambda for processing every log message is being sent to Kansas right how does the cost behave so if a developer who just joins who unknowingly adds one log statements to say a 920 or your robin which is the centred order that takes yeah my lambda class is going to shoot up and I don't even know if that's the cost because lambda is cost based on every indication yes so how do you guys handle or manage that will cost under control the service right now we know what lambda costs and there are a few you know you can put few monitors on your serverless you know logging for a logging framework actually and you can do cost alerts in your AWS no we didn't have any other question and yeah so slides are there in my website so will there be any latency sometimes work I use to online or learning portal it's completely synchronous in serverless how do you feel the latency that come up with the regular server it there can be some latency but we haven't experienced such latency because it's a logging frame at the end of the day and only the first part where the clients will be pushing the logs has to be you know maintain the SLA and the first part is kinesis stream when we so far we haven't faced any issues in kinesis stream kinesis stream works on based on HTTP calls and I guess and we faced this issue in SQS which is also another managed queue when we were pushing logs from our the microservices to SQS that specific line was affecting our SLA then we put up a reader's perks of cash kind of thing before that which you know solved that so it depends depends on the user's case and latency can be there but you have to decide which part of the pipeline you need to reduce the latency