 Hi, good evening. I'm here to make a case study on my experience in using airline. About me, I'm Chaitanya. I actually kick-started my career in airline. First day of my job and first job, I was given a book of airline to go through and then start learning it. I fell in love with the language so much that since 14 years I've been doing only airline programming and I'm still an airline programmer. I'm a part-time architect and a full-time developer. I work for NetStratum. It's a start-up focus on VoIP technologies and cloud-based solutions based on Kochi, Kela. I just have 20 minutes time so I rushed through this stuff. This is the plan now which I had to first establish the context. Then the challenges that we faced, how did we approach the challenges and then the solution comes after that. How many of you have a work or know or developed solutions for call center? Let me get to the context. Whenever we make a call to the call center, we get a series of prompts being played. A tape gets played that your great customer, a call for you from you is valuable to us. Please hang on for some time. In the backend, what it's going to do is that it's going to let you connect to a few of the agents which are there connected to different types of queues. What are queues? Typically you have a prompt playing that press one for sales and offers, plus two for technical support, plus three for billing, payments. Each of those is said to be a queue. Not every agent can offer to serve every queue. Some of them can only do help you with the billing in some of the issues. Some of them can do you help you with the sales part and some with the payments part. These agents are then connected to the queues. This queue and the call center service then diverts your call to, it picks up an agent and then gives a call to that. In terms of terminology, there is an agent and there are queues and the calls are given to the queues. The first two terms are something called as call off. When a customer calls a service, it tries to offer a call to an agent. An agent then tries to answer the call, then the term is called call answer. The agent can choose to reject the call. Maybe you never knew about call rejection because it's a service which then offers it to a second agent. So when an agent doesn't want to answer the call, he can choose to reject. And in third case, it's a little tricky. The agent just doesn't answer. His phone just rings, rings, rings and rings and he just sanks on with his friends. So what happens is that the call center service has a timeout, then it considers this particular call as a burner. So let's keep these terms in place. It's required for us to move forward for the next months. How many of you have ever worked on a datamart or a reporting backend? So typically a datamart works this way. You have a completed event or a CDR or a CDR or something which already happened. It has a batch process of pulling those things and then denormalizing them and then creating summaries based on the reports that you need. And these summaries are placed in a denormalized stable structure and the portal just comes in, pulls out those reports and then shows up. Now this batch process doesn't actually guarantee that, doesn't have a notion that the delay or the recency of the report is connected to the batch, the timing of the batch or the delay in the batch process that triggers or pulls out the data from there and gives it to us. So it all depends on when did you run this, when did you take this law. And most of the time in the call centers or in any of those telecom reports generation, they are acceptable to have a lag of one day. That means the recency factor could be up to a day. That means that they can tell you what is your usage pattern, what are your calls, what are your stuff, only a day after. It can't tell you less than that. I mean they're not entitled to do but they might, you know, as I put a tag here saying that in the recency section at best four hours with certain tolerable errors. The good thing about datamart reporting is that there is a certain tolerable error that is acceptable. They might, even if it need not said that about 23.65% of the calls are dropping. It can simply say 23%. So it's acceptable to apply. Now the problem that you're facing was the performance was an issue. It can do only 30 minutes of, you know, chunking of the, crunching of the data that it can do for one single day. And we had a lot of issues about scalability. This is designed for about 100 tenants and about 100k calls per day. And it, the manageability was really bad. And we were running a lot of operations in terms of fixing this issues with the summaries data. So we, the client wanted a replacement for this. And their expectations were little unrealistic saying that instead of a day, we would like to have 15 minutes of recency. That means that you have to pull out that batch process every 15 minutes and think about it that it has to not just run this 15 minutes of data. It has to run it across the day to get that summarize. So that's not going to work at all. And this went to a different tangent and a lot of people, I mean, they approached a lot of vendors to get different solutions. They couldn't get it. I don't know what happened. Finally, they came back saying that, okay, you have this much budget, try doing something with this. So we came up with certain approach is why don't we take the events from the call processing engine in real time and take this call processing events and then create a sort of semi-cooked data which can be used to create reports further on this semi-cooked data. It's typically like all our Indian restaurants. When you go to a menu, you see 300 items in the menu. Typically they all come from four or five pages. So it's sort of making a semi-cooked data from which we can pull out all the reports on the portal for the portal. Now, how do we do that? We made first iteration of it. We were taking, whenever the call processing engine was trying to write the CDRs, it's also trying to push the events to this particular engine. What we did was we kind of had this semi-cooked data. I'll show you how the semi-cooked data can be designed. Now, this semi-cooked data, what you see here, the call arrived, call answered, the talk time, this stuff, were being pushed to the engine. And then engine was just like a counter updating them in that slot. What you see there, one part of it here, one part of it, is a slot. And these are counters. These are type of the semi-cooked data called counters. And these are the durations. Now, the engine was then taking the events and then updating the counters, having its time was running to update the talk time. Now, this solution was good. It worked very well that when it was pulling from the semi-cooked data, it can pull out the data really fast. And then it doesn't have a lag at all. But the problem here was running the timers. Anybody who has an experience in writing massive airline code or writing very high performance systems, you understand that it cannot have so many timers running there. Think about it like there are about 100 calls simultaneously going on or 500 calls simultaneously going on. And for each call, you need to have a timer running other than the call timers. That's some massive performance. There is a timer module which does that. But if you see the efficiency guide, they will highly discourage using that. There are other tricks to do where you do airline send after and then once you get it, you invoke the send after again. But again, running the timers wasn't very efficient. We had our design philosophy that we shouldn't run timers. And we had a problem too. When this call event, when it just pushes the event and then forgets, the call processing engine is lying there. Now, this one is running times. It's every 15 minutes trying to update the slot there saying that there is a talk time being used in 15 minutes. Or whatever the slot time is. Let's consider the slot time here. Time slot is about 15 minutes. So it's being pushed to that every 15 minutes. Now, for some reason, the end call doesn't come. It's all prone to errors. End call event doesn't come. The timers are lying there. Keep on updating the slot. Which is also not a good idea. So we came up with the second iteration. Instead of doing this aggregation or the duration calculation or the timers running in the pre-part of the semi-coupe data, we will do it on the post-part of the session. So what we are doing here is that we are taking these events and then marking onto the slots. When an event comes at, let's say, 15 minutes, 10 seconds. In the 15-minute slot, we'll mark it at 10 seconds and say that, okay, this event came over here. Forget about it. And that's how we were able to consume a lot of data. Here, we're not doing any processing to create pre-coupe data. We are just consuming and marking them into the time scale. On the post-part of it, whenever the report needs to be generated, the report runs from the first slot and then it runs down. Now, keeping in its state, okay, the call started here. It started here. Okay, call started here. Let's put it in the state. It goes down. No, call ended here. What's the duration? Pack it up. This is the duration between this state. Then it flows down. There's an event here. Okay, there's a call which came in. Great. Put the event in. Close down. Then it says there's a call ended here. Okay, what's the duration? Write it down. So the whole of this aggregation part of it from the semi-coupe data is now being done when the report is being asked to generate. Okay? Let's see these event maps. Then we can get to the design which you can understand clearly. Now, these are the, on the left-hand side, you can see there are three events which come typically for a successful call scenario. There's an incoming call. The agent answered the call. And then there's either agent hung up the call or the end user hung up the call. In either of these cases, whenever an event comes to the system, then you can see the mappings of what all the semi-coupe data it needs to push it into the data port. You can see the marks. The blue color is what the incoming call is. It's not very clear here. This is the incoming call. This is the agent answered. This is the answered by the agent hung up or the end user hung up. The blue color says, I need to update the ACD call arrive. I need to update the call wait time. Now that the call arrived, we have a call wait time until an agent answers. And there's an event which says there's a queue call arrive. And then there is a peak wait as increased by one, one incremented by one. So all this pre-coupe data is now being pushed to the system. This is how we need to, usually a product guy or a product person doing a business analysis would come up with this particular events mapping to this particular semi-coupe data. Okay, let's deep down to the design and then we can discuss on this. The first module that, where the event center has been incoming call. From the incoming call event, we need to now merge out and then put half a dozen of semi-coupe data into the data port. Now this module is called event interpolator. The event interpolator is a callback module which takes up a behavior from the data port. And then you can write it down what all the events and event types that you want to push it to the data port. Okay, the structure of the event can be really joined. I think you can see the slides in the site. You can download the slides from there. Now the event interpolator, like the way we showed in the previous slide about mapping one event to multiple semi-coupe events. It's going to take up this event. It does have a state. So based on the state again, it has a call state. Based on the call state and the type of the event lining at it, it pushes its own semi-coupe data into this. Now this is the design of the data plot. The data plot has slots and wherever this event is being pushed to this, the difference of the time from the start of the slot to the time when the slot event came in, we call it a mark. So these two data would be put into one record there. It's basically an initiative. It puts a record there which has this slot ID, the UUID of the entity. The entity could be a queue, the agent, or the ACD. And then the binary packed format of what is the event which came in and what mark it actually made, at what mark the event actually made. Now it's very important that we segment or fragment this data plot which is running, could be running for several days or several hours depending on what type of traffic you are expecting on to that. You can see so many thousands or millions of events landing at this data plot. So it's very important to segment or fragment this data plot into different fragments. And each fragment, in my case, it's just basically an initiative. And the way we name these events is in a constant time, there's a slot number which comes from the timestamp, div the slot size of it. And the fragment, the initiative or the fragment number comes from timestamp div the fragment size. So right now that's how we make the fragments. Each fragment is an initiative table and each slot is a logical entity. Each entity with a slot has one row in the initiative. Now when we are scanning this thing, we come from the start of a fragment and then scan from the start of the fragment till the end of the fragment. Not literally end of the fragment until when you need the report for. And when you're scanning those things, exactly the way I said before, it comes along the events that are coming from the start of the fragment. Once it's a counter, it updates the counter, comes down, the start of a timer, updates the timer, comes down, there's this close of a timer, calculates the duration, updates the duration. And it just flows through that. Now these scanned events from the start of the fragment to the end of the fragment, these things are again packed up or aggregated based on the aggregation interval asked by the reports. Our slot could be at 15 minutes, but the report is asking for an aggregation interval of one hour. That means that it needs rows with one hour of duration. Then that kind of aggregation on top of the scanning is done where there's a 15 minutes and a 15 minutes, a 4-15 minutes combined if he's asking for one hour, whatever is there is aggregated there and then pushed to the reporting. Then there's a report handler module for which this is where the API is exported, which asks for the data plot request about this report that needs to come in and this one asks for the scanning of it and it gets the aggregation. Before it gets aggregation, it runs through the time zone shift. This system is designed for multiple tenants. It's designed for tenants that can go across the globe. So it should run for multiple time zones. So there's a time zone shift applied while the data is being presented to the user. That's where it runs through that filter. Any questions? Now the report mapping. On the left-hand side, again, on the right-hand side, you see the semi-cooked data that went into the data plot. On the right-hand side is what you need in the reports. For what you need in the reports, you need to pull up with one or multiple events on the semi-cooked data and you pull out the events and then map it to each of those things and that's where the report gets generated out of there. Let's see the demonstration. Now this is the typical agent performance report. You can see that there are about 10 agents which are there and let's take a week's time and let's put the aggregation in the last 15 minutes. And these many agents are connected to so many queues. So you can check, any agent can be associated with multiple queues. So it's up to you to choose which queue you're trying to run for. And you run it, that's where the whole data is updated. Now this analysis of this data is completely done on the fly when you ask for the report. And this has, in fact, look at the samples of it. This can keep running like, I guess there are about 10 agents and 100 times samples per day, that's a week's time. There are about 500 and choose about 5000 samples that's been collected over a period of week's time with all this analysis data provided on the fly. And this report is available to the person who is viewing it in real time and the call is right now happening. And then there's a queue report. Is there anyone who wants to do a lightning talk after this one? We can take the time if there's no one who wants to do a lightning talk. So this is a queue report. It shows that the day starts at roughly around 8, ends around 7. So this is the queue report. This again came back in about a couple of seconds where it manipulated the whole week's data across so many queues and then gave up to some room in and see what's happening in each of them. Any questions? Now let's go to the scalability of it. Since data plot doesn't need any locking, no transactions, any sort of that, the call processing engine pushes all these events to the data plot engine and the data plot engine pushes it to the niche as a dirty operations and any portal, any server or portal which is having a horizontal queue that's along with it can have as many as you want and when the data plot can pull up from the niche in a constant time there's no restriction on scalability. It would scale in. Now what is the outcome? So far we spent about two months building this and then getting it rolled out. What did we change in this system? The earlier one was designed for 100 calls per day. This one can do up to 50 million calls a day. The previous one at the best it can do 4 hours old data until best. This one is a minute old. The minute old is the most part. It's always like when the call is happening it marks. The granularity earlier one used to have 15 minutes, 30 minutes. This one is about a minute. And the performance it can consume about 500 calls a second and it can publish a week's data in 1 to 5 seconds. The previous one had a limited vertical scalability. It was running as a procedure in MySQL. It always used to run one single code. It never can run multiple calls. This one can scale in a lot. The manageability was tedious. It was reads. We have a lot of unstructs and APIs to manage any hand-line errors or misinterpretation of the reports and stuff like that. And the post-mortem pictures is very important stuff and things go really bad on the call side. When we had to run the whole base data which was 30 minutes in the previous system, this one is as quick and isolated just by re-creating that event back in the system. And the transparent can rely on underlying TV. It depended on the time zones if MySQL had time zones implemented in its database. This one has a time zone shipped to the post-mortem. And the roadmap. Right now the slot on the fragment sizes are kind of static. It's at the design time. We should be able to support where we can evolve the system with certain slot and fragment time and then adjust it as it progresses based on the traffic and you should be able to support multiple slot and fragment sizes. Right now it only supports niche here because that's the need of the customer right now. We have plans to include support for Black, TV, Coach Base and so on. And the ones I have shown about the event maps on the report summary. We are planning to put a web-based studio to create those event maps yourself on the web and that can apply to the data. So what did we use in this? Erlang? How-by for the web-API? Jiffy to encode the JSON objects. That's all. Nothing else. And this would be open source. We're going to patch it to 2.0. Watch out for this space. Thank you.