 So, I'm Mark, it's been a year since I came back to the meet up, I guess. Was it a year? I think more like that. Okay, welcome back. So, when Dan asked me to give a talk, I was thinking about talking about what we built for BK subscription engine, but after reviewing it, I think I realized that it's not going to be very useful for people outside of BK, so I decided to drop it. But one topic that keeps coming back again and again when I prepare the talk is how do we build a reliable system. And I think it's a more general thing that you guys will enjoy. So, I changed it a little bit. So, there will be things about the subscription engine, but it's going to be centralized around robustness of the system. So, I'm not expecting a lot of this. So, if you have any ideas, feel free to chime in at the end. So, let's start. So, robust start, the meaning of the word robust stem from a Latin word that means hard, strong, and solid. And the first thing that came to my mind when I read this is Lego. And the other thing is Nokia phone that I have been having for 10 years. So, but those are example of things that you would consider very robust. It lasts with time and it function well. So, and I think as engineer when we try to build software, most of the time we idealistically aim for a robust system. So, because then less drag for us, less support request to handle and we feel better within the system with our book. So, but it's not a very easy thing to do. So, even if you look at the open source, like well-known popular open source project nowadays, there are times when things doesn't work right. So, for example, in 2014 we have hardly recently there's this thing called dirty cow on Linux. Then just a few, just a week ago, then there's this DNS. But in that case, the DNS system doesn't function as expected as well. And a lot of things still go down. And then closer to home, when you look at system that you use daily, if you aware of something called Jepsen, it's a framework to test this system. We run a lot of things like MongoDB React, console, etcd through it, and they discover that. On this system, even they are well-known and there's a lot of good developer on it, they don't function well all the time. So, it's very hard to do robust software. On the other hand, what you consider robust might be very subjective as well. So, for example, if you look at this function, so let's just look at this very simple function. How many of you think this is an example of simple and robust? Okay, that's the way. Everyone just immediately recognize that. This function doesn't work when you put an input that will overflow the result. So if you compare this with another version that I have below, then you would see that the second looks more reliable. In this case, what happened is we defy a very clear input and input boundary. We defy a very clear output and what it should happen and the function should work as expected. So, this shows that you can view a robust function but it's not as straightforward as you might think. And most of the time, when you look at software, I would argue that what we tend to have in our mind is closer to the first form and it's very close to a mathematical definition of function. Actually, you can just get the same effect by using N64 or Internet. But when you look at the function conceptually, if you work with a language more wide then you learn this quote and you learn to live with it. But conceptually, we don't recognize it right away. So the second function, the form is very close to a mathematical function where you have a defy set of input and a defy set of output and it works on the time. And then these are not anything new if you have been programming with, say, or OCaml for a while, then this is what you tend to mix. So, still, most of us don't have the chance to work with functional language on the time and there are practical reasons for that. One is bit, for example. So, still, we can take advantage of this kind of functional thinking in our design. In our design of system as a whole so that we write better software. So, that's what I tried to... We tried to talk today through an example of Viki... a piece of Viki subscription system that I worked on recently. So, the background story is this. So, we have a subscription engine in Viki and it allows for people to buy subscription on the web via an app adjacent to iOS and Android and enjoy it elsewhere in the system. And people just need to buy once and then it should propagate everywhere else. The previous system didn't do so well in the sense that sometimes you buy stuff, you get view, you get charge, but it doesn't show up reliably. So, not a very good experience for the user. And since we are getting more people paying us recently, we become an urgent concern and we started to rewrite. And the stuff that we have is fairly simple, like depending on the bundle that you buy on Viki, your subscription end up in one of the four states. At the first time you get, you get to a trial, then you pay, you renew and you get into a paying state. And then after that you keep renewing to be a subscriber. And at any point in time you fail to pay, you enter something we call a Dunning state. So Dunning is a kind of like industry word for all these things. Where we try to charge you for X number of times and if we can't finish it, then you will end your subscription. So this is a very simple model. So this is general subscription in Viki for Google. So what I'm working on at that point is the Google PIS of the subscription system like how people buy Android subscription and then enjoy across Viki. So Google subscription work more or less the same way with just a few constraints. So for Google, we are allowed to call this API which return something like what you see on the right side. There will be a start time an expiry time of the subscription and there is a flag called auto renewing. But what we need to do is capture the state of the subscription which is not available here. And the reason for that is more or less business because we need to treat user in different state differently in Viki. So we have to pack the expiry time here and infer what state the user is in based on that and historical data. So this is more or less context and the flow is something that looks like something similar to this diagram. So people will buy Android subscription. The Google engine will give them an identifier called purchase token. Then the Android client will send that together with a session token to our API at which point we link that subscription to the user. And from then on Viki is responsible to tracking and refreshing the subscription status of that user. And then after that, any other client can just call the API Viki API to get that information. So it's pretty standard, nothing really. But just when we have to redesign the system is how do we make sure this flow work reliably in the phase of network partition is a failure from our database even sometimes from Google and so on. And to do that we ran through a very long design verification phase and get a list of a big list of questions a few of which I will list here. So for example we divide it basically into groups and so for example one group would be how do we classify the raw input from Google API. The second group is how do we how does our system run in the phase of network partition or when dependency doesn't respond. The next one is that we need to have an historical audit trail so that we can trace back whatever happened in system and put money as you need to argue with people like why they get charged and so on. And then of course there's no perfect software so sometimes we make bugs and we need a way to go back and fix the changes not manually but automatically and reliably. And then the last step is when we have to scale we have to deal with machine then what do we do? So So I think it depends on how reliable your system is you will have to answer something akin to this at some point. So our first let's take on the first question first. What are the possible input that we have from Google? So conceptually it seems very simple Google just return an expiry time we just need to check that and then we write simple first cut code just check in expiry time take it minus from the last expiry time and compare. What we found out is so obviously we make some design flaw there but what we found more interestingly is that expiry time that Google provides is not very deterministic in any fashion. So that's interesting because I expected Google to be precise. But we have data to show that and we have to design around it. So for example here is data from one guy that we have so he started a subscription that lasts until 17th of October at 1960 1960 and 43 the next he went into what we call the Dundang Step where he failed to pay Google extend his subscription by one day why he try to retry and then as you can see the time is not exact anymore so it vary a little bit the next day that time increase again increase again for the third time and then Google was finally able to charge the guy and bring him to the correct state but then it deduct some time from his subscription period so not very what did Google say so we contact them but it takes him a long time so we just need to go ahead and work out the differences so this is problem because without understanding what can Google give us we can write a very robust code so this happen so why we was waiting for Google I'm just doing trial and error and finally was able to get more or less working version of the code but through trial and error and put it on production because of time still and a few days later I was still worrying about the system when I was watching a talk called live beyond the illusion of the present is a very good distributed system talk that you might be interested in watching so in that talk the speaker argue about when you view distributed system you should consider time as just another input to your system and not something that constantly flowing so somehow a few things click for me and I flip this thing around and say if what I have always been doing for trial and error is just figure out the input space from Google so I put now because we have a system running for a while I put out the data to statistical and try to plot a histogram of what Google extend the subscription expiry date so interesting result we was able to group Google data into four distinct group one is a thing that extend from zero to 88 which seem to represent dunia then we see it extend people who are in monthly subscription by anywhere between 26 to 38 days and if you are on a yearly subscription it extend you by 357 to 365 days so and then there is a bad case where Google give you an expiry date that is less than the one that you previously had so not very good but with this we was able to go back in the code and do very nice case analysis and because it is divided into nicely cases the code we wrote was much simpler and this is an example of the bad case I mentioned so this guy have increasing expiry time until he stop renewing and then in that case somehow the data on Google say become corrupted and this happened for both renewing and non renewing subscription so this exercise gave me a thought it seems that I was able to make a much better design by treating my system as a function trying to map out the input space so what if not just input from Google what if that is all the things in our system that we can think about as input and map out the complete space then we can build a very rigorous system so that's the thing that let me do some visualization so I imagine the tracking system I'm building as a function with one dimensional input is coming from Google API and then I was mapping out the space so that works then I came back to the same design question we had and tackle the second group so what happened when there are dependency that I use and I was not able to contact them so some variation that so obviously when we write a code we already do a lot of error handling for this thing and go is a pretty good language because it force you to consciously do this so I guess everyone is familiar with that what what I realize is that this whole error handling exercise ultimately what it does is again map out an input space where input is health check of your dependency and then when you write code then you would try to define the output which either is a happy pad where everything works or some farewell pad where something doesn't function well and if you can clearly define what your system behave in all these cases again you get another level of burst test and then come back to what else can we do with the design question so we was looking at how did trail and generating analytics so this we are locking stuff as we go already and then for us that seems to be something like a side effect of running the function and then if you look at this it's very I guess so you have input you have output you have side effect so wish it is what it actually is but it's a very big and complicated so basically so next so we have I did lock define as side effect and next how can we go back and fix historical error that we have and this is an interesting one because it happens sometimes and if you are not careful you end up writing directly to DB to override your mistake for example it's not good what? luckily for me so I did have some bug with analytics not the subscription status but I still need to go and correct that luckily when we write a system that we are already sampling and put into database so I primarily did this so that when customer service people came and ask me why I can just show them and say this is Google problem but it turns out that it helps so I started to design some small tools that take this sample run it through my system again basically replaying it and generate a different set of outcome data and then here I realize that if I take the sample data together with when sample the point in time with their sample and think of that as input on the thing like I did lock analytics data and output of my system basically when you aggregate all of that equal to a version of history so if your code has a bug it generate one file of history and then if you fix it and replay the historical event it become another set of history and then to correct the result all you need to do is merge the history and this visually is very similar to a way how we resolve conflict with a bit much so and then if you think about it this way what you can do is build automatic tools to resolve problem of your system without too much of human intervention like for us what because I'm just dumping analytics data what I just did will be overwriting what I have written earlier but imagine you are running a bank and so now you end up with two version of history one in which you charge the user another one in which you don't and then you have to merge this together so what would happen is that when you merge the action you need to take is refunding the charge event that you did wrongly but because all this is deterministic you can write a tool that automatically and you can mathematically be sure and then there's a last part which is very common in distributed system like what do you do when the machine goes down and do you have redundancy to take over so it's well explore problem with distributed system and the advice that they generally give out is that you decide with network partition in mind with machine failure in mind and if you decide your system for that it works for every single other cases so with that I kind of complete the model I have for the system where my input is a few things like Google API dependency time when I do things in the system network partition condition that can occur I have a clear outcome that I want so after this is done we went back and started and it become much clearer what we need to write in code to achieve the desired behavior perhaps a lot of course in reality achieve this is a very theoretical model in reality it's not always clear how you can get a good partitioning especially if you are building system for the first time but I think for certain things for example you know that network can either be up or down so that's always there you always decide for both cases you know that and if you have any other type of input you can try to do data collection and do some statistical analysis so that gives you some insight and that will help you design it will help you to have easier time designing a system interview for a while you're going to tie the system down to f of x comma t I mean it's a function basically f of x comma t so if you so if you can so it leaves me with a question so of course in reality you can do you can get 100% on this analysis so but still given a lot of this is common like network partition so on is there a way for us to mechanically generate all this thing so as a programmer we look at the scenario beforehand and structure a program especially this might be pretty important when you design this with a system for me my realization is that most of the time when we design we take the problem we'll sit in some corner and then somehow magically came out with a model so and a lot of it depends on your own experience the more experienced people will have more domain knowledge they will know the input space better but if there is a tool then you can be much more efficient at designing and then another thing is when you look at this as a model function then it deals with a lot of non-non so how do you prepare the system for unknown unknown is this another thing and then the last thing so again this is a model you have to transform this into code how do you know your code actually implement the model so these are the three question that still kind of like hanging and maybe if you have build similar system you have some experience you can tell everyone else so with that i would like to conclude the talk Wow, Kido Hintin very great what do you think like a scenario i mean what i did so basically i was lucky because the initial design we decide to capture all this every single input and we leave our time so in some system you would use something like time.now in your code to check what is the current time and your code behave dependent on some value compared to time.now if you have a design we also leave that out we just take some time which is external use that as input so we have a very complete input in the system so what i just did is basically spin up another instant pipe all this input into that guy again it generate a different set of output so my output is basically replace the production data so it's an event base so what i did is basically transform it into an event base system and because event can install and replay then you have all the benefit i guess the model is that ya iya iya iya iya iya iya