 I am Nivedita and I am Sriheri and today we are going to talk about our experience in building an experimentation platform in total. So we are both from Nelenso and Nelenso is a software cooperative based in Bangalore. So over the last year and a half we have worked with a subsidiary of Staples called Staples Park which uses machine learning predictive modeling and other systems to make niche products for Staples. What we have built with them is a multivariate testing platform which serves all of experimentation needs of Staples to one box. These are some magnitude of data and sessions to give you an idea and perspective for the rest of the talk. We have a very strict SLA of 99.9% that's in milliseconds as we said synchronously between all requests that go to Staples.com. So here is roughly what you will take away from the stock. You learn about various ways in which you can set up your experiments. You will learn that traffic is precious, I have said it once now and how to use it efficiently and you will learn some nice things about Clojure and you will see some really good examples and you will also see that you can build nice beautiful assembly lines using Clojure something that we should do more of and you will also see how you can test a complex system using simulation testing. So this would be the overall structure of the talk, we will spend some time explaining in brief the domain of experimentation especially in context of the app you have built. Then we will go into the implementation videos of the app and then we will talk about how we test the app using relation testing. So the first portion of the talk, the domain itself, science. So the scientific method has been in use by humans since the 17th century where we propose a hypothesis, test it, do some measurements around it and then based on the measurements that we did during an experiment on that hypothesis, we say the hypothesis was correct or incorrect. Experimentation is a step in the scientific method which lets us compare two competing hypothesis. For example, the most common example would be a vaccine trial where you could divide your target population into two groups of control group and treatment group where the control group does not get any vaccine or gets a vaccine that is already there in the market and the test group would get the vaccine that you are trying to put in the market. So then after a certain period of time you would compare the test group against the control group to see how effective the vaccine has been. So historically experimentation has been used extensively in say medicine or genetics but more recently we are doing this to businesses. We are evaluating business ideas against each other and we want to compare say which one is better than the other and the basic underlying fact is that we use hard data to make decisions. By experiment we mean control experiment specifically. So we will go through a few basic terms before we jump into the details of experiment infrastructure but say something like a hypothesis, a term like a hypothesis can use some explanation. It can be as simple as say on the web a red button is more compelling than a blue button or the other way. You can also say that one model for like making an offer to a user is more effective than another model or say you can even evaluate entire products like say I have been using an email service for five years and I want to test a new one but I am not sure which one to use but like you can use experiments to make your decision and like say even like you are comparing different coupon services or something. Another such keyword that we use a lot is treatment and we colloquially refer to it as a bucket and we use this we will be using this interchangeably through the talk. So treatment is basically like the value for a variable linear system right like imagine your red blue experiment so the color is the variable and the values red or blue would be the treatment and typically what you have in an experiment is a control treatment and a test treatment. So control is effectively no treatment and test is your hypothesis and there is no need for your experiment to be restricted to only two treatments you can have as many treatments as you want say red blue green. So every term we use is coverage so for example you are testing a recommendation engine using our experimentation platform and we tell the recommendation engine yes give a recommendation on this product but when they try and apply this treatment on real time they figure out that the product they are giving recommendation on is actually out of stock they cannot really give this recommendation. So they do not confirm that treatment with us what they do is so that particular session or that particular user for which the recommendation was being given should not be considered part of the experiment. So when we are measuring the experiment's success we should not say that the purchase made by that user was because of the recommendation that we said we should give because we did not really show it. So coverage of a treatment is very fundamental in ensuring the precise measurement of an application what we do in our system is by design we assume every treatment that we give out to our clients will not be shown so there is an explicit acknowledgement or explicit confirmation required from the client side where the client says that I have applied the treatment that you have provided so please consider this particular user this particular session as part of the experiment. And it's quite fundamental because say in our case say only 10% of the actual traffic is covered so it really skews your metrics if you are not looking at coverage. So here for example the sequence of interactions like I'm not getting to any we are not getting to any closure yet but there's lots coming so just like the domain is kind of heavy so like we want to get through those parts before we show you the closure so that it's actually meaningful. So here's for example the sequence of interactions that you and I will have say you're the client of the experimentation platform say I'm the experimentation platform so first you say oh there's a new user here is the user or here is what the session ID of the user is and I'm like okay and then you're like give me the treatments for this guy and say I want the checkout page while I experiment and I want the red blue experiment and then I give you the buckets for that guy and then you're like oh I actually showed the button to this guy and then you're like you tell me that you showed it and that's the confirmed treatment call and then you're like this guy did something interesting so like he made a purchase or something and you want to let me know that this thing happened because you later want to measure against it right like you want to say how many sales dollars is this thing made over the other or what's the margin or something like that. So now we'll go through the basic infrastructure experiment infrastructure the most basic of an experiment can be where you have one experiment and you have two treatments under it control and test and you divide your user or the traffic that comes to your system into these two buckets and then control can be essentially no treatment and test will be the hypothesis that you're trying to test and then you run your experiment for a certain period of time and then you compare the measurements of test against control. In this case traffic is split between control and test only and like we said before it's not restricted to only two so you can have as many treatments as you want but note that the traffic is split between them and then you're free to compare as many treatments as you want so you can compare control again test one or control again test two or even test one against test two. So once you've set up your experiment you want to run multiple experiments at the same time one way to do this would be to share traffic between them this would be the messy way of running. Actually the term messy is not really used in the domain we have used it because so you share the traffic between two experiments that you're running parallel. So a user which comes into your system can be part of both experiment one and experiment two. This type of experiment setup is useful when you're trying to test two orthogonal hypothesis. For example if you have an experiment running on the home page of your system and on the checkout page of your system you know that the user will not be affected by these two experiments so the measurement of experiment one will not be affected by experiment two so hence the orthogonal hypothesis. So another way you can run experiments at the same time is by splitting the traffic between them and you do this normally when you want precise measurements in each experiment. A silly example would be say you're doing some experiments on your checkout page and you've changed both the typeface and the button color okay and then you say you're not really sure if the typeface is more compelling or the button is more compelling but you want to make sure that whatever decisions you make it's well informed and precise so what you do is you run them with split traffic so a person who sees the typeface change will never see the button change and vice versa and and that's it you can see that the traffic is split at all places so this is quite expensive. So this is the first setup we went into production with in our experimentation platform where we allowed a client to run both mesh experiments and precise experiments at the same time. The way we did this we would split the traffic at the top level into a precise and messy and the portions of traffic that go into precise will be split into other precise experiments and the portion of traffic that go into messy will become part of all the experiments which can potentially become part of all the all the messy experiments. So yeah this is the first version we went into production with. After we were in production for a couple of months we realized this one important lesson that traffic pressures. Splitting of traffic at every level on every experiment makes it very hard to make measurements like check to see the performance of the experiment in a very quick manner. So for example for us it took four weeks of running an experiment with 100% traffic 100% of tables traffic to get any statistical significant data for that particular experiment which was pretty expensive because we wanted to run we wanted quick feedback loop we want to run experiments on on the on smaller portions of traffic in a more final granular manner. And this is not only true for staples right like for example Google also says the exact same things and this next model that we went to production with is actually something that's inspired by Google's paper and we're calling it the nested or the layered model wherein it's basically a tree right so this structure gives us all the benefits that we would get from the precise and messy model except that it gives us final control over traffic so we can restrict the traffic for an experiment with exactly the kind of traffic the kind of people that we want to put into that experiment and how we have done the nesting is that an experiment can be nested under the bucket of another experiment so to give you a more illustrative example I'm sorry the colors have really gone out but so you have like say three experiments at the top level check out super labs in search right and super labs to say your fictitious experiment right so what you want to do is say compare the world of mutants and you want to compare them against the world of the mutants right so that's your top-level division right there and say in the world of mutants you want to compare the powers of adamantium and say you want to compare it against or like you also want to test the powers of telepathy right so what you do is you run them paralleling and that's no split of traffic so adamantium and telepathy are running paralleling and with an adamantium you want to test the powers of adamantium in same Wolverine's bones or say in Ultron's shell right so at this point you want to ensure that your measurement is precise and hence you split the traffic between them and because it's a split it's not split between adamantium telepathy say if I'm in the world of mutants and I'm being experimented on then I can be like a gene gray with Wolverine's bones or like a professor X with Wolverine's bone all all Ultron's shell right and what I can also do with this model is that if I want to run an experiment on all the Wolverines I can do that and I can restrict to and restrict it to only the Wolverine's right and this is something that I could never fathom doing in the older model. This is the same setup in the previous slide using a different visualization a layered visualization which is used by the Google overlapping experiments infrastructure papers so this is same model but cut into layers each layer is orthogonal to each other again so traffic is essentially shared between each layer of experiments a traffic slice through a layer is essentially given the same treatment so in this case if I slice my traffic from here every every person in that traffic slice will get a mutant treatment and every if I slice my traffic from here every person in that traffic slice will get claw Wolverine and gene gray treatment the interesting part here is that immutants is a shared control group for all the mutants so I can compare Wolverine to immutants I can compare Ultron to immutants and I can compare all the mutants together with immutants so here you can clearly see that this is a shared control for all these experiments running which is done in wow the colors are not visible at all but I'll use the pointer which is done in our structure by using a shared bucket this essentially is supposed to be a different color from the rest this is the shared immutants bucket which can be which can be compared against these buckets these Wolverine Ultron gene gray treatment if we did not have a concept of shared bucket you would have to sacrifice some traffic here under adamantium to have a control group of Wolverine to compare them to compare Wolverine to immutants so because again traffic is precious we can't split traffic at every single level that's when again shared buckets help us you make use of traffic efficiency so another one of the things that is quite vital to any experimentation platform is to basically null test right so an AA test is that it helps you remove the possibility of the null hypothesis so what it does is basically compare a treatment against itself right so you have an A bucket say and you're dividing it into A1 and A2 that's basically the same traffic but it's getting split and what it helps you do is that for example there is some randomness right you want to ensure that the treatment is responsible for your measurements and not just some randomness so you know that when they like say reports on A1 and A2 are sort of similar you know that you have attained statistical significance in the traffic and what it also helps you do is test the experimentation platform itself in that we don't have any bias in the way we are bucketing users so let us know if that's not is that legible can people read that sorry there's something wrong with the the converter so the most frequent question we people ask us when we explain the thing that we're building is why not use a third party app which is out there like Kissmetrics or OptimizeV or any other such app so we have we've thought about that and there are reasons why we can't use them or why we wanted to build it ourselves so we wanted to be able to run multiple experiments over the traffic that comes into our system but not only that we wanted to we wanted to have a final gain control over how the traffic flows over all our experiment through all our experiments second our platform is very e-commerce opinionated the data we collect and the metrics we report on are metrics like margin conversion and sales dollars which again is like our system is very built to measure these metrics in a in a very good way we really needed low latency because again if you go to staples.com right now and search for a product staples.com is asking us which search algorithm should I use for doing this search and then giving you the result so we're sitting in every single request that goes to staples.com and that's why we need to be very fast we needed real-time reports because that's the thing people want they want to see the performance of the experiments that they're running as they're running it not after two months or not after a certain period of time we wanted to have this feature called controlled ramp-ups of experiments so if you're running a risky experiment a new model that our data scientists came up with which is which they're not sure of how it will actually work in the real world you don't want to assign 100% traffic or a huge chunk of your traffic you don't want to expose huge chunk of your traffic to that experiment to that risk experiment. So you start with a small amount of traffic start with 10% start with 5% and then as the time goes by as as you get the feedback from that you using our real-time reports as you get feedback on that experiment you increase in the amount of traffic that goes through it and through that increase through the ramp-up of the traffic you want to ensure stickiness that is also an important thing to be learned that as you're changing the traffic that goes through the experiment you want to ensure that if a person that had received one treatment in the beginning of the experiment should receive the same treatment till the end of the experiment. There are experiments and we wanted statistically sound data which can be audited by our data scientists and other people and again we wanted to build a system that can be very deeply integrated with other systems in our clients ecosystem that we can leverage the function at the the data from other systems as well in addition to the data that we collect. We're calling the slide parantu because we didn't really come up with another better word that basically means but watch out for these things. So if you do decide to build something like this by yourself you need to understand that it's quite complex you'll have to read thick statistics books and the field is like in the experimental in the research phase for the finer details and you will have to read some white papers to understand a few things better and it's a significant investment of time right like if Google a Netflix or Microsoft or any examples it takes them years to build this it has taken us years to build this and years to get right so understand that it's a big undertaking and for all you know what you need is a third-party service that gives you like quick access and you just want to get get it up and running and you should try it out before you're sure that you want something like this. And of course we didn't build it ourselves we had a team of statisticians there to help us. The next section is the implementation detail of the system that we built. This is the overall structure of the system. It's not very complex but there are four major parts to this system. The API service, the reporting service, the PostFist cluster and the ETL service. We'll go into a little more detail with each of these things in a bit but in as a brief explanation the API service is the one that serves all the client's runtime needs where they start a session with us they ask us for treatments we give them treatments they confirm a treatment they record even saying purchase has happened all that happens here. This is the portion where they set up an experiment and the scheduling of the experiment and all that and and they report on the experiment. The cluster is the essentially the state of all our functionally written code. So we have actually maintain a PostFist cluster ourselves where we have one master and three standby or one of the standby act as a reporting DB. We'll talk about it in a bit. Then we have an ETL service which again is written in closure which talks to Redshift and our ETL service. So in the API service one of the more interesting parts is how we bucket users right and here's a simplified version of that and I'll walk through this really quickly. So what this is is basically a tree traversal so we walk the tree breadth first and like for each level of experiments for each experiment we bucket you and the way we bucket you is ensure that if you have been given a bucket before we give that back to you and that's on multiple levels right like during this session on that device on that group of devices we ensure that you get the same treatment and if you haven't been assigned to this experiment before we roll a fair die and put you into one of these treatments and give that back to you. So the PostFist cluster we talked about again this is a functionally gone I don't want to go into details of it but since ours is a very data-centered domain we wanted to ensure that the data that we save has integrity and we don't lose a lot of data. So we need a very quick failover mechanism if our database crashes we need a very quick way of making sure that another database comes up in place of it and we're able to write to it. So we since we were using Postgres as our runtime database because again it's awesome as a OLTP database at least. There was no out-of-the-box Postgres cluster management tool out there which we could use. RDS at that point did not have replicas for Postgres when we started building this thing and even with RDS I don't think there's a failover mechanism in place. So we built it ourselves using a tool called RepManager which second quadrant built and maintained. It's an open source tool. So what it does is you essentially have a cluster of nodes it's not that it's a very simple leader selection process that is done in RepManager where if a node fails the rest of the nodes select a new leader and that comes up as a new master and the rest of the nodes start streaming from it continuously. The interesting part was the multiple lines of defense we built against such failover scenario not just at the cluster level but inside our application itself. So the first line of defense occurs when such thing happens the cluster itself the RepManager itself pushes a notification to all our application servers that hey your master node has changed figure out like make sure you're writing to the correct one then the next level of defense was of course all our applications keep polling for the master it keeps they keep checking can I write to this master can I write to the master can I write to the master the moment they say I cannot write to the master they try and they have a list of standby they can figure out which of these I can write to and then they start writing to the new one because in our in our cluster only the master is the writable rest all our hot standby which are read only by default in the process. The last thing that we used but we did was using DFS instead of the regular file system on Ubuntu. DFS allows us to have mirrors and incremental snapshots which can be backed up and recovered from there easily. So the other side of experimentation is where you measure things right so we did we do most of our real-time measurement in Postgres and what we feel is that Postgres has the sweet spot right that it's not great as a warehouse it's not say as good as redshift with months of data but what it is good is that you give it 300 to 500 gigs of data and it's really good at pulling reports really fast on them but and and another restriction that we had is that since we built it as a cluster and the database is streaming live from the master we have to use the same schema as the master so we don't have a separating separate say reporting schema. What we did was change what's underneath so we configured a file system and the hardware to read as fast as possible from disk and that's what we essentially need for a large queries and so we actually did end up doing a lot of crazy Postgres optimizations and you can come to us later for that and here's another parantu is that maintenance is sometimes definitely underestimated with respect to Postgres the reclamation of like we do have a bird strategy in place in an ever-increasing DB but the reclamation of space and reuse of space is still sort of an unsolved problem for us and here's a shout out to the Postgres Q community we've gone loads of times to IRC on free note and they've always given us answers or explanations. So Postgres provided us with the real time reports that we needed but we still needed a way to report on historical data like year and months of data that we have recorded so far in our experimentation platform so we started we thought of using a real OLAP solution so we tried Green Plum as the OLAP solution because again it was a cluster over on top of Postgres the loading reporting was pretty fast it even had the absurd or the merge strategy with most of the OLAP solutions lack when loading the data into it but again it's not hosted and maintaining and standing up the cluster in Green Plum at least to untrain people like me seem very expensive so we used again Amazon Redshift that's well maintained by Amazon and the reason we use it used it was not only that we already had other systems in our client's office using it and we wanted to be able to report across systems again so we went with the same data warehouse. So to move data from our system to the OLAP DB that we had Redshift we used we built an ETS service for it the way it worked was we had events stream of events that are API sent out and writes to the disk and that goes to S3 those events event files are then pulled from S3 using our ETL with a ETS service pulls those things from S3 transforms them and loads them loads them to Redshift all of them this was written as an assembly line which we built using co-rating so this essentially sums up the entire ETL process that we had to do and you can see like this is essentially all written in co-racing co-racing gives you a thing called a pipeline which which has one input channel one output channel and it takes a trans user which transforms things from the input channel and puts them on to an output channel if you can imagine all these steps in the ETL process as different groups of people in a factory with an assembly line so extract process is done by that many number and the second second parameter to the pipeline is the how many threads you want to run for that particular job so that many number of people in an SM in a factory doing that job putting the extracting data extracting data and putting into another another conveyor belt for the transform job to pick that up and start doing it things essentially yeah an assembly line so apologies for coming to this so late in the talk but we couldn't miss this well I would say one of the most beautiful examples of closure was the previous slide but given that let me go to this slide well I guess most of this slide can be summed up to we really like closure and I really like Rich Hickey but to be honest like it really lets us focus on the actual problem we've not found us ourselves like trying to find out how something is implemented in closure it's really expressive right the assembly line and we show you a lot of examples ahead also and then we want to be on the JVM because we want to be fast and we want the standard tools for debugging profiling and it's also it was the established language of choice amongst all the teams that sparks and we want to leverage the closure talent and it's not that we couldn't have written in this in like say a scholar or go but like say we want to be on the JVM we want a GC and like we'd really like to spend a year learning Haskell but we can't really afford that at the moment so here is say another example of such of interesting expressive code so if you remember the precise messy architecture this code sits like sort of critically in between that so what it does what what this code is is basically a map on the left hand side is a vector of the sessions tag and the type of experiment you're requesting for and the right hand side is how to bucket it right so I find that really expresses and I've found people who came to our team and and like started ramping up on this ramp up pretty fast because of code like this Wow okay so this is the graph of how we wanted to generate the report on the measure so on top of this graph are the data are the things that are actually recorded and these are the computations we wanted to do to finally get to a value that is meaningful to our client to report on to see in a report so all this complicated graph maps on to closure in a beautiful way all the things you see on the left are the notes on the graph and on the right is a very expressive way of saying how that value is calculated so this is again another beautiful piece of course it's code that describes says how expressive closure is this is a code of defining a state machine enclosure a state machine enclosure and where each row defines the transition of a system from one state to another you can imagine you can imagine your circles and add those diagrams that you draw in your automatic last sitting in drawing this is essentially mapping of that on to closure in a very beautiful way and we'll dig into this later when we go into simulation testing yeah and that is probably another moment for a nice anecdote or so so one of the things that that we normally try to like normally find ourselves failing with closure is really understanding its laziness so in the absence of good practices for understanding laziness in closure you can really mess up pretty bad so for example you run a function in the repel and it doesn't like when that run then you'll be like oh it's lazy that's why I didn't run like rapid in in a do block or something and then say you're profiling something right and then your actual function doesn't show up in the profiler because it's lazy and it shows up somewhere else in a function that you wouldn't expect and that's bad and the anecdote I was talking about was that we had an application cache and we were putting things in it and we were putting datomic datums and one thing about datomic datums is that they have an index to the root of the database inside the root of each of the four indexes that means that it has a link to the entire database in every datum and we were putting datums in the cache but we were like that shouldn't blow up anything but we were noticing that we were like going up to four gigs and then crashing in out of memory errors turns out that we were putting a lazy seek into the cache and that's obviously very embarrassing and we had to have Stuart Halloway come and fix it for us but it's embarrassing but it's true right so it's something you need to watch out for. Back to this as Abhinav was telling us the other day the way to fix this would be to have some sort of engineering practice where you realize you see that every layer of your architecture where your data is going out of one layer to another you realize you need to make sure there's no realization left between two layers of code where they don't know what they're getting. So this is the last part of our talk. So we built a system we thought it was working because all our tests were passing, all our unit tests, education tests were passing but you don't really know that your system is working until it's working in production and since this was such a critical piece of code for our client to be put on production they want to be really sure how to be sure that the code will work in production. Simulation testing is one such tool which will help you get that. What they do is it lets you simulate the working of your system with real usage patterns and varying degrees of load that you will actually see into production. So we'll go into the detail of how it does that, how it allows you to go through your realistic usage pattern but that is what we use to test it. So why test using simulation testing? Again, if you imagine your test pyramid where unit testing is on the low in the low most part of the pyramid and as you go up there's integration test, there's API level test, there's your UI test, there's your functional test. Simulation testing sits on top of all of that because it is testing not the local parts of your application but all of your application as a whole and not just your application but even if you want systems of application that is integrating with your application like a system of all the subsystems it can test as a whole. Again, other examples could be that humans can't, example-based testing humans can't possibly think of all the possible test cases and then so you go from example-based testing to property testing. Simulation testing would be property testing at the system level on the outside not at the local application level. The tools that we have used for simulation testing are Simulant Causatum and Datamix. Simulant is a library in schema for developing simulation-based tests which is written and maintained by context. It gives you abstractions like models, tests and sims. Model is essentially how you define the different activities that happen in your system and then based on the model you define your test cases. Based on what are the activities that can happen in your system, you define all your test cases which are again generated, you don't have to define one of them, each of them separately and then simulation is an instance of that test executed against your system and again all of this we'll go into it later. Causatum is a library we use to generate streams of time-based events and again we'll go into a state machine that we give to Causatum after this. Datamix is a data store of choice because it works well with Simulant. So this is sort of borrowed from Michael Nigert's talk in strange loop which you should watch and it's a great explanation of this simulation testing tool that was used to test this exact system and I'll go through it a little briefly here. So what you have is your system on the test, the big circle and the simulation runner there. So these are the two main important parts in this thing. So what you have is the activity model. So let's go through this from the top. So activity model is your state machine that describes user patterns on staples.com say and you generate a bunch of events from that state machine. So this is how 100,000 users will behave on my system and then you're gonna take this and give it to the simulation runner and the simulation runner is gonna run the same but what it has to do first is establish state on the target system and you do that first and once your target system is in an expected state you run the same and when you're running the same you're gonna record every single thing that happens in the sim and that not in a log but in a database. So you have at the end of a sim, you have the system's output and the entire log of the sim inside the database which you can then use to run validations. So this is a state machine that Ned was explaining earlier and I'll go into it in a bit detail here. So all these keywords that you see here are basically states that you can be in as say a staples.com user. So what you can do, you can log in, you can make a purchase, you can say go to some special page that I make a note of and things like that. So all those things are your states and each row in this thing represents a state transition. So you can go from logging in to like say a place where a promo is displayed and the probability of the state transition is the third column. So you can say what is the probability of this user going into that? And the last two columns are like, so there are some real time constraints on how fast you can do two actions. So that is specified in the min delay and the max delay is how, like for example, if a session times out before you do something, you enforce that and say that your users have to behave this way. So that's like the max delay thing. So all these things are what go into causatum and give you a list of an event stream that you can then use to test your target system. So the event stream that causatum generates is then each of the event is then defined how that event will look like. So this is an example event. For example, here I'm asking my experimentation platform for give me treatment for somebody who's a user. The important things to note here is it makes a request and then it records it. If it gets an exception, it records it. That's the important part. Whatever happens, it just records whatever happened and moves on because it's not asserting anything here. It's not waiting for some kind of special response from thing and then acting upon it. It's just recording everything and then moving on. So this would be the, so imagine your simulation testing as sort of a, you can compare simulation testing to functional programming in a way where you imagine the system under, like the target system under testing, which you're trying to test, can be imagined as a pure function. Pure function, not completely pure because it might have some state, but given a certain state, it should behave in the same manner. So you extract out the setting of state and the tearing down of state as an explicit step in your simulation. So before you run the simulation, you set up a certain state on your target system. In our case, it would be set up this merchant, set up these experiments for that merchant and some business rules that we need for that particular merchant. Then you run the streams of events that you had generated, streams of activities that users can do on our system. And then you do the tear down where you say, okay, now I want the end state of your system. I gave you some state. I ran some simulation on you and I want the end state and I retrieved the reports. Then I run some cleanup that I need to do to put the system back into the state that it was. And that brings us to how you actually write validations based on all the data you recorded. So like I said before, all the data, like the data that you get while running the simulation and the system state after is inside the database. So your validation is essentially a single query. In our case, because we used a simulant in Datomic, our query is a simple data log query, which looks somewhat like this and say if a simple validation would be to assert that there were no 500s during the entire sim. You can do that by just saying, like, look at the third line in, yeah, imagine like, look at the third line in the query. That's basically like, look for any action log with an exception and this assertion will fail if there were any such exception. Other examples of such validations could be that, say, I never assigned two buckets of the same experiment to the same user, right? Or say, if my traffic distribution is 60-40 between two buckets, I ensure that at the end of my sim, and if I had, say, 1000 users, 500 and 500 is the split. And, or like with a certain error margin, and then you wanna check that, so I can compute these reports, right, the reports that may experimentation platform will generate. I can do that in the simulation testing platform and so I can compare that report and the actual report of the system and ensure that they match as well. So the simulation testing we used to actually help just find some critical bugs in our system which are actually mixed by our QA team and which were not bugs that we could see it happening like you could see that. The other thing that this gives us, because so you're recording everything that is happening in your system, right? You're not missing out anything. So you can find diagnostics on what happened, why things went wrong. So you found, we found a bug that, okay, I gave two buckets of the same experiment for the user. Why did this happen? What was the state of the system when this happened? So in our simulation testing tool, this is the thing we run, say that print the timeline for the token. Token is the thing that identifies a session and it gives the result like this. This is a very cropped out result of what the actual timeline looks like. So we can see that at this time, the session started, this is the time it took for this target system to respond. At this point, it did this request, this is the time it took to respond, this is the response it gave, what was the status code and all of that and you can help, this will help diagnose the actual problem that you found in your system. And no, actually I think this will be, sorry. So all this is, simulation testing again is really good and we should definitely use it, but I would also like to point out a few things that are hard to do. It's hard to maintain the simulation testing tool that we have. For example, the testing suite, whatever, if you wanna call it, because you're testing a cluster of systems, right? Every system moves at its own pace. There's so many moving parts that if one system changes its API, you need to make sure that your simulation testing tool has been updated to that. So we all, so in the last one and a half years, we had one or two people from the team dedicated to maintain and make sure that the testing tool has been up to date, is remaining up to date. Then again, the reason you run tests is to make it part of your development process, put it on CI. So imagine putting a unit test suite on CI for one service, that's easy. Imagine putting an integration test suite on CI, that's also easy. Imagine putting an end-to-end test, that's slightly hard. Imagine putting a browser test, that's also slightly hard. But all of that for n number of services that you're running, that becomes super hard. So we were actually trying to put the simulation testing suite that we had on CI and we sort of, as I said, we couldn't, but it's still in the process, we'll get there. And then simulation testing can become part of your normal software development delivery process where every three hours, a simulation test is run against your system to make sure that all of your systems are working in this manner that they should. So in conclusion, traffic is precious and that's, I think, the fourth time I've said it. And so keep that in mind, it's not just us, it's not just Staples, it's Google, it's everyone you can think of who's doing experimentation. Keep that in mind. And then you can build assembly lines in Clojure really well and you should probably use that and ETL is a good example of where you can use that. So thanks for that. And test your system from the outside, your simulation testing, and yes, use Clojure. And here are some really good papers that we went through and books that we went through. These will remain on the slides and we'll put them up somewhere. You can go through them later on. And that's it, thanks. If you have any questions, we can take them out. So can we go to that slide? Okay, all right, we can take this offline. Thanks.