 We are happy to have, I guess, another guest speaker today for the quarantine talks. Sedan Tang is the chief engineer at Pincap, working on ThaiDB and ThaiKD. He's here also to talk today about the testing infrastructure, a framework that's built to test ThaiDB called Chaos Mesh. Probably what Sedan is most famous for, and you can feel free to disagree, is that he is the author of LettuceDB, which is one of the hottest, reddest clones on the internet now. Is that a fair assessment, Sedan? Okay. Okay. So, before I turn it over to Sedan, we begin with the talk. Again, if you guys have a question, by all means, interrupt, and then say who you are, where you're coming from, and then ask your question. And then Sedan is calling in today from China. So, it is 4 a.m. where he's at. So, if the internet cuts out, we'll do the best we can. Okay. I sit down. The floor is yours. Go for it. Thank you very much. Glad to see you here. And thanks, Andy, for giving me a chance to talk about how we use Chaos Mesh to test a cloud-related database here. And today is my talk, testing cloud-related database with Chaos Mesh. And, okay, I need to check. And before this talk, I may be, let me introduce myself briefly. My name is Liu Sedan. And you can call me Liu. And I'm the chief engineer at PINCAP, and as Andy said, I'm the maintainer of TIDV. TIDV is a cloud-related database. And of course, a distributed key-value, TIDV, and of course, Chaos Mesh, and also Chaos Engineering at the WK. And here's today's agenda. At first, I will introduce something about TIDV, and I will use TIDV, for example, to show you how we use Chaos Mesh to test a cloud-related database. And of course, I will tell you some of our ten-fold on-call experiments with TIDV, then to show you why the Chaos Engineering is so important in the distributed world. Of course, I will show you why you need Chaos Mesh if you want to use Chaos Engineering in your system. And of course, I will tell you how to start a Chaos experiment with Chaos Mesh. And in the end, of course, I will talk about some future plans with Chaos Mesh, and if you have interest and want to contribute to us. Okay, let's begin. The first is Hello, TIDV. What's TIDV? And I think that TIDV is a famous distributed transactional database widely used in the world now. And you can see that it now has more than 1,000 production users. And the GitHub star is more than 2,000, 24,000. And the TIDV supports many features. For example, it can support the MyCircle compatibility protocol directly. So if you use MyCircle and you found that your business grows bigger and bigger, and when MyCircle cannot hold your data, and you don't want to value shutting your MyCircle, and you can try TIDV. And my TIDV now is an edge type database, and we have already released a paper called... It's called it in the newest VLDB, and you can see the paper soon. And it supports hybrid transactional and analytical processing. So you can run both your OLTP and OLAP workloads together in TIDV directly. Of course, because TIDV is a cloud-lative database, and I think that it's a cloud-lative database for another Aurora and Google Spanner. And as far as I know, some of our customers have used TIDV to migrate from Aurora and Spanner. So here is the architecture of TIDV because... And you can see in the center part is we provide two story layers. One is TIDV, which is a low-format storage, and the other is Tiflash, which provides column format. And in the left part is TIDV, which is MyCircle compatibility server. And you can use any MyCircle driver to communicate with TIDV to maybe to run your OLTP workload, and it can also run some OLAP workload. And in the right part, we build a Spark driver called Tidespark. So you can run all your OLAP workloads in Spark and with TIDV. And as you can see, for a TIDV cluster, there are maybe some components in the cluster. And so maintaining these components in production is not an easy thing. Of course, we meet many problems. And now, for example, here is just some... And I think some of you are very interested in some problems. The first is that can be trusted. In TIDV, of course, because we are database, we just want to save data into disk to keep data safe. And we use a common API to write data to the file. But in some places, we forgot to call Flash or F-Sync. Maybe you think that maybe... Mostly, I think I say because when the TIDV process crashed, but the limiters can still guarantee that the data is safe because the limiters can help us flash the data from the page card to the disk through PD Flash. And maybe you will... So our data is safe, of course. But it's really safe. We miss a bug. We miss a limit card bug in our user production. And we found that when the limiters want to help us to flash the page card to the disk and you miss a bug, and you can see through the D-message and call a slab unable to unlock a memory. And when we miss this error, it's just abandoned the page card directly. So we lost our data. It's very horrible in production, but we miss this. And you can hear it. So from that... So from... At this time, I have not trusted a limit card anymore. And so every time I want to save my data, I just... I will use F-Sync or Flash to... to save it to the disk. And here's a ladder on call. And boy, we called... You can see that one of our most important customer runs TIDB on the cloud. And sometimes they call us that the read latency increases, but the customer workload didn't change. And the only maybe abnormal matrix you can see is that just a memory cache. The cached memory. And we found that the cached memory dropped a lot at that time, but we had no idea why. Finally, we identified the problem. And then you can see it's very interesting. And the reason is that the cloud vendor ran a script on the author machine. And the script... In the script, it might plug or unplug the own memory device randomly. And so it will flash or the page card to the disk possibility. So we can see that the cache memory increased. But for TIDB, because there's... Now there's no data in the page cache. So the TIDB needs to read the data through disk again. So the latency increased. After a while, I still can't understand why the cloud vendor created such a bug, because... How did you find out they ran a script? Did you ask them and they told you? But how did you find this out that they ran a script on the host machine? It's very interesting that we found this problem. And I ran a PD state. Maybe I ran some script in the background maybe regularly and found this problem. And we found that... Oh, when the latency increased, we found maybe a global process, maybe progress. And we identified the progress maybe it's created by a script. So we found this problem. So you asked the cloud vendor. Can you say... And they confirmed this? Or you just assumed it's this? The cloud vendor doesn't know this bug. Didn't know this bug before. And we used some way to find that their script can cause the problem. And they know that there's a bug in their script. Okay, so you told them and they're like, oh yeah, you were right, there was a bug. Yes. Can you say who it was? Was it Google, Amazon, Microsoft, Alibaba? No, no, no. It's just a cloud vendor in Thailand. Okay, okay. But not Alibaba. Not Alibaba, Tencent. Got it, fair enough. So as you can see some... I still can't understand why they created such a bug because before we found it, the script runs hundreds of machines before. Okay. And there's just two examples. And of course we met before. We met many things, problems before, not only above two, but also as you can see, and we'll meet some network package on some cloud vendors. And we were so... because we load some cloud vendors to hold multi-talent. So sometimes we use... you were thrown out by another user's job on the same machine. And as you can see, some of our customers' machine was hacked by the hacker. And so in the hacking script, the tiny V was killed randomly. And you can see we met many things. So the conclusion is that error can happen anywhere at the time. So building a distributed database is hard, but making sure it works as expected is even harder. But we needed to survive from the complex and distributed work, survive from it. In my opinion, the best... or maybe hardly the best definition is a good offer. So maybe here we need chaos engineering. So let's talk about chaos engineering. So what's chaos engineering? And through the web, we can see that chaos engineering is a display of experiment to help you to build confidence in the system. Okay. Sorry. I want to emphasize that it's just to build confidence. I think that chaos engineering is a way that can help you to gain more confidence of your system in the distributed world. So... Okay. Of course, there is just a brief history of chaos engineering, and maybe it came from... 2010 came from... Netflix. And maybe you are working as a chaos monkey at that time. And in... maybe in 2018, the principle of chaos engineering become online. I recommended that if you are interested in chaos engineering, more of you should read about this principle. So you can learn the chaos engineering more deeply. And of course, in the end of 2019, the chaos mesh is open-source by PINCAP. Of course, the PINCAP has also contributed to chaos engineering. We wrote a chapter for chaos engineering on a database in the book, chaos engineering. It's my pleasure that you can read this book after this meeting. Okay. So... Yeah. Chaos engineering is just a way that to help you gain more confidence. So how to do the chaos engineering? And in my opinion, doing chaos is very easy. And if you want to do chaos in practice, you just need four steps. And don't worry about it. I will use tidyB for example, and the four steps are that the first is define the steady state and then make a hypothesis and introduce the variables and then prove the hypothesis. And I will use tidyB here, for example, the steady state. What's the steady state? And in the chaos engineering, the steady state is that you can use to identify that your system works well in the normal condition. Which, for example, here we use QPS metrics to identify that our tidyB works well globally. And then we make a hypothesis because tidyB use rough concentration algorithm to support full tolerance. So if we kill one tidyB instance and if the instance just a rough leader and if we kill the leader the QPS may be drawn because the client cannot write data into the rough leader now. But other replica using rough algorithm will elect a new leader soon to serve the client's right again. So if we found that QPS will be recovered and here is the hypothesis because and so we will do the experiment and we'll introduce the variable real-world into it and the variable we want to do is just kill one tidyB instance randomly. And then we want to disprove the hypothesis. It's okay so here we just run the experiment and kill the tidyB and we found that the tidyB dropped, the QPS dropped but not recovered anymore. We found that there's a bug in our system and we will try to locate the problem and fix the bug and then do the kill experiment again. As you can see there are four steps very easy. But now we know how to do a QPS experiment now but where to start? I think that maybe here we need a QPS platform. For QPS platform I think it should have following feature and the first it must support doing QPS doing QPS experiment automatically. And no one want to do the QPS experiment one by one manually and it's a tough work. And the platform must provide lots of different real variables and so you can maybe to simulate as much as the real problem in the world. And of course the most important thing is that the system and the test must don't know anything about QPS experiment. And so of course where is the QPS platform? And as you can see here is QPS Mesh. Here's just a background of QPS Mesh when we begin to begin to develop TIDB maybe five years ago we had already begin to develop an internal project we call Scordina is another QPS engineering platform and but as the platform grows maybe better and better we found that those only pink hat but also our customers or communities will both lead the QPS engineering platform so we decided to start a new project so we call QPS Mesh here maybe so we start a project maybe at April 2019 and open source it in the end of 2019 Did you guys consider Chaos Monkey at all or you just like why not use Chaos Monkey that's been around for nine years since you started like was there something about Chaos Monkey that was insufficient for database testing and that's why you had to build a QPS Mesh? We don't try the Chaos Monkey before because at that time as you can see TIDB is a compact distributed system and we want to maybe we want to customer our Chaos with it so we decided to build by ourselves so we have a lot try Chaos Monkey or try any other tool but inspired by Chaos Monkey we found the tools guide us how to do the QPS engineering From what I remember from Chaos Monkey was they were just killing random servers in their data center and see how the overall system would recover it. My question is is there something specific about a distributed database system like TIDB that merited building something from scratch? Okay and I will answer your question why we don't need Chaos Monkey and at that time because the QPS Mesh had only maybe I will also compare the Chaos Mesh to the Chaos Monkey soon so maybe you can know why because we want to customer our failure injection with Chaos Engineering and do more of other things Okay Okay so this is just a background of QPS Mesh and so here I just want to tell you why do we need QPS Mesh if you want to use QPS Engineering and here are some in my opinion they are for the QPS Mesh and there are some maybe following advantages and the first is based on the Qubilates and it can provide lots of real-world valuable assimilation and it's easy to use and can also provide dashboard so you can observe your system and it has a booming community and I will expand one by one soon and of course in my opinion maybe the most important thing, the reason why you need QPS Mesh is that it can really help you found bugs of course I will show some bugs many bugs found by QPS Mesh soon too so the first is based on Qubilates because we want to test cloud-related database so what are the things should be run on the cloud and of course we know that Qubilates is the operating system on the cloud so what are the things should be run on the cloud and for the QPS Mesh is use some maybe use also the QPS Mesh run on Qubilates and it uses some hacky way is run as a demo set and it can run as a set card on the Qubilates directly so using these two ways your application or your system doesn't know anything about QPS Mesh a lot of reason is that QPS Mesh can provide lots of real-world available and you can see that QPS Mesh can help you to kill the process randomly and can help to delay the network or just help you can write or read file data field and QPS Mesh can even help you to just failure into the linear color directly and here is a big picture that why you should choose QPS Mesh compared to other QPS engineering platform and you can see that the QPS Mesh may be nearly provided most of the ways to help you to ingest failure and of course compared to QPS Monkey and you can see that QPS Monkey just provide a little thing so we don't want to use it at that time and we just begin to develop our QPS our QPS engineering platform and QPS Mesh is very easy to use because it's based on the Qubilities and so you can deploy it directly on the Qubility cluster and you don't need to require other special dependencies and of course your system and the test doesn't need to know anything about QPS Mesh and you don't need to modify your deployment logic of your system and if you want to do a QPS experiment you can just maybe provide a YAMF config file and if you don't want to edit the file you can also even use dashboard to run the QPS experiment and of course the QPS Mesh also provide a dashboard to help you to manage and monitor all the QPS experiment and you can see that we have a QPS capability and in the QPS Mesh we provide a dashboard so as you can see like this and in the dashboard you can see the QPS experiment directly for example here we provide we run three QPS experiment the first two is that short term port failure and you can see that it doesn't finish but for the third experiment we found that the long term port failure we found that the QPS has not been recovered for a long time so we found that there is a bug of TIDB so as you can see using this the dashboard is very suitable to let you know whether your system goes well or not and of course for the booming community you can see that the star on the GitHub instrument increase a lot and we also have now have maybe 37 contributors not only from PINCAP but also from Red Hat and Emotion and Media etc and of course and we have many adoption now not only PINCAP uses and many other guys many other companies use QPS Mesh to build to test their own business like this of course QPS Mesh has above advantage features maybe and QPS Mesh can really can help you to find bugs and it has just maybe bug leads found by QPS Mesh and most as you can see some bugs are very serious and but lucky we found them before we release the version okay and now I just talk about why do we why do you need QPS Mesh if you want to use QPS Engineering so maybe I think that now we can start a QPS experiment with QPS Mesh now but when you before when you want to do the QPS experiment before doing it you should load the following things and notice that the first thing is that if you are a new buy for QPS Engineering don't do the QPS experiment in the production environment directly at first because you are a new buy and you can't know whether your system is strong enough to survive from the QPS experiment you need to do the experiment in the testing environment at first and the other thing is that if you want to do the experiment in production you need to control the blast radios in production and just increase the radios gradually for the radios for example if you want maybe you can ingest failure to some specific person in your system then you can increase the radios and maybe to ingest failure into some maybe person in one street and then increase it in the city but if you just increase the failure toward the person or to the system maybe and make your system crash and maybe I guess maybe tomorrow your boss will let you go and the third I think is the most important why you should know is that doing chaos experiment is not just ingest failure randomly and ably because time is limit and we must do the most valuable things so when we do the chaos experiment we should ask a question what are the best valuable experiment to give more confidence in the system in your list that you had in the previous slide of like here's all the bugs you found in TATB were these things you found like when you first run it in chaos mesh were these things you found in the first one minute or two minutes or did it take several days for those bugs to manifest themselves some as far as I know some bug can be found in just one or two minutes for example I mentioned that maybe we meet some network cracked bug before so using chaos mesh we just cracked a network package and we found many bugs it's very quickly but sometimes when you for example when you working some bugs must be found maybe in a long time maybe one or two days for example sometimes TATB use rockcb to flash data to save data and TATB is not an easy thing because rockcb is a robot maybe has I think rockcb is a robot and it's very strong and it's stable from long but rockcb still has some bugs and so finding these bugs using chaos mesh can take a long time remember that we located a serious rockcb bug the bug may be in rockcb about more than three years and we run the chaos mesh about one week and just found this bug of course I will continue when you want to do the chaos mesh experiment you must make sure that your application has already been run on the Qubilated because chaos mesh can only work on the Qubilated so let's begin using chaos mesh is very easy if you have a Qubilated cluster or you can use just one step to install Qubilated and of course if you want to try using your local machine and you can install chaos mesh with time like this here's just a simple example of chaos mesh we just need to provide a config file to tell chaos mesh how to do the chaos experiment in the config file you can see that you need to define the action maybe here is the port kill which means that we are killed our pod and the mode is just randomly select one pod and of course we will use selector to control the blaster radios and of course the schedule is how to do the chaos experiment and we can so through the config file we can see that we will randomly select the type QV instant and kill it every five minutes and after we edit the config file and we can use the Qube counter to do the chaos experiment and we can also use Qube control delete to stop the chaos experiment and through the dashboard we can see the output of chaos of this experiment and you can see that we can do the chaos experiment every five minutes and meet our expectations next day and of course here they see a way just you use maybe you use a config file and I think that maybe I'm a lazy person and so I sometimes I don't want to come on the way so here the chaos mesh and through the dashboard you can see that we can config and we can config the experiment and easily and you can as an example shows next day that's all for the how we do the experiment and of course in the end I will talk about the future plan of chaos mesh and for the chaos mesh now has been widely used in especially in China but there was many things to do at first we needed to improve our dashboard and because now the dashboard can only work tidy B and some specific specific distributed application we want to integrate more and I want to the user can integrate their own business into the dashboard directly and another thing is that we want to do the chaos on the AWS or GCP directly which means that we want to maybe we want to use chaos mesh maybe to shut down your home region of of your business like this and of course another thing is that because now all of the chaos experiments are defined by ourselves and we want to integrate with with machine learning or AI so to let AI to help us to do more maybe to help to help us to define the chaos experiment and of course another thing is that we will integrate with CICD system and such as ARCO and Jenkins which means that you can run the chaos mesh with ARCO and then after the chaos mesh finish and you can maybe release your version and publish it through ARCO or joking like this and if you want to know more about the chaos mesh and you can visit our website and or GitHub or Twitter like this okay thank you very much and that's all for today's talk and in the end okay awesome I will applaud on behalf of everyone else I think Lenny you have a question yes thank you Lenny you should be allowed to unmute yourself there you go oh okay nice thanks Andy and thanks Leo it's a pretty interesting talk oh by the way my name is Ling I'm a PhD student here working with Andy on databases so one question I want to ask and very curious is that how actually do people at TaiDB come up with what would be the set of chaos experiments to run to test various aspects of TaiDB especially given that there are many failure options supported chaos mesh and there are also various combinations of possible failures basically I'm just curious okay and that's a good question and thanks Lenny and you just told me maybe the question is that you just want to know how typing have to combine so many maybe combination of failures to test TaiDB right how we do the combination right exactly especially I think in your talk you also said that you don't want to just blast everything right you want to have some sort of focus right so I don't know how you do that in pink cap at first you can say that TaiDB maybe have different layers so at first maybe we will do mostly we will do the chaos mesh in some scope maybe we just here mostly we will do the chaos mesh with chaos engineering against maybe with IO or network because we found and because we found that because for TaiDB because we found that most of our most of our our core issues are caused by the network and by system so mostly we do this thing another thing is another another most another experiment we did we have done most is just use just random queue the port or just random delay the port because and as you can see maybe we most briefly we mainly focus on the IO I see then we will do something adjust with others for the network and then we will also delay then for the process, progress injection and we did a really we really inject failure into the medical course kernel but only if we found we really meet some interesting bugs and we will try to maybe to inject why the model can meet can have this part I see, I see, thanks for sharing yeah thanks a question from Van, you want to unmute yourself oh yeah sure, hi I'm Van, you can call me Jean Chal I'm working with Andy for the testing infrastructure and I'm particularly interested why about the decision for why you only focus on Kubernetes instead of on the bare metal or on the VM like the chaos monkey does for the chaos mesh because it occurs to me but it occurs to me that the round time you are working on as on the container layer it's like additional layer from the container to the host so in terms of the IO it will occur additional layer which introduce additional possibility of bugs that will only be more troublesome sounds to me, I'm not quite sure why, what's the concern from the pink hat okay so thanks and your question is why we just support Kubernetes and not support VM okay because we believe that in the near future maybe TIDB is a cloud-related database and we think we believe that in the near future what application maybe should be run on the cloud and on the cloud and maybe we think that Kubernetes is just the operating system on the cloud so we think that maybe supporting the Kubernetes is future I see thanks what percentage of your customers if you know run TIDB via Kubernetes I is it like 75% a large percentage or is it still pretty rare pretty rare is it very common that people run TIDB with Kubernetes or do you have a small number of customers and maybe one year before just small but from now on, but this year many customers, especially in China want to tell us they want to migrate what they are maybe infrastructure to Kubernetes and they want to run TIDB on Kubernetes and also of course also of China, maybe in America or I think that maybe most of our customers run TIDB on Kubernetes okay question from Don hi, this is Don I'm a incoming faculty member to Penn State University so it's very nice talk this whole thing reminds me about Jepson, I don't know if you're familiar with that it's a very popular distributed database kind of like testing framework I think TIDB is Jepson certified yes, I saw that before so I'm just kind of wondering because we're very lucky in this situation in this particular slice Jepson also allows people actually injecting different kind of faults I don't know if it's this broad of kind of faults so I want somehow hear your comments on how chaos mesh is somehow related to or orthogonal to Jepson and another one I want to ask is pretty much so right now chaos mesh is supporting Kubernetes which is all of the different kind of services are kind of bounded in a container boundary so have you ever think about extending this boundary making it more general for example there are multiple different parts connecting with each other in a general process can you actually inject problems or CPU burn memory burn whatever in a sense of process not really a container the first question is that maybe to talk about something with chaos mesh related to Jepson right? yeah and you also talked before that TIDB has also passed maybe passed Jepson test and we have worked we have worked with Jepson also for a long time to let Jepson to let TIDB to pass Jepson test Jepson is just one way to test a database and you can say the Jepson will for example with TIDB the Jepson will start TIDB cluster and also to do some failure but the most important thing is that the Jepson is I think that it's a mechanism to maybe is a way to check visibility or just to check with to check the whether your mid relation the Jepson can work well with Jepson and of course in ppm our internal we have already integrated Jepson with chaos mesh which means that we still use the failure we still use chaos mesh for failure injection but we use Jepson verification and to use Jepson to run TIDB and to check whether TIDB meets the ability or still TIDB meets the separation so just to interject like yes so Jepson introduces the same low level like syscall or like hardware failures but it has the rules above it to determine whether you're violating transactional consistency guarantees yeah yes so you can think that we can work well with Jepson and we also by the way we also port Jepson from cloudjar to go and we have a right to go Jepson really? yes because we don't like GVM or don't want to write cloudjar maybe it's hard language absolutely it's all the same from the cloudjar to go and we call the type pocket the project in Ping Ha and you can also say oh sorry what's the second question? oh yeah the second one was pretty much I said like right now the platform support is Kubernetes so like I would guess that most of the fault injections are injected into like in a unit of like containers I'm just curious that if this thing can be more generalized into like more common application with easier boundaries into kind of like if for example I can inject all this kind of stuff in a general process rather than like it's like bounded by a particular Kubernetes so for example you can cut it off and see like inject different kind of syscalls into it you can also try to kind of like like try to like wrap around like a particular process like sockets and just listen try to like drop packages create timescale and all this kind of stuff okay yeah the question is that maybe now you are that the customer can only maybe inject to the containers but we should increase the scope yeah right? yeah I think like yeah I definitely see the reason why Kubernetes is like very appealing but I think like it's also there's a chance that you probably want to kind of make it more general like to make any kind of distributed applications benefit from it okay and in the and you can throw this and the curbsmash you two ways to help you to help you to inject system one way is to have a case demo a demon and you can see that maybe the case demon is just like a demon progress running at the demon side of Linux directly so using this way can help you to ingest failure not only to the container but also to other projects that running on the maybe in Linux and the other is the case set card maybe you can see that it starts in another container may be running in the same run in the same part of your application and the set card can help you maybe can do the failure injection maybe like to ingest failure to a socket or to the binding socket like you said before like this so in my opinion the case and case mesh that not only can support to ingest container directly it can also maybe do other things like you said before and of course there's a tricky way also we said that can must should only be run on capabilities the failure the failure injection way can still be used on the local machine or on primates on the maybe mental machine directly so you can still use care for failure injection way of care smash on capabilities but also maybe on the local machine okay thanks another question from Dominic hi yeah thank you I'm Dominic I'm principal engineer at Cisco Systems and I have a question for Sidon Tang I hope I pronounce the name correctly I saw on your twitter profile that you also use TLA plus temporal logic of actions and I wonder what is your opinion how do chaos engineering and former specification how do they relate do they complement each other or are they completely unrelated you talk about former leak verification yes TLA plus temporal logic of actions and how do they how do they relate okay okay your question is how to compare with TLA plus or other formalization yeah in my opinion the difference is care smash using TLA plus is a way to verify that your algorithm is right for example we use TLA plus to verify that our transaction algorithm can work well but after you use TLA plus to maybe to prove that your algorithm is correct you still cannot guarantee that it can work well in the real world so here because you may introduce some bugs or may you may anyway and so here we use use care smash or other ways to verify that your implementation can work well in the real world in my opinion that TLA plus is just a way to verify the algorithm before you before you really write the code and care smash is a way to verify to prove that your code is right that things is okay so you're saying that the formal verification is to verify the specification and care smash is to verify the implementation yeah cool thank you okay anybody else so I guess the memory burn test what is that like that's flipped bits what does that actually do the memory burn like CPU burn is when the CPU just starts computing garbage and the memory burn is what you start getting flipped bits the memory burn is simple maybe you can start maybe start application and most of the memory you just start flipping bits what like so what does it actually do if it flips values in bits or it tells you I don't have any more memory maybe to maybe to simulate OM problem yeah have you tried doing the bit flip thing where like you just you spawn a thread and then you start randomly flipping bits in the address space just to see how it behaves I know that the duck DB guys have tried that and I've heard that in other commercial systems for the schedule it would be like I guess chaos mesh can't do this like you have like you know you mallock a bunch of space for the database system in memory and then you just have another thread randomly writes randomly flip bits in memory just to see what breaks and how and how like fault tolerant you are have you tried that one okay you think that we oh oh yes I love sorry you think that maybe so because for a memory database and but another maybe chaos mesh want to flip the memory be like this so to maybe and maybe okay oh for this we can use a color chaos and as you can see we can oh next day here is yes we can help you to inject failure into the memory directly got it okay okay cool awesome alright guys so again thank you for being here we appreciate you staying up I realize again it's 5 a.m. in the morning you're going to go to bed or the day just getting started we appreciate you getting up early and spending time with us