 Hello everyone. Can everyone hear me okay? I sound British. That's just a thing James understands so it's good to have another Brit in the room you know for moral support. So I'm Chris. I love Elixir. I have been doing Elixir for a few years now, given a few different talks about some of the things I've done in the past. As Desmond already said, we host a podcast together. It's definitely worth checking out. We talk about a lot of the things that we've kind of been going over today as well. So might be worth listening to the back catalogue just you know and Desmond likes to plug his consulting business on there as well. So you can hear about that. And I also help organize the other MPEC. So the one that's in MYC, which hopefully we'll see some of you at and shout out to all the New York crew who are here today. So it's really good to see everyone. So I work at a company called Frame.io. I'm a director of engineering there where we basically build review and collaboration software for video teams. So we're used by all these big companies like Vias and Turner and stuff like that. And there's loads of stats here like 450,000 customers and we're founded in 2014 and we're based in MYC. And basically what I'm going to talk to you about today is quite a long journey that we've been on over the last year where we've gone from Ruby on Rails to Elixir and Phoenix. So I'm going to give you like a whirlwind tour of some of that journey. Like honestly, if they would have let me, I probably could have spoken for the whole day about some of the things we've been doing. And that's not because I like speaking. It's probably because there's just been a lot. So hopefully you'll get a little taste of that today and see some of the great use cases for Elixir in the wild as well. So let's start off by saying why did we do a rewrite and why was that in Elixir? So first of all, yes, we did that thing. I'm sure everyone's familiar with this blog post. Sorry, Joel. Yes, we did a rewrite. I'm going to try and justify some of the reasons why we did that today. And actually, Andrew mentioned this really, really well and succinctly earlier about, you know, software grows. Software grows over time and it gets complex and it gets messy. And when you're in a startup especially, like requirements change, right? You're working on one thing one week, the CEO comes to you the next week and he's like, you know what? We need to pivot. And often the code bases reflect some of that. And that's very, very much the position we're in at Frame.io with a lot of debt in the code base. So I'm just going to run you through some of the key issues that we had. So it was a Ruby 1.9.3 app running Rails 3.2. Two major versions out of date now for anyone who keeps up with the Rails community. So it had no tests, none at all for anything in the app whatsoever, which also coincidentally makes rewriting it quite difficult but I'll get to that in a bit. So it also had this custom RRM that was backed onto DynamoDB, which you might be thinking, well, that's quite unusual given that Rails comes with this framework called Active Record. And yes, it was quite unusual. So this talk's recorded. I'm going to feel really bad for our CTO. We wrote quite a lot of this but hopefully he won't watch. So there was a lot of metaprogramming everywhere across the app. It takes you a really long time to trace any calls to see the source of them. There was absolutely no logical separation of any aspects inside of the application. So being a collaboration app, it's all about what can you access and when and how. And there was none of that. It was very much your classic MVC, put everything in the models and have really fat controllers. And also there's no metrics of visibility into the performance. Okay, so that's the API side. Let's just touch on the database side. So I mentioned that it backs into DynamoDB. So for those of you who are familiar with DynamoDB, it's a key value store, effectively like a NoSQL key value store. So it's really, really good for horizontally scaling except we weren't using it in that way whatsoever. So everything was a string, everything, even nulls. There's a value called JTNullValue which is a null apparently but also a string. All of our foreign keys were stored on the entity but encoded as a JSON list. So everything, nothing could be updated atomically. So we would actually drop customers data left, right and center if there was two updates happening at the same point in time. And DynamoDB costs us a lot of money to run this workload. So and then to top it all off, no pagination whatsoever because of the way DynamoDB was structured, which resulted in some of our largest customers having response times of up to 45 seconds. And that's no exaggeration. I literally had to up the load balance of time out to 45 seconds to make sure that we could deal with those requests. So yeah, that's kind of like it in a nutshell. I came into the company and we were all talking about doing this rewrite. And we really felt like we'd got to a point where moving forward wasn't a great option for us with this existing code base, right? And I'm sure many of you are in the room thinking, you know what, my code base is bad but I probably wouldn't rewrite it. And I completely understand that perspective as well. Rewrites are big, they're hairy, they're dangerous, they're costly, right? We decided that if we were going to make a dent into this thing, which is a monolithic API, that actually our best bet was actually starting over and rewriting it and carrying on. So why Elixir? Again, I think like the speakers here today have done a really fantastic job of arming all of you in the room who haven't used Elixir yet with a lot of the reasons why it's so fantastic. For us, we had a lot of Ruby and Python developers. I think the ramp up there is pretty straightforward. There's a lot of people who have made the transition. The concurrency was a key aspect of why we chose Elixir. And the fact that we're building on top of something that's been out there, as Emma so rightly pointed out earlier, just, you know, it's a fantastic thing that we're building on top of years of legacy and very well, very well architected system. And, you know, lovely language attributes as well. But the biggest reason was really that someone else had sold it and I was kind of just coming in and helping to bring that thing to life. But so next, I really want to like take you through what the system looked like and what we shipped and then give you a bit of a tour through some of the more kind of interesting parts of the system as well. So, sorry, this diagram might be quite small for those in the back hatch trying to fit in as many of the services as possible. So, really, we had this big monolithic API in the middle. It was a Rails API. Then we had a few node microservices around the edges that did a few different things inside of the app. But really, you can see this as one big monolithic system. There was a high degree of coupling between all of these different components. They all relied on the same schemas. They all relied on the same database, ultimately, going through DynamoDB, as I mentioned before. And to tell you the truth, some of these services in here as well were not great. Like the email digest service that's on the right hand side here, we had to restart every single morning. We've gone through a really large growth phase and dealing with that put an absolute strain on some of these systems that weren't well architected from the beginning because, you know, you're a startup and who knows if you're going to be that successful at that point. So, the digest service in particular was the bane of my existence for quite a few months. Basically, woke up every single morning, go into EC2, click the restart button and hope for the best. But it kind of worked out. Yeah. So, we have all of these different components here. In terms of clients, we actually support, so we have an iOS app, we have a web application, we have these different embedded Adobe applications as well. So, there's a lot out there in the wild and things that we can't easily update as well. So, what we replaced in the system are all of these parts I highlighted here today. So, I mean, it was basically most of it and actually it was kind of all of it. And there was actually a bit more as well, but I won't go into that today because it's probably out of scope of this talk. But, you know, we really had to address this head on. So, we thought about it and we said, if we're going to tackle this, we're going to have to ship all of these different parts. And for us, what we actually shipped was this elixir powered API, a new notification system, a new real-time service, and this support tool so our customer support team could actually access all of this data. And then at the same time, what we did, it was we took all of our data from DynamoDB and we migrated that over to Postgres as well. So, we did this whole migration in about, it was like eight months from start to finish with a team of three developers, two of which had never worked in elixir before, if that's some kind of metric of adoption and how good it is. And as well as that, we dockerized the whole thing. So, like many of the other people on the deployment plan earlier, we run everything inside of Docker and we completely rebuilt our tooling from the ground up. So, this is a large project, something that I've spent a good amount of the last few months dealing with. And we actually shipped all of this in November. So, I can give you some real stats because it's actually been running in production. But when I did this proposal, it wasn't. So, I'm glad about that. So, what the system looks like after we shipped everything is a bit like this. So, again, this might be slightly difficult to see. But effectively in the middle here, what we have is a very large monorepo that contains everything inside of an umbrella application. We have a real-time service at the bottom that we actually shipped before we did any of this stuff. So, we swapped our socket IO, Node.js service with an Elixir Phoenix service so that we could understand what it takes to run Elixir and Phoenix in production. That was a really great learning experience for the team and be able to scale it and understand how we'd run it in Docker and all of these kind of ideas before we approached this big monolithic chunk. And then inside of the big monolith there, I don't know if you can read it at the back, but there's this part of the API on the left called Munger. And Munger for us is this kind of like intermediary API layer. And I'll explain a bit more about that later. But you can think of it like a consumer of the new API that acts as like a data translation layer between our clients and the new API itself. So, all of that bit in the middle was all Elixir. We separated out all of our core business logic. We moved everything to Postgres. We added a bit of Memcached in there as well. And yeah, this is kind of an overview of the architecture there. So, everyone wants to see some wins, right? So, I can reveal, the grand reveal, some really cool stats that happen to us. So, we were running around 40 EC2 instances before we made this migration. So, that was about 20 for our API service, about 20 for our Node.js real-time service. And what we did is we moved all of that down to five. So, we run everything now inside of five EC2 instances. They're about comparable in terms of the size. So, this is an okay comparison. The difference now is that we use Docker and ECS. So, we're packing all of those services into these instances, right? So, there's no more treating like our API service gets deployed over there and runs on its own server. We treat compute resource and memory as just one big allocation and you're putting services inside of it. So, that's how you get some of these wins as well. And if you haven't looked at stuff like ECS or Kubernetes or Mezos or any of those Docker kind of scheduling things, it's well worth a look. And I think it's like heretical to say, but I think the Erlang VM runs really well inside of there. I'm sure other people have different opinions. So, in terms of our response time, we're looking at about a 30 millisecond response time, about 120 requests per second. Pretty great. And that was down from about 300 milliseconds previously and going all the way up to the 95th percent I was actually a bit higher than that. Our database costs went down by 91%. That's largely because of the workload we were running on DynamoDB. And honestly, you can't attribute the DynamoDB part to Elixir. I know that. But it's a nice case study for using the right database for the job. And now we have about 70% test coverage across the board with about 2,000 tests. So, we feel safe in continuously deploying this thing, which is great. Before, we'd literally like cross our fingers and hope for the best and hope that every deploy worked out and we didn't bring down production. And visibility-wise, I'm sure if you're not running Elixir in production now, some of you might have questions around what do you do about like new relic substitutes and things like that. Well, for us, we actually just run statsD. We pipe everything through statsD to datadog. And then we get really great visibility through just looking at the memory utilization. And we use this tool called Recon as well, which allows you to get a really deep look into the scheduler and get some of those nice scheduler statistics out of it as well. And for me, some of the best things are that we started out this project with a clear goal about our architecture as well. So, we'd learned a lot of lessons from the previous API. And I think this goes without saying when you do a rewrite. You're kind of taking what was there before and thinking about how to improve it. And what we ended up with was a really clean modular documented, and I would say at this point, pretty maintainable code base. So, we did a really great job of separating out a bunch of our concerns. Like, that authorization logic is now a separate module that we can test independently. All of our service entities are separate. All of our persistence layer and business logic are all separate. So, we can test those all in isolation. And for me, the best thing of all has been, we've been running this since November and literally touch wood, even though it's plastic. I haven't been woken up in the middle of the night and none of my team have as well. So, we're doing an insignificant amount of traffic through this service. It's not like web scale, but it's pretty high, right? Like, it's significant. And we have had no real large incidents yet. And it runs stably. The graphs look pretty consistent. And, yeah, just very few incidents in general so far. So, I wanted to give you a bit of a whirlwind tour through some of these parts of the system. So, we're going to go through these four parts here and hopefully they will be interesting to some of you who might be doing something like this yourself. So, first of all, the intermediary API. So, as I said before, it's kind of a consumer that consumes our new API and maintains the contract between the legacy API and our clients. And that contract means basically stringifying all the things, which isn't that fun, but someone had to do it. And what that meant was it allowed us to ship this brand new infrastructure and brand new architecture without having to change everything downstream. So, for us, that was a really great way to show wins inside the business and also move things forward while also making sure that everything that came before will still work. Like, if we didn't do this, we essentially would have had to update it, like, for client applications. And for those teams, they would have stopped development on the product for that long, just be waiting on that. And we'd be bottlenecked by deploying this whole new infrastructure by those services. And, you know, this is an absolutely throwaway part of the system. It effectively meant we were building two APIs. So, we built the brand new API and then this intermediary API as well. So, really, we added our development time by about 100%. But the way that we looked at resourcing, we felt like it was the best way to approach this problem. And we're looking to throw this away as well. All of those client teams have gotten their roadmap to be moving past the old API and the old schema and moving to our new API. So, I wanted to show you a little bit about how this works. So, I haven't started with the HTTP request here. But basically, what we do is we, there's a controller, it calls out to a service. And in that service, what we do is replicate what the old contract of data was. So, for this case, it means fetching an asset which for us is like a video. And then all of the comments associated with that asset as well. So, what we did is we wrote this really simple request library thing here that you can basically execute those in parallel and then joins them all together and then we can take that data and do something with it. And for us, what we do with it is actually encode it from a new type to the old type. So, we're going from a new database schema to an old database schema. And I'm a huge fan of protocols in Elixir to do things like this. It's a really good way to just get these kind of polymorphic generic kind of things inside of your code base. So, in our case here, we call an asset a file reference in the old world. And what we're doing is basically moving a bunch of attributes around and then casting it as a, as that struct for the old world, which we call a file reference. So, you can imagine the input here being a new video or a new asset and then the output is an old file reference. And then what we have to do is basically serialize everything. So, we made this really janky little typing code a thing that takes in a type, basically turns everything into a string. And I just did this to troll my CTO, so. But we take everything as a string, sorry, all the inputs and then turn it into a string and serialize those back out to the clients. And those clients are none the wiser. They think they're still interacting with that old system. Okay, so the next thing I wanted to walk through is a hotly debated subject. And again, I think Andrew did a really great job of speaking at length about how to structure umbrella apps well and thinking about context and breaking things down. So, we use this single mono repo, which I think a mono repo has been very effective for our team size. Everyone can contribute to this one single source of code. There's 11 apps inside of it, which is probably pushing it at this point. We're probably going to start to think about breaking those out at some point soon. But for now, it's been great. It means we're all working in one place. All of our deployments are kind of quite simple. Every time we push, we can deploy off of all of this. So, the apps look like this. On the left-hand side, we have four Phoenix apps. So, we have our API, the support tool, Munger, which is the intermediary API, and then this auth app. And then we have a bunch of business logic, and then a bunch of shared components. So, the shared components on the right here are kind of interesting. Really, these could be extracted into a private HEX organization repo-y things now that that exists. But we've kept them inside of our umbrella application for the time being. But we have stuff like everything that we do is stats D. We want to share across all the services. We don't want to have to rewrite that every time in all the configuration. So, we just dump that in a little app and import that. Similarly, middleware. So, we have to have a lot of secure headers. Security is a big concern of ours, working in the video space with large clients. So, we put all of those kind of shared components in this middleware library. And then the DB is basically configuration and access into the database with some shared kind of types as well. Things like that that we need to make use of elsewhere. So, just highlighting a few of these applications. So, all of the five here that I've highlighted are actually things that we deploy and run. So, these are apps that aren't just shared code. They're actually things that we look at, run, and people can make requests to, and things like that. So, all of those applications are built and deployed as separate Docker containers. We use CircleCI 2.0 extensively to do all of this via distillery. If you're not familiar with CircleCI 2.0, definitely have a look if you're doing Docker base workloads. What we've been able to do is basically paralyze all of our builds. So, every single build there of all of those applications, each one, it will be the slowest build is the one that will take longest. So, for us, they all take about five minutes to run. We do things like blue, green, deployers via ECS. You kind of like just get all of this stuff out of the box with ECS, which is the elastic container service. I should have said that earlier. And everything we do now is auto-scaled. And we use CPU and memory threshold alarms to do that. Works really well. We basically set it and forget it. And we can see some nice graphs about where we've auto-scaled and how we've met the traffic demands as well. The other part of the app I wanted to walk through was the core application. So, for us, core is, as it might sound, it's all of the core business logic. So, it's everything to do with services, everything to do with persistence, access policies, lots of deferred logic, everything like that. And what we've done is separated it out into two contexts. One for accounts, so everything to do with managing teams and users and things like that on the platform and everything to do with projects, which is everything to do with, like, assets and adding people as collaborators and things like that. We have lots of different kind of collaboration concepts in the application. And effectively, we make use of the core business logic layer inside of the API and the support tool. And they act as, like, very, very dumb wrappers that call into this, like, nicely defined business logic layer to do all of their processing. And actually, the fallback controller in latest Phoenix, as well, has been really great because you basically just, like, write a two-line with statement, and then you handle the error case just magically in the fallback controller. It's well worth looking at. And inside of this core layer, we have a lot of tests, so lots and lots of tests. They run in about 10 seconds. They're very database dependent, which is usually the bottleneck for us. But it feels good to have that many tests coming from zero. It definitely is, like, a win for us. So the third thing I wanted to walk through is our event system. So this might seem really familiar. Actually, it's really interesting how well some of these talks have flowed together today. So good speaker selection, everyone. But our event system looks a lot like what Andrew was talking about previously. So we have this single kind of system where every single change in the app is pushed through this event bus and broadcast to lots of different consumers of that change. So it allows us to have this really powerful way to decouple parts of the system. So we use it for things like audit logs. So you want to say what person did what thing in the system or analytics. I'm sure everyone here is making use of something like segment or something like that where you want to say this event happened for this user. So we use our event bus to decouple these concepts to keep our services really clean. And they just push these events. And everything internally is implemented using gen stage and protocols. I'm not going to show you the gen stage code because it's literally taken from the example on the gen stage documentation for how to do the gen event replacement. And it's been working really well for us. So just to give you a little bit of an example about how this works. So we have our service here where in this instance we're creating an asset in the system. We have some function that does the create. It will return an asset back with the nice tag tuples that we use in elixir. And then basically we create a struct. So we have every single event in the system has a corresponding struct with it that acts as that kind of place holder so we can hook in and do some nice protocol magic in a minute. And then we push it through our broadcaster. So that sends it out into the event bus. And then as I said I'm not going to show you the broadcaster here but this is one of the consumers. So our auditor here acts as a place where we write audit logs in the system. So an audit log is what person did what thing on what resource. And we write all of that through this kind of this model. So what we'll do here the event will come into the consumer. We'll basically call the protocol with the event. And then if it if it returns something that we're expecting. So if the with statement here matches then we'll actually insert some kind of audit entry. It's very simple. But it works very well for us right now. And we've scaled this pattern out a lot. So this is what the implementation of that might look like. So we say on the right hand side here you've got the asset created event. So we are implementing the type of protocol for that event struct. And then basically that thing will return an audit. Which if you look at the previous slide here you can see that it was expecting an audit and the with statement. And we just build up that struct return it and then we're done. And what we've done is actually scale this pattern out for a ton of consumers. So we have audits we have analytics we have all of our notifications. We have things that broadcast to our socket service. Lots and lots of different events and they all throw flow through this singular event bus and then get sent out accordingly. So at the moment just a caveat with this in our implementation. So the broadcaster here is going to notify every single one of those consumers. Because that's how we are using gen stage right now. There are definitely different ways to do this as Andrew showed you before. Where you might actually have more of a pub sub model. Where you're saying I want to subscribe to this type of event only. And then you maintain some kind of list of the pub sub. And all of our consumers here are implemented as dynamic supervisors. So we get lots and lots of processes processing each one of these. Sorry requests there as well. Cool. So the last thing I wanted to talk about giving you the overview of the system is how we went and moved these millions of records that we have. So how we went from DynamoDB to Postgres. So what we made use of here is flow. I'm not sure if everyone is familiar in the audience with flow. Flow is an abstraction on top of gen stage. It kind of acts like a stream where you can take in a source of data and then fan out and have loads of processes basically act on that stream of data. So it's a really great use case in our case for we basically stream the entirety of a DynamoDB table and then fan out with loads and loads of processes taking those DynamoDB records, translating them into Postgres and then inserting them into the Postgres database. So flow is a great tool for data processing and I'll show you a little snippet of flow code and how easy it can be to use something like this in your own code base as well. But our largest table was nine million records and in total to sync every single table so there was about 22 tables in the system. It took us about an hour of downtime so we had to turn off the old database, shut down all the system, basically turn on all these data migration tasks, wait an hour for everything to finish and then turn on the new system. Yeah, that don't do that, don't ever do that. It was a good idea but it kind of worked for us. So how we implemented this was we basically have all these schemers that are really thin wrappers around DynamoDB tables. So you might actually recognize this because it looks a lot like Ecto. So we wrote a really simple wrapper that acts like an Ecto kind of schema so we can define lots of different types here. So and this thing will do all of our type coercion from strings into the appropriate types for us as well which made it a bit easier to deal with some of this data. So we define lots of these schemers like this and then what we have is a bunch of flow code here so I'm going to spend like a couple of minutes and just walk through this just so everyone understands it. The statics.measure part is for us is how we we get statistics and stats off of this kind of function call and then inside of that what we're doing is we're saying scan from the Dynamo table so a scan will basically say give us lots and lots of information and we do that per we do a limit here of a certain amount so we'll put each table differed on the amount we needed to scan at a time depending on the size of the data set. So we basically start the stream off by scanning the table and then we start the flow process here. So flow has a single from enumerable which is from as you might imagine from a list or something and then that turns it into a stream and then we're mapping over that turning each one of those things that's coming from DynamoDB into us into one of those Dynamo structs that I showed you before and then we basically partition so what we're saying is for all of those things we partition by the IDs and then we fan out to lots and lots of processes and then each one of those processes in turn is then calling a function that's taking that Dynamo schema and turning it into a Postgres schema and inserting it. I know that's a lot so I can ask answer questions at the end and honestly I wish I had more time otherwise I could go through lots and lots more of this but if you want to speak about this grab me after the talk and then what that from record looks like so this is the function that took the Dynamo schema and then turns it into a Postgres schema and inserts it into Postgres so we're basically doing lots and lots of those in parallel as many as we can do and we spent a really long time kind of tune in the workload so we could do this really quickly and efficiently so lots and lots of like per table tweaks and configuration needed to do this kind of thing but every single thing that we did here we ran in its own docker container so all of these table processes are running in parallel at once so we're parallel processing and then parallel inserting all these different records and we basically just got like the biggest Postgres box we could we could get and just let it go wild and yeah and that's how we did like I think it was like 20 million records and that's how we did that in an hour and 10 minutes in total so very very cool elixir has many tricks up its sleeve so flow is part of the gen stage the library I believe but I could be wrong is that no it's separate thanks James he gave me a nod so it's really worth looking at if you're doing kind like these large data processing workloads and you don't need something more like a Hadoop or you know some of those bigger map reduce kind of jobs because flow is effectively doing a lot of that for you so everything here we kind of we kept running this thing weekly kept dumping all this data we were doing trial runs all the time to try and make this process smooth because you know taking all of your customers data and moving it to a new database is kind of terrifying so we did as much as we can that could to mitigate the the risk of error here as well cool so I wanted to talk a little bit about some of the challenges and some of the takeaways that we've had from doing this so just a I mean there was a lot to go into and I know I've gone through some of this quite quickly but I'll just recap some of the biggest problems that we had during this journey so as I said before when you have a system with no tests where basically everything is hearsay about how it works re-implementing that system is really difficult and error prone so we ended up with tons of bugs tons like it especially in that might that intermediary API because oops we found out that someone actually implemented something a different way and the system should have worked in this like a way that we wouldn't have expected and you know there wasn't many ways to deal with that problem for us other than lots and lots of manual QA and that was a very error prone and slow process but we eventually got there ironed out most of the kinks it just it took us like another six weeks extra from when we thought we were going to ship this system to when we did but I think that's a reasonable amount of time given the complexity and what I've kind of gone through today as I said before we had lots of new developers trying to learn elixir and the and shipper project that's you know it's a lot to take on we these were brand new members to the team as well so immediately you don't have that kind of institutional knowledge about how things should work so what we really did is we had a few more experienced members of the team kind of going forward establishing a bunch of patterns laying out the foundations and then letting those other members kind of come in and work on top of that and that was actually really good that worked really well for us we implemented lots of style guides we had credo as a linter we're trying to enforce lots of quality in the code so everything has has to have like documentation and all of our public functions we generate docs off of everything here as well so we enforce lots and lots of conventions on the team I think that really helped people get over some of the learning hurdles as well and then the other thing here is just shipping a new system and a brand new database where you haven't run the workload at a production scale is a very difficult thing so especially in postgres where the performance between what you might be running in dev and production might be wholly different right so in dev you might have all your qa environment you might have a few thousand records and then in production you might have a few million records and the way indexes work and things like that are very different between those two so what we actually ended up doing is replaying a bunch of traffic and really thinking about those kinds of ideas before we ship this thing so we could get handle on how it would run in production and then the other part of that as I said before was we used a single service as a test case to get an elixir into the team and making them understand how to run it at scale and that was that was like insanely valuable so if you're thinking about doing a migration like this and you can't break it down like start with something small and then work inwards from there so a few takeaways so the first one was elixir definitely was a huge win for us but it might not be for you right rewriting a system just because there's a new language might not be the best idea right think really carefully before you embark on a rewrite really think about like what are you trying to solve are there other ways to do it like you know that second system syndrome is a thing like you're doing it is going to take you twice as long cost you twice as much so just think very very carefully about doing it elixir was great for us because of the concurrency model basically and the fact that we could ramp a lot of our developers up and use lots of these really great attributes of the language to kind of promote lots of the explicitness and and get a much more maintainable code base out of it so but you know I think your mileage may vary is a saying that Desmond and I say a lot on the podcast and actually is definitely true here as well takeaway to is basically if you do a rewrite don't move databases at the same time like yeah this is basically like flying by the seat of your pants kind of stuff you know moving millions of records and the business critical amount of data is is scary it's terrifying and I would keep reassuring the CEO like yeah I'll be alright well yeah don't worry we've tested it we tested it but even on the day when we're doing it I was kind of crossing my fingers and hoping for the best right and we we actually developed a lot of tooling to help us with this process so we could easily like get a single customer's data if it didn't migrate properly and things like that and fortunately we only lost one customer's data during this entire migration but we were able to recover that data as well so we lost none so that was a win and the third takeaway this is I think I put this in like every talk is like absolutely love this article so this quote comes from an article called write code that's easy to delete and not maintain and I really really love this quote from Tef which is that good code isn't about getting it right the first time good code is just legacy code that doesn't get in the way and that's how I like to think about our systems like we've re-architected it and we've embarked on this new journey and hopefully we've put in a lot better kind of boundaries around different parts of the system so that those bits of code doesn't get in the way and if we do need to get rid of it it's much easier to do so right design your systems like that think about think about keeping things nice and isolated my module is basically just a grab bag of like related functions right so it gives you a really easy way to like collect all these things together that do the same thing and start thinking about that modularity before you embark on doing a project like this or any kind of big system in general cool so thank you very much and I'll definitely answer questions so there's a question oh yes so he said how can you lose one customer's data that's a very good question and I don't know the exact answer but we basically had a situation where we had we had a bunch of duplicate users in the system so and we basically when we're doing our import we use the user as like the thing that we found out and down from so if we didn't have them there and we were trying to cross-reference the records and stuff to get guarantee the consistency and make sure it's there but I think that user escaped our wide net yeah so the question is did you keep the DynamoDB database around before and after migration the answer is it's still there and we just like lowered the read and write capacity to one but we basically have a backup of that data which is good just in case someone comes around and says hey I don't have any of my data and I haven't logged in for six months and we could hopefully help them yeah I'm still nervous to delete that to be honest I think I always will be but and I think what we're going to do is actually take a backup and then have it in S3 or something yeah great question so did I have a contingency plan to go back if things didn't go right absolutely so we had clear so we actually ran like drills of doing this deployment about three or four times before we did it and we had clear rollback points at every step on the way until we did the data migration like when we turned off all the services was if we would have turned everything back on and then been like oh we needed to revert back we would have had a gap in data and that might have been an acceptable loss for us given the severity of the bug like fortunately that didn't have to happen but we practice that and our drills as well so that was a question about flow and how did we get to our ideal batch size using flow the answer is basically a lot of trial and error different for us that lots of the sizes of the tables was were different the records inside of the table so I mentioned in that one of the slides that one of the largest ones that we had was about 100 kilobytes per record dynamo actually has a scan limit of how much you can return in a single request so we were bounded by that and then basically it's like how much can you pump through dynamo and then fan out and really that was just a massive like trial and error tweak it see if it took longer run it again drop the DB run it again it was yeah lots and lots of trial and error but that's why we started doing these like migrations like way before we got to the the finish line you know I would say we would I'm sorry so the the follow-up question is how many weeks did it take I would say we were doing that for for about three months probably doing it once a week and for us we needed all the data in dev and our like QA dev environment because that's where we're doing all of our acceptance testing against the new API and the intermediary API so we needed to run it all the time anyway great question as well so the question is did we can continue feature development during this time the answer is somewhat so unfortunately again being in that startup environment you know yeah it's very hard to stand still we did pretty much like get to a point where we'd say we're not adding any new features to this API but the front ends kept moving but then what we would try and do is stay in sync between the client teams and the back-end team to make sure that you know their changes were reflected in the new schemers and we weren't going out of sync there but yeah oh yeah great question as well and how do we determine the percentage of test coverage I think like percentage of test coverage is a bit BS to be honest but and I think a lot of people disagree with this but for me like anything over like 70% seems okay what we really optimized for was testing the business critical paths and lots of the kind of common cases I mean the coverage there is like there's so much right like it's a whole three-year-old system re-implemented over again so I'm not a look I'm not like dogmatic about we must have a hundred percent coverage or anything like that in my team I just think every feature has to have tests and we need to make sure we're testing the main cases and hopefully we can use something like property testing in the future to try and improve that as well and yes the question is we had three developers who are learning elixir and I mentioned that we had some patterns in place and what were the examples of those so and one example of that is I actually did all the bootstrapping of the app so I would go in do the controllers break out the service break out the policies so the things can someone access this lay out all of those foundations and then get it code reviewed and walk through and do a pairing exercise with one of our other developers make sure they are fully up to speed with that and then they could run and follow a lot of those patterns but there was a lot of hand-holding in the beginning but now they're you know they this team is like running off and building new features and code in themselves and everything's like great so and we've actually brought on more members of the team now as well and they're able to teach the other people so it's been great yeah and the last question so the question is did we consider any other framework beside Phoenix so do you mean outside of elixir as well or right and we didn't I I'm actually of the opinion that Phoenix is small enough that you know I don't feel like there's a lot of bloating there like the whole like Sinatra rails thing like I understand because Sinatra felt a lot smaller than rails did and there's a lot of superfluous stuff in rails especially now with like action cable and stuff but I think Phoenix is small enough that if we didn't need some of it like we just remove it so like channels in the API we didn't need so just get rid of it but yeah I'm actually not that familiar with any of the other web frameworks either I guess we could just use plug as well but I don't know if I'd want to do that cool thank you thanks a lot Chris