 So yeah, good morning everybody. I'll be talking about Amanda, our distributed services platform. I won't be showing any other pretty pictures, I just hope that the talk can live up to the amazing work everybody else has been doing. Couple of things about myself, first, I've been software developer at MPC since about 2010. I've been working with Python since 2009. Love services and everything that's plugin based. Slightly obsessed by monitoring after various phone calls at three in the morning. And I had the great opportunity to actually hold an Oscar for Life of Pyro. It was a great opportunity down there. So I'm part of the infrastructure team at MPC. I've been working there since, like I said, 2010. And we create visual effects for advertising and feature films. These are a couple of the movies that we have been working on recently. And we actually do this across eight sites with what we call fully integrated cross-site pipeline, which makes sure that our data flows from one side to the other, depending on what the departments are and where they work. So I guess not everybody here might be specifically familiar with what visual effects are. So this is a quick quote from Wikipedia. It pretty much comes down to everything that is either expensive, dangerous, or would hurt an actor during filming. So we're trying to avoid it. But I guess a couple of actual images of the work that we do is probably gonna be a bit better. So this is a shot from what was the years we got it in from the clients. And this is the actual work that we did to it. So everything that you see there in the background is absolutely fake. Same thing here, this is a shot that we got in from Godzilla, one of our latest movies. Same thing here, this is what we got in and this is the actual work that was done to it. If you look, well, you can even see that the guys in the tank got replaced with CG characters. That's how far we pushed things these days. So to do this, we work, of course, with a lot of assets, where an asset is something like a creature or a texture or whatever else that we need that is fake and we actually make sure that it flows through the whole system. To do that, of course, the artist first does a bit of his magic. Once that is done, he creates what we call a daily, which is a short movie to actually show the work that he has been doing and that can then be reviewed by the supervisors. Once that is done, he can approve the asset and he can, of course, add some comments and things from there. Once it's approved, we actually go through a releasing stage where a lot of things happen. We actually create directories where we store our data. We add in some actual metadata about the assets as well and make sure that everything flows into the next department. So here, for example, we've got our modeling team, which, for example, creates an actual character and then he's some textures. And while we release, we actually make sure that we update all of the dependencies. We make sure that we notify all of the different artists that new things have been released and we actually make sure as well that we sync any data that we have to all of the different sites. So, of course, we have to keep a couple of things in mind. Doing this, there's not one artist working. There's about 1,600 working. And release thousands of versions of assets a day. So we have to keep that in mind, but also an ever-changing schedule. So one day it might actually be quiet and the next day we might have a completely different schedule with a trailer that needs to be delivered in a couple of weeks. Which means that we have a whole lot of different sources that we use coming from a database, third-party APIs, storage, a whole lot of different locations. And they're used by in-house tools that we have been writing, third-party applications that artists tend to work with and, of course, we have a whole lot of actual multiple environments. So we don't work with one single environment. We've got a whole range of different environments, which means that for every single show we can have a soft set of tools that they use with a specific version where a different show might be using completely different ones. Other things to keep in mind are users themselves. The artists themselves want something that's quick and easy to use, something that's consistent. They don't want to have to worry who I'm using this API so I'm gonna have to use this way of doing it or I'm using that API, I'm gonna have to use that way of doing it. We do also have to keep in mind that these artists are not necessarily trained developers, but they do write code. They hack around quite a bit. We need to make sure that we can present them data in a safe way for us and a safe way for them so we want to expose only certain parts of our data to them in a nice and consistent way. Similar for developers, we have developers of any level coming in. Some of them are trained in more visual effects side of things. Others are trained in asset management. But they're not necessarily trained in anything that's distributed or scoping across eight different sites around the world. So to do this, we developed a service-based architecture called Amanda. We provide that as a platform as a service to all of our different artists and developers. And it's a multi-protocol setup with multiple transporters and multiple concurrences. I'll be going into every single bit throughout the different slides but it's just a small introduction to what it is. And we try and provide an ecosystem to write a service for developers of any level. So anybody that comes in on the first day should be able to write a service during that day and get into the production by the end of the day. So we're currently running our second generation that was written in 2012 and has gone live in 2013. And it replaced our first generation which was a push model, which caused a lot of problems. So as soon as the request would come in, it would actually start scaling with extra threads to start running and running and running. And there was no way for us to actually limit that in any sort of way. So we have now moved to a Q-based model which just allows us to limit things a bit nicer and actually make sure that we have a specific flow and can control that flow in a way, way nicer way. So just some stats. Godzilla, which is one of the latest moves, like I said before. Like I said, we have a render file which has thousands of CPUs. But if we would have rendered, which means creating that final image on one single machine, it would have taken 444 years to actually render, which I guess a fair bit amount of time. And we've got 65, 650 terabytes of worth of data that went through the system as well. And that generated during our peak times about 250,000 demand requests a minute, which is 120 million requests in eight hours. And for those of you, I guess, since we're in Germany, most of you have seen the Brazil, Germany game. It's about four times the amount of tweets that were about that specific game and Congress, Germany are winning by the way. So I'm not gonna just step into how we actually have been setting up the whole system from the ground up. I'm gonna be starting with the actual service. And the way we have done that is that the service is nothing but a class. So we're gonna make here a make movie service. We've got 20 minutes to make a movie, which is probably gonna be a bit short, but let's try anyway. So we're gonna start with green the director because we need to get some work in, which is your typical hello world scenario. And the important bit here is that it's a class. It's absolutely standalone. It's completely testable. And you don't depend on any of the tools or any of the scaling features of Amanda, which is very important for us because we don't want people to have to worry about any of these things. We have these little decorators here called ad public. We also have ad protected and ad private, which allows us to actually expose what methods are available throughout the system for other people to use. So an ad public would mean that an artist and the developer can use it from outside. And ad protected would mean that you can only call it from a different service. So cool, we have a service now, but it's not actually doing anything useful. And it's definitely not gonna help us getting the kind of ratings that we've been having on Rotting Tomatoes. So let's actually make it do something. To do that, we provide into service calls, as we call them. And it actually allows us to call different services. And the way we do this by actually declaring a dependency inside that class. So I can say I have a dependency on the storage service and here I'm using the storage service to actually check if the data is on this. And I can do the self, the storage and check exist and pass in the parameters that it needs. At that point, of course, we also need some information about our database itself of our show itself, sorry. And we can do that with what we call infrastructures. And infrastructure is really a formalizing, is a way for us to formalize our access to the backend such as databases, logging, configuration, sessions, and you have those things. And in here you see the underscore DB that is an actual infrastructure. And it just provides the users with a nice, clean and consistent way to actually access databases. They are in themselves services, but they're stateful services so that we can do things like pooling and caching and those kind of things. And they are local to the service. So these services are actually not spread across the system. They are inside that same Python module which allows us to do the pooling, of course. And the really, really nice bit about this and I'll be hamming on this quite a bit throughout the whole talk is that we can swap any of those services to other services. So it means that, for example, in this case, here at the bottom you've got our config where we're getting something out of the configuration. In a development environment, this could be a dictionary, in production, this could be an XML file, this could be a YAML file, it could be whatever file. And we can swap that in and out with different services without once again the actual developer of the service having to change anything to his code. So now we've got something that does something but it's not very useful in any sort of way. It's not scaling. It's still local in one person's machine and we also don't have that bit where we can actually provide a consistent interface. But we did create all of the abstractions that we need so that we can change any of the parts that we already have with other parts that we might want to use in the future. So let me introduce you to the service provider. And this is how you actually create one of these service providers. And this actually allows us to get that consistent interface. It hosts the services for us. And at the bottom you can see we create a make movie service here and our story service, we pass them in and we can then call them with services that make service and make my movie magic happen or logging and actually change the logging level. So that's the kind of things that we actually allow them to do. But we still don't, like we're still not able to scale in any sort of way. And we came up with the idea of proxies. And proxies are stand-in services for the requested service. So they pretend the other service that you want but they're not really the service that you want and underneath the hood to just stick your data into a queue and the queue can pass it on to whatever. We also tend to call these queues transports and once again they're completely transparent to the user. The user doesn't have to care. The service developer doesn't have to care where his data is coming from. So queues or transports, they allow us to abstract away technologies like RabbitMQ, 0MQ, UDP, any of those things. We can all abstract them away. And it allows us to transparently swap out things like adapters. So if at some day I want to use LibRabbitMQ and the next day I want to use PyAMQP, I can without once again having to change my service, my service provider, anything else, I just have to pass in the transport bit which is configuration. So at this point we can scale a bit but it's still going to be expensive to run 250,000 requests simultaneously because we need a whole lot of these services running. So of course you want to do some parallel processing and some concurrency kind of things. Service developers for us shouldn't have to worry about how they're doing concurrency and how that works. And they need to know if something is going to be CPU intensive or IO intensive. That is something that we do want them to think about but we don't want to have them to think about oh, I'm going to have to pull a thread there and do this, that way, that way, that way. We think that we should accommodate for both because some tests can be CPU intensive other tests can be IO intensive and we don't want them to worry about that. We want to be able to use threading in one way and dreamless in another way or multi-processing even if we wanted to. So far we have been building this little block here which we have been seeing and what we did is actually stick a worker pool in front of it. The worker pool provides a simple interface across various concurrences and the pool is fed with requests from internal queues that is filled by consuming from our queues in there. Once again workers can be changed, can be extended and it can actually be changed just like you would do with middleware. So you can just build a whole nice setup here like at this point we've got a nice little building block that we can reuse everywhere. So at this point we really have something like we have all of our building blocks that we need to start building a slightly larger system at this point. And the nice thing about this is that we can actually start chaining these blocks together and that's what we did in production. So in production we have a cross-language pipeline as in we don't have Python, we've got 95% of Python at MPC for most of our tools but of course we need some C++ for anything that is really, really heavy. Yeah, this is just really heavy to do. At that point you might just wanna use C++ for any graphics. We have some JavaScript laying out for some of the web tools. We have Lua, we have a whole bunch of other ones and we actually wanna be able to present all of the data that Amanda has and all of the services have to all of these different languages in a nice and consistent way. So what we did is actually our first worker pool we replaced it with micro-risky and flask. Nice and lightweight and simple. Just a little zoom in it so you can actually see where it changed. And that allows us to actually use HTTP quite effectively. It allows for simple clients on every single language. I mean, any language these days should be able to make an HTTP call and it's a nice, simple client that people can use and people don't have to worry, ooh, I'm gonna have to do threading to use this transport or that transport. We take care of that and we take that away. It does limit us to native types because our HTTP transports either transport JSON or XML because JSON, XML are pretty much available across all those languages as well. But it does limit us to native types. Oh, we need to start extending the encoders and decoders to actually start dealing with those issues. So our front end here is a micro-risky flask worker and actually we don't really do any work in flask except for session handling, which itself is an actual service. The rest is just being proxied across to RabbitMQ where RabbitMQ takes care of the distribution across all of the different services that we might have running. So at this point, we got a system that can be distributed that is available to all of these different languages. Of course, we wanna make sure that it's full tolerant as well. So what we did is actually we run two instances of those micro-risky flask workers and we stick NGNX in front of it to do load balancing and failover, nice and easy. And we actually run a non-clustered RabbitMQ setup. So rather than actually clustering RabbitMQ for those who are familiar with RabbitMQ, we run multiple instances of RabbitMQ. And what that gives us is that we can actually use our proxies to consume from multiple queues and transport at the same time. So like I said before, we can swap in any of these transport with a different transport and we can go as far as running RabbitMQ and another RabbitMQ, but we can also run RabbitMQ, ZeroMQ and Redis at the same time and we can start consuming with one single proxy from all of these different transports at the same time. So if in the future something nicer comes along or something better comes along or our whole setup changes, we don't have to rewrite the services, we don't have to rewrite anything else, we can just swap all of these bits in and out. So at that point with that going, the last bit that is left is monitoring, which I'm quite keen on and there's something that needs to be done. So what we did is we assigned an extra ID to every single request. As soon as it comes in, we'll make sure that it has an ID and the ID is being followed throughout the system. So if I go from service A to service B to service C in Vancouver, if it blows up in Vancouver, I'll have a trace that it blew up in Vancouver because every single request is logged and I can actually start searching on those request IDs throughout the system and find the whole trace of all the different requests. So since we really love our services and service-based architectures, we actually made sure that we have a statistic service and a logging service and so the data that we had in here, for example, at the bottom, where we have a calculation of how long it takes to get from the front end, so from Michael Whiskey to the end of RabbitMQ or the amount of time it actually took to execute the request, we can map these onto the system itself and we actually send those to a statistic service or a logging service. And what that allows us to do once again is that if at some point we're now using carbon, if at some point we want to use stats D, we can change the statistic service without once again having to change everything else. The nice thing that we did with our workers since they can be wrapped and can be done a whole lot of things with is that we have one single worker that executes the requests and that worker is wrapped in a statistics worker, so as soon as the request has been done handling and since we have transports and queues, we already have the result going back to the client, that is the point where we actually start doing our stats calculation. So we don't have the overhead of actually doing our stats while we're still executing the request. It's done afterwards. Of course, there's a bit of calculation up front, but under that, it all happens afterwards. Same thing for logging, all of our logs are going through a logging service which allows us to actually dynamically change our logging levels by an Amanda request, I can say, change my logging level to debug for this specific service and we can make those changes on the fly as we need them. So maintenance-wise, we use salt. For those who don't know salt, check it out. It's a really cool tool. Similar to Poppen and Chef for those who know those. It's in Python and actually we extended it within Amanda module, so salt can now run up Amanda services and can run up a whole framework for us. And we actually wrapped the salt client itself on the service, so we actually use salt to investigate the system on the fly via actual Amanda requests to know what's going on in the system without really having to log into the master node. And what it really, really gives us is that predefined, repeatable configuration that we need because we've got eight sites to look after. We want to be able to make sure that what's running inside A is gonna be the same as the side B is inside C. We wanna make sure that it's all the same. So we've got an adaptable, extendable, configurable system at this point. We can change services, swap them in and out like we want to, we can swap our transports with whatever tools that we need. By the way, a big thank you to everybody who has been writing a lot of these modules like LipRabbit MQ or SimpleJSIM or whatever. We use them a lot and thank you for that. It's very extendable and configurable and it's all configuration based. We can abstract the whole system from system level all the way down to a service level. And we really have the best of breed system at that point where we might build a system for a particular show or pipeline or for any of our specific use cases. So there's a couple of things that we're still looking at. Containerization is one. We don't want service A if the CPU is going crazy on it to actually take out service B. So we're looking at containerization. We're looking at auto scaling as well. If you have done investigations just like us and you wanna have a chat about it, it would be great. And we're also looking at the possibility of actually open sourcing the whole system. So that's pretty much it for the whole technical thing. Sorry, I didn't go a lot into actual Python code itself. It's 20 minutes, so actually digging into it would be quite tricky. Just a couple of slides. We are actually looking for people and we've got a lot of things in production at the moment. The Jungle Book one, keep an eye out for that one. This should be a really, really cool movie. And of course, we are hiring as well. Across all our studios, across everything. So yeah, either have a look on the website or have a look at recruitment and or come and talk to me after the talk, of course, as well. Thank you and yeah, any questions really? The microphone for questions is over there. Just stand up and go there if you have any questions. Do you deal with versioning of these services in any way? Do we do it, sorry? Versioning of these services, you provide a lot of services. Yeah, so every single service as it's deployed you get assigned a version and we have an actual, that's why we use SOLD as well. So we have a configuration set of those services and if a service change, we actually run up a different version of it and we can actually have a staging and a development match where we can push those changes out first and run a bunch of tests against and spread it out to a couple of users to start using before actually pushing it into production. So every single service is versioned. Hi, I have a question. How Amanda differs from, let's say, standard enterprise service bus? Because I do get it, I don't understand why have you wrote the code from scratch and not used, for example, let's say a service bus where you can plug in different services and so on. You mentioned salary, right? I'm saying ESB, enterprise service bus because when you do want to do, let's say, service oriented architecture, you just use an ESB and I don't know why haven't you done that? I don't know, to be honest. I'm not too familiar with ESB, to be honest. This is a technology, not a tool, like enterprise service buses when you want to integrate a lot of different various environments and so on. You just use an ESB with multi-protocols and so on and this looks quite same to me and maybe we can chat about it. Yeah, let's do it, let's do it, yeah, of course, yeah. Interested to learn more about that one, yeah. Hello, it was a great talk. I have a question about load balancing. What do you use to do it? Have you got any algorithms and metrics? Sorry, I couldn't hear you. What about load balancing? What technology do you use to do it? To do load balancing? Yeah, approximately, or something like that. So at the front end, we've got NGINX which we use for load balancing, so we've got multiple micro-risky and flask instances set up there and NGINX load balances between that and in production we use, on the other side, we use RabbitMQ to actually do load balancing. So we have our proxy set up and they have a certain amount of requests that they can handle simultaneously and if we see that the queues are getting too long we just start spinning up more services. That's why we're looking at auto-scaling as well to actually deal with those issues. Hey, what do you do about the large data amount about the other service, which operates on data like source images and they're not available on your other location around the world? How do you make sure that the data is available and how get it pushed around the world? So we've got various things that we do. We've got one of our infrastructures that we have is what we call a cross mesh infrastructure and of course we cannot check if something is on, so if we look at storage, we might not have it on storage in Vancouver and we can actually make what we call a cross mesh goal and you can do with self cross mesh with this site and you can then use the same service interface actually go and call that specific method, say in Vancouver to go and check with the storage service in Vancouver if it's available down there. And then we've got, of course, our syncing queue which take care of actually syncing all of the data across all of the different sites, which happens at release time. We have specific rules set up as part of a service that says, okay, this method has, or this asset has been released, doesn't need to be synced to any of the other sites. So that's all service-based? And it's the same for generating data. How do you prevent generating like terabytes of data by some artists who, you just do monitoring and you look at the operations and... So the requests going through it are very small and very lightweight, so we wouldn't be sending terabytes of data through there. We just use our sync service, as we call it, to actually detect and make sure that the data that needs to be synced is going to be synced. We have large dependency trees in these assets where we can say, oh, this asset has this texture and this texture and this texture and these kind of rigs. Go and check if we need them in the other sites as well, or are they just doing something like lighting where they just need the rendered frames, for example? Thanks. Yep. Any more questions? Not. Thank you very much. Thank you.