 Okay, hello everyone, and welcome to the session. We are very excited to be here talking about our Swift implementation, so thank you very much for coming. My name is Max Katch, I'm a Senior Infrastructure Engineer and a member of the Cloud Builders team at Mercado Libre. Together with Maximiliano, we'll show you how Mercado Libre is freeing itself from expensive NFS storage appliances by simply using OpenStack Swift on top of cheap commodity hardware. If you don't know Swift yet, it is a highly scalable, fault-tolerant and distributed object storage solution. It has a flexible and intuitive RESTful API that allows you to store large amounts of data. In this session, we will walk you through the different stages of our architecture. I will start talking about what we had at the beginning, what problems we needed to solve, our move to the cloud, and then Maximiliano will talk more in-depth about our Swift implementation, how we improve the solution, and also he will share some tips and experiences about what we did. But I'm sure some of you never heard about MercadoLibre.com, so let me give you a short introduction about the company and what challenges we have today. MercadoLibre is the leader in commerce platform in Latin America. It is the eighth largest online retailer in the world, and it is one of the fastest-growing companies of 2012. It has presence over 14 countries, including the biggest in the region, Brazil, Argentina and Mexico, and we have more than 81 million registered users. Now that we are an open platform, our APIs receive more than 2.5 million requests per minute, and we have around 4 gigabytes of bandwidth per second. This number increases on holiday season to 4 million requests per minute, and around 6 gigabytes of bandwidth per second. We have more than 1,600 employees, including an IT staff of more than 380 people between developers and engineers. But to get here, we had to walk a long way and change a lot of our infrastructure. So let me tell you how we started. We felt like being cavemen in the Stone Age, because MercadoLibre had a monolithic application, a product, that it was running on physical servers, and we were storing everything in a single NFS volume. As the site grew, we implemented a flat-send virtualization, but we kept storing more and more data on the same NFS storage appliances. As you know, this is hard to scale up, and when you have problems, it's difficult to manage the maintenance. So let me give you some examples of why I wanted to change. Because we had ISKSI block volumes, and we needed to manually create them, and they were directly attached to physical and virtual servers. This was a hard to troubleshoot, and we wanted to change that. We also had NFS volumes that were mounted on top of other mount points, and this was confusing us, and we had also problems because we had even late data loss. We have also performance issues when purging files, because we had millions storing to the same directory, this was out of control, so this seriously affected other applications stored on the same filer. So as you can see, every single event happening on the storage side was affecting our platform, even we had peaks and even downtime because of that. So obviously the wheel wasn't rolling for us. We had more by this time, we had more than 50 terabytes of images and side data stored on the same place. We had also just one storage cluster per data center taking all the bullets and receiving more than 1 million IOPS per second. We were very slow in terms of deploying new infrastructure and analyzing issues and the way things were, we couldn't really handle a caching layer outage. So the performance wasn't good as the IT department was tagged as a bottleneck in the company, and the overall situation was pretty bad for us. So we knew this couldn't go on any longer because we felt like we were chasing an reachable carot, and there was no point adding more and more hardware into our infrastructure. If we couldn't take out some of the complexity that we had in our architecture and in the way, if we couldn't simplify the way we work in our workflow. So we dream about a solution where we could have no vendor lock ins, where it will be flexible, where it will be possible to scale out just using cheap hardware. So we realized that what we needed was cloud and together with it to move the storage layer to the cloud to provide the storage service solution. Then we will be happy. So we got fed up and we decided it was time to fight back. It was time for a big change of paradigm in the way we work, in the way we buy hardware, in the way we deploy it, in the way we use our infrastructure and our human resources. So that's when we decided to take a leap of faith and move to the clouds. And I can tell you that the flight had some turbulence, but in the end we succeeded. And let me tell you the biggest changes we had in our architecture. First of all, we changed from a high dev ops model to a no ops model. What this means is that now, instead of having to interact with our internal users, we are now free and we can focus on what really matters to us, which is developing and researching new products and features for our cloud. Second, we migrated our flat and send manual virtualization and we deploy a full OpenStack compute cloud. Now our cloud users are able to provision their own demands with simple HTTP calls so they can get all their resources they need freeing us and we are not the bottleneck anymore. Third, we stopped making handcrafted virtual machines using the common line or scripts. We started to fully automate the installation, configuration and deployment of these virtual machines using Chef. This freed us also some valuable and precious time because you have a lot of recipes and cookbooks that you can download and just use. Fourth, we changed from stateful to stateless and this was a really important point because we needed to work hard to change our developer's mind and they needed to understand that resources in the cloud will eventually fail. And with that in mind, they can now create applications and services that they are full tolerant by design. So they are cloud ready. And fifth, we stopped what we used to do before that. We created manually, we provisioned the volumes, storage volumes. And now we have storage as a service solution. Now also our cloud users are able to create block volumes for themselves. They can take snapshots if they want. They can destroy them, attach them, detach them from instances. And what's more important, we are freeing ourselves from the NFS volumes. And now they are using Swift as a back end to store all their files. So as you can see, what we did was to change our crappy and flat infrastructure and we transform it into a full infrastructure as a service. And also with this big change, we managed to upgrade our monolithic application and we transform it also into a full open and distributed platform. So we can say that, fuck yeah, we did it, we changed it and we evolved. Now we are not the caveman in the stone age, so we reached the clouds. And let me tell you why we succeeded in this. It was because we chose OpenStack and we chose also Swift. First of all, we chose them because we love open source. It's now a DNA and we love the philosophy behind it. And I can tell you that MercadoLiber.com is almost entirely running and using open source software. Also, we love that it is written in Python because we can take a look into the OpenStack code, we can hack it, we can debug it, we can change it and customize it to our own needs. And we can also give back to the community, submitting bugs in Launchpad or submitting patches. Also, as you know, open source, you have no vendor lock-in. You can save cost and our managers love that. And also, because OpenStack was built by Rackspace and NASA, they have these big, large infrastructures. So it was built with and conceived with scalability in mind and also it's high available. And as you know, everything in OpenStack is a core component that has a restful API. So you can use it and you can interact with these components. And you can develop your own tools, as we did in MercadoLiber. And so because also we are using it OpenStack since the very least. So we feel very comfortable about it, we know it from the inside out. So last, and finally, our OpenStack grew a lot in the last two years. It got a lot of hype. So it is now a fully mature cloud OS. You can use it and in every new release, you have a lot of very exciting new features. So that's why we stick to OpenStack. And now let me tell you what we are running, what versions of OpenStack we are using in our cloud production in MercadoLiber.com. We are using, for the whole cloud, we are using NOVA, the Essex release. We use it to run more than 7,000 virtual instances that they are running on top of more than 1,000 physical servers, okay? The whole website, I can tell you, is running on this architecture. We have it in production running, it's working very well, and we are very happy. Second, we have Keystone, the Essex release also. We integrated it with our LDAP servers. And that means that any internal users can just take resources from the cloud without any problems and also we use it to authenticate our internal applications, okay? We have also Glance, the Essex release for image provisioning. We mainly use Ubuntu 12.04, which is the long-term support. We also have support from Canonical, so it's very useful for us. And we also have some Red Hat instances images for better compatibility with our Oracle databases. We also have Quantum, we have it, also the Essex release. We mainly use it for better isolation of specific parts of our cloud that we need an extra layer of security on the network side. And last, we have for our Swift cluster implementation, we are using the Folsom release. And now I will invite Maximiliano to tell us more in detail what we have. That's, yeah, okay. So first of all, hello to everyone. My name is Maximiliano Inesio. And I am a cloud builder from macabre.com. And I will start talking about our first open stack Swift deployment. And then we will go through different points that we considered important in the process to get our current open stack Swift deployment. So as you can see in this picture, we installed 24 stretch notes divided in two of our data centers in Virginia. That we come here with four hard disk drives of three data IT each one. 64 chips of RAM and two XACOR processors. We installed the entire Swift package in version 1.4.8. That is the SX release. And we configure all the services, the three main services, account, container, and object services. And also the background services, Replicator, Update, and Auditor to run in all the storage nodes. In addition to that, we create four ring zones. Dividing in both places also. And three, we manage three replicas. So this is to ensure that we will have at least one copy of each object in our cluster, leaving in both places. And we can also see 12 proxy nodes. There are not 12, but we have 12. 12 proxy nodes divided in two groups that expose the HTTP Swift API and also have configure memcache to cache ring metadata. And a lot balancer layer, as you can see, that is composed by big IPFI. We use VIPFI to balance traffic between those proxy groups. And you can notice Kiston there in the middle of the proxy groups. And Kiston is a really important part of our infrastructure. And it's mainly because we have about 100 of departments inside Mercado Libre from our internal users. And we needed the flexibility to the narrow grant, the access to many cloud services and cloud applications, depending on the department and the user roles that they have. So we use Kiston to do that. And we also integrated Kiston with Swift. Well, finally, we connect all that infrastructure with that to a traffic building, a single building. And we use two one gigabit interfaces, two one gigabit interfaces in each one of the storage and proxy nodes in bonding mode, link aggregation. So we went to production like that and all the results was not as respected. And there are many reasons why we could not reach our stated goals. And we listed here, so we can take a look. Slow response times, concurrency issues, big issues with the support from the hardware side, all traffic to the same building, killing Kiston due to the lack of token caching, no more than 60K RPM peaks, and lack in a safe architecture knowledge. So I will walk through those different points and I will try to share you our experience of solving that point. And I will start with Kiston, killing Kiston due to the lack of token caching. As you know, each one of the HTTPs request in Swift needs to be authorized by Kiston by default. And this authorization process cause an increment of traffic in our Kiston cluster. And that impacts directly in the Kiston service. So as a result of that, we got an extra overhead in each one of the HTTP requests in the Swift API. So to solve that, we start to cache in tokens in our Proxies Memcache, okay? And actually this solved that problem because reduce the traffic in our Kiston cluster, but also reduce the overhead in all the private HTTP requests in the Swift API. But we also needed to be prepared to that increment in traffic in our Kiston cluster. So we also added more servers in that cluster. Okay, some configurations, some hardware considerations. Highly good service support and enterprise-grade SATA drives. Well, those are two points that we need to take care using Swift and large-scale cluster of commodity hardware. And that's mainly because all our drives in our Swift cluster are under a heavy concurrent right load around the clock. And this load may cause many broken drives that we need to replace as soon as possible to avoid performance issues. So believe me that we spend a lot of time fighting with our service support to replace that broken drives. So high good service support give us better response time in case of failures for that replacement. And enterprise-grade SATA drives will give us better performance in all our Swift services and also reduce the failure rate of our drives in our cluster. So we think that in this kind of applications, enterprise-grade SATA drives make the difference because give us better performance, more durability and a good balance between cost and benefits. We now use flash drive modules to install the operating system of our storage and proxy nodes. And this comes to solve a problem that we had in the previous OpenStack Swift deployment. And was that we installed the operating system in the same drives that we installed the rest of the Swift services. And when one of those drives fail, the entire server fail, okay? So we now avoid that issue using Sarah Dom's modules to install the operating system. And well, now we are using solid-state drive for ACANN and container services. As you know, ACANN and container services use SQLite file to upgrade data about your ACANN and containers. So with solid-state drives, we improve the concurrency access and the IO performance of those services. So now we use solid-state drive for ACANN and container services and we leave the rest of the enterprise-grade drives, salad drives, to be used by the option service. And on the network side, we upgraded from one gigabit to 10 gigabit network interfaces also to improve throughput and performance in our network networks. Well, I think that this is the most important slide regarding the performance improvement in our current OpenStack Swift deployment. So isolate ACANN and container services from the option services. Well, as I said in the previous slide, ACANN and container services use SQLite files to store data, metadata about your ACANN and containers in your cluster. And on the other hand, the option service stores, retrieve, delete, and update binary files in your file systems with extended attributes. And this last service is the more IO intensive. So when we mix ACANN and container and option services in the same stretch nodes and we configure those services to use the same drives, well, we get performance degradation in all of them. And this is for the impact that the option service has in all our drives. So a good practice to avoid that is to isolate ACANN and container services from the option services, preferably using solid state drive for ACANN and container services, as I said in the previous slide. Okay? So, well, as I said, and I said again, we use solid state drive for ACANN and container and let the rest of the salad rise to be used by the option services. And in addition to that, our users split the amount of containers that they use to improve the concurrent right performance of their applications. And this is because this way they can balance their right operations between different SQLite files. Okay? And this is really important when you have fast clients, that is our case. We have a big amount of fast clients that generate a big amount of concurrent traffic of a small objects in our service. So multiplying the amount of containers is a good advice to improve the performance of their users. And we also multiply the amount of drives in our cluster to have a better balance of high ups. Well, yeah, we also have a better balance of high ups multiplying the amount of drives and also have a better distribution of ring partitions across our ring drives in our three main rings. So now we decrease the size of our drives and increase the amount of drives in our cluster. And on the network side, well, we use two different network bilans to isolate internal from the external traffic. As you know, the background processes, the consistency processes, a replicator, for example, generate a big amount of traffic using our same to ensure the consistency of the data across your cluster. And by the other way, we have a big amount of traffic that comes from our proxies to our search nodes. So we now isolate these two kinds of traffic using two different bilans of 10 gigabits each one. Okay, so it's really important also for us to have a good monitoring of all the pieces of our Swift cluster. So while we use New York to monitor our Swift cluster times and traffic. And well, this is mainly because Mercado Libre has been using New York for a long time to monitor different kind of applications. So here we just took advantage of that and we integrated New York in our Swift proxies. That is really easy by the way. And with New York, we can figure out quickly that we are having some issue in some part of our system cluster, of our Swift cluster, sorry. Or we also can figure out quickly that some user is changing the way in which they use our Swift service. So it's really useful for us. And for that reason, we use New Relic. But you can use also Statsy. Statsy comes integrated with Swift. It's really easy to configure. It's really easy to install. And you can use Statsy to get detailed operational metrics in real time from your different services in Swift. We are testing Statsy and we are planning to implement that in the next months. And we use also Kafka. Kafka is an open source, distributed messaging service. And we use Kafka to centralize all logs from our storage nodes and our proxy nodes in a centralized repository to make future analysis with all that information. So we can make analysis, for example, with Hadoop. And we'll try with it, as you know, try with it is the Swift service that analyze all your drives, all your storage node drives looking for IO errors. So we customize try with it to send us emails as soon as something happened in some one of our drives. So this way we also improve the response time in case of failures. And we have also custom scripts to check the consistency of the ring and configuration files. As you know, we change our ring files many times because we need to add new drives or remove broken drives or change weights. So each time that you make a change in your ring files, you need to make a rebalance, replicate all that files across your cluster. So it's really important to check the consistency to ensure that you have the same files in all your nodes of your cluster to avoid big issues in the Swift service. So we have custom scripts that check the consistency of our ring files across our cluster. We know that there is a Swift Recon service that is really useful to do that. We are testing Swift Recon and we are planning to implement that in the next months. But what if something fails and everything will fail? So, well, hardware failures are really common using Swift and large scale clusters of commodity hardware because of the commodity hardware. And we know that Swift can handle that failures, the most kind of failures. But in some cases, we saw some performance degradation caused by that handling. So we need to know what we need to do in case of failures. If a storage node fail, and we can solve it quickly, we just remove all the drives from that storage nodes from our ring. And we do it gradually by decreasing the drive weight in each one of the drives. If a single drive fail, well, the impact in the performance is obviously lower. But also, if we can solve that issue quickly, we just remove that drive from our ring. And if a proxy service fails, we have good F5 health monitor to detect that broken proxy or that failed proxy quickly and remove that proxy immediately from our F5 proxy pool because that could cause big issues in our safe IPI. So, well, here we have the architecture as a result of all that previous points. And as you can see here, we have now 32 storage nodes we have that are configured with 12 hard disk drives, enterprise-grade SATA drives. So we have three times more drives than the first time. We multiply the amount of drive that we use to have the advance. Our computer also with two solid state drives for a kernel container. So we have the isolation of the service in two layers, a kernel container and the option service layer. We are using SATA domes and, well, we have 64 Gs of RAM and two XACOR processors. That's the same. We divide all storage nodes in the two data centers as the first time and we have the same ring-sones configuration of Rebicus. We have 18 proxy nodes now. We have more proxy nodes to improve throughput that are caching token. So we have a token caching layer now and our caching also ring metadata. We have the same load balancer layer composed by BIPFI and we have a bigger keystone cluster to support the traffic that comes from Swift. And we connect now all that infrastructure with two traffic billions of 10 Gbits each one. So we have the internal traffic and the external traffic bill. Okay, so this architecture looks similar than the first one, but here we applied all the previous points that we saw to deploy a completely different architecture. And well, the results are also really different because we are serving, we are about 70% faster. We are serving 300 K RPM average. We have better service support with a very response time in case of failures and a lower failure rate because of hardware. So we are stronger and we are proactive. Okay, so here we have some lessons now. Harbor biome planning is the key because we have good service support to reduce response times. We have a better hardware. We have solid state drive, we have enterprise-grade drives. We have set the domes modules to install the operating system. We have better network interfaces. Size-matter because we multiply the amount of drives that we use in our Keystone cluster to have a better balance of IOPS. Well, divide and conquer because we isolate some parts of our services to increase the performance. And it's not bound with throughput because we don't have a bandwidth issue by now. So it's all about throughput. And we are human hashing the hashing ring because we need to be sure that we have the same ring files in all our clusters. So we need to check the consistency of all that ring files across our clusters. And hardware we fail and we fail a lot. Well, we need to know what we need to do in case of failures. Okay, so that's all. It's Q and A time. So, well, we have time for a few questions. Yeah, also, sorry. Before that, here are our contact details in case you want to write us or have any questions. Yeah, that's one. Well, that depends of your traffic. We can notice that we need more proxy because our times are low. So, no, we use New Relic to monitor our traffic in our Swift traffic. So we can notice that that we are having some issue in our response times. We need to add more proxies. Okay, but we have proxies today to support a big amount of traffic. Yeah. Thank you so much. How many servers do we have in the Eastern cluster? Oh, yeah, about 20 instances. Yeah. No, it's also for Nova. Yeah, we have it for a lot of services. The core OpenStack components that I mentioned before, we are also using it for some internal applications because we integrated it with our LDAP servers so we can authenticate also internal users and use that. How many main cache? Well, the same amount of proxies because we have main cache in the same server that proxies. So we use proxies and main cache in the same server. Yeah. Actually, main cache comes integrated good with your sixth proxy service. Yeah. With Gluster. No, in the meantime, we are using this. We use XFS and we thought about checking SEF because we also want to integrate it with other parts of our OpenStack deployment. But no, in the meantime, we didn't thought about implementing Gluster. XFS? It's XFS, XFS. Yeah. Yeah. Yeah. Yeah, on the use case. Yeah. In our case, this worked for us because we have fast clients and we know that we didn't reach, we didn't have too much load on the CPU side. We had two extra cores, so it's a big amount. So the bottleneck was on the drives. That's why we choose to put the higher density storage nodes with the 12 drives instead of four as the first implementation. Yep. It's not easy. Yeah. It's not easy, but we use New Relic also to monitor the API. We have also, we are using it to check the server workload. You can check the IO network on the network on the drive. But it's always in your drives, of course. It's usually in our drives, yeah. But you need a really good monitoring system to check which layer you are having issues. It's usually on the, we usually hit it on the hard drive side. We also, sometimes, if you have, that's why we also upgrade it from one gigabit to 10 gigabit to have a better throughput. But it's usually, in our case, it was where we had the problems. No, actually, we are, as he said before, Maximiliano, we are using custom scripts. To replicate all dream files and custom files. To replicate when we rebalance and check the consistency, yeah. We actually are not using it for that. We use it mainly for the installation and configuration of our instances when they are born, launch it. We also want to make this wider, we use Chef, because it grew a lot also with recipes and cookbooks. So we need to implement everything, yeah. Yeah. On the side of disks, yes, 32. Yeah, we have around a quarter to 250 gigabytes, 250 gigabytes, sorry, in terms of size. This is the current implementation we have, but as you know, you can add more hardware, you can scale out and put more hardware because you can distribute the partitions in your ring and then you start moving data and using more space. Yeah, our final size, I can't remember our final size, but give us many petabytes. We have, I don't know. We have a lot because we have, as we told you before, we have fast cloud, internal clients are mainly fast clients. We started to migrate some of the images of what we had in the storage appliances. We started to migrate it to the Swift solution. So we mainly have a lot of small objects between, I would say, between 40K to 100K kilobytes, more or less. We have, we use it also, some of our internal clients use it to store large files, but it's very small, very small traffic. We have a mix, yep. No, actually, as I said, all this moved from the old infrastructure to the new one, so we needed to work and talk with our developers or internal users because they needed to adapt to this new cloud and this new type of resources so they can be able to talk and tweak their app with the APIs. And so for that, not everyone is migrating just as fast as others, but they are trying to do it because we have a lot of benefits because it's distributed, you have fault tolerance, so you have better performance than NFS and you don't have a bottleneck, you have no vendor locking, so that's why it's better and everyone is moving faster or slower, but they are moving. The main use case, well, we have a mix of users, of internal users in our company, so they use Swift for everything. We are planning to move, we are planning to do the pictures, the images of our site to Swift, but we are planning to do that, so it's just a plan. And the other question, factor? Yeah, we have three replicas. We have three replicas, yes. As we explain, we have two data centers where we have this cluster. We have three replicas, so we ensure that we will have at least one copy of each one of the objects. Yeah, in case of a disaster recovery, you have. Yeah, no. No, because they are physically closed, so we have a low latency interconnect between those, but we are planning to move some part of Swift to Atlanta, to another data center, so maybe we will do that. We need to test it, but it's possible. Yeah. The way it is working right now, it's working very well for us, as you saw and as we told you, the first implementation wasn't good and we learned a lot about it, and that's why we changed and we arrived to this solution. And I think it's really coupled to the way your clients work with it. So, because we have fast clients, and the way this layout is working very well for us, I think maybe if our clients change the way they use it, then maybe we will need to adapt to their needs, but because we have fast clients, this is the way this is distributed, it's working very well for us. Sorry, I didn't hear you. Yeah, right now we have internal users, but we don't build them yet. So, we might do that in the future. We implemented this and we are trying to add, as Maximil told you, we are planning to implement stats D that gives you a better metrics of usage about your Swift implementation solution. Yeah, close to silometer is integrated with Swift. I saw that. Yeah, so we might in the future, but right now we are not building anyone, so we just give it for our internal use to see how everyone is using. Every time a new client arrives, we check their usage and their research patterns, but not yet. Anyone else? SSDs? Yeah, we had two SSDs per source node. Yeah. No, no, no. They are not rated together. Rated, if they are rated together? Rated, no, no. No, every, what? Okay, sorry. What you wanna do is to keep the hardware layer as simple as possible because you want to be able to replace, in case of any failure, you want to be able to replace this. So, we use commodity hardware and every storage node has its own SSD drives and enterprise-grade set of drives. So, in case of any failure, you can just take out the machine, the drives, hot-swap them, okay? We don't have more time. Yeah, I think we need to... We can go out, if you want. We can also go outside, okay? Oh, yeah, sorry. So, here it is. We want to thank you and if you have any question or want to contact us, you're welcome. Thank you. Thank you.