 Let's get started. Hi everyone, today we will talk about how to foster cross community collaboration lessons learned by sharing our experience. So we will be using Alexio and Presto as an example. We will walk you through the process and share with you our lessons and takeaways. First, a little bit introduction about the speakers. So I'm Kay, Kay Wong, I'm a software engineer at Metta. So I've been in Presto for about four years now. Here's Ben. Hi everyone, my name is Ben. I'm the founding engineer and the EPOP resource in Alexio. So I have been working in Alexio since the very beginning of the journey of the company to work with Metta and also the Presto team since about in the end of the 2019. All right. So let's go to the topic today. So first, we will talk about maybe a little bit introduction on those two projects, Alexio and Presto. So Ben, do you want to give introduction on Alexio? Yeah. So Alexio's official project intubated from UC Berkeley AMP lab. At the beginning, there was more like a research project called Techian, and it's helping Spark to persist the RDD files, RDDs as a file using in-memory storage. And nowadays that becomes a general purpose of our system that different applications like big data analytics or machine learning frameworks can use to use this as an abstract layer to access different storage with the unified API and unified main space. Yeah. So move on to Presto. So Presto is an open source SQL query engine that is fast, reliable and efficient at scale. So it's widely used to run interactive and ad hoc queries at sub-second performance for high volume of data. So it's originally developed in Meta and open sourced in 2013. Now it's widely used and well adopted inside Meta and also other companies like Uber, Twitter, Alibaba, etc. So this is how everything started. So it begins with the prototype project called Raptex. Raptex proposed a new disaggregated architecture and aim to boost query performance by at least 10 eggs by adopting a comprehensive catching solution. I won't go into too much details, but you can learn more about Raptex in this blog. So after prototyping and getting a good performance number, we want to move fast and reuse a library that can meet our requirements for our SSD caching. However, at that time, the existing solution and Meta is not open sourced and they're also on a different language stack. So Presto is an open source community. So ideally we hope to solve this problem by an open source collaboration. And magically in a Presto meetup, we met with Alexio developers and instantly found a match at the time. So I will let Bing tell the rest of the story. Yeah, thanks, okay. For the purpose of this discussion or this presentation, we will not go to the technical details, but rather just use the story to tell to show you how the two different communities to different teams of developers are working together on the same goal. And the goal here is to combine Presto and Alexio together to fulfill what Raptor X is aiming at. So, so first of all, you can think of, you can, like, not surprisingly, we can see a lot of different challenges when two teams to different open source projects communities are working together. The challenge I would say, or like many other problems, it's to get how to get this in a party for your in your organization here in both organizations because you can imagine for both teams for me and also at that time for the tech lead in Presto. We both have a lot of different items. And there are also for the top party, they can be different alternative solutions to that. Unless this is a party and endorsed by the leadership, it's hard to really get this, even like we found it. Instantly, instantly, there is a match between two projects, we can go hand in hand. However, you have to make sure this is a blast. And we need to make sure this can be endorsed and get enough resource allocated for this collaboration. Otherwise, it's hard to make any progress. Well, how do we achieve that? Basically, we want to convince the leadership from the technique from a technology perspective. So we set up quite a few meetings. I remember I was visiting Meta at that time Facebook quite a few times just to pitch the leadership also give the like engineers a more detailed deep diving about what Alexio is how we implement this and what are the collaborators who are the developers, more details. And so in this way, we want to answer the technology or answer the questions with the technology. So that's the part one. Part two is really we want to illustrate what is the app potential, what is the benefit of doing this together. So basically, those are two open source projects and we want to boost open source adoption. So there's nothing more than nothing better than having another open source projects working together. So we can work with the other community and get a recognition and also get more leads or more information from the other community. So by having by putting this together, we see this really works well, at least for me and for at least the president team, we get this blessed by the leadership. Next slide. Next challenge, just like a lot of other open source projects or any other projects, I don't say you always faced a very tight timeline, and especially this is a first time collaboration. So we started to talk with the president team leads in the end of 2019, when I was on vacation in Hawaii, we started conversation. But we want to, at that time we heard the timeline to get this doctor acts shipped to production is in six months. And so like a lot of us think we need to start a project from zero, and also we need to test it, we need to write a lot of documentation to also run a staging test. So it's really actually very aggressive timeline. So we divided this into different stages, and really want to meet the timeline for each steps. So you can think, you can take a look at timetable here we spent about one month working together to have a design together. But later on, we spend in total maybe about two months to do the code implementation, I will see the initial drop to the code implementation. After four or five months, we all spend in testing, or to, or to optimize or to identify issues in a staging environment. So okay, you may still remember at that time. The code is there quickly, but it takes actually a lot of time to pause to really get this into a stage, which is close to production ready because there's a lot of small issues here and there like monitoring or resource, all kinds of different things that also shows up and in the in a in a pre-stating staging environment. So, yes, the take I guess the takeaway here is do allocate enough time for testing, especially for near production or using real production workloads that will surface a lot of problems. Yeah, I totally agree with you like right now we are also starting new project as well. I really feel like the code actually is easy in some way because if you want to go to production, the long tail issue is really the pain and it might cause like, it might take longer time than the actual coding time. So, yeah, like you said, prepare enough time, give yourself enough time for the project. So, I'm really glad like in the beginning for this collaboration, we set a pretty aggressive timeline for code completion and then spend 60% of the time really just for testing and to make sure this project can be adopted in production environments. There's a lot of adoption or integration needs to be done. And also, even I said this is production in June 2020. By far, it means it's done. It's perfect done. Okay, may still remember like we spend a lot of time together, even after that, there's a lot of issues, like we need to go hunting panel and we will go to more details in later slides. But I'm just saying here, the timeframe here, having this June in 2020 is not a completion of the entire project. I will say it's just like the first major milestone for this project. So, another interesting challenge to have two different projects working together is, they may have difference or totally different time release, really cycles, and also bug escalation process. For example, for a lot of, at that time, we were doing more like we call this a quarterly release every three to four months, we have one release. So that means like if you're in a staging environments or in the production from the issue, unlike in the same repository in the same project, you can always do the politics, go to the branch and do the topics and create another patch. That means, in this case, Presto is using a lot so it's a dependency. It's a maybe dependency. It's a Java project. So, so we need to make a release, in order to deliver a patch to Presto. So that increases the complexity, also increases the time to deliver a patch dramatically. So in order to address this, I talked to different people in the Alexa community and we agreed that, hey, why don't we just let's do some light with release, instead of have a full blown release cycle. Well, so you have to do the documentation, you have to do the run a release test, you have to do the, a lot of different things together, and showing up this final present final results in the websites, including documentation. We just do this maybe release, we call this maybe release or lightweight or maybe release. So instead, whenever there's a bug as reported by the Facebook team, we will just create a fixed box first, and later on we'll create a maybe release, which does not deal to the documentation or any other, any other release related tests, just to boost this up. So in this case, I can do very frequent release. For example, I remember, I can do this like two to three times a week, just by myself, in order to address this timely. But on Facebook side, we can test this. I also remember at that time, you guys create a kind of command for us to use to create a temporary like package so that we can include all those different logs, so that next time when we are running in the environment, we can collect more stats, so that if the issue happened, we have more, more, more, more logs to look at. Yeah, well, we have to work very clever to make it work. Otherwise, it is more complicated than just like working in the same project. But it's, it's pretty fun. Communication. So in the beginning of this collaboration, I remember we discussed, hey, how do we want to communicate that definitely we want the meetings. Actually, we said, cadence meetings on this. But still, there's a lot of one-off questions you have to answer, or you have to just, just to leave some message to other people to see, hey, what's going on there. Right. So we decided to create a shared Slack channel between Facebook team and Alexa team together to work on this. So we can be very responsive there. For example, whenever Kay finds anything is going, not going well, and she can just pay, hey, there's something you may need to pay attention. And this is the symptom there. So in this way, they don't need to wait for the next meeting or weekly meeting for me to answer them. So I found like this is really working really well. But besides the Slack, we also use emails. And the way we're doing this is that we do daily developments on Slack, but also for the major critical decisions and also important agreements, they still go to emails because this is something we can leave this as a return format. And so everyone can be aligned in a more formal way. And we found this combination of having those emails and Slack works pretty well. Also, I want to add one more thing is that people really matters in communication. Like, if the people in collaboration is not responsive, the problem could be defer longer and the issue will be resolved later. I find that at that time, people from both our companies are actually very responsive. So actually, whenever an issue just pop up, like, the next day we come up with a plan and already figure out what it is. So it's really important that to stay responsive during those communication channels. Also peer pressure. Once I think he is on top of everything I tell myself, hey, then you have to be on top of these things too. We are pushing each other. Okay. Yeah, it's good. Next slide. Also, as you can see, like working in organizations, especially now two organizations are working together. We have to deal with the responsibility to change our people chair or team grows. And because it takes for a while to two quarters actually, even right now we're still working together on this project. In order to do this more smoothly to do the transition more smoothly, we take some preventative solutions, including writing well documented knowledge down. Including project vision design implementation, we have a very long design down there with a lot of a very detailed tips. So whenever someone in Facebook, they need to be onboarding to this project, we can just transfer the knowledge. So we are meeting this documentation has all the information you need. Plus, we all the appendix, all the appendix are created there for a reason, because there's some incidents or there's discussion around so we make this as a really clear. Well documented makes the makes the collaboration much easier. And also, just to run the project together with the owner and new new owner for for a little while. So we know the transition. For example, originally I was the primary owner for a lot of sites on this project. And we have another key team member joining and benign he is taking the ownership from me so I was a shadowing him for months or two, and I found he is perfectly driving this through so I just leave this to to him and it runs so very well so far. Yeah, so it's it's requires the collaboration between the two different owners to work on this together so don't know each other was the was a convention there was the, how was the meeting was like before. And also we want to settle the ownership change very clearly once who is working on what and internally like we have very clear documentation that who should be responsible for one. In this case, even certain responsibility is changed. It's clear that there's no confusion. So in, in, in, in matter we have James starting working on this project and then Rohit and then me and then a lot of people come, like, work those things, and, but it's very amazing that we can still keep the momentum going on from like, January to December after like when we finally land those changes in our production relive reliably. So it's really a great collaboration. And I think those documentation really helps in in this situation. So actually internally I also have a lot of internal documentation just to come to just help me remind who is doing what and also where we are. Yeah, so it helps me greatly. So extra challenge here we see, especially after 2020 is the pandemic, as you can imagine. And it's start from early 2020, a lot of companies the in Silicon Valley they started to implement the working from home policy. And this impacts both teams for the Luxo team and also meta team. And by doing this we have less chance to meet each other, but also it's just like a more communication cost, more communication overhead is added, right. Well, so, I mean, this is a generic problem is not only to us or to this project but to, I guess, I'm not tied to my company or I guess to also as though for your project too. Right. So the, to me, it goes really goes down to how to do the relationship building. How do we build trust, even we don't meet each other face to face. Prior to the pandemic we actually do have a cadence meetings. We grab lunch together or sometimes grab coffee together. But now, because the, because the pandemic. So we have to know the direction partner more in person as a friend and understanding each individual goals project goals and comfortable and align this pay extra attention or efforts to align this goals and interest. And also just it's very important to have their regular meeting cadence to make sure people are on the same page, especially they are not in the, in the same office. So actually, prior to pandemic, the way we're doing this is especially during the very active developing phase. Each week, either I go to Facebook office, or someone from the back pressure team, go to my office. So we can sit in the same office to talk to each other. So, and that also helps dramatically. The last one is, yeah, go ahead. I would just want to say hopefully people won't be facing the same challenge again. But those are good practice. Yeah, those are good practice but also I think the team is getting more virtual nowadays in the valley. More team more team more different team members are joining in different companies there they're working in different cities. I don't know, maybe this can be a norm in the future, but essentially, especially working with a distance remote teams or distributed team I will put in that way, especially across different companies, having a cadence meeting and knowing each other well, it's very important that that's the, like I keep emphasizing this. So, so I think we covered the most about the process and also the communication process this type of challenges. I want to spend a little more time to go to the technical challenges to. I think because I think this is interesting, like, and also to show the value of this project. Yeah, the first the first of all, to me the key challenge in this case is I wrote the code the code is not that complicated. But the problem is if you want to deploy the code deploy your algorithm at Facebook or at metascale that has a lot of different challenges. For example, like in matter, they have a huge amount of resource you have to manage. And this carefully that means that we have to carefully use the thousands of operations because we're using in large as a team to manage data. And because the amount total amount of volumes queries, and if we're not using the process in the cause carefully, this can be a very expensive. But also, we see the disk and the process in areas can be expected, especially when we are a lot says the point of a metascale. I mean, hard system failures, and hardware failures, that's what I mean. In this case, we do see one of the two cases that six costs Linux is cost to the disk are no longer considered as reliable. So we have to implement something like, in this case, a two face commit in order to for some common for some common operations that we have to implement this to save the data to prevent potential data failures. And also, especially handling for catching data at a Facebook metascale, you have to handle the case that there will be no restart. And because the restart can be planned or end planned, there can be a lot of data you have to bring back to your library and to your to this project. So, in the in the first iteration, the information takes 40 minutes to read back all the data dumped on the disk, which is totally unacceptable. So we, and we didn't realize this because we didn't realize that's the how much Facebook, like, one single node, you have set you have so many data so much data. So we have to, after the feedback, like, look together to find a way to work on this around, instead of blocking the cost, we can make we move this into the background. But essentially, I'm saying, like, having this will design for Facebook scale is is quite challenging. The last one is very interesting experiences. So one day, like after we wrote out a new release for Facebook, and they're telling us, hey, then we've noticed the city utilization is increased by 1%. So we need to fix this issue. And this was never actually the issue from before. We have our internal in-house in-cast. But I didn't really pay attention to the level of 1% secure utilization increase. But this means a lot for the company like metascale. So it turns out this is some public configuration, like we are using the configuration twice and there are some parsing companies considered expensive. So we need to work this around. But essentially, the lesson for me is working at that scale, saving every single 1%, either severe utilization or memory utilization was never utilization means a lot. Another interesting challenge is here, because we are two different companies, unlike the press and team developers, they can go to the production post or production service to check what's going on there. I cannot access their production environments. And I cannot access the production law. And the best I can do is to just ask them, hey, I know you have another incident, but how can you help me to identify what's going on there, especially given they're running this on the thousands or tens of thousands of different nodes, right? So we have to build a comprehensive metric system to understand faster data from their dashboard. And they have a wonderful dashboard system. By the way, this is, to me, this is a really awesome dashboard system to help. So essentially, we have to sit down together and to work carefully, write down all these important or potentially important metrics so they can at least tell me from this metrics, they observe what's going on there and how they can help me to develop the system. And in next slide, I can talk about more. So only having metrics is not enough. So especially a lot of times there is error, different error cases. And because I can, again, I cannot access their production environments, I cannot access the production logs. So what we can do is we, I think this is a pretty interesting way to solve this is basically in each error case, in each like exceptions thrown from the from the from the system. We assign a unique counter to this. So this unique counter can be exposed to their dashboarding system. And in this case, even I from extra from extra from serve from from our team cannot access their production logs. Well, at least we know which type of errors they are seeing, like this is really, really specific errors. So, and given the sequence and given the time series of this, it's really a very powerful for me to help debugging system issues. And also, sometimes it's a misconfiguration like a human communication is very important. Like there's a one issue with now is because the configuration, they only have 600 terabytes gigabytes available on the disk, but they configure this to be one terabytes and turns out it creates some problems with the with the with Presto. So essentially, I'm saying like, even there's a lot of dashboarding, there's a very accurate, a very convenient way to watch what's going on there. Human errors is still inevitable, like we need to watch closely order to actually eliminate the possibility to have some human errors. Okay, so the quick summary for the technical. So disks, when you're really running on a meta scale, this because I'm not that reliable, you have to design a system design the algorithm to cope with the issues when this can fail, and human are not reliable. This configuration can happen. This operation can happen. And yeah, so we need you need to be prepared for that. It's like, so we talked about a lot of collaboration and the technical challenges, right. So where are we today. In the meta, this project wrapper acts has landed in the end of 2020, and it has been used widely widely in Facebook. And we have some blog to talk about I think there is one more blog on Presto website and there's actually a paper recently published in Sigma talking about this collaboration or part of as a part of the paper, but also this influence other team members, other community members, like in Uber, this, they read the Facebook collaboration between Alexio and Presto, and they want to say hey how can replicate the success here. So Uber started to work with us also on this similar project. And this has also been in production since last year, and Uber has also published a blog on this. Next slide. Okay, I believe this is the we're getting to the end. This is the last slide. In terms of the user journey, like the journey, what are the takeaways. First of all, I think it's a great that both communities, Presto and Luxio are open source communities. So this creates a lot of a synergy there. And we are both in the big data analytics community. And technically, the trend of having a disaggregated architecture makes cash important in Presto, Presto deployments for this, both meta and also the cash in the client cash is a tighter integration with the compute. So this is also product learning on the Luxio and summary this up together. I think this is a good journey for me personally, but also for my team to work with a meta team and I believe we learned a lot along the way, and the collaboration is still continuing. Yeah, I want to add one or two aspect to this slide as well like why is people like people need to be responsive and responsible for like all the issues we see like I mentioned before. And the second one is the culture. I think kind of moving fast is kind of a culture we need when we do collaboration between two companies. And so yeah, that's the two to aspect I want to add here. Okay, I think this is a, we're just sharing what we're seeing in this collaboration. Hopefully you'll find this useful. If you have any questions, feel free to find us or find me in the Slack channel from Luxio. I'm happy to answer. I'm running the open source initiative in Luxio. So I'm open to more collaborations. Thank you. All right. Thanks. Thank you everyone.