 My name is Ryan Tanner. I'm a software engineer at Conspire, and we're a startup in Boulder, Colorado. And today I'm going to tell you about the past year, our experience working with ACCA and Scala, and what that's been like in a startup, how that decision's played out over the year. This talk isn't going to be as technical as the others we've seen this weekend, but I am going to go into the technical challenges that Conspire faced as we chose ACCA, why we chose ACCA and Scala to tackle those challenges. I'll also go into the successes we've had with that, but more so the failures and the lessons we've learned over the past year, including the egg on my face story of why everything crashed and burned the first time we tried to go into production. So I hope you'll learn from my experience bringing Scala and ACCA to a skeptical team, and I hope you'll avoid the pitfalls we've fallen into. So Conspire, what do we do? We analyze email data to tell you the best way to get introduced to anyone. We score every relationship in order to understand the difference between a colleague and someone you just met five minutes three years ago. This isn't an easy task, and we've obviously expended considerable engineering effort to reach that understanding. I joined that process while Conspire was still in Techstars, Techstars Cloud down in San Antonio. After they graduated, I became the first hire, and we began thinking about the system I'm presenting to you here today. So the technical challenges we faced, what were we thinking about when we chose ACCA? On average, we have about 100,000 messages per user with outliers reaching up to four or five million, but when a user joins, we have to have their initial data processed and available within 60 seconds, and we need all of it processed within two minutes. And we need to maintain that goal when we get, for instance, a press hit, which can lead to a lot more data coming into our system. One press mention can easily represent billions of new messages to be processed, but even under that, we still have to hit that two-minute goal. We can't break the internal SLA we've set for ourselves. So that was the challenge we faced in June of 2013 when we began designing this system. We needed something resilient, scalable, and easy to understand, and we only had a team of three people to build it. We had the two co-founders and myself. So, and we chose ACCA for that. If you'd like the details on the ACCA cluster itself, the technical details are on our blog, and that's at blog.conspire.com. So initially, we decided to use ACCA actually with Java and not Scala, but obviously I wouldn't be presenting this here at a Scala conference if I'd chosen to respect that decision. So none of us had used Scala professionally, none of us. My job's at albin.net, actually, before. And the other guys were lawyers. So that's how we started, ACCA and Java. The back end for Conspire's MVP had been written in Java, so using ACCA with that was an appealing option. And that's how we started, but then I pretty quickly got tired of writing 20 lines of Java when all I needed was a case class to define a message. And then I got tired of mutable by default and for loops and I'm preaching to the converted. I don't need to convince you guys why we don't want to write more Java. So I started defining, I started small. I started defining our actor messages as case classes. We were using SBT anyways so it was easy to have a mixed code base. And then I took a little bit more of a risk and I wrote a small piece of our analytics code in Scala just to demonstrate how immutable collections with higher order functions made the code safer and easier to reason about as opposed to the traditional imperative for loop approach. And the guys were okay with that. I was able to clearly demonstrate the benefits and it's a lot easier to take that risk in a three person startup where you have a lot more freedom to make these sort of decisions. And that's half the fun of working for a company of that size. So this is Paul. Paul is our CTO and co-founder. And at this point I demonstrated enough Scala to pique his interest. So he decided to give it a shot himself. He was a, he's a Java developer. That was his experience. But having never written Scala, he in one sitting actually wrote the first pass of our analytics service and went from no prior Scala experience to an analytics service that worked correctly the first time it compiled. Now he actually, it took him about six hours to wade through all the type errors. But once it typed checked, sorry, once it typed checked it actually worked correctly the first time. And that code still forms the backbone of the company today. He was hooked and we've barely written any Java since that point. I'm not gonna claim that Scala has been perfect for us. We definitely have challenges with it. Hiring Scala developers is hard. We've, that's not something we've had a lot of success with. And another big challenge, which I'm sure a lot of you have faced, is wrangling the different styles of Scala that each person on the team writes. Mine tends more towards the functional side, type classes, et cetera. Paul sits somewhere in between. And some of the guys are, there's still the occasional option dot get that shows up in a pull request that they gotta trace down. But it's place in our stack is assured. At the same time, however, Ruby is also has a solid place in our stack. We're not purists. And by and large, we write front end code in Ruby, back end code in Scala. And I know from my conversations here and with other companies using Scala that that's a pretty common arrangement. It's one that's worked well for us. And I think it seems to be working well for others. And this is a little bit of heresy that I'm sure is going to get some of you mad. Are we just using Scala as a better Java at conspire? A little bit. I would say we're in the shallow end. We've moved beyond for loops with mutable state, but Scala Z is not in our build files. It's not selling Scala, the core of Scala as a huge productivity improvement was actually very easy. But selling all the really cool stuff we've been seeing this weekend. Well, it's awesome. And I love learning about it. It's harder to sell a team on that when the team doesn't have hardcore functional experience, the math background. I myself, my introduction to functional programming was with Jay. If anyone's familiar with that, which is an APL dialect. I got one hand in the back. One hand, yeah, that scared me away from functional programming for many, many years. Yeah, taking the average of a list in four characters of code is pretty cool. But when your professor asks you to write a content management system in Jay, you kind of run away screaming. I'm not making that up. That was actually an assignment. But, you know, so the five or 10% of Scala that we do use at Conspire has been a tremendous boost for us. And we, you know, I'm slowly trying to introduce the cooler stuff. I don't know if Scala's Ed is ever gonna show up, but shapeless might be coming. The real kicker for us though is the ecosystem around Scala. And that brings us around to Acca. So I mentioned the three goals of our system at the beginning of the talk. Resilience, scalability, and clarity. So how did we do a year later? The first two goals, I would say, we totally hit the mark on those. That 60 second SLA we have for ourselves is a lot easier to maintain when you can trivially just add a couple more EC2 instances into your cluster when you know a press mention is on its way. Our first hit on TechCrunch saw a 320% increase in the amount of data we had to process. But because it was so easy to add more capacity, we weathered that storm with no discernible impact to our users and with no serious problems from our Acca cluster. And scalability of that form, of course, requires resilience because things are going to fail. We actually see, due to network problems, we see nodes periodically lose touch with the rest of the cluster multiple times a day. But we've built in the resilience required that it's a non-event for us. A node loses touch, eventually it just kills itself, restarts, rejoins the cluster, and things move on. And oftentimes these crashes are out of your control. Who else, anyone else on EC2 two months ago when Amazon pushed out that Xen upgrade and gave you all of about 72 hours of notice that your stuff was all gonna get rebooted with you had no say in the matter? Yeah, that was fun. But it's not so bad when you know that your system is just gonna restart itself and pick up from where it left off. Failures are compartmentalized. The work that gets impacted by a failure will get redistributed elsewhere. It's very easy to build those kind of concurrency and supervision models on ACA. You can make them clear and reliable. But on clarity overall, I would say we missed the mark. I'm not talking about individual functions or actors. I'm talking about design as a whole. The doctrine of microservices has been making the rounds in the Scala community lately. Victor Klang had his blog post about it. And I very much agree that this stack is suited to that sort of architecture. And it's something we're transitioning to. But we found that ACA actually makes it very easy to back yourself into a tightly coupled architecture. I'm not giving the details here because frankly, that's not very interesting to you guys. But I say this as a warning for people considering a similar stack. Because you can build such solid concurrency and supervision models, you can actually very easily wind up just with a very tightly coupled architecture. It's easy to shove more and more functionality into the same cluster. We became so enamored with the ease of building a distributed system that we forgot to build a decoupled system at the same time. We didn't give proper consideration to that aspect of our design. We wound up with a highly coupled distributed system. And trust me when I tell you that's not somewhere you want to be. This isn't ACA's fault. We're not blaming the tool. It's our fault completely. We wound up with one cluster responsible for multiple facets of our product. One code base, one deployment. That locks you into a lot of poor decisions that you don't want to be stuck with. It also makes onboarding new developers incredibly difficult. It's already hard enough when you're generally taking non-scala developers and teaching them scala on the job. It's even worse when you're teaching them scala through a code base that is kept back by some poor design decisions. And so I said I'd talk about lessons learned and this is the first. The single responsibility principle, sorry, isn't news, but I wish we'd taken it to heart a year ago. We wound up with one cluster with many responsibilities working towards multiple clusters each with a single responsibility. So it's nothing to do with ACA or even scala. I recognize that. But as we began using ACA, we realized or just couldn't find that there's not a lot of literature on exactly how you design these systems. None of us had distributed systems experience. And the literature is certainly out there but we didn't know where to start. So we just kind of dove in head first and we wound up locking ourselves into some decisions that we regret a year on. ACA's documentation itself is fantastic and it's certainly some of the best I've seen of any open source project. But the proper design patterns around ACA are not very well documented and the communities made a lot of progress on that over the past year. We're excited about that. For us, the biggest anti-pattern is building a system based around tightly coupled orchestration. And that mistake has been an impediment to the business now. It's not just a technical problem. It's made it harder for us to ship new features and it's a straight jacket from which we're still trying to break free. So I say this just as a warning to others considering the same path. So now I'm gonna move on to why everything crashed and burned the first time we tried to take this into production. So this was back during the early days of the 2.2 branch. This was right after clustering had been moved out of the experimental stage and been declared production ready. So everything looked good locally. We had our test passing. We had some nice big complicated multi-JVM testing. Looked good. We could load email data, analyze, consume, et cetera. Then we moved up to EC2. One user through the pipeline, fine. Two at a time, okay. Five, 10, things started falling apart. And you end up with the never-ending, that's unreadable, never-ending. It doesn't matter. Never-ending, unreachable nodes, association exceptions, all sorts of fun stuff. You see some heartbeats get missed and then a node gets quarantined, gets marked unreachable, and then you've got a split brain cluster and you have a deal with that. The problem was what we didn't, well, there were a number of things we didn't think about. We were doing so much CPU intensive work in our system but we failed to take into account any kind of back pressure. And in fact, our own user land work was so intensive that ACA's internal actors for remoting and clustering couldn't get any CPU time. Or we'd get the really fun problem where the out-of-memory killer on AWS Linux would just come around and just axe our JVM process. So that's fun because then you get no output whatsoever in your own logs until you finally think to go and check syslog and you see the little kill statement in there. That took a while for us to find. So whenever you see some guy reply about ACA, someone else is saying, why does my thing keep dying? You see some guy say, check the out-of-memory killer on your operating system. It's almost always me going around and saying that. So we tried to go live and we failed miserably. It was like grumpy cat here on a gumball rampage. That's how it feels when you try to go live and then fail completely. And if you dig into the history of the ACA user mailing list, you'll find a bunch of panicked messages from me wondering what in the hell I do to keep my cluster from falling apart. We were honestly afraid that the four months of work we'd done building this system was gonna be for nothing because we were completely unsure as to whether or not we could actually take it into production. And I came very close to stripping all of the clustering entirely and just throwing rabid MQ in the middle of everything and just communicating over that. I came very close to that. So we spent about two weeks rethinking every aspect of supervision and of how work was dispatched throughout the cluster and how we did so concurrently. In the end, we actually were able to build a very strong and reliable cluster that could handle failures, could handle network problems and most importantly had back pressure baked into every single aspect of the system. So how did we do that? This is what we didn't do because it didn't work. So at first we just kept moving up to bigger and bigger EC2 instances which is a band aid and it gets you a little way but that's not a long-term solution. Especially if you don't wanna burn through all of your venture capital in about a month and a half. So then we tried tweaking the dispatchers in ACA, the underlying executors. We moved our stuff to their own executors, we moved ACA's internal actors to their own dispatchers, we tweaked every setting, every knob in the documentation. None of that really worked. You do need to tweak it a little bit and now in 2.3 ACA's internal actors are on their own dispatcher by default which is fantastic but the real question just comes back to back pressure and this was the problem right here, taking care of futures within actors. One of the most common themes in the ACA documentation and on the mailing list and so forth is the danger of mixing actors and futures. Most of it focuses on mutable state, closing over sender and so forth which is dangerous and we didn't make that mistake but there's a more insidious mistake which looks a little bit more like this. So you do, a message comes into your actor, you immediately spawn a future that's getting hundreds of megabytes of data over the network from AWS, whatever and then you do something computationally expensive with the result and you do it on the default dispatcher and the actors receive function and then you do it for hundreds of users at the exact same time. I call this reactive the hard way. The core of the problem here is you're pulling in hundreds of gigabytes a day or not hundreds but many gigabytes of data at a time performing computationally expensive tasks and no upper bound whatsoever on the amount done at once because that's the way the received function works. As soon as that future is spawned that actor is gonna go get the next message and if that message is also gonna go get a ton of data and do something expensive with it you're gonna have a lot of problems. This flaw was so serious and was so thoroughly permeated our code base that it necessitated a near total rewrite of all of our supervision and work distribution concepts as we were trying to go into production. As we had press hits lined up ready to go we were about to pull the trigger and we couldn't go live. This was not a lot of fun but because our individual actors followed the SRP pretty well this rewrite was not as painful as it sounds and really this rethink is why we have such a stable system today. Going through that mess forced us to code defensively. We learned the hard way that you must bake in back pressure at every step of your pipeline and learning from that mistake has led to a much stronger system. And the core of that back pressure strategy for us has been to always pull work to never push it. ACCA will gladly let you hang yourself when it comes to work dispatch. Pushing is easy. That's the most basic function of an actor. Actor tell message, that's it. But there are problems with that. It makes accounting for work a lot difficult. If you tell a message to a router you don't have a clue where that message went which of the actors behind that router is actually processing it. It's even worse if it's a cluster aware router and the Routee is not necessarily on the same box. So now you're sending work off over the network with no clue as to where it's going. So if then you get back a failure notification, a death watch notice at one of those actors that's failed you don't know what it was working on a priori. You have to, maybe you have some kind of confirmation coming back, but what if it dies in flight? How do you know which message to retry? The other, another complicated aspect of just pushing work is that little code snippet I showed you earlier. You're then, every actor in your hierarchy has to be a part of your back pressure strategy because you can't just rely on the receive function blocking until it gets the next message. If you have futures or child actors involved and your work is immediately going off into a future or a child actor that worker actor is gonna move on to the next one. You can't rely on an actor's mailbox as a form of back pressure. And if you continue with that then if you do wanna do that you have to involve every layer of your actor hierarchy in your back pressure strategy. So we find that the drawbacks of blindly pushing really manifest themselves primarily in three scenarios. When you're sending work off to a remote node when there's some concern about the stability of that worker actor or in this case when you need specific control over the amount of work you're doing concurrently. Fortunately it's very easy to build a pull mechanism on top of the basic unit of pushing. And this isn't our pattern and credit I think has to go to Derek Wyatt for first writing up the work pulling pattern on the let it crash blog. I'd encourage you to take a look at that. If you let your workers pull work from a supervisor instead your system becomes dramatically simpler. And to us at first it was counterintuitive because we thought no we're writing more code now it's not simpler. But even though there's a little bit of boilerplate involved it makes it far easier to reason about the amount of work being done at any given point in time in your system. Because worker actors are only going to move on they're only going to request another piece of work when their entire process however it may be implemented is actually complete. Rather than simply letting work queue up in a mailbox. So for us I'll point you to our blog again for the details on this and about a year ago I wrote a five part series on how we've implemented all of this and there's code and I'll give into the details on the blog. So for details hit that up. But we use this pattern everywhere. Any kind of work dispatch anywhere in our cluster uses work pulling. And we have found that that is absolutely key to maintaining a stable cluster under heavy load. So given the pitfalls I've described here and the huge problems we actually had with ACCA at the beginning does conspire or consider ACCA and SCALA to have been a success for the business. And I would say that yes we do overall. We're rethinking how we use ACCA and we're rethinking especially how we deploy ACCA but it's place in our infrastructure and our stack is secure. Our cluster has lasted about a year and we're a small startup. So for us that's actually about the right timeframe we consider that a victory. Our requirements have actually changed massively since this thing went live. A year ago all we did was email users some vanity stats about their email once a week. But today we have a much different product. We provide you the best way to get introduced to anyone. And so we're a startup, we move fast and despite some very poor and short-sighted decisions early on ACCA and SCALA have been able to keep up with us. And we put SCALA into other parts of our infrastructure. We have a custom Neo4j plugin that powers all of our search or sorry all of our graph search. And that's all written in SCALA. We have a play instance that serves up all of that data as an API to our users. And so SCALA we love it frankly. We do and we're excited about it. But we just want others who, we know there are teams out there who like us don't have the experience in distributed systems or functional programming but are considering going down this path. And there are pitfalls to be aware of. Much of that admittedly was our own inexperience. We were not blaming the tools but there are things to be aware of. So going forward, what did we learn? We have to keep our services small, as small as possible, building the monolith. Everyone knows that's not what you wanna do but it's easy to find yourself doing it accidentally. Always think about how you would onboard a new developer especially when it's a language like SCALA where often you're not hiring someone who already has SCALA experience. So always be thinking about how you would teach a Python guy how to be productive in your code base. And I'm not trying to say don't use hard stuff. Absolutely use the hard stuff when it's called for. But my point with the second bullet point here is not language constructs. If the type class is the right way to go, if the iterative is the right way to go, go for it. But think about the design, how you would explain to them the design behind your system. We didn't think about that enough but I found that it's a good canary in the coal mine to tell you when you're going down the wrong path. When it would take you a very long time to explain someone how to be productive in that project. And then just assume that it's all gonna crash. That's the third point there. So how is conspire implementing these suggestions? Currently we're taking that monolith and we're breaking it into a series of much smaller specific services that will all communicate via Kafka. I know Kafka is pretty exciting right now coming out of LinkedIn. Right now we have a single top level pipeline that implements a user through, or sorry, ushers a user through every step in our process. And we're breaking that apart. And we're sticking Kafka in the middle which is admittedly not dissimilar from my RabbitMQ plan B that I mentioned earlier. We're also throwing out Chef and Vagrant. That's what we've been using for a year and I'm glad to say that it's now all been deleted. A year ago we set up Vagrant to deploy our EC, or provision our EC2 instances. We had Chef building code, pushing it up. And that was just horrible. We hated that. I see a lot of nodding heads, absolutely. So about two months ago, they leave it all of that and we're now deploying everything on CoreOS with Docker. So we Dockerize all of our ACA stuff, push it up to Docker Hub and then let CoreOS pull it down and run it. And I'll hopefully have some blog posts up about that in about two, three weeks. And if you haven't checked it out the SBT Docker plugin is fantastic. I don't remember who wrote that but it's a great, great SBT plugin. One SBT command and your stuff's up in Docker. Good to go. So if you would like to join a team that's made these kind of horrible mistakes, we are hiring. If you would like to come hang out in Boulder, Colorado. If you're interested in Colorado and working with a small startup doing some cool stuff with these technologies, come talk to me or my email, ryanatconspire.com, and I'm on Twitter. And so thanks. I know this presentation was pretty short but since it's not technical and we ran over anyways so that worked out well. And that is a short URL to that five series blog post from a year ago. So I hope those of you considering a similar path will avoid the mistakes we've made. Always be thinking about back pressure. Always be thinking about how you would bring someone into your new system. I hope that steers some people away from the pitfalls we've made today. Thank you.