 And this is theCUBE, SiliconANGLE's continuous production of the MongoDB day event here in the Big Apple. We are checking in with John Hoffman, who's an engineering infrastructure lead at Four Square John. Welcome to theCUBE. Yeah, great to be here. Thanks for having me. So I'm here with my co-host Jeff Kelly, and Mongo does these things all over the place. We had Max on early today, was telling us about sort of the ecosystem, how it's growing. So how's this day going for you? What are you learning here? What attracts you to events like this? I think the biggest attraction for me is meeting up with other big users of Mongo. And we could talk about, you know, different challenges that we've had scaling things out, either with Mongo or with other things. But it's really for me about meeting the other, you know, people who do what I do. Yeah, so is that the big challenge? Is that what keeps you up at night? Is figuring out how to scale and do so reliably and? Yeah, a lot more, that kept me up at night a lot more in the past than it does now. We actually solved that problem. I don't know if we solved it, but we're, you know, a little bit more stable than we were in the past. Lots of more gnarly problems, is that, yeah. Yeah, yeah, we have other problems like, you know, just more, our business problems are actually keeping me up at night more than our scalability problems. So talk a little bit more about your role at Four Square. Sure, I lead the infrastructure team in New York and we do a few things. One of them is building higher level services on top of Mongo itself. And that involves a lot of, you know, building tools that other developers can use to interact with Mongo. We're also working on a big project right now to split up our monolithic application server into multiple smaller applications. So we're developing this service-oriented architecture and we're building a lot of the tooling to make that possible. My team also works on all of the offline processing at Four Square. So we take all of the data that we have in Mongo, we snapshot it, we pull it into Hadoop along with other log data and that allows us to run, you know, business analytics and it allows us to create a lot of the signals that we use to power our recommendation engine offline. So you're an early Mongo user. Talk about what the motivation was to bring in Mongo, why Mongo and what specifically you're doing with Mongo. Sure, so we started using Mongo about three and a half years ago. And at the time we were running on a standard SQL engine, popular open source version of SQL. And the reason we moved over, there are actually a bunch of reasons, but one of the major ones was that we knew we were going to have a lot of data. We knew that our biggest data source which at the time was check-ins was growing at an exponential rate and we would have to split that data up in some way in order to scale. We can no longer keep it on just one server even if we bought the biggest server out there. So we would have to shard our data and there's a lot of different ways you can do that. You can do that on top of a standard SQL engine by managing the data splits yourself or you could use a tool like Mongo where a lot of that infrastructure and heavy lifting is handled for you. So that was really an attractive proposition for us. Talk a little bit more about that alternative. That kind of brute force, I guess describe it as load balancing, homegrown load balancing, right? Versus what's inherent to Mongo. Can you describe that in a little bit more detail and help people understand the inherent nature of that capability? Sure, so let's say you have a data set that has billions of things and you need to split them across multiple servers for one reason or another. Either there's not enough space to sort them on one server or because the rate at which you're writing data is too high for a single server to support. And you could manage that yourself on any sort of storage using any sort of storage mechanism by a few different ways. One way is to just say, okay, I'm going to like pick some unique identifier that lives inside this piece of data and based on some sort of hashing function, decide which of your multiple hardware backends that have your storage engine you're going to put the data on. And that works out pretty well but you have to manage a lot of the complications yourself. Like what happens when one node becomes more unbalanced than another? What happens when, you know, if you start out with two or three nodes then you have to add a fourth and you want to rebalance your data. There's a lot of accounting work that you have to do there to make that work well. It's a lot of manual intervention to rebalance that load. Yeah, there's a lot of manual tooling that you have to do. A lot of people have done that successfully but if you haven't done it before it's a real challenge. Well, it's hard to make that scale, right? I mean, at mega scale. Yeah, I mean you can make it scale it's just that there's a lot of, you know, work and expertise involved to get it to work correctly. I guess what I mean is the business model doesn't scale well. I mean, because you just keep throwing people at the problem. Right, exactly. So, you know, you're- People with specific skill sets. Exactly. So, you know, maybe you have to waste an engineer's time for a year to get this to work right. And you're suggesting or I'm inferring from what you said then that Mongo somehow magically helps you deal with that and can make it easier to automate that capability. Yeah, exactly. So Mongo has a built in feature called auto balancing and auto sharding. And, you know, what you do is, you know, for your type of data you can say, okay, this is the key that I want to split things up on and it will handle distributing it across multiple nodes. When you add a new node to your cluster it will automatically rebalance the data for you. It will continue to keep things in balance as, you know, things become unbalanced as your application grows. So, I want to talk a little bit about as well kind of the business side of the equation. Obviously we hear a lot about in this, you know, with the big data meme about turning data into kind of monetizing your data and that's kind of the goal for a lot of companies. And obviously that's what Foursquare does, you're a data driven business really. It's all about data. So, explain how Mongo helps you do that. Specifically, what was it in Mongo versus something like Cassandra or some other databases you could have, could be using? What about Mongo specifically really allows you that flexibility or whatever the characteristics are to really leverage the data to drive services that really drive revenue, drive your business? I think one of the biggest attractions of Mongo regards to flexibility is the ability to move quickly. And we could do that in a few ways. So, one is that the data model doesn't need to be specified in advance. So, you know, one of the features of Mongo they hear about is that it's schemeless, which means basically that you have to enforce the schema somewhere else in your stack, probably in your application. But it allows you to modify the structure of your data on the fly without having to do an expensive migration. It might take a really long time to accomplish in other storage engines. Another piece of flexibility is that we can add secondary indexes very easily. And that's something that's very challenging if you're using a product that's more of a key value store. The problem is if you have a piece of data and you want to, you know, not only look it up by some primary key, but by some other, you know, other aspect of the data, what you often have to do is manage all of these lookup tables for yourself and managing all these things, keeping them in sync can slow down development because Mongo allows all these secondary indexes. We can move very quickly. We can look up the same type of data in a lot of different ways and that allows us to move quickly. So, you're on the infrastructure side but in terms of developing and deploying new applications on top of Mongo, that's really one of the keys, especially for a business such as Foursquare, which is very much a mobile application that's got to move quickly in terms of the way the users expect to interact with an app and stay kind of current with new ways of providing services to customers. Right, exactly. Foursquare isn't a static application. We're never done with it, right? We're constantly adding new features, tweaking things, and we want to be able to do that on top of the infrastructure that we've already created. We don't want to have to spend, you know, a long time re-engineering the data, creating entirely new data sets just to look up the data that we already have. And you mentioned, so you also use Hadoop kind of to analyze the user experience, I would imagine, that the data is coming from the applications, how people are using Foursquare. Right. Is that kind of the use case? Because I'm interested in kind of understanding Mongo on side by side with Hadoop and how they complement one another. Yeah, so we use Mongo for all of the online, all the online data and application. But we do a lot of things offline, like we just want to see how many users did some action yesterday or in the past week. How's this tweak to the product? How's that affected usage? And we do all of that analysis offline. And they just, you know, Mongo wasn't exactly the right fit for us. It might be now they're adding a lot of aggregation features, which are like really attractive, but they're really new. And we've had this problem for a long time. So what we do is we store all of that data and Hadoop's that we could analyze it offline. And, you know, I think it's probably not a good idea to use the same infrastructure that used for online processing for offline queries because we want to be able to do tremendously complicated things and not have it impact, you know, our online service. So kind of keep those separate. Well, it's interesting you say that because we're seeing in, you know, there is some talk about the kind of the world's emerging kind of the transactional with your analytic capabilities. Specifically, if you want to do real time, really kind of analyze that data in real time as it's coming from, whether it's an application or whatever source system, make a quick decision on what that means and then turn that background to an action to impact the transactional system. Is that something, do you think that architecture, that concept is just difficult to deploy? I mean, what are your thoughts? I think you could do that. You just have to have like a, you know, a hard boundary between the two systems and keep them isolated. So as long as you're not using the same resources to do that analysis, then you're fine. So there's a lot of different ways of achieving that. You know, you could stream all of your online data, all of your online data to two different places and as long as you're able to keep them isolated, that could be fine. That's not what we do. We do more batch processing because we just like, we don't need the answers like at, you know, right at the minute where we could wait a day. But if you do then, you know, as long as you keep things isolated, that could work. Very interesting. So I wonder if we could talk about your infrastructure like a lot of, you know, companies when you were starting up, you know, use the public cloud, like you used AWS and I believe you've begun to move or have moved back into an in-house infrastructure. Maybe it's a combination. I wonder if you could talk about that a little bit in the decisions to do that. Yeah, so when we first started, we were hosted on AWS and I think, you know, AWS and other cloud hosting services are definitely the way to go if you're just starting a new company and you're not sure how you're going to grow, you're not sure exactly what sort of resources you're going to need because you could just be extremely flexible and move really quickly. If you need to spin up 10 machines, you know, you just start them up with an API call. It's super simple. But once you start to get a little bit more stable, you understand how you're growing, you understand exactly what hardware you need then you kind of have to reevaluate things from both a cost perspective and perhaps, you know, a stability, reliability perspective. And we started doing that a couple of, I guess maybe a year and a half ago. And we only did this because we already had some in-house expertise on how to go about managing your own hardware infrastructure. And what we did was we started looking at the costs and weighing that against, you know, the trade-offs and flexibility. And we decided that it made sense for us to move at least some of our infrastructure from the AWS cloud onto our own hardware. The pieces that were more predictable perhaps, right? Yeah, the pieces that were more predictable. And it turns out that some of the pieces that are more predictable are the database itself. You know, you can't really like, you don't scale a database overnight. It's something that kind of grows slowly over time. So we could predict how much hardware we're going to need a few months out and purchase that in advance. Yeah, so you found like a lot of companies that renting is more expensive than owning for certain applications? Yeah, definitely for certain things. And it also, it depends on what sort of expertise you have in-house. If you've never done this before, if you've never managed your own hardware, if you're like a five-person startup, I definitely wouldn't recommend this. You know, it's kind of like what percentage of your overall technology budget is going to be devoted to the people required to get this to work. Be careful what you wish for in that regard. Talk about what you would advise young people looking to get into this business, looking to get it to this new world of development and big data and no sequel, and what do you advise these young people? Yeah, so I think like for people who just want to learn the new technologies out there, I always tell people to figure out like a problem that you want to solve and just solve it using that technology. So it's very hard to like read a book and go through tutorials of all these contrived problems that you don't really care about. It's much more interesting to find something that you really passionately want to do and you know, pick that new technology that you want to learn about to do it in. I think that's the best way to learn new things. If you're interested in like working for a company, a startup company like Foursquare or others out there, I mean, we're all hiring, so come talk to us, please. What kind of person are you looking for? We're looking for just any engineer. We're just looking for really smart people who are passionate about development. You don't necessarily have to have prior experience with the technologies that we're using because I think we're using Scala and we're using Mongo and these kind of have smaller user bases to compare to the things that are out there but we're looking for people who can just learn quickly and are interested in these things. Excellent. All right, John, well, thanks very much for coming on theCUBE. Really appreciate your perspectives. It was good to have you. Sure, great to be here. Thank you. All right, keep it right there, everybody. Jeff Kelly and I will be back. This is theCUBE, we're live from the Marriott Marquis. This is the MongoDB days, we'll be right back.