 Galvanize Campus in San Francisco. It's theCUBE, covering Apache SparkMaker community event brought to you by IBM. Now, here are your hosts, John Walls and George Gilbert. And welcome back to San Francisco on theCUBE, we're continuing our coverage here of Apache SparkMaker communities event. It's all day today and then to tonight, general sessions tonight here in Galvanize, which is downtown, it's just about a mile from the Hilton which is where the Spark Summit is going on Tuesday and Wednesday. We'll be over there for coverage as well on those days. I'm with George Gilbert, I'm John Walls and we're joined by J.J. Allaire, who is the founder and the CEO of RStudio and J.J. Thanks for being with us. Absolutely, it's a pleasure to be here. Making your CUBE debut. Absolutely, yeah. So tell us a little bit about RStudio, if you will, then just to set the stage in the hopes of folks at home who might not be familiar with it. Sure, RStudio is a company that builds tools around the R statistical computing environment. And R is actually, I call it a statistical computing environment. It's also a programming language. It was actually invented 40 years ago at AT&T Bell Labs. At that time it was called S. And it's amazingly kind of continued to grow in relevance. It's a 40 year old programming language that's sort of just hitting its stride. It's really a remarkable story. It was built at the time as a way to, at the time all the data analysis was done in Fortran. And there was all these decades old, high performance robust statistical routines. But statisticians and analysts weren't able to really use them productively because they were all Fortran. So S was created to create an interactive interface for all of this great computational power. And that sort of grew up into R. And now my company RStudio builds a lot of tools around R. So now you've got this 40 year old language marrying up with this five year old computing framework and Spark. What is that relationship all about? How that comes about? Well really what it is is that Spark has this incredibly rich environment for doing distributed computing, having distributed data sets, distributed machine learning. It's really powerful. It actually allows you to work interactively. It's high performance enough that you can work interactively. So the interesting marriage here is that R was built from day one to be an interactive data analysis environment. It's a programming language that was built to support a conversation between the analyst and the data. And so as a result it's very productive, very easy to use, has a lot of really great data visualization libraries. And I think what people are really interested in is they wanna take that user experience that there are millions now people using R and allow them to take that user experience and use it within a Spark environment. And that's really, really, really powerful combination. So was it waiting then you think for a framework like Spark to come along because now all of a sudden this is the great value? I mean you've hit the grand slam basically. In a way, I mean R has always, as I said at the beginning, it was about interfacing to Fortran code. And there was this treasure trove of Fortran code that nobody could get to. And over the years I mean it's interfaced, more so today it interfaces with a lot of native code. But also it interfaces with distributed computing frameworks like H2O. R, it's the creator of R, John Chambers, just gave a talk a couple years ago at Usar where he said R is an interface language. And he really emphasized that whatever it is that's out there that's interesting, we wanna be able to create a really nice interface for. So I mean R hasn't been waiting. I mean I think it's been interfacing with things that are interesting. I think there's a great opportunity to provide a great interface to Spark for people who use R. And so I think there's a great opportunity. So. So tell us there's a bit of religion in the choice of any language. Yeah, yeah, yeah. What are the, you know, so this event is Spark focused. Yes, yes. Which means we're talking about Java, Scala. That's right. Python and R. Yeah, yeah. Who are the tribes? Yeah, that's a great question. I would say in general that the R was created by statisticians for statisticians. So statisticians tend to really readily understand R. It makes a huge amount of sense to them. Software engineers tend to look at R and they're confused by what they see and it doesn't work the way that other languages work that they expect. So I would say in general, I think there's plenty of crossover. I'd say people from a statistics background tend to like R. People from a software engineering background tend to like, certainly they like Java and Scala and Python. And then I think Python, there's some overlap where it's sort of high level enough that you can achieve many of the same benefits as R. So Python kind of plays in both communities. And so, if you were to look back over the last couple of years, we've seen the explosion just partly in terminology, classifying data scientists, but also more broadly in applying data science to applications. What are some of those things that have pushed the walls back to allow greater application of these capability? I think the tools available, at least in the ecosystem that I work in is this sort of open source data science tools have just going at an incredibly fast pace. The environments like RStudio that are used, the environments like the Jupiter Notebook, there's so many projects now and so much energy going into making open source data science productive. And there's so many frameworks with data visualization, frameworks with distributed computing, frameworks with everything happening with Spark. So I think as more companies have gotten excited and invested in using open source tools, it's really fostered an explosion of innovation really. You mentioned actually something really interesting and it brings up one of our earlier interviews where you talk about the rich choice now of the open source tools and then energy and frameworks of visualization and distributed computing. And what IBM was telling us earlier was we now have to, that wonderful age-old trade-off between specialization and integration and they tried to make it clear that the only way you could really service all your constituencies was if you could flow the tools seamlessly across them, not the same tools but what's behind the tools, I guess some models, how does that work? Well, it depends. I think there are different scales in which people do data analysis. I think that a lot of the focus here is on using the Spark ecosystem as a sort of central organizing platform for data. And that has all the virtues I think that IBM was talking about. There's all kinds of other smaller scale data problems that people solve with all these tools as well. If you look at the most popular data analysis tool in the world is Excel and it handles nothing large but there's still a lot of interesting problems that people wrestle with in that domain. So I think a lot of these tools do actually span, are certainly spans and Python spans, both smaller data sets that aren't in a distributed environment and then also now the larger data sets. I think part of the interesting challenge is to try to leverage, and you see this all the time, we're really trying to leverage this incredible backend that's usually interacted with in one way with this incredible front end that usually interacts with a certain scale of data and then trying to put them together. And how far along are we, where are we constrained? I think it's all the right things are happening. I don't see technical constraints to making these tools work really well in a scaled up distributed environment. I think everybody knows what they need to do and I think in the next 12 to 24 months you're gonna continue to see lots of great stuff show up. I don't think, yeah. You know with as much as what we're talking about structured and unstructured and streaming and all these great capabilities now and that stuff that just wasn't available not too long ago. What is R allowing us to do, marrying up with these new capabilities that excites you, that gets you going? I think the stuff that I'm really excited about in R is we've managed to build, and it's just, it's the R community, we've invested a lot in it, a really incredible platform for communicating both communicating about data and also building custom applications around data. So we have a system for creating reproducible production quality output and this could be documents, it could be presentations, it could be dashboards within the R called R Markdown that people are getting tremendous value from typically on these smaller scale data sets. And we have another technology called Shiny which is a web application framework for R that allows our analysts to take their work and translate it into a web application very, very easily without learning all the details of the web development stack. So I look at things like R Markdown and Shiny that deliver tremendous value against smaller, small and medium scale data and marrying those things to spark back ends I think it's gonna be tremendously exciting and powerful. So an example, R Markdown, how would I put that into practice? You basically, the way you traditionally use a report writer or a business intelligence tool that says I'm trying to make a case for something, I'm trying to tell a story about something and what I wanna do is I wanna combine some narrative, some visualization, some models, some direct data browsing kind of into a presentation or a document. That's really what R Markdown does. It lets you communicate in lots of different ways because data typically, there is a story being told you're trying to make a decision, you're trying to understand a process. There's always context and narrative around data and that's really what the focus of that tool is. It's different in a way that other programming languages typically you're building an application of some kind or you're building a batch process of some kind. There isn't like a story per se around it. With data you're always, there's always some narrative, there's always some point, there's always a need to communicate about it. So that's really what R Markdown is about. Shiny is really about creating the very shortest path between data scientists getting a handle on how to manipulate data for understanding and then letting an end user be able to do that directly by creating a web application for that. And are those differentiators for you then? Put such a aside from Java and Scala and all the others. Those are very big differentiators. I think R has the best system for doing that kind of work. You mentioned a couple things in this last set of questions. So data needs to tell a story on that angle. It's the analysis informs a user. Does the user then operationalize that directly within an application through the user interface of an existing app? It can be. So you might create a simulation for, I'll just make a simulation for an emergency room that wants to basically figure out capacity planning and how many people to have on staff at different times. So you might build a tool for them that based on historical trends and based on today's data, what our staffing level should be. So you might build an application that says, this is how you need to flex your staffing level. So there are tools, or we know tools where people in the medical practice will understand the optimum doses to give, things like that. So there's lots of ways that these applications help people immediately operationalize. Sometimes it's just for understanding the support of decision, but sometimes it is for direct operationalization. And so then with Shiny, where you're creating the shortest path between the data scientist and the end user, is the goal there to help the end user like to embed it in an app and say, based on what Shiny tells you, hit the yes button or the no button. Absolutely people can embed Shiny applications in bigger apps or the Shiny app itself could have that kind of workflow built into it. Okay, okay. If you think about what people would do traditionally, people would create these little applications with Excel workbooks. They'd do the same sort of thing. They'd say, I've got some data, here's some graphs you can flip these parameters around and you see different graphs. And it turns out that, and that's great and that works really well, but when you can say you have the full power of R, a full programming language, incredible data visualization, all our statistics and modeling functions, and then you can build a tool from that, that's really interesting for people. Okay, elaborate. Well, if you think, you're basically saying instead of fitting what you're trying to express into a template of, you know, you have rows, you have columns, you have named fields, and you have graphs, and it's a very, it works very well because it's easy to get started with. But if you could say instead you can create arbitrarily complex or interactive web application, right, you can do anything. You can let the user brush over data, to drill into data, you can sort of anything you can imagine for a user interface that the analysts can build. In other words, an intelligent, data aware canvas. A data aware canvas that's purpose built for a given task. You're really saying, what's your task here? Your task of understanding or task of decision. I can build you a purpose built tool that leverages all the data, leverages great data visualization, leverages modeling, you know. And is that yet to be done, or are you saying you could? Oh yeah, people do that all the time with Shiny, yeah. No, they do that with Shiny, yeah. Absolutely, yeah. So, I guess there's a lot to grok in here, but I'm having one of my, my early onset, all the time there's a moment, so you might have to, Well, just before we wrap up, let's talk about next phase, next wave for you then. Where do you see in terms of this relationship growing between what you're doing in the Spark community and what R is able to fuel, what kind of fire that's going on? Yeah, we're, I was just saying earlier, you know, we want to light a bonfire under the use of R with Spark. So we're really working closely with the Apache Spark community to try to make sure all the tools are in place to, so that R users can fully leverage Spark, and people who have invested in the Spark ecosystem can fully leverage R. So we're putting a lot of energy into that. We're continuing to put a lot of energy into data visualization, into this reporting and publishing framework Shiny, and then we're actually building a bunch of servers and tools to sort of facilitate using those things at scale in organizations as well, so. Oh, I think I recovered my Alzheimer's deficit, which is, let's say you have one of these tools like Pentaho, which does data prep and integration. It has analytics, it has a visualization, so it may not be the very best at any one, but it takes, you take an existing legacy app, and you embed this, and you essentially upgrade the app to an insightful app. There are some analogs, but I think what I would tell you about the approach that we've taken is that it's essentially, there's a similar kind of flow and pipeline, but the users can build whatever they want and whatever they can imagine. So the end product is a web application, and I mean that in every way that you'd write that web application with any, with traditional web application development tools, it's just using R. So it's not, there are not static notions of what the end user can see. Fewer guardrails, more degrees of freedom. Fewer guardrails, more degrees of freedom. So more custom, and I think as a result, higher value. So more effort required, but not that much effort. In relative to the business value, I think people are finding it to be a good trade. And George, it wasn't on set. I mean, I was blown away by the S gone to R. Back in the, I got hung up on that. We've been better beginning, so that's all right. But actually that's an interesting comment because you're taking, what were the core operational custom built by that, you know, Harry Knuckle, Kabul program, or you're infusing it with analytics? Yeah, absolutely, yeah. Sorry if I offended anyone. No, not at all, not at all. J.J., thanks for being with us. Oh, thank you. We appreciate the time, and thanks for the insight on R. All right, great to be here. J.J. Lair, thank you very much. J.J. Lair, founder and CEO of RStudio. Back with more from theCUBE in San Francisco in just a bit.