 So the first question or the first thing we would like to talk is how to start with these technologies. Do you think it's better a big man approach with a very big architecture for the big data projects in the company and make something big or is it better to go for a small project to look for a use case where you can have real advantage of the technology and start with a more small project? What do you think is the best approach because here there is something like a debate around. Who is the first one? I'll take that one. In my experience, the adage that behind every successful large project is a successful small project. I can't think of any times that I've seen companies say let's go take a technology that we have no experience with and bet the company on it and make that work out. Really starting with something small, getting some experience with it is a much better approach in my experience. And I have the sense that a lot of people in this room are technologists in a company and it's always much easier to sell that kind of project to the management by suggesting a small pilot project and making sure that it's something that can succeed. If the management is not familiar with big data as a concept, a demonstration can make an enormous difference. One thing I will add because I've seen people do it is I think the other problem is making sure that that small project is an actual project as opposed to making up an excuse to play with these systems. I've seen people who seem to have put stuff in place because it was there and they want to play with it and almost just as bad as the large project that Jonathan is talking about is that project that you didn't really need to do but you put something together just because the technology is there, making the right selection, understanding what your choices are and why you need to go with something is almost as important as making that technology choice. Okay, then we have something like a consensus here. It's better to look for a small project with a concrete use case, not something to play around and to start with that. Okay, and what do you think is the best department or area of a bank or an insurance company to start with this project or to have the architectural department, a technological, the business intelligent department, the scientific data department? Who do you think are the best ones to start with the projects and the technology and to make the land? Because we have seen different things here in Spain. For example, if the business intelligent department is controlling the architecture, sometimes it is something like resistance because they are used to a classical technology and then it is like to book the folks taking care of the chickens. So it is sometimes dangerous. So who do you think is the one that has to lead this advance? So I think there is no one category. It really I think boils down to different categories. In finance, which is the market that we've been working for a long time, mostly the risk analysis was usually a good candidate because of two things. One of them, it's a batch system that fits well with analytics and usually the first to hit the big data problems versus a trading application as an example, which is more latency sensitive and less to hit big data problems to begin with. The second part is the part of application that basically have no choice. So if you need to do certain things that can't really fit into the existing technology, then that's an easier bet because you don't really risk. Well, you still risk, but the risk is as opposed to not being able to do that. And those are usually the two good candidates that I found in those type of industries that usually falls nicely into project and then become as successful as you rightly mentioned, I think choosing the right project. You do need to find a project that has value to the business and then it becomes successful. So I think one area is one rule of thumb might be just where you have developers working on things. I think a lot of big data solutions right now are not at the point where you can take an analyst or a sysadmin and say, hey, can you go implement this solution for us? You kind of need somebody with more technical knowledge and more training to understand it. Big quay may be a counter example of that. But I think if you're just looking at sort of dipping your toe into the big data ecosystems and starting to be able to handle larger and larger volumes of data, I think you kind of need to start with where you have engineers that you can throw at the problem. Though I would add that you also have to have the analysts to throw at the problem somewhere. I mean, the big data is this interesting crossover area between questions that have traditionally been answered by statisticians and analysts and problems that have traditionally been solved by engineers. And you need both. You have to have the good questions. And I think engineers often ask very good questions. But you need to have the input from the management of the company through perhaps an analyst or a business intelligence person in order to know what sorts of questions that need to be answered. And that's that's crucial to getting the project continued funding, right, is that the management is happy with the quality of the answers they're getting. But I think the sophistication now is coming off and from the developer side. And I think your point about the business intelligence department being the Fox in charge of the the henhouse. Yeah, but but but the general point there I think is sound, which is that you have to management has to buy in to the fact that there's a problem. They have to want they have to want change they have to be willing to try something new. And not just say, Well, let's let's throw some more money at Oracle. Right. If you've got people staying saying, Well, if you know, if we'd given Oracle half a million dollars, we would have solved this a month ago. What are we doing here spending time on this still? You really, you really need to have people willing to try something new, not you don't want to be selling them that they have a problem. At the same time as you're trying to give them a solution. It's mud, it's going to work much better if they recognize that they have a problem, and they're willing to look for the solution. Related to this question, we were talking, taking the coffee, that it is very important, the systems that you saw in the first presentation. It was really amazing how they improve for the water and the energy generation and so on. But the system and the technology was very important, but it was really very important, or the people that is doing the algorithms to improve this and all the data and the statistics that were behind those systems to improve at the end, well, in Los Angeles, they circulate the roads and the smart systems and all these people. So they are really important. How important they are? Because in Spain sometimes we have the feeling in big companies, telecom, banks, insurance, that you are going to buy an airplane, a very good airplane, very sophisticated, but you don't have the pilot system to drive it. So how important is it? Well, I can add an important detail to both of those examples that perhaps I should have mentioned in the talk, which is that in both cases, a lot of the software they use was written by employees of those organizations. And it's rare to see that in government organizations where the laws are sort of written to force people to hire contractors to do that sort of thing. Part of the dam software is written by engineers, but all of the city traffic light software is written by engineers who work for the city of Los Angeles and they are traffic engineers mostly who then started writing software. They had to bring in software engineers eventually because it became a big data problem. But in both cases, domain expertise is a way to think of it. The people who developed these systems were people who just had a total command, an absolute command of what the problem that they were working on. There are plenty of people who can approach problems that they're not already deeply familiar with. But in those cases in particular, that software grew up gradually over several decades because the people and the people who were building it were internal and they were experts in the field. It's hard to do that from scratch in a short period of time. But for Netflix, the algorithm was very important because it went from 4% to 10%. 6% is quite important. So it seems that you have to take care of that part of the data statistics. Do you think the universities are keeping the base and are preparing people for these data analysis that we have in the future or we will have in the future? So I think the university question is a tough one. In fact, we were speaking in the speakers lounge earlier about the idea that so many people are coming out of school right now, for example, knowing Java rather than C or C++ or anything else. And I think if you're teaching people Hadoop today, you're going to teach them Hadoop 1.0, and then by the time they come around, MapReduce has been replaced with Yarn and all of the other things that are happening, or you're going to teach them the fundamentals. So understanding algorithms, for example, rather than learning Java specifically, understanding how to do data analysis at large scale, understanding how to architect these systems as a fundamental is probably much more important than teaching one system in particular or one set of systems, because it's so hard that by the time you get a curriculum together, you're out of date for whatever the industry has moved towards. And to elaborate a little on what John was saying, I think it actually takes a team to make this work. I don't think one person can do it because one thing I realized when I started working in this is when I started working in big data, I had no idea what linear regression analysis was, right? I didn't study that in school, even though I studied math. So you get, you build teams with different strengths, right? And the people who are good at building the tools that you want, the people who can build the languages on top of Hadoop or whatever, aren't necessarily the people that understand the right statistical questions to ask, but the people who can do the statistics maybe have some notion that the right data is there, but they couldn't write a tool necessarily to efficiently do that. They can write a first pass at it, a first take, but it'll be gross and inefficient and it'll work for a little while until they need to go to somebody who's really their core competencies writing software. So I think it comes down to you build teams of people that can do this. Yeah, I think it's important to be self-aware enough too for an organization to recognize that, you know, maybe we're not ready to be early adopters of Hadoop or Cassandra or MongoDB or whatever, right? Sometimes technologies need a little bit longer to mature before they're ready for a more mainstream audience, and certainly some of these big tools, big data tools are starting to get there, but I think you do need to recognize that some organizations should take a more cautious approach than others and there's nothing wrong with that. I think that the few references or example that comes to my mind with regard to the university involvement in education for big data. I mean, it was grid computing, which was a vastly, I would say, generated by universities. There was this Globus project and all of those things, and I think we all realized that that project wasn't getting the momentum that it supposedly had in terms of potential. On the other hand, if we look at the pattern of what did work and how come we take a technology that is so dramatically different than what people used to do and still being successful, the pattern that I see is almost the following. We have the Google and the Amazon of the world becoming kind of the new IBM and the new Oracle and the new leading, but the difference is that they're eating their own dog foods and therefore when they pitch for something, unlike Globus and unlike the university based project, you know that it's going to work because they've been using it. And they also come with the right compromise and we all know that especially in big data and no sequel is that it's a world of compromise and defining or finding the right compromises that are practical can be done mostly by people who are actually eating their own dog food and are building those large systems. And so I think that the involvement of universities in this movement is, for me, is very disappointing and I think that there is a much better potential to participate in those open source projects and get people that are students to be part of open source projects and for whatever reasons universities are kind of reluctant for being part of that open source movement from different reasons because at least in Israel, for example, it's because it's not pure research. It's other reasons, but I think this is a role that I would expect universities to be more active about. So I think there's a follow on to that which is with big data you need lots of machines. You need a cloud and universities don't often have the resources to have their own cloud and if they want to do pure research on how can you connect machines better. When you get to a certain scale, unless you're a Google or an Amazon or a Microsoft or one of the big players, I think it's going to be difficult for good research to come out about big data ideas. I'm not sure if there's a way around that and I think even if they had access to those clouds, I'm not sure that the researchers would even know which questions to ask. Getting back to like you have to come to this with real problems that you're trying to solve unless you're constantly dealing with big data problems you're not going to know how to get the big data solutions to those problems. I think that there's kind of two sides to that because on the one hand sure the university here probably won't be working on routing across a network of tens of thousands of machines but students here could certainly get involved with the new pig refactoring of the query parser or something like that. So there's different ways that universities could definitely get involved I think and help move the state of the art forward. My last question and then we can, if you have any question we can, we have seen a combination of technologies and sometimes we see well, big data at the end is like an ecosystem of different technologies working together as you have explained. We have seen Cassandra working with Solar, with Lucen, with an indexing database and we have really liked the idea about the virtual machine because some of these integrations they are competing or repeating parts of the code and the features but when to combine? When to use and why Cassandra with Lucen or a tougher one? When to use Cassandra with MongoDB for instance because I like to be a troublemaker so... If you want to talk about the Cassandra and MongoDB because I'm not touching that one. Okay, let's talk about the combinations just that. I'll try to touch that minefield in a way. I've been in that industry for 10 years and there's always the tendency, especially if you're a vendor, to look at the customer problem and try to solve it and by that you basically want to consume or own the entire stack eventually that comprises that problem. I think that's in general the tendency but that's also a mistake. A mistake that I can look backward and say there are definitely certain things that I wouldn't do if I would go backward on that regard and the temptations to actually try to own everything is very big because it usually looks from a technology perspective a very little step that you could add a very little optimization. In big data I think that option even cannot really exist mostly because the type of problems and the specialties in terms of solutions that you need to cater for the different problems as I think laid out in your presentation is such that there's not going to be one solution that's going to cover everything. There's no going to be one that is going to be good for batch and good for real time, good for SQL based analytics and good for BigQuery and good for all of those type of things in the same box. I think that it's mandatory that solution would integrate well together then try to conquer the world in many ways. I think one of the times you want to try to combine is when you identify something that you're sure everybody needs and you've seen it work somewhere and you know it succeeds. If you just start out from the beginning and try to plan this, the old waterfall model designed from the beginning, I don't think that's going to succeed if you sit there and just say, big data needs this new feature and I'm going to add it and then everybody's going to integrate to it. People won't necessarily integrate to it. So I think it comes down to a test of build it somewhere and see it work and then figure out how to share that with other systems and see that other systems are replicating this and realize, okay, this is something everybody needs. We need to pick one and then we need to work together with people on how we integrate that. That's my question and answer time, but please go ahead. Okay, short response but I just think that when you have big data, it's really hard to have a one-size-fits-all solution and that's why you may need Cassandra and MongoDB. You may need five different types of NoSQL storage engines. You may need five different types of query engines because each one can be optimized to do something very fast but because the data size is so large, the penalty for failure is so large. If you take a little bit extra time per record, then it could be a massive problem or if the workload is slightly different then the time it takes can go up by orders of magnitude. So I think you're going to need lots of different types of solutions. Alright, so thanks a lot. I wonder if there are any questions here. First of all, I would like to contribute to maybe clear some misunderstanding. The University and the Spanish University is hosting this event. So the relationship with open source is something that maybe they want to have a say in but anyway, please go ahead. Is there any question that you would like to ask? Let us know the question and solely a comment related to small projects, big projects. Really, I think that it has to do with attitudes and if the small departments or the engineers or developers at the business have the attitude and also the managers, the projects, it doesn't matter if the project is small or bigger. It depends if you want to push your company to the front of your competitors, if you want to innovate. So mainly I think that in the business, in the businesses in Spain, seldom have an attitude to innovate. Probably mostly banks actually because the need of the market, but not so much. Hello, I would like to comment about the Spanish University because as you know we are helping the open software world. A lot of people here have studied here or collaborated with us. But I must say that sometimes even for the open software projects it's hard because if we engage a lot of, let's say, not very expert students and open source projects I'm not very patient with them. It's also hard because they only don't go they're very good ones but also the bads because we have all kinds of students and sometimes the interaction with them they are not very patient. I don't say your projects, but in the past we have some experience that they throw our students out because they were not smart enough. So it's a very hard problem to handle a lot of students and because you don't know what are the good or the bads till you put them in the open project. Questions please. Hi, I work for a consulting company. We work a lot for big financial firms and we're seeing a lot of interesting technologies but we're also seeing different departments in the same firm running their startup POC projects and we're seeing further down the road a potential conflict because obviously you don't want to have two or three head up clusters in the same company. So the question really is is there anything in terms of technology or in terms of process that can help set up like a multi-tenant cluster where you have an array of big data technologies? Sorry, random. Sorry, Alan may be better suited to explain this than I have but my understanding is at least with Hadoop the last couple of releases some of the things have been added are things like quotas specifically the way it was explained to me was because multiple departments had Hadoop clusters rather than having three Hadoop clusters having one Hadoop cluster where each department has a fixed set of resources that they have access to. Yeah, that's correct. So, I mean, it was just observed in big users of Hadoop that you really do want to push clusters together as much as you can because you want to have that shared data set but you have to find some way to divvy it up so quotas, queues and those kinds of things have been put in and better security. To answer the question a little bit though I mean part of what you're getting to is the homegrown flavor of this I mean we're saying take a small cluster take a small project start out try it on in a lot of places I don't know that that's all a bad thing right I mean competition is good and it helps you you know if different people try different things it may turn out as you said that the tools they picked really are best for their particular case and you really do want multiple tools it may also turn out that one tool is better than the other and another group didn't make the best choice for whatever reason and that competition will help get that out so I actually maybe it's a slightly painful process and I think if if you go to the point where launching a Duke deployment is going to be very simple and fast then you're going to create if you like multi-tenant environment that fits different departments so for example if the process of creating one could take 10 minutes build your own cluster and then you could have one in your local cloud one in the public cloud one in the bare metal then it's going to be equivalent of multi-tenant in terms of the ability to share the same distribution of the same infrastructure using that so there are tools like Chef and Puppet that do a pretty good job on that I've talked about Cloudify and how we enhance those tools but I think that's kind of an easy way to do things in that regard right now and I think building automation around that would give you a lot of what I think you're looking for. Last quick question very quick if any yes my question is how to define a big data because supposed to go to the boss and tell what is the data what is this is what kind of results this is the question always to boss to repeat this question that was a quick question but a very big one so just one answer please the two answers that I can give you that are simple one of them is things that you can't really do with your existing relational database and that could be not just because of capacity it could be because of velocity meaning the speed in which data comes in and also the flexibility that you need into the schema that data keeps on changing usually that would be the easiest boundary where we would say well it's probably a big data problem because it doesn't fit the existing type of solutions other than that you could get into a much more complex definition but practically it's not going to get here further I can't remember who said this but someone said big data happens when the cost of deciding what to throw away is greater than the cost of storing the data very good so please guys put your hands together for this magnificent round table