 Live from Las Vegas, it's theCUBE. Cover EMC World 2016, brought to you by EMC. And welcome back inside theCUBE here in the Sands Expo at EMC World 2016. I'm John Walls along with Dave Valente. It is a pleasure to welcome now to theCUBE set, Mike Bishop, who's from Pression, the chief systems architect there. And Mike, welcome to the show. We appreciate your time here. I'm delighted to be here, thank you. Tell us a little bit about Pression, if you will. You're Chicago based and I know of risk management, risk mitigation, but tell us a little bit more about your core mission and focus. Sure, well, about seven years ago, we got into business in the federal space. So we still have an existing practice down in McLean, Virginia, headquartered there. But just this last June, June 2015, we opened a new commercial headquarters in Chicago, Illinois. So that was part of productizing a number of services we've provided in the national security space for private sector use. So one of which is traveler safety. We also have a due diligence practice. They're all based out of Chicago. And then data that you're bringing in from where and then for what? Yeah, so the impetus really for what we've come to market with commercially is, was we were approached by a law firm that was very concerned about duty of care for people traveling around the world, both from a liability standpoint and also just traveler safety, helping them avoid danger. And so we looked at existing offerings of what helps keep people safe and realize that no one was using technology to really gather that information. And so the first thing we built to answer your question is a big data curation system that helps us identify where information is available that relates to the safety of individuals and companies for that matter. So right now we've got about 41,000 sources of information indexed. They range from social media to RSS feeds to news broadcasts, to forums and blogs, news articles. And so we're doing a lot of natural language processing to figure out which of those are relevant for specific locations. And then most importantly, figuring out which one of those or which of those sources relate to the individuals traveling. So we send them the alerts that relate to them most. So you're basically building a machine intelligence engine and then your end point is the individual. Yes, but there's a lot of human in the loop analysis as well. What this curation system does is across those 41,000 sources I mentioned and we had about 150 a day. Is it percolates up what's most relevant for a given area and for a given types of people and for given threats. So we've modeled various threats to business continuity and personal safety like physical, health, environmental. And so for a given location across those 40 plus thousand sources, you might say what are the threats for cyber monitoring or cybersecurity in a given country and there might be just five. And so that's what's relevant. And so that allows human analysts to focus a much more tightly on what's relevant for people in that area. And saying that humans are the last mile of fires. We are, the phrase we use is we humanize intelligence. No one was looking at traveler safety and risk management in a contemporary fashion using the technology that we've invested here that allows them to aggregate data in such volume. But the goal is really to still put it in front of a human analyst. So we have geospatial analysts, we have threat analysts that are looking at geopolitical instability, crime and it's all data driven. So we figure out where in the world there is and down to a city level, a street level. And that's what the mobile app provides is proximity alerting as you get close to these areas where there's actual danger. So that domain expertise is still critical. But you can't scale without that software infrastructure. Right, and traditionally what happens is companies will send a threat brief to a traveler before they go to someplace and it's 30 pages long and it's six months old. And so it's not really well tailored to the individual that's reading it. It's more than they can really ingest if they even have the time for it. And so what this does is put it into a very concise package and again, resonate with the individual. So down to actual demographics, the attributes of the traveler, so a female going to someplace is going to get different alerts perhaps than a male. It's not an exact analog by any means but you think about fraud detection in the last several years and how that's really moved to real time. So presumably you're getting as close to near real time as possible. We are, and in some cases it is real time. Like with an RSS feed, we're subscribing to USGS and weather alerts, those are highly structured. And so if there's a tsunami alert or an earthquake, we have a latitude, longitude, a magnitude, and we can send out an instantaneous alert to people that are in the affected region as well as to the security stakeholders that would want to know. So paint a picture of how you're doing all this stuff. What's the infrastructure and the architecture look like? Sure, so primary components, we're running SAP HANA for real time. Geospatial and linguistic analysis helps us perform sentiment analysis to figure out if people are saying good things and bad things about people we care about. But also entity extraction, fact extraction, HANA is a big part of that. Geospatially we're correlating where threats are at in proximity to travelers. And so all that traveler data is in MongoDB. And so we have a number of web applications, some of them mobile applications, some are like a dashboard, a secure dashboard that provides accountability of where people are at. But we take that positional data that's coming out of the mobile app and bounce it off threat areas that we've persisted actually in a post-GIS database. So we have geospatial threat analysts using Esri ArcGIS as well as QGIS and open source geospatial framework. And they're persisting these big polygons that say here's statistically an area of town where bad things happen. So then there's also Hadoop, an integral component of our data lake. And from a hardware perspective, we're running Icelon and Extreme I.O. So that's an example of that that I often use is when we're making snapshots of large chunks of textual data like social media and running sentiment analysis. At one point we're just kind of testing the throughput. We did about 350,000 tweets per minute for sentiment analysis. So you're doing Apache Hadoop and you're a Hortonworks customer as well. You subscribe to their service. We are and we also use Apache NiFi. I believe we're one of the first companies to use it for real-time aggregation and curation of data where we have a number of processors that have been written that allow us to again extract what's being discussed in RSS feeds as it relates to those three threat domains that I mentioned earlier, physical health. And are you looking at Spark? Are you using Spark at all or? A little bit, yeah. And we're really just scratching the surface on some of this technology. I mean, the core components that I mentioned, we're using SAP HANA smart data access and a number of APIs that we've developed ourselves. But yeah. So HANA gives you the in-memory piece of it. In-memory color data store. So Spark will be redundant to that. But there are things that we're looking at what we can do within Hadoop and Hortonworks is an integral in helping us kind of broaden our horizons in that. And the NLP is embedded in HANA? Yeah. So that's really one of the major strengths in that, you know, in 33 different languages that can perform entity extraction based on our own object models, our own definitions or keywords. So those main threat domains I keep referencing in terms of physical threats, health threats, environmental. Those are broken down into a number of factors like physical crime, civil unrest. And so there's a number of keywords associated with each one of those. And what we do when we scrape all those sources I mentioned before, thank you, is we actually persist all the corpus of text. And so the things we'd like to begin doing with that in the future, far more advanced natural language processing algorithms, one of which our data engineers playing with right now is word vector to look at the kind of the semantic proximity when you see this, what else might you look for when you see this word or this phrase? And that's another reason to be persisting this large corpus of text was we ingested. Yeah, I was thinking, Mike, you're talking about 140,000 some sources from which you're drawing data and you have federal agencies with whom you're working. That I'm curious, are there any secure sources, confidential sources or top secret sources of some type that require a different layer of protection or consideration within your system that you have to be wary of because it is the federal government with whom you're working. Sure, excellent question. Everything we're doing for Prussian Traveler, it's the name of this particular offering that we've been describing is for the private sector and so it's all open source information. However, we are working very closely with NS2. It's a division of SAP that works specifically within the intelligence community, which is again where we came from six, seven years ago. And so there are a number of intelligence community customers that we're looking to help with these capabilities and that means tapping their sources as well. We still do have a number of federal contracts that involve classified sources, but they in no way come in contact with what we're doing on the commercial side. One of the complaints that we hear a lot of course from Hadoop practitioners is the complexity and they spend more time wrangling the data and getting it right than they are able to add value. What are EMC and I guess Hortonworks, I guess they're partnering on this, what are they doing to make that complexity simpler? Well, Hortonworks sent out an engineer that sat down with our developers and helped us stand up the clusters, figured out what we really did need based on the type of data we're acquiring, what we need to do to transform it into a form that we can access across the data lake. And I often refer to Isilon as this connective tissue that allows us to touch these different repositories regardless of where the data came from, but Hadoop lives, we're also running a VBlock so we virtualize these clusters and so that data transformation process is something that is an ongoing, not problem, it's just one of the jobs to do. Anytime you identify a new source you have to figure out how to structure it so it can be accessed by other queries. So it's interesting to hear you talk, right? Because in the early days of Hadoop, it's like, oh yeah, it's just all commodity hardware, scale out, white boxes, anybody's storage, go to fries, you'll be good to go. Hadoop obviously has entered the enterprise realm. So you're talking about VBlocks, you're talking about Isilon. So you're using a lot of modern but traditional-like infrastructure. Why? You can help the audience understand that. Sure, well, what we wanted first and foremost was flexibility, this was a Greenfield project. When we started looking at this roughly 18 months ago, there was nothing else that existed like it, so it wasn't like there was a template, but we also didn't have legacy systems that we needed to worry about integrating with. And so we wanted to go with the systems that afforded us the greatest flexibility and so virtualizing those processors was a core requirement. But we also didn't have really insight into what the kind of volume of data or what the throughput would look like and so we wanted to do what was most flexible and really in working with Hortonworks, they've been able to keep us very agile in terms of what types of data we can access and also they recently acquired on Yara and so with NIFI playing a major role, I refer to it as a data manifold and it's through NIFI processors that all of this data is conditioned and ultimately that's kind of the edge for our ETL or extract transformer load. So your challenge is, it's like golf, you never beat golf, you know? So you're never done, right? It's a continuous improvement cycle. When you think about continuous improvement, you said 41,000 data sources. Do you have limited resources like anybody? If you have 100 bucks to spend, do you spend it on ingesting more data or do you spend it on improving your algorithms and that piece of it? First and foremost, I would look at the optimization of the algorithms but that's where NIFI over time as we really have this thing in practice is going to allow us to tune that because one of the things that NIFI gives you is provenance. You know where it comes from but you can also look at those elevated threat areas that are drawn on a map and say, what specifically were the data points that resulted in that being drawn? Was there crime statistics? Where did it come from? Were there incidents being reported at a per capita by some NGO or some government? Where did they come from? And then you can say, well, these other sources weren't relevant to that and so it allows you to really cull your sources and so that's one way you can limit bloat and limit tracking feeds that maybe aren't paying off but you can also then use those that's the same data points to corroborate and kind of validate the veracity of information which is very important, obviously. Is there a crowd source component to this today or in the future? I mean, in the future, absolutely. So within the mobile application, beyond giving a user the ability to selectively be monitored if they want, they can have a low resolution monitor where we literally truncate the GPS coordinate before it comes back to the dashboard. So you just know generally where they're at plus or minus a few miles. But we also have alerting mechanisms where they can actually say, this is a suspicious area, I feel unsafe. I'm looking at violence in the near area whether it's like a Ferguson riot type situation and that's information we can pull back in and so that's one component of crowdsourcing but then within the mobile app we're also thinking of adding things like literally just in areas where it's data sparse. There's not a lot of information to support anything other than maybe overhead imagery analysis that we do a lot of. You could literally probe people and say, do you feel safe here? And it's just a yes, no. And that would be a crowdsourcing kind of survey mechanism that would be a feedback loop for on the ground. Mike, I probably should have asked this because it hit me early on and we went down a different road. But when you're talking about threat levels and threat areas and what have you, how is what you're doing different than what something like the State Department would do? We hear about issued warnings and what have you. I mean, what granularity of data do you have at your disposal that perhaps they don't or how do they dovetail? Well, this is where it truly is unique. We look at the individual person that's using the application and based on their own profile, where they're from, Des Moines, Iowa, mail of a given ethnicity, given nationality, that person has every threat score that they see relative to their home location. So you always have a baseline. No one else is doing that. And that I think is a very elegant manifestation of the power of big data. As you can say, all right, you have all the statistical information and you know where threats target individuals, where threats are likely to manifest against a given person or type of person or a given company based on these attributes. So report them relative to their home environment. And so every threat assessment they see, I mentioned the verbosity of these longer reports that traditionally get sent out, we just have a bar graph to start with. You can click on it and see more data, more information, and see an analyst annotated report that's still only as long as your notepad. You know, this was one page here, but it's very concise, but it starts with literally a bar graph. Compared to your home location, it's this much more dangerous or this much less dangerous. And so you immediately know, do I need to ratchet up my level of vigilance or is it roughly about the same as Des Moines? And so that's very unique. No one else is doing that and it's all based on attributes. And the business model is selling the app or is it selling the data? Well, we do have an API, so companies can subscribe to the data itself, but the application is a service that we provide for monitoring and the real-time threat reporting. So we've performed several hundred assessments around the world. In the not distant future, we'll have every city more than a hundred thousand is probably going to take us several months to get there because the level of effort that goes into the human analysis, sometimes which involves this overhead imagery analysis and looking at disparate data sources where you have a lot of gaps. So there's a lot of work that goes into building those street-level maps, but that's all available through a mobile app. And so it's based on a month-to-month fee that they pay to be. And an individual pays or an organization or both? Either one, yeah. So we have the mobile app both for iOS and for Android and then for security stakeholders that have traveling teams of employees, whether it's 10 or 10,000 or 400,000, they have dashboards that can be deployed at a site level or at an operations room that provides that real-time accountability of where the people are at and also the ability to interact with them. So a Boston Marathon bombing, they can immediately geofence an area, send out a sign-of-life message request, everyone checks in with their PIN number and you immediately know of my 13 people there, this one needs help. And so that's again something that really differentiates what we're doing. It's fascinating stuff, it really is. And I think when we talk about it a lot at this show and many others, but technology impacting business, in this case, the business is saving lives. It is. Fascinating application. Mike, thanks for the time. Appreciate it. Thank you. That's the light going forward. Mike Bishop from Prussian in Chicago. That's it for this edition or at least this hour here on theCUBE, back with more from the Santa's Expo in just a bit. We'll see you then. It's always fun to come back to theCUBE because, you know,