 Hello, everyone, and thank you for joining the webinar today. My name is Eric Fransen. We'd like to thank you very much for being here. This is a production of Dataversity. In our speakers today are Nova, Spivak, and Dominic Terheid of Bottle Nose. Today Nova and Dominic will be discussing how to analyze 72 billion messages a day to find trends. Just a few quick points to get us started. Due to the large number of people that attend these sessions, you are muted and you will be muted during the webinar. We will be collecting questions, however, into Q&A box in the bottom right-hand corner of your screen. At some points during today's presentation, the layout of your screen may change. This is due to the type of media the presenters need to show and how they need to show it. If that happens, please be aware that a drop-down navigation panel will be available to you at the top center of your screen, and you can always return to the standard view using that. You'll see a return arrow there. You will still be able to access the Q&A panel and other modules using that drop-down. As always, we will tend to follow-up email within two business days following this webinar, containing links to the slides, the recording of this session, and any additional information that may come up during the webinar. And now a few words about our speakers. Dominic Terheid has built complex information systems since age 16, incubated numerous web projects on Big Data's social and semantic web, created and run development teams in Europe, Japan, and the U.S. He is former CTO of Machine Learning Startup, Serigo, Japan, an expert in machine learning, big data architecture, and analytics. He is owner at Synaptify.com, advisor to several startups in EU and Asia, and holds a bachelor's degree in intelligence systems engineering. He is also one of the top 50 tech rock stars to follow on Twitter. Novus Vivac is a technology futurist, serial entrepreneur, angel investor, and one of the leading voices on the next generation of search, AI, big data, and the web. Novus is presently CEO of Bottlenose, which uses big data mining to discover emerging trends for large brands and enterprises. He also advises and or invests in several startups, including Clout, which was acquired by Lithium, The Daily Dot, NextIT, Who Knows, Big Matrix, acquired by OVGuide, PublishThis, Sensensia, and Cambrian Genomics. He started EarthWeb and DICE, both of which went public. Nova is also an advisor to several early-stage venture funds in the areas of emerging internet technology and clean energy. So with that, please allow me to welcome Nova. Nova, you may be on mute. So please take yourself off of mute. And Dominic, Dominic, welcome, guys. Hi, everybody. We're happy to be here. This is Nova speaking. And this is Dominic. Thank you, Eric. So I'm going to kick things off, and hopefully you can see my slides. Now, today what we're going to do is we're going to take you through a high-level presentation on our architecture and what we do. As well as a live demo of some functionality, some of which has never before been shown publicly around the platform that we're building. And so without further ado, we'll jump right in. So as Eric mentioned, I'm the CEO and co-founder of Bottle Knows here with Dominic, my co-founder and CTO. And what we do is we discover threats and opportunities that impact enterprise customers using a patented stream intelligence technology. And we'll get into what all of this means. But effectively, this is the next generation of business intelligence focused on streaming data. And the goal is to automatically detect threats or opportunities by analyzing this data in real time using a variety of advanced discovery techniques at very high volume. So for example, detecting risks or impending crises, competitive threats, reputational threats, cyber threats, or helping to detect opportunities around new markets or new customer needs, competitive intelligence related opportunities, or product and marketing intelligence opportunities. And so this is really an intelligence stack that goes beyond analytics to provide actionable intelligence to customers by looking at real time streaming data. So what is stream intelligence and what is our mission? So we're trying to build a business intelligence company focused on this type of data. Stream data is the fastest growing segment of data, and it includes all types of live and historical unstructured or structured time-stamped data. For example, email, messaging data in the enterprise, social media messages such as Twitter, Facebook, Tumblr, YouTube, et cetera. Mobile application data, news data from news wires or blogs or other sources of news, live television and radio, IT log file data, CRM data, support data, sales data, any kind of web or app analytics data, financial data, or even sensor and device data, for example, Internet of Things, IoT, or wearables. So basically any kind of unstructured, semi-structured, or structured time-stamped data that's streaming in and out of applications is fair game for this category. And what's important here is that we stepped back and looked at the overall problem of dealing with data in motion and generating actionable intelligence from that, and we've built the first unified platform and application for doing this, and we focus on what we call, therefore, stream intelligence. So a little bit of background on the size of the opportunity and why it's here. So on the left side you see a chart which shows a bunch of red bars growing really quickly. The red bars are enterprise unstructured data, and there's a little blue sliver which you can barely see, and that is structured data in the enterprise. So what should be evident from this is that unstructured data is vastly outpacing structured data and is becoming really the primary type of data that enterprises need to be concerned with. On the right side here, some Gartner quotes, data discovery will displace traditional IT authored static reports as a dominant BI in analytics user interaction paradigm by 2015, and I think we've seen this happen. By 2017, over 50% of analytics implementations will make use of event data streams generated from instrumented machines, applications, and or individuals. So that's really the opportunity, and all of that data is effectively streamed data, and the majority of that will be unstructured or semi-structured data. Now there's a problem here. If you look at the growth of data, this orange line, it's outpacing the growth of the people who are qualified to help with that data, the sort of bluish-greenish line. So basically data analysts and data scientists are growing at roughly linear growth rate where the data itself is growing nonlinearly, and so we have this exponentially widening gap between the demand, which is the information that has to be analyzed, and the supply, which are the people who can actually do that analysis. And right now we're really just in the very, very beginning of the divergence between these two vectors, and what we're seeing is already, you know, if you talk to analysts and data scientists in large global corporations or government agencies, they're already feeling very strained, and we're barely seeing the beginning of this gap. You know, in coming years, it's literally going to be completely unmanageable. And so how do we solve this? We can't create more humans. We can't produce more data scientists than analysts. You know, there's just universities and industry produce them at a pretty much stable rate, and that doesn't look like it's going to change. So this is a classic supply-demand problem where mass production or automation is actually the solution to the problem, and so that's what we're doing. We're effectively going to automate the analyst and mass-produced data science to solve this gap. And so when we look at the three Vs of big data, volume, variety, and velocity, what we see are various products that have engineered towards one of the Vs. So the Hadoop infrastructure is really engineered towards dealing with volume. It's a batch processing ecosystem where, for the most part, you're doing large batch computations that are not very fast and are not suited to streaming data in motion. On the velocity axis, you see event data processing systems focused around complex event processing. Two great examples are IBM InfoSphere Streams and TIPCO Streambase. These are legacy technologies built on sort of older infrastructures that come out of the financial trading arena and are really rules-based data bus processing systems for rapidly making decisions against fairly low-volume streams of data, in fact, compared to what we see today. The challenge with those technologies is they typically require fairly extensive professional solution, and they're not really making discoveries. They're rules-based. So somebody has to come in and build a lot of complex rules to find various patterns that they know they're looking for in advance. Now, on the variety axis, we have various things such as the ElkStack, which is Elastic Search plus Kibana, and other products which try to make it easy to deal with a wide variety of data sources. Splunk is kind of getting close to the center, which we view as the Goldilocks zone. That's where we are. We're engineering a system at Bottle Nose, which is equally suitable for the three Vs, so high volume, high velocity, and high variety data. So Splunk is probably very close in some respects. However, historically Splunk has been focused more on IT data, and also it was built on an older stack. It really comes out of Apache, Lucene originally, and what we're doing is actually much more based around Elastic Search and newer technologies, which we'll get into. So Stream Intelligence is the third generation of BI, BI 3.0. If you look at traditional BI, it was background analyst teams looking at SQL databases. When we got into BI 2.0, the era of big data, we have complex, large unstructured data sources, the concept of the data scientist emerging, Hadoop and MapReduce. And as we move into the 3.0 era of BI, it's much more about analytics at the moment of decision, real time, interactive analytics that are rapid, agile, and give you insights, and actually give insights to decision makers, not just IT and engineering people, at the moment they need them. So just in time, in real time, at industrial scale. There are a lot of advantages to doing this, and I'll quickly run through this. So the first thing I want to point out, however, is that companies today have made relatively small investments into dealing with high velocity data. Even though, in fact, that is the primary form of data they should be focused on. Most investment in big data has gone into volume and variety. Velocity has so far not really received the attention it needs to receive, and I think that is a friction point that CIOs and enterprises are going to start to feel in coming years until that is balanced out. But organizations that are successful with dealing with velocity, that is real time data, data in motion, visualizing analytics, for example, in real time, are more successful. So for example, higher growth in sales pipeline, more cash generated, greater operational efficiencies. In another study of top performing organizations by Accenture, they also found that real time business capabilities had significant operational benefits. And so here we see the high performers were able to analyze the costs and benefit of business processes and improve them more effectively. They were able to embed decision making tools into business processes, develop and capitalize on insights more effectively, and provide access to key information to their employees across devices more effectively. And so top performing organizations had these traits compared to other organizations. But another interesting perspective on this is that organizations today, while they're very good at understanding their cash position in real time, are really bad at understanding things like operational risk or financial performance or other aspects of their business in real time. And these are all opportunities for dealing with high velocity stream data. So let's get into what we do and how we solve this. So the bottom of this platform has been in development since 2010. We actually have around nine patents and dozens of pending patents around it. And it's designed to provide a data agnostic unified solution for generating intelligence from any kind of streaming data. So first of all, what kinds of data? We began with social data because it was readily available and used to be free. So that was fire hoses and APIs out of social networks. We've gone beyond that today. So we've added, for example, news sources, blogs, and other data sources as well as secondary sources we're able to scrape from the social data such as video sites and photo sharing sites and other sites. So we've got a variety of different social and traditional data sources coming into the platform. We also ingest 98% of all live TV and radio broadcasts in the US, UK, Canada, and the Middle East with other markets coming. So that's 40 hours of video and audio per minute coming through our system with transcripts and getting analyzed. Where we're going and the things that we haven't announced yet but you're seeing for the first time here is we actually have the capability to ingest any kind of enterprise stream data. So that includes emails, web analytics, IT data, and so forth, web commercial data, financial data, public data sources, and also machine and sensor data, for example. So any type of streaming data, whether it's unstructured, semi-structured, or structured, the platform makes it quite easy to add that. We've built new tools which actually analyze the data, figure out how to map it in, and then suggest a mapping very quickly. And once the data is mapped in, many layers of analytics then start to take place. So a high level view of the pipeline, data in motion, that's streaming data in hot storage, coming in right off the wire into the system. First thing that happens is after we've mapped it in once, we now have a way to understand it. And Dominic will speak to this a bit more, and we can also talk about it in the Q&A. We ingest the data and we enrich it. So we enrich it with augmentations. We'll talk about that. The next step above that is data mining and analytics. Currently around 30 different entity types are recognized. And for every entity, we're tracking about 150 different metrics per entity, generating time series data. Above that, we then have a trend detection layer where we apply a range of different algorithms and heuristics for discovering trends, anomalies, and other kinds of patterns in the time series data that we see. So when we get to the trend detection layer, it's already completely data agnostic. It doesn't really know too much about what the data is. It's really looking at time series. Although it is possible to put in specific trend detection modules for specific kinds of data or use cases. And then that actually also happens above in the rules and agent layer where we're able to create business rules or other kinds of intelligent agents that watch for certain kinds of situations, such as a complex event processing scenario where we're looking for multiple different kinds of things happening in some configuration. The key here is that we can look all the way down at the very raw data or we can look at high level patterns. So we can make rules that look for trends or anomalies without having to say exactly what they are. So if the system detects an anomaly or a trend with a certain score for a certain data source and or something else is happening with another data source where the configuration of these two things is interesting, then do X. So that's important because we don't have to specify every possible anomaly or every possible kind of trend or every possible kind of pattern. The system finds those. We can write rules effectively at a higher level. We're writing rules and agents that deal with trends and patterns rather than having to deal with every possible lowest level configuration. So the effect of this platform is that we datamine everything that's coming through these different streams and we find both the knowns and the unknowns and I'll show you what this means in a second. So basically the point here is we want to be able to detect things that we didn't know to ask for in advance. In order to do that, we basically have to datamine and track every entity in the data, measure a range of different metrics about every one of those entities and look at their statistical behavior, higher order temporal distributions in order to detect unusual or interesting activity around any of these particular entities and that's how we find unknowns. So for example, a particular customer might be interested in looking at risks around their business. They can't specify every possible risk. There may be threats that have never happened before. We need to be able to detect those and by effectively scanning the entire environment continuously and using an air traffic control metaphor, we can see if any flights are coming in that are unexpected or even if an expected flight, if you will, is doing something unexpected. So as well as concepts, topics, or issues they may be aware of, will detect anomalous activities or anomalous items, people, topics, or other kinds of activity, IP addresses, et cetera, that are behaving in unusual ways. And the effect here is that we actually do quite a bit of analytics to accomplish this. So we analyze 72 billion data records every day on a continuous basis. This will scale dramatically, even this year, as we add in lots of new data sources. A few metrics on this. We're looking at about 3 billion live and historical messages per hour. We're doing predictive analytics on 7.2 billion data points a day, adjusting 67 million new messages a day, and that's growing actually quite rapidly as we bring on new data sources. We're doing trend detection in our trend layer at a rate of about a million events per second analyzed. And looking at hundreds of billions of time series in the 200 terabyte range today growing significantly beyond that. We're finding about several thousand high level detected trends in these data sources every hour. And according to our own internal studies, we're finding that about 80% of the time we're detecting breaking news and emerging threats or opportunities or even trending keywords between tens and hundreds of minutes ahead of the media, ahead of Twitter and the ad networks, and we see similar advantages against non-textual data sources. So let me just quickly give you a little demo of what this does, and then we'll come back to the slides. So I'm going to switch over to Chrome, and you should see this now. So we're inside of Nerve Center, which is our web-based dashboard for analysts. This sits at the top of our stack and is built with our APIs and visualization libraries. It's using what our platform provides. In this particular case, we're looking at Pfizer and we're looking at social data, which is a pretty straightforward example. I'll show you some others in a minute. This visualization is called Sonar, and it's one of our patented visualizations. And what it's showing is the topology of real-time trend activity around Pfizer. So first of all, using natural language, we extract trends. And we extract topics, for example, without using any dictionary. We have a machine learning-based approach to doing this, where we have detectors that we've trained that are very good at finding topics in messy Twitter-like messages which do not necessarily contain punctuation, which may have vowels missing in order to make messages fit in 140 characters. And that may have, for example, more than 100 different spellings of the same word. And so we've made a system that can detect things which are emerging, and we can go back and look at spikes. So let's look at a spike a few days ago. And so if I click at that, you can see it's a different view. If I look at this spike over here, for example, we'll see a different view. Now, how do we interpret this graph? The size of a node is the relative volume compared to the other topics we found. So bigger means more volume. The color, if it starts to turn orange and it starts to glow, indicates high or spiking momentum. These little dots red and green indicate negative or positive sentiment. The connections indicate co-occurrence relationships between the terms. So here I can see birth with Pfizer. If I click on that a couple days ago, what we see is a trend in this word about birth in the context of Pfizer and its scientist's warning of potential link between Zoloft and birth defects. And so that was a spiking trend at that time. We can also look at other views. So if I add in hashtags and if I add in the mentioned people, so what usernames were mentioned, for example, around Pfizer at this particular time, who was being talked about. And the types of messages. So here we can classify messages by the type of content that they are as well. And we see things like emotions as forms of conceptual thinking, as well as types of content. So all of this is designed to deal with unexpected content that we can't anticipate. That's based on just looking at what's out there in the wild. The next view, the trends view, surfaces the most statistically significant events. And so, for example, a few days ago, we saw this link between antidepressants and birth defects. And if I pop that open, we can drill into it in real time. And we see 22 messages shared by 23 people with a cluster of five different signals that were detected and a spike type of behavior. And we can actually go in and look at the metrics for this particular trend. This comes as an email alert, or you can see it in the tool. And we can look into the trend and actually explore what exactly it was and what was going on. That was a very weak signal when it was detected, but we detected it early. For a more powerful signal, we can look at this trend. And here, what we're going to see is a larger trend. Let's just reload that and see if we can get that to load. This will have more people and more data. So two messages shared by four people, but a lot of impressions. If we go and look even more deeply into this, let's say a couple days ago, here's one in Spanish reached a lot of people. So this had 1.6 million impressions reached, which are potential views. 27 people shared this one message, but they were fairly high influence. So the system detects these weak signals. You can control what you're getting alerted on and you can set the strengths and so on. So you can decide if you want to see weak signals, strong signals, signals that are in the middle band and so on. There's many other views here. For example, a psychological profile. What are the emotions of people who talk about Pfizer and how is that changing? So if we look at say 30 days, here we're doing a linguistic analysis of emotional language based on statistically validated data sets for these different emotions. In order to chart, for example, is there a rise in anxiety? Is there a rise in sadness? Around the birth defects issue, for example, we see a big spike in sadness, which is that pink line. So we're able to detect whether audiences or people are going bearish or bullish or if there's more aggression or restraint versus glory and affection. We can measure sentiment as well, traditional sentiment. So here, looking at sentiment, these are kind of things which are almost commodities. Sentiment's a commodity today. Emotion is not. That's unique. We can get into all of the different signals. So let's look at all the topics we detected. So 5,000 topics in Pfizer's airspace in the last 30 days and here's birth defects. And we can drill into that and explore what is the conversation. 513 messages shared by 378 people. Here's the cluster of different links, tags, topics, and so forth that was detected. We can do the same thing for types of messages. So looking at opinions about Pfizer, for example, on nearly 2,000 opinions shared by about 1,400 people, we can go in and look at what are the opinions around Pfizer in this period of time and so forth. So we can actually kind of explore this interactively and in real time. I'll also mention that we can see all the people, all the links, but we won't go into that in the interest of time. And we're able to see things, for example, like all the languages, the geography. So 74 languages. Let's look at Russian, for example, about Pfizer. And then let's explore that conversation around Pfizer. So so far what I've shown you is activity using social data. Now I want to show you that this is live and searchable. So I'll take a topic out of the blue today and just show you that this works live as well. So let's look at FIFA today. What's happening with the FIFA scandal? So it's pulling your data in real time. It's analyzing it in real time. It's visualizing it. And it's showing us what's going on. I'm able to quickly get a sense of whatever it is that people are sharing. What are the trending links? What are the recent links? Exactly what's going on with this issue in real time. So we can then walk through it and explore it. So if I saw, for example, this particular hashtag, I can just click on it, explore it. It's pulling in new data and analyzing it in real time. Okay. So going back to where we were looking at these standing queries, we're just pulling in data about Pfizer. Let's look at Pfizer on TV and radio as another example. So here, this is 3,000 radio stations and TV stations looking at Pfizer. They were just mentioned on market call. We can actually pull up the media with the transcript and actually play it. And in real time, we'll see what was said. So here's Pfizer mentioned in the transcript. So we're actually able to detect these. So we can go through, play the media and so forth. And we can see all the different TV stations that talk about Pfizer, and we can get their analytics. We can see the different programs where Pfizer is being mentioned and so forth. All right. Now let's jump out of social and TV and look at something completely different because the big point here is this is data agnostic. So now let's look at 500,000 Enron emails. So these are emails between 1999 and 2002 in Enron. And what we're looking at here is the secret BCC network. So who was BCC the most? Apparently Mark Fisher. And what was the BCC network? And at any given time, what was the BCC network? What was happening at that moment in time? So if I go in and look at a particular moment in time in Enron, just going back in time here and it'll just pull in the data and visualize that in a moment. So this is what was happening on the BCC network in Enron at that moment in time. Perhaps instead of BCC, we want to see topics. So at that moment in time, those were the key topics that people were talking about. Let's look at a different spike here in February. Let's see what they were talking about. And so we can see zip codes and other things that people were talking about. We can drill in and open these emails and look inside them and see what people were talking about. For example, this possibly suspicious email. So we're able to actually analyze historical email but also look at it over time. We can look at sentiment over time. So here's sentiment in Enron. And let's do this over all time and look at the sentiment behavior in these 500,000 Enron emails over this multi-year period of time. Just give it a second. It'll pull the data, load it in. And here we can see a visualization of how sentiment was behaving in this time period. We can also go in and analyze. So let's get a little more deep here. Let's look at the people in Enron and let's grasp them by anxiety. So let's look at who was most anxious in Enron. And here what we see are the most anxious people in Enron. So let's add them into the graph here. And in particular, we see Vince Kaminsky here. He was one of the most anxious people in Enron. So it'll just grasp this. And let's look at Vince Kaminsky. Let's drill into him. So this is Vince Kaminsky. And these are messages. These are people that he was talking about or sharing things with. And let's go to a certain period of time like October of a particular year here, and let's look at what's going on. And what we see, depending on if I went into the right zone, is that when Vince Kaminsky got nervous or had high anxiety, and let me see if it shows up here. Let me find the right spot. What he did was he actually would CC, rather BCC himself with company messages to his personal AOL account. Here it is. So this is his AOL account. So when his anxiety spiked, he would start sending company documents to his personal AOL account. This is a behavior pattern that we can automatically detect. And it's one of many kinds of behaviors. We'd also be able to look at, for example, the topics at that particular time and what Vince Kaminsky was interested in at that moment in time, what he was talking about, as well as, for example, what topics he BCCed to whom. And so the system can actually go in, do this analysis, and figure out the structure of the topics and recipients and so forth in this period of time. So that's an example looking at email. Now let's look at something quite different. This is network data. So now we're looking at IT data from our own networks and infrastructure around a free app that we share called Sonar Solo. It's a free noncommercial use-only app. And what we're looking at is IP and IP locations. So we're looking at IP addresses hitting us and where we see those coming from. Let's go in and analyze this for a moment. And so here, let's look at all time and let's look at just remote IP address. And so we can see here various different IP addresses hitting our system during this period of time. And what we notice is particularly interesting spikes and anomalous behavior. Again, these are the kinds of trends that we can detect and alert on. And we can drill in to any one of these and look at what was this IP address, what was the request path that was used, what was the refer domain, who was the refer, what kinds of status messages or other kinds of activity do we see around that. Let's look at another type of data and then I'm going to move more deeply into where we're going. So this is Chinese data from SinoWeibo. Similar case. Let's look at fraud data from darkwebs. So here we're looking at what hackers and cyber criminals are sharing, credit cards and so forth. We can do the same thing that we did on social data on these other kinds of data. Let's look at one more example, perhaps app server errors. So looking at infrastructure errors and trying to understand error activity rather than looking at the organization or refer here. I might want to just look at the request method, request path and so forth. So we're able to do all kinds of different analytics across these different data types. I'm going to go back to the deck so you should be able to see my slides again. And let's continue on and get into the technical stuff. So I've shown how we do data Ignite Extreme Intelligence. Now I want to talk about the platform that lets us do this on 72 billion messages a day. There's two layers. There's the app, which I've shown you, which is Nerve Center, and then there's a platform under it with APIs. We've got a discovery engine, an analytics engine, and below that we ingest, augment and store the data. So Dominic, why don't you start covering this and just let me know when to advance the slides. Yeah, you can skip ahead to my section. Yep. 22. All right. So you can skip ahead. Yes. So in order to get deep insights out of streaming data, you need to do a lot of processing. And in fact, you need all of these processing layers here to get the kind of insights that we have in Nerve Center. And each layer needs to work at scale and it needs to work with the other layers. Just skip ahead. Okay. So it starts with the lowest layer in our system, which is the Augmentation Engine. Now the Augmentation Engine enriches unstructured data. So examples of this are sentiment analysis and text, natural language processing, geocoding, psychographic analysis on social media. And as Nova mentioned, we have some very specific augmentations in there for dealing with social media, messy social media data, like for example, the 150 different spellings of the words tomorrow on Twitter. You can skip ahead, Nova. Yep. So the layer on top of the Augmentation Engine is the Analytics Engine. It takes the unstructured data and turns that into analytical outputs. So we allow advanced search and aggregation using a simple OLOC-like interface. Now we have a... Can you go back? Now we have a mapping system in there. So we can simply add a new data source by configuring a new mapping that describes the data and insights that you get from it. The final point about the Analytics Engine, it's really fast. It's radically different than traditional batch processing-based approaches. You can ask the Analytics Engine a question, and it will give you an answer within a second. So this allows this type of interactive querying that Nova showed in the demo. Can you skip ahead, Nova? All right. So as you can see from the demo, you can get insights pretty quickly by just clicking around. But there's many different ways that you can slice and dice the data. There's countless of ways. So each data stream in the system really has billions of potential data points. And as data gets added and streams into the system, more and more data points are possible. So there's an inherent signal-to-noise problem in with Analytics. Next slide. So a lot of the other generalized analytics solutions, they kind of stop there. They allow you to query the data. They allow you to get analytics from the data. But that's where they stop. We try to continue to elevate these analytics into intelligence. So the detection engine is the first step in this intelligence mining process. Next slide. So in the detection engine, we systematically walk through all the data points. So that clicking around in the user interface that Nova showed you, we kind of automate that in the background. And we take all those data points and do predictive analytics on those data points. And the result is a continuous stream of categorized signals. So this is essentially a recommender system for the analysts. It tells the analysts, hey, if you're looking for gold, then these are the specific places that you should be drilling and you have a high chance of finding it. Next slide. Here's an example of a spike detection on social media. This is a social media stream that monitors intelligence topics throughout the world. In the top graph, you can see the system detected for debt in Nairobi car bomb explosion. When we zoom in on that specific detection, the system picked up a BBC article. But before the BBC article, it already picked up the topic Nairobi in the data and it clustered those together. Next slide, Nova. Here's a screenshot of an internal tool that we have. This is a data tool called StreamSense. It is a workbench for data scientists. Here it's showing the output of the detection system on the web traffic data that Nova quickly showed. So here we're detecting things in the HTTP log files. And we've reduced all those billions and billions of data points over one years of data. We've reduced that to about 15,000 detections. And we're now looking at the highest scoring spikes. Next slide, Nova. And indeed, if we click on these detections, we see one referred domain standing out. And we see the co-occurring entities, what IP address, what location, the different rigorous paths. And here, this detection is in fact a third party that was stealing some code and causing a lot of traffic. So zooming in on detection engine a little bit more. Even this layer, there's a lot of different processing algorithms going on. There's not one magic algorithm that magically gives you the results. No, it's really layer upon layer of processing in order to accomplish these things. So it starts with the detector. It continuously hammers that analytics engine, the high performance analytics engine, in a production system. It's currently analyzing a million events per second. It takes those data points, converts them to time series. Those time series are then sent to a component that we developed called Anticipate. Anticipate is a predictive analytics engine. Then the detections go through a context gathering phase where we find co-occurring metadata and other entities. And then the detection goes through a clustering phase where we look at previously detected things and try and combine them together. Next slide, Nova. So the Anticipate library is kind of where the math of all this lives. We use several statistical statistics and machine learning heuristics in order to accomplish the time series classification. We have a lot of tools in here for training the system, so training the SEM to recognize certain types of spikes. It's high performance. We can scale it up by just adding more CPUs to it. Next slide, Nova. The fourth layer in the system is the correlation engine. In this engine, we systematically compare time series in order to find hidden patterns that we wouldn't pick up by looking at the entities. Next slide. And then the final fifth layer of the system is our rules and agents engine. Here we can query all the outputs from the previous processing layers. We can program certain rules that trigger web books. Our system is very different than your traditional complex event processing where you need to have programming knowledge in order to set very specific rules to trigger actions in your data. The query language that we're developing here is really high level. Example queries here are, alert me whenever one of my employees has an anxiety increase of 20% compounded over two weeks by analyzing the emails, for example. Another example would be, alert me whenever one of my company's assets shows up on the dark web. So this would be a query where we query a dynamic stream of internal company data, like, for example, get IP addresses from that and employee names, and then compare that dynamic set to things that we detected on the dark web. Next slide, Nova. Give it back to you, Nova. Yeah, so I think that's a very high level. Underneath the hood, how do we do this? What kinds of infrastructure do we use? Our platform is proprietary, but it lives on top of some open-source technologies like Elasticsearch, Cassandra, Redis, and others, and Dominic can speak about that a little bit more. The idea here is that we're providing intelligence as a service. This is the next step after analytics, and we want to do this across functions, across data types for every organization. Today, what we see are highly siloed solutions. So you have solutions like Splunk in IT. You have solutions like Omniture for Marketing and Web Analytics. You have solutions in the financial space that do sales and transaction analytics. But enterprises today need something that cuts across all these categories, because the gold is not just within one data type, but it's really in the relationships between them. So for example, what is the relationship between your spending on TV advertising and sales in different markets? Or what is the relationship between sentiment about your brand on social media and trading activity for your stock? Can we discover particular patterns that are either indicative of impending change or even predictive? And in fact, yes, we can. So what we're looking for, not just finding these threats, opportunities, or other trends or anomalies within one type of data, but being able to do this across all the different data streams that an enterprise looks at, both the internal and external streams. So with that, I think we'll open it up for questions in the time we have remaining, and we can just have a discussion here with the folks that are part of the panel. Great. Nova and Dominic, thank you so much for this fascinating presentation. While we give people a chance to ask any questions that they may have, I want to let everyone know that you'll have an opportunity to meet Dominic face-to-face later this summer at the 2015 Smart Data Conference. It's going to take place in San Jose, California, August 18 through 22. And I hope you're all able to join us. It'll be an opportunity to meet Dom and speak to the Bottle Nose team about what they're doing. Next, looks like we don't have any questions yet, but please do put them in. We'll wait for another minute here while people are gathering their thoughts. While we're waiting, Nova and Dom, let me ask you, is there any type of data that you've found is particularly difficult to work with or needs to go through any additional cleaning process before you're able to subsume it into the Bottle Nose system? Dominic, why don't you answer that? We haven't done a lot of Internet of Things data yet. So every data type has its own challenges when it comes to data preparation. We have some tools that we developed to allow you to quickly filter data before loading it in, but the system is pretty flexible. So, yeah, no issues yet. Yeah, a lot of what we've done over these last five years is kind of mitigate that issue by putting in all kinds of algorithms and heuristics that are resilient to bad data. So, you know, Twitter data, for example, is really bad data. It's really, really messy. It doesn't look, you know, even when it's in English, it doesn't look like it's English. And traditional NLP systems tend to produce pretty poor results when there's no punctuation and there's missing letters and missing words and strange grammar and, you know, odd things like hashtags and user names appearing in the middle of sentences. So we've made a system that's pretty resilient to bad data. And the mapping layer, when we bring in a new data source for the first time and normalize it with the kind of high-level internal representation that we use, also solves some of those issues. The augmentation layer can do further data cleaning, for example, by adding metadata about entities and other things that are found. Now, for matching between, you know, ambiguous terms and different data sets, there's always work to be done. So, you know, disambiguating a name that occurs in one stream and possibly also occurs just slightly differently. And another stream is, you know, a problem that is not unique to us. Everybody has that challenge. We have some interesting ways of handling that by looking at the topology of the context around those terms. Effectively, we're building a real-time semantic graph. Yeah, I love traditional natural language processing and NER, like it's named anti-recognition. They always use the document as a context. But with a lot of small events and streaming data, you usually don't have enough context in the document, like a small tweet, for example, in a very little context. So, if you take the stream of data as a context, you can start making some of those disambiguation decisions after the fact at a stream intelligence level. Yeah, so it helps with human curation at that point. Yeah, I mean, the system automatically cleans and organizes the data in an emergent bottom-up fashion and then does discovery to identify the most likely candidates for further analysis. So it's getting the analytics. You can drill down and explore and pretty much walk through the data in real-time to your heart's content. But the system's also making these high-level recommendations. Here's a trend you should look at. Here's something unusual that you should look at. And it alerts analysts. One of the big pieces of feedback we've always gotten from analysts is they'd like to be able to leave the office once in a while. They don't want to be glued to their terminal watching all the time, especially when you're dealing with high-volume data. They want to get notified when there's something really important according to their criteria, and then they'll look. And so we've spent a lot of time working on a system that can do that in a unified way at a highly abstract level across any kind of data. And one of the important points is that the trend and other types of anomaly and pattern detection algorithms and heuristics work at a level that effectively doesn't really care what the original data was. And that's really important so that we're entering a world where the variety of data is exploding. Every app has its own little data format. Some of them are using standards, a lot of them are not. So you have to be able to bring in a huge variety of data. Within an enterprise there could be hundreds of different data types that you have to look at. You have to be able to bring those in without having to rewrite your detector layer for each new data type. So the idea here is as data flows higher in the pipeline, it gets abstracted into a mathematical representation that is really a universal representation that we can then apply data science automation to without having to know a lot about the underlying data itself. Another sector that of course has a ton of data is government. And we have a question here. Do you currently have any government agencies using this system? Well, I really can't comment on that. However, I can say that what's publicly known is we've done quite a lot of work tracking ISIS and radicalization. In fact, there was an article on the front page of the Boston Sunday Globe last month about this showing some of our work in that arena. So for example, we're able to show exactly how radicalization happens, how youth, for example, are radicalized. We're also able to see where different key disseminators are likely to be or what they're talking about, what religious expressions they're using, which sites they're talking about. We also can see, for example, what locations or topics are of interest to ex-patriots or others from the region who live in other countries. And we can also, for example, look at trends in the global media, and so forth. All of this is based on public data. We're not even looking at any private or classified data to do that. In the case of ISIS in particular, a lot of their strategy, if you will, has been to use public media, public social media, to get their message out and to also recruit. So we've been able to analyze that and learn a lot about how they operate, and that's been quite valuable intelligence for organizations that are interested in that. All right, I think we have time for one more question here. You spoke about rules that use patterns for predictive analytics. What are some examples of recommendations or actions that can be done? So, for example, we can detect a customer uprising that has a lot of influencers involved, which starts perhaps as a weak signal, but explodes rapidly and starts to spread geographically as well as across media outlets. We can detect that in the early stages. So Dominic showed one example where we detected that Nairobi bomb. In fact, we're able to detect these kind of breaking weak signals tens to hundreds of minutes ahead of them being picked up by larger media outlets. Sometimes it's just a hashtag or a word that suddenly starts to spike in a certain region because people on the scene are sharing messages about it before it's in any news outlet. In any case, when we find something like that, it generates an alert. So, for example, one of our clients is a very large, I guess you could say, crop sciences and bio-pharmaceutical-oriented company, and they are tracking threats against their company, which could be, for example, content which has just come out that could be inaccurate, or protests that are suddenly being organized or even discussion about them by hackers. We're able to see these just when they begin, and if they spread or behave in a way that's different than what we would predict, that's of particular interest. So we're constantly running, as Dominic mentioned, predictive analytics against all these time series, and what we're doing is we're effectively figuring out what is the baseline behavior for each time series on a running basis, and predicting an hour to a few hours ahead what we think it will do. And when we see significant deviations from that expected behavior, that begins to percolate it upwards through our detection layers, where other algorithms and heuristics are applied in order to determine if it's signal or noise. So actions that can result could be as simple as somebody saying this is inaccurate information and making a response, perhaps it's on a website or through their PR channel, or it could be a security threat, in which case the risk team or the security team or the fraud team in an organization can respond to that signal in real time, or they can do forensics around it if it's after the fact. So basically pre-boom and post-boom intelligence around security threats generate a range of different kinds of actions. Sometimes we also detect opportunities. For example, a major software company launching a gaming platform used us in order to predict what would be the main themes for the launch of this gaming platform. And they used that, we were actually able to measure historical activity as well as live activity. They used that several days in advance to develop content for their launch and also choose what advertising keywords to buy. And actually that was very successful. In some of those cases, brands and agencies have been able to use our intelligence to cut their spending in half but get double the results because we're able to show them what they really should be targeting. In other cases, it's discovering new markets, new opportunities, so finding a new customer need. For example, we discovered a rising need and opportunity around coconut milk in the coffee industry, and we discovered that something like six months before Starbucks actually announced coconut milk in all their stores. We actually found that as well as other findings and shared that information publicly. So we're able to see where industries and markets are moving and make predictions. So there's both a kind of real-time type of scenario. You know, we see, for example, live emergency landings, live bomb threats all over the world. During the Ebola situation, we were able to see people on airplanes or in airports talking about the person next to them exhibiting symptoms that could be Ebola symptoms. So that's live threat intelligence. And then there's the longer term, whether it's longitudinal, historical, or even future types of trends that we're able to provide intelligence around. That's great. Thank you so much, Dominic and Nova, for this great presentation and your time today. I'm afraid we are over time here by a minute, so we're going to cut it short here. Just to remind everyone, we will be posting the recorded webinar and slides to Dativersity.net within two business days. And if you registered for the live webinar today, you will receive a follow-up email to let you know how to access that material. Thank you again for attending, everyone, and I hope you have a great day. Thank you, Dom. Thank you, Nova. Thank you very much. Thank you.