 Welcome, my name is Shannon Kemp, and I'm the Chief Digital Manager at our University. We'd like to thank you for joining the current installment of the monthly Dataverse eSmart Data Webinar series with host Adrienne Bowles. Today Adrienne will discuss knowledge as a service, an introduction to the emerging pre-built knowledge market. Just a couple of points to get us started, due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen, or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtags, smart data. If you'd like to chat with us and with each other, we certainly encourage you to do so. Just click the chat icon in the top right-hand corner for that feature. As always, we will send a follow-up email within two business days, containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me introduce to our speaker for today, Adrienne Bowles. Adrienne is an industry analyst and recovering academic, providing research and advisory services for buyer, sellers, and investors in emerging technology markets. His coverage areas include cognitive computing, big data analytics, the Internet of Things, and cloud computing. Adrienne co-authored cognitive, I can say that today, cognitive computing and big data analytics, published by Wiley in 2015, and is currently writing a book on the business and societal impact of these emerging technologies. Adrienne earned his BA in Psychology and MS in Computer Science from SUNY of Binghamton, and his PhD in Computer Science from Northwestern University. And with that, Adrienne, I'm going to turn it over to you. Hello and welcome. Computer sinus, I think that's the first I remember. My tongue is just tied today. That's too bad that our topic today isn't natural language processing. Indeed. Well, thank you, Shannon. As always, it's good to be here. And I'm looking forward to having a lively discussion at the end. So I guess I better get started with the beginning. Yeah, thanks. So as Shannon said, I want to talk about knowledge as a service or more broadly the idea of pre-package knowledge as a commodity to be bought and sold for those that are developing software solutions and in particular, cognitive solutions. So let me give you a quick overview. See, this is where it's just sitting here, and I've lost my cursor. There we go. So I'm going to spend a few minutes on the nature of the problem that we're solving with pre-built knowledge and the types of issues that we need to look at. Then talk about the idea of designing systems with knowledge or data in mind. And we get into kind of the meat of it. What are the available sources out there? And how do you get to them? And then a few words on getting started before we go into the Q&A. So let's dive right in. What is the problem that we're solving with pre-built data or pre-built knowledge? Going back to my traditional overview of the world of AI and cognitive, what I've said for the last couple of years is that when I look at modern AI, the way things are today versus a few years ago when I got started, we're still trying to solve some of the same problems in terms of problem solving, natural language processing. There you go, Shannon. That one's for you. Machine learning and all that good stuff. That all fits in classic AI. But what has changed in the last two decades and certainly in the last decade that the pace is speeding up is that we seem to be focused a lot on big data and deep learning and the two are intertwined. You really need for most types of deep machine learning a lot more data than we used in the past. So what we want to look at today is how that impacts the way we design systems, but how do we get access to data so that we don't have to do everything ourselves? So in my view of the world, we've sort of got this red circle there, which is the heart of cognitive computing systems where we have understanding, reasoning, and learning. And for context here, what I'm showing outside with the shaded arrows are the inputs and outputs to our system. So we're getting input, which may be data. Well, it's all data, but it may be coming in the form of human input in natural language or gestures or tracking people's emotions based on audio signals, et cetera, or it's coming from machines. So it can be coming from sensors from the IoT systems. And on the output side, obviously, we may have some output for people or for machines. But that blue circle between the IO and the cognitive red circle data management often gets a short shrift when we're talking about these systems. And that's really what I want to focus on today, which is what is the data that's coming in? What are the different alternatives? And how do we get them? How do we process them? And perhaps how do we design systems with the data in mind? So if I take the same diagram and try and change the scale a little bit, really what's happened is I've moved some things like the idea of concepts and emotions and meaning and intent into this data management level and out of the cognitive portion. The actual border is a lot fuzzier than I show it here. But the idea is that we need to be able to store a lot of data about the problem domains we're working with. And sometimes that's going to be done before we get into this understanding and reasoning part. And by that I mean we want to be able to get a lot of data sources that are already organized in a way that we can use them. And so today we're going to look at some of the considerations and how we classify the data and then what are the sources and how can we actually acquire or procure data that allows us to do the interesting stuff. I say that with all due respect to the folks that think that procuring and manipulating the data is the interesting stuff. But the actual cognition, if you will, or the pseudo-cognition within the understanding, reasoning, and learning loop. So let's look at the way the data is going to be stored so that we can actually represent not just data but some meaning associated with the data. And I'm generally pretty casual in this sort of a chat about the distinction between data and knowledge. But we'll look at sort of how data gets refined to be knowledge and some of the important considerations when we're choosing data sources. So another part of the reason for this being important is that historically when we built systems that had some semblance or some attempt at intelligence, the fundamental approach was to use a lot of the human intelligence of the system designer and code that into the application so that it could handle the data that we were looking at. And back in my days as a computer science professor, I always said that you could tell if you gave the same programming assignment to a freshman, sophomore, junior, and senior, you could tell by their code what level they were at because a freshman would write something that would solve the problem just for the test data you gave them. A sophomore would try to show off, perhaps, and write code that would solve every problem in the world, including the one that you asked for. But by the time you got a little more mature, you realized that the importance was to solve for the broad class of data that the application was supposed to solve for, make it robust, and not do much more than that because anything adds complexity. So if you look at this chart here going left or right, in the early days we were really focused on kind of being very smart with what we put in the application. Now with machine learning and particularly with deep learning, we can let the data direct the activities of the application. So it doesn't mean that the system doesn't have to have intelligence or have to have logic. It just means that we're putting a lot more responsibility, if you will, on the data itself. And so the data has to be pre-processed or be organized in a way that it will allow the system to act intelligently. So the significance of the data, the complexity of the data, the completeness of the data, if you will, is a lot more important today, I would say, than it was just a few years ago. Now, as we get into different domains and tasks, the nature of the data also changes. And this is actually from something that we're publishing, I believe next week with Aragon Research chatbots. The idea is that the data that we're going to be dealing with, as it gets more specific, if I'm dealing with something, the system handles all of health care versus just pharma. It's a completely different set of requirements for the data. And if I'm trying to make a system that will handle any type of input, start to get up towards AGI or artificial general intelligence, that's a whole different set of constraints on the data. Usually kind of the sweet spot for a chatbot is down in the lower left. We want to have something that's tightly defined in terms of tasks and domain. But the types of systems that we're looking at today would go more towards the middle. I want to be able to build systems that will handle more general tasks or more general domains. And so we want to get past the chatbot level of data and look at what are the options for us for data sources. I have a couple more thoughts on organizing this data. And I'm going to just make the broad statement that I think really when we're building applications today, I certainly go along with the semantic web approach where any data that we use should have semantic attributes or meaning associated with it. It doesn't mean that everything needs to be completely specified, but using the OWL, for example, the Web Ontology language, enables us to specify more clearly the meaning of data in a way that data can be linked across datasets and across applications. And so as we look at data sources, one of the things we want to look at is how much metadata or contextual data or semantic information is associated with it or available with it so that it can be used in multiple contexts. And that gets to what I was starting to allude to at the beginning in terms of really imbuing, if you will, the data in the dataset with attributes that you might associate with knowledge. So things have meaning, and some of the meaning is supported in the dataset by the way it's organized or by the way it's specified. And this is something that, if you've been with us for the last year or two on this series, I've alluded to this framework in the past. I'm a very firm believer that one of the dumbest things we've done in terms of terminology is to make a distinction between structured and unstructured data. So I try not to use those terms because, frankly, if it's not structured, if there is no structure, then to me it's not data, it's noise. And so when we look at different data sources or different approaches to capturing data, really what I want to look at is kind of the degree of difficulty, if you will, in identifying the structure so that it can be used in a particular application. And just very quickly to sort of justify that, if you say something that is unstructured, it means that it is not structured. Very simple. It's not like flammable and inflammable in English. But data, if we're ever going to be able to use something, it has to have structure. It's just a question of how hard it is to find. If you go on the beach and you find a gold bar on the sand, it's treasure. Well, it's still treasure if it's covered by a few grains of sand or it's still treasure if it's deeply buried. Something has structure just because it's not obvious, just because we have to go through maybe six layers of deep learning algorithm to identify the edges and the shape and then what it actually is. To me, that's meaningless. So for all of the sources that we look at when the providers talk about structured versus unstructured or finding structure, finding some meaning in a Wikipedia article, just think of it in these terms. It's either something that's at the surface and that's like the gold bar that's actually on the sand. You can see it. It's obvious. Typically we use that to mean something that was written for machine consumption versus human consumption. But it really shouldn't matter when we're looking at these sources because for any source that can be used as data, there is a process that will transform it or identify that. So I'm going to say as we're looking at these sources, I'll try and point out when things are organized as taxonomies. This is sort of a hierarchical representation so that we don't just have a list of maybe it's a data set. You're interested in doing some insurance calculations. You may find a data set of all the registered vehicles in a particular state. Well, just having a list of all the vehicles that doesn't include information about how they fit in the world of vehicles is not particularly interesting. That's data at its most raw form. It's something that I would think of as no level of abstraction. It's one level above just letters. But if we can organize that data into a taxonomy as a starting point, it's going to be a lot more useful. Again, it's sort of a loose definition here. We want to go a little further and have an ontology that actually specifies the definitions and the attributes of the entities within the domain. And the domain here is what we're looking at. So when we're looking at data sources, we want to be able to get them properly organized to make the most use out of it. I mean, there will be certainly instances where we'll look at fairly raw data and it'll be up to us to discover the meaning. But in general, I'm going to look and say we'll give extra points, if you will, to things that are organized like this. So I'm going to point out a couple of things now and then again at the end in terms of things to watch out for. One of the issues, even when you're building systems yourself and you have control over the data is to make a distinction in your logic between things that you know and things that you believe. And it's a very big difference, I think, in the way that these need to be handled and specifically if it's something that is truth, it's immutable, it shouldn't change. If it's something that's a belief, then we need to have data attributes associated with it that will let us know temporarily and perhaps spatially when that belief was held to be valid. And just as a starting point here, I used the DSM, the Diagnostic and Statistical Manual of Mental Disorders from the American Psychiatric Association. That was my background. To show how the definition of autism has changed or has evolved over the last few releases of the DSM, so we're now at DSM 5, came out in 2013. We have a spectrum. We talk about someone being on the spectrum. There's a list of attributes that are used to define the behavior that defines membership in that set of being on the autism spectrum. Well, human behavior hasn't really changed between DSM 3, 4, and 5, but where people are classified has changed. And so it's a question of what is truth? And if you're trying to use data, let's say case data on an individual and trying to categorize them using the DSM classification scheme, then it's very important to know when the diagnosis took place. Because you're going to get different classification for the same symptoms based on the assumptions that were in place for the different releases, if you will. And it may seem like I believe at this point, but it's really important to have that temporal information because there are things that are, as I say, universal and they shouldn't change. Hopefully mathematics, for example, you may have new discoveries, physics, things in the periodic table. We may find new elements, but the things that are there, we know the organization. But a lot of what we capture in systems that are going to be cognitive systems are subject to interpretation and that involves over time. And so the time element needs to be part of the data stream. And the last thing on this part in terms of preparing to bring in data streams from different sources is to recognize that today a lot of the applications that we're going to be building will need to integrate a combination of historical data and current data to make predictions. Whether we're dealing with relatively straightforward predictive analytics or whether we're going into some deeper logical systems, for example, usually in a complex environment we're dealing with a combination of both types of data. And the current data may be relatively slow moving. We'll see an example of that. Or it may be high speed streaming data. And so when we're looking at the available sources, it's important to understand whether we're going to be able to use everything or whether we're going to sample. And what are the implications of that? I was speaking with a firm yesterday that's doing political polling and they're doing it 100% based on Twitter. And to me there's an issue there because they have historical data which includes things other than Twitter, but their current data is only Twitter. And even if you use 100% of the tweets on a particular topic that still represents only a – it represents the entire population of people who use Twitter, which is a subset or a sample of the people who vote. So we need to be able to look at the sources and how they're created and how we're going to be able to integrate them. We're not going to have time today to talk about specific products for doing the integration, but maybe we'll cover that another time where if you're interested just reach out and email me about it. So the question that I often raise when people are talking about going out and getting new data sources is do you need to start by identifying need for specific data? Or are you looking for new opportunities and you want to find out what's out there and then find new ways to use them? And so either way I think it's important to sort of map out the characteristics of the data that you're going to need. And I do it on these two dimensions. One is how fast the values change. That's the rate of change. Are we dealing with something that's relatively static? It doesn't need to be completely static. It can be slow change versus something that's high frequency change. And then how specific is the domain? And that's a level of abstraction. The two key things in computer science are always how abstract can we get because abstraction is power and how long can we delay binding to a decision? And that really ties to the rate of change. So natural language, for example, if we're going to take in a data set that is, that gives us a lot of information in natural language or about natural language. If we stick to the one that I know best for myself is English. English changes. It's not completely static. There are new words added. There are new meanings added. But compared to the weather, the language is fairly stable. And if you take the totality of the language, then it's pretty general. You can describe pretty much anything that we would want to describe in English versus restricting the language to a specific domain, a specific subset, or restricting the meaning to a specific subset. You can use the entire vocabulary if you will. You can still use all the words in the language, but they'll have very specific meanings within a domain like medicine. If we talk about temperature just in the abstract, the word temperature has a lot of different implications depending on how it's used. But if it's used in a patient chart, then there's really only one meaning. We're kind of narrowing that down. And that doesn't change very often. So when I have APA diagnostics, it's not that they don't change, but it takes years between the additions versus something that's streaming, something that the weather changes fairly frequently. Yes, stock prices change faster than the weather. And if you combine that, when we start to look at what sources are out there, you can find different opportunities for the data depending on where the data fit on this chart. I always turn off my phones, but that's one that's not even connected. Sorry about that. All right. So the answer to the question, do you determine the need to identify the resources first? I say it depends on the application. There are times when you're going to build an application, you know what it is. It's in your domain. You want to create some new functionality. But I'm going to encourage you, certainly at the end here, to look around and find new sources of data and find new uses for existing sources of data that are now much easier. For you to get. So why are we using pre-built knowledge? I'll pick up the pace a little bit here. It's simple. If you can get the knowledge that's already been packaged and put in a form that's usable to you, you still want to do some tailoring, obviously, so that it fits your application. But it's going to save you time, money, and create something, bring something in the market faster. So the example here is from Cycorp down in Texas. This has been working on a model of the world's knowledge, if you will, all together for about 30 years. And going into what they have with OpenSight is a platform where you can get access to this full representation, if you will. So you can read, I'm not going to go through all the numbers there, but it's an ontology that is an attempt to capture human consensus on the meaning of words and the relationships between them. So we've got hundreds of thousands of terms, triples and RDF, and all of the semantic information that I was talking about has already been captured. So as a starting point, if you were to go in and map out your data needs and say, all right, I need to be able to have some common sense understanding of the English language so that when I'm doing my natural language understanding, this is one of the places that you could go to get that data source that's already been validated to some extent in terms of their process for capturing the meaning. To me, it's very similar to the way the OED, the Oxford English Dictionary, was developed by getting a large population to contribute meaning based on word usage. And so this is a project that's been around for decades and is now being made available to the public. We use pre-built knowledge in more specific applications. So in the earlier chart, this would be something that's high on the specificity, it's not generalized, and it's very task specific and domain specific. And I'm just using the example from IBM for Watson Conversation, Virtual Agent, where if you're working with a system and you're trying to build an account management system, for example, you can use the pre-built knowledge that they package with their application, with their functionality via APIs so that you're not spending your time creating things for a customer calling in and wanting to do an email change, for example. It's something that's been done. It's been done a million times. You can just package that. And what we want to look at for the remainder of the time is things that are at either extreme like this, this being a task specific domain specific set, or the Cycor work, which is much more general use of English, and see that right now, in the last several years, things have opened up to the point where it's really easy and in general really inexpensive to get access to the data that would have been impossible or impossibly expensive just a few years ago. And that's where we're going with this. So how do we procure the data? And I'm going to look at sort of four approaches and which one or combination you choose is going to depend on the application itself, but this is trying to get you to think in terms of, okay, if I know what I need, I can see if any combination of these four sources will help me. And if I don't know what I need and I'm trying to create something innovative within my industry, let me look at what's out there that I can either acquire or repurpose. So with that in mind, I'm going to start with the Yago, yet another great ontology. There's no shortage even when we go to four letters of great terms here. So Yago is, excuse me, is a joint project at the Max Planck Institute and the Telecom Parasitec University. Bring this one out first. It's open source. And this is information from their development site. Semantic knowledge base derived from Wikipedia, WordNet, and GeoNames. Now, Wikipedia is obviously a great source for a lot of things. There's obviously stuff in Wikipedia that is wrong. There's more that's right with it than there is wrong, is the way I would put it. But Wikipedia is something that actually changes fairly rapidly, particularly when there's a world event. It changes a lot faster than the old encyclopedias that it tends to replace. So this is an ontology that's derived from Wikipedia and WordNet, and I'll mention WordNet in more detail in a minute, has over 10 million entities. And what I like about it, besides the fact that it's quite comprehensive and a lot of people that are involved in this, is that facts can also have associated with them or an entity, a temporal and a spatial dimension. So as I was speaking about before, there are things that apply within a time range. If you look at dietary restrictions, for example, what we think is safe today may not be what we think is safe tomorrow, but what we think today is definitely not what we thought just a few years ago. And so it's good to know where something applies in terms of truth or belief and when. You also has a really nice graph browser. You can sort for things by region, by time, and it's a great starting place. It was a lot of the things that are in there were used actually by the original IBM Watson team when they were building the knowledge base to play jeopardy, finding those relationships. It's good to have a pre-built solution to build to use as a starting point. The goal of the DbPedia is to extract structured data, and I put it in quotes, I hate the term, from Wikipedia. And again, the idea here is this is from their entry. Because it's open source, you can get access to it. It describes the entities using it in ontology. What it's trying to do is provide some organizational properties to the raw data that's available in Wikipedia. Instead of you trying to build something that's going to scan all of Wikipedia and then process that, if you start with the DbPedia or Yago, someone else has done the work of building that framework, building that ontology, building those relationships, and doing the mapping. So that will save a lot of time. As we'll see, there are some issues because you want to make sure that the assumptions that are being used in these sources are consistent with your own. And there are a couple of examples at the end that all uses kind of cautionary tales. WordNet was mentioned earlier, just another plug for WordNet as an ongoing project based out of Princeton that has already organized the English language to a large extent in a way that it makes it much easier to build systems that read and interpret natural language. So when we talk about natural language processing, you break it up into natural language understanding and natural language generation. In terms of the understanding, a lot of what you need to do if you're going to build a system, maybe it's a question answering system or a diagnostic system. That first part, first step of breaking it out into the language and then storing it in a way that you're going to be able to use it if you use the meanings that have been adopted by things like the team that's building WordNet or PsyCorp. Then again, it's a huge time savings. You can build systems today far quicker and spend your time on the differentiating value that you have based on your interpretation of that higher level meaning. As I said, the more abstract you get, the more powerful you get. This has taken the raw data, which is very concrete. It comes in in terms of characters, characters, words, sentences. You start to do your parsing and then your semantic analysis. But once you've done the syntactic analysis, if you're having a raw data stream of text, you only do that simple part and there are plenty of open source parses out there to do that for you. Then the next step here in terms of understanding is done if you're using one of these ontologies. How do you do it in practice? You can build your system pretty much from scratch, but now the types of data streams I've been talking about, which are relatively static because they are the language-based, those are all available as open source feeds. I'm going to just go through the representative sets from three of the larger commercial players who have a vested interest in getting you to use their cloud services. It's kind of like the telephone carriers that love to have you buy different types of content because it uses the minutes or however you're being metered. Here for AWS is one of the examples. These are the public datasets that you can get just by if you're building your system on AWS. And so it has everything from machine learning datasets to public datasets on banking to geospatial datasets. These are all free. They're all available to you, whatever sort of app you're building, as ongoing datasets. And I will note that some of those like NASA datasets or EPA risk screening indicators, these are things that I'll describe more generally as open data from the government and have just a minute on that. But basically to build a system today, if you identify the nature of the data that you need, one of the public cloud providers will probably give you access to it. So this is a representative set from AWS, from Google with the Google cloud platform. On the left, the commercial datasets, it's a set of datasets that you can get through Google that are fee based. So Dow Jones, AccuWeather versus one on the right. There are big query datasets that are in general free. And you'll see again, a lot of those include government data or open data from government. So New York City 311 service request. It's a public database that you could get directly from the New York City government site, or you can get it through Google and some of the other cloud providers. Microsoft, another one. A lot of the government agency data, take it from the federal to the state to the New York City taxi data. I think it's fascinating that you can get taxi trip records for free. It comes from New York City. And one of the ways you can get it sort of packaged in a way that you can use it in the API is through Azure on the Microsoft Azure platform. And again, a lot of the stuff is provided by government agencies. Customers are another great source of data, especially if you can get them to opt in. The example here is that most modern cell phones, mobile devices have a number of sensors and sensors that are built into your phone for specific functions that the carrier is going to provide can also provide data to applications. And I would encourage you to think about all the different ways this can be used. So the barometer, for example, in your iPhone measures altitude. If you're using a health app that tells you how many stairs you've climbed, it's likely that it's reading the information actually from the barometer, but that can also be used for crowdsourced weather reporting. Here's just one example. This is a weather app that this is from the dark sky company, which is a claim to fame is hyper local forecasts, and they're able to do that at a finer level of granularity than the National Weather Service or some of the other things because they have a large population of opt in users that are providing barometric data that they can identify down to, you know, plus or minus a meter. And so by looking at what sources are out there, particularly as more and more IOT sources emerge, even identifying new business opportunities. I mentioned open data, just a couple of more words on that. Most data sets that are produced by government agencies, particularly in the US, are made available to the public unless there's a privacy or security issue involved in the name of transparency. And most of these are available. Sometimes it's fairly crude in terms of what you can get through an API, but everything that the New York City produces, for example, you can go on to a New York City website and get access to it. In some cases, it will be value added by providers that will just organize it and maybe run it through with either a proprietary or an open ontology and start to describe some meaning to it and interpretation and then turn that into a feed. But the raw data is there. One of my favorite apps, since I spent a lot of time driving in Boston, is one that they developed a few years ago called Street Pump. And this uses the same data that's available to you. The way it's used here is they have people opt in with an application and as you're driving around Boston, obviously the app knows where you are. The GPS in your phone handles that. But by using data from the accelerometer, they can start to map aberrations that are interpreted so that they can understand where there are potholes. That was the original intent. Okay. Well, if I took that data and I took weather data from another source, I may be able to create a service that's going to predict automobile failures. So it's going to predict a demand for ride sharing services. There's all sorts of ways to combine this, but almost every city has something similar where you can get this data. They don't all have something that is using it for potholes, but certainly they're gathering it for planning and in most cases that's available to you. The city of Chicago, another place I've lived in my checkered past, has been using sensors to do a lot of environmental monitoring. These are placed on lampposts. And again, a lot of this is available if you just know that it's being collected, then you can figure out which government agency has it and start to put it to use for your own commercial purposes. And free data that's used by companies like Uber, obviously Uber's not going to just give you their data, but a lot of the data that they, the way they're able to provide services is by leveraging a lot of this open data that is available. So it's a combination of that opt-in customer data and freely available open data. Location and proximity data. A lot of this is done at the hardware handset level. I just mentioned this. If you're starting to think in terms of applications that would use this data, then you might need to know where it's coming from. And in many cases it's the ISAT location technologies that are built in to handsets that use the Qualcomm chipset. So looking at what you need and what's available, somewhere in there, I think there's usually a good combination. Now, just a couple of words of caution. Just making the data available. I don't want to get all Heisenberg on here, but just making data available will sometimes change the behavior as more people know about it. This is an example that I picked up a couple of years ago where a lot of side streets that were not heavily trafficked in the past have become bottlenecks because applications like Uber are starting to send people down these side streets to get around the traffic jams. And so in some places people are trying to really basically post decoy reports to thwart that routing. And so it's important that you're able when you're depending on the risk associated with your use of data that you know that the data itself is accurate and that you're not dealing with phantoms, sort of like local hackers trying to influence an election. Here's local hackers trying to influence traffic flow. There's also the issue that I said, you know, you need to think about how how the companies are modeling things. And it's probably not something most people give much thought to. But when you have when you're trying to build an app and you're looking for somebody's location based on their IP address, maybe you're doing political polling, maybe you're doing some other type of forecasting. When some of the services that are providing geolocation based on IP address can't resolve the actual address, they have to give some address rather than an error. And MaxMind is one place that provides a service and can't be resolved. Then they default to the center of the region. And in this specific case that came out a couple of years ago, it was determined that this farmhouse in Kansas was the center of the US. And the service that was providing the data basically listed it as the home of 600,000 IP addresses. And so a lot of folks that were trying to track down IP, track down the source of some dubious characters based on their IP address, got the default physical address. And that led to some uncomfortable situations for the homeowners being associated with those 600,000 IP addresses. And it was simply a case of that was the default. And up until the point that they ran into legal issues, nobody had thought why that was a problem as a default. So if you're depending on the veracity of the data, you need to understand any assumptions that are being made. One of the interpretation issues was a belief. Something that I'm spending a lot of time on right now is looking at anonymizing profiles so that we can normalize natural language to the point where you probably all know people who exaggerate or people who underestimate and within your circle of friends, different people saying the same thing. It has different meanings. Well, that's fine if you know the people, but as we, we start to build systems that handle natural language, you need to know how they are interpreting or describing meaning if they don't have a large enough sample from an individual. So the very meaning of words changes by the context and sometimes that's reflected by the individual that's speaking them. And that's something that we're getting better at, but it's an important consideration. This is an example I took from my old psych studies where, and it's the same point just looked at from a different way. There's a study where they asked children to draw a quarter and they found a statistical relationship. It's an inverse relationship to the income of the family. Children with lower income drew a quarter as a much larger physical coin than people who came from a higher income family. And that sort of reinforces the point that we need to be able to, in interpreting adjectives and putting that data into a repository, we need to be able to know more about the source. All right. So wrapping it up, most of the data that you're going to need for the next generation of applications is already out there, whether it's something that's in Wikipedia that you can get from Wikipedia, which you then have to factor in things like if it's a house stable has the specific entry bin before you can decide how much weight to give it, or use one of the systems that takes that from the abstract and makes it more meaningful. Things like natural, natural national weather service. There's a lot of sources out there. And what I'm going to leave you with today is the idea that it's, you can get most types of data through the large cloud services providers via API. You can get a lot of great data from your customers opt in or, or otherwise. I think I mentioned recently a case where there's a large retailer that is using heat maps to follow people around the store. And it can try and determine whether a group of people is a family unit, in which case they will have one checkout occurrence versus multiple. But there are so many sources out there now. They're being made available publicly via the government, the individuals are just contributing willingly directly to the Internet of Things, the public sources, and the sensors via your customers again, whether it's opt in and you're getting the sensor data from the phones or it's through another method of monitoring, that there are new opportunities available in all of these areas. And with that, I'm going to turn it back to Shannon and see if we have any questions. Adrian, thank you for another great presentation. If you have any questions for Adrian, submit them in the Q&A in the bottom right hand corner of your screen. And just to answer the most commonly asked questions, just a reminder, I will be sending a follow-up email for this webinar by end of day Monday with links to the slides and links to the recording of the session. Everyone's very quiet today, Adrian. You're showing pity on me for losing my voice. Or maybe they're just worried that I'm going to be talking again after being so tongue-tied this morning. Here we go. We've got a question for you. What is the best way to discover public information available for a specific location or business domain? What's the best way to see what's available? Yeah, to discover public information available for a specific location or business domain. Okay. If it's something that's collected by the government, most, well, again, speaking of the U.S., I can't really address many other places. But almost every city has a website where they list their open data. I can maybe make a list. I won't be able to do a comprehensive list that you can send out when you send out the slides, but I can list a couple of samples there. And the federal government, which is generally done by agencies, so for example, if you're interested in air quality, you would go to the EBA site if you were interested in traffic patterns or accidents, things like that. The Department of Transportation would do that. There's not one single site that has everything. You really sort of need to kind of pick the region and then pick the agency. So, you know, for example, also, you know, what's the best way to discover California or healthcare? Okay, so for California, I would go to, you know, one of the Sacramento sites. I'll try and pull that out because I don't have a directory. Yeah, those are obviously it's one question, but it's two different answers. So if you're looking at something like healthcare and you wanted to limit it to California, then you're going to be dealing with one of the Sacramento sites. But if you're looking at healthcare across the U.S., then you would have to look for an agency like Health and Human Services. And it's going to be all the .gov sites that would have that. I'll see if I can just put together a sort of a starting point and I'll include that when you send out the follow-up letter to folks. So, Adrian, what is TCPH and how does it relate to open data? What is, I need some context there, TCPH. Yeah, that one, I'm thinking. Mike gave a spelling for the acronym. I'm not familiar with this either. When will it come? TCPH, no. Okay, well, let me come back to that. And, you know, the whole... Oh. Yeah, did you find? Yeah, I was just wondering if somebody's...it's always fun to place, dump the analyst. It's the world's largest and most authoritative dictionary database of abbreviations and acronyms. How about that? Oh, there you go. You got me there. Well, and, you know, hot in the news today, you know, this week, especially, you know, it's been Facebook in the last couple of weeks, you know. Is that on this topic in relation to the Zuckerman context? Sure, sure. Can I have another hour? Yeah, I mean, the shortest thing I can say on that is it was interesting to me. I just assume that everything is being monitored. The question is what's being done with it afterwards. And Zuckerberg's testimony was being very emphatic that they didn't sell the data about individuals to companies. What they actually sold was access to people based on the data. So if I wanted to do something and target a specific group, I couldn't say give me the list of people that fall into this set. Give me the people that are 21 to 35 in Chicago, specific race, religion, etc. But I could say that that was a group of interest and I want to send those people this message. So they held on to the personally identifiable information or that's the claim. But to me, that's to some extent a distinction without a difference. It's they have the information they've captured the information and they're providing that as a service, if you will, rather than providing the data. That's kind of a distinction that maybe I should have made when I was talking about some of these data sources. For the most part, the sources that are commercially available like the list that I had on Google and other providers have them too, like Dow Jones, for example. That data is provided as a service. The distinction is the other types of data you can download and then you have in many cases free use to to build on it. But it's not something that you own with the sucker murder in Facebook. What they're saying is that they own that data about you and they can provide access to that subgroup based on it. But they won't give or sell that data. All right. Well, Adrian, that is all the time we have for today. Thank you for another great presentation and thanks to everybody for being engaged in everything we do. I will again, we'll send a follow-up email by end of day Monday for this with links to the slides and links to the recording. And Adrian, I'll send you over the additional questions that we didn't have time to get to there and comment. Thanks. All right, everybody. Well, thank you so much. Hope you have a great day. Thanks, Adrian. Thanks. Take care.