 That's right. Here we go. All right. So again, for those just coming in, we're going to talk about the triggers, the techniques, the inputs, outputs, scaling, and mapping. Now I feel really loud. All right. So some of the triggers that I've seen over the years for unstructured data projects are when a lot of people say, hey, didn't we already do that? We did this. We have this data somewhere. Or I found it. I found it once. I know it's in there somewhere. And then the corollary to that, which is I found it and I saved it on my local drive. And now we have 57 copies of it across the enterprise. And which one is the right one? Or you're looking for an expert in your organization. Who knows how to do this? When I was at Raytheon, I was a metadata architect at Raytheon for a number of years. And you guys have all been there, especially if you're a large company, dozens. We use DocuShare at the time from Xerox. I mean dozens of document management systems. If you're a business owner, if you're responsible for a product, can't we reuse the content we already have to solve some new problem or create some new product? That is frequently a trigger. And then I've also heard feedback come in from, say, the sales staff. You know, their customers want a feature. But how that information gets to the folks who are in charge of determining the priorities for the product. Or how you then, if you think that it may be something you might want to do, how do you get back to the customers who actually wanted it so you can dive deeper and actually give them something that they want is sometimes a problem. So these comments are from June 2003 from a survey that we did when I was at Raytheon. We did a data discovery project. And amongst the machine, along with the machine analysis, we did a survey. And these are some of the comments that we got back in the free text part of it. So the search engines poured in adequate. I need to find the data sheet. Had 366 entries, most of which were not relevant. I spent far too much time looking through the search results. It was not effective. I would have just have soon used Google, basically is what we kept hearing. Google was then, 2003, just becoming a household name. But it's, I can't, because it's an internal document. It was a defense company. We had to be a little bit careful with our data. But it's hidden in there with the Ark of the Covenant and we all just had that image of the end of Indiana Jones, right? And then the other one was about our internet being a wasteland of information. And I'm sure that a lot of you feel that way too, sometimes. And it could be waiting like, you know, finding Shangri-La, basically. What we're talking about here when we talk about unstructured data, it's unstructured and semi-structured. We're talking about Word documents, PowerPoints, Excel, Access databases, HTML pages, email, instant messenger, that sort of thing. In this particular context, we were early adopters being in the defense industry of some of these capabilities. So we had IM in 2002, I think. So we were working virtually. Decisions were made over instant messenger. We needed to be able to track that stuff. Email, Microsoft project files, things that are not in a standard database. And as I said, dozens of document management systems. So where does that get us though? We had some really wonderful content objects that had no metadata attached to them at all. So we were reliant on the full text search and indexing and the natural language processing. In the early days, we didn't have the NLP. We didn't have entity extraction. We didn't have rules-based classification. It was just pattern matching. Is the text string in there with some character replacement. So yay, okay. Can we get there? No. Then you have the things that have leftover metadata. Maybe from somebody else who filled up file properties. Or just something that meant something to the individual, not to the organization. Then you just get into this really maze. It's really hard to find things. I think it makes it worse. Bad metadata. Then, let's see if I can get this thing to work. Sorry, gang, it doesn't really want to play nicely today. What if the battery's dying? Then as I said, with all those different repositories, you get those silos. Now, let's take a moment back to think about this metaphor. These are decrepit old silos. In fact, I think someone has made it a climbing wall. We have no idea what's in there, right? Don't know how to get Anna out safely. But originally, each one of those had a very specific purpose. A silo is not a bad thing. It means the farmer knows that my corn is in that tower. My barley is in that tower, et cetera, et cetera. I know that it's in there. I know that it's safe. I know how to get it in. I know how to get it out and use it. I know how to keep it clean. But as soon as it stops getting used, like the zombie instances of a cloud system, for example, that we learned about from Mr. Chisholm on day one, that's why these things happen. But each of those silos provides context. If they originally had a purpose and that purpose was not diluted too badly, if you know that all of your HR documents are on a certain server, you have that context. And it can help inform your downstream decisions. Hey, coming back here, I'm sorry. So it may seem odd to be talking about techniques for defining requirements. I've seen this task at all places in the spectrum, from Lucy Goosey to Anal Retentive. I suspect that one of the reasons that unstructured data projects have sunk lower in the priorities list is because it starts really squishy, really touchy-feely, and a lot of folks don't like that. But it needs to be addressed. There's a lot of really great information in these objects. So how do you figure out what your requirements are and how do we separate the requirements from technologies? And that's what we're gonna talk about. Sometimes it's obvious. You just look around your organization and you know that when you go into a meeting, folks will say, oh, it's on such-and-such server, and you get several different answers. Or they've taken information out of a data warehouse, they've gotten one of their analysts to extract information, and then they've pushed it into a PowerPoint. And that's the only place that you've seen that information, you don't actually get to it from its source, which is a problem, because of the provenance information of the data. You wonder where it's come from, is that it's authoritative. You can do a budget analysis. One of the few ways in the early days of Six Sigma that we found a hard dollar savings for a unstructured data project was to determine the cost of the servers. How much it was costing us to keep them going given that information rate was increasing. We were doubling the amount of information that we were storing every year, and it was all of that valuable. Instead of putting everything on the same high-cost server, we were realizing that we could take the stuff that was less valuable, and we had a metric for determining what was valuable, putting it on a lower-cost server, the stuff that was maybe a little bit more valuable on a mid-tier server, and the stuff that we really had to keep for the long term on the really expensive machines. But that needs to be done intelligently, so you need to figure out what's in there. So this next slide is a graphic that I dreamed up one morning, and I'm not a graphic designer, so before I'll answer your question. Most of how we actually proved our case to the business champions was in soft dollar savings, so productivity. But that stuff's harder to justify, and manufacturing company, Raytheon is an engineering and manufacturing company right at the end of the day, so Six Sigma was not super friendly to the services part of the organization, so it got easier. Soft dollar savings got easier to prove, but we had to, at this point in time, we had to come up with a hard dollar savings, and this, the whole, I may have a slide, I know I have a slide in one of my decks, I'm not sure, I don't remember if it's in this one, that shows the graph, but the person that came up with the graph was the financial analyst who was assigned to our team, so they bought it, they bought it. I mean it was accurate, and we proved it in the out years, but it's one of the ways we sold the project. Of course, I'm also the person who did a presentation a few years ago at SEMTech called Six Weeks of the Semantic Web, where we basically, on White Space Time, took six weeks and begged, and stole data from all across the organization and just put it together and came up with a semantic thing and tried to sell it that way, so I'm, you know, within constraints, obviously. Not afraid to try things and push buttons. So, I'm not good at graphics, so this is just something I whipped up pretty easily. So what happens frequently is, this is sort of supposed to be in Manhattan, where you have the grid, right? You have a fairly good set of systems that you know what's in there, your structured databases, and then you have, you know, the village and battery, and it's all the old cow paths where things are sort of organically happened, and that's your semi-structured data, your HTML pages, email, that sort of thing. They have, and really anything digital is at least semi-structured because you've got your file system metadata. And then there's the wash of things that are truly unstructured. Now, in some of your older organizations, as we had with Raytheon, a lot of stuff was still in technical notebooks, it was still in paper. So, I'm not sure if any of you are with an organization that has that particular challenge as well. You know, so the Patriot Missile defends the development times on these things are long. The Patriot Missile began development in the 40s. The first one was not fired until the first Gulf War, in the 90s. So, most of the research and the notes were in lab notebooks, they weren't online. When we started overhauling the Patriot Missile for the Defense Department between the Gulf Wars, we were pulling stuff out like you wouldn't believe. Young junior engineers would come into the library and say, apparently I need to learn Fortran, that sort of thing. So, does anybody have that, you know, the actual paper challenge in their organization? You do? Yeah. Nobody else ever does, pretty digital. So, for a lot of people though, tackling unstructured data projects are, it's too much like the knowledge management projects that most of us had to try, and most of us got burned with. It's, but it's not all of CAM. Unstructured data projects are not like all of CAM. It is explicit. We're not talking about the whole tacit and implicit side of CAM. We're definitely talking about the explicit side. So, it can benefit from some of the techniques and practices that CAM folks use, these unstructured data projects. So, one of the things that I will recommend to you, if you need to figure out how to generate ideas, and I've got a couple more slides, and then we can actually get into some of your cases. These are, have you ever seen the IDEO Method Cards, any of you? No. IDEO is a company, the what? No, these, probably. So, IDEO is the company, I think they got, they became well known because they helped Apple design many of its products, their usability design firm. They have published a deck of cards, and if you don't have them, I would recommend trying to find some, they're hard to get now. But they have all of the techniques that the designers at IDEO use. So, observation processes, and when is a good time to use them, how to use them, you know, what the challenges are, and frequently what these folks will do is they'll take, they'll get together as their team on whatever product they're supposed to be delivering, and they'll each look through their card deck and they'll say, okay, I think this technique might work for us to do some research into this challenge that we have. And they'll sit and they'll throw them on to a pile in the middle, and then they'll decide which amongst them, which techniques they're actually going to use. And they're all fairly well-defined. So if you can get ahold of one of these decks, I strongly recommend it. The other deck there is the KM Method Cards. Now, the reason I put this on there, these are from a company called Straits Knowledge. They're a consulting firm in Singapore, a British expatriate by the name of Patrick Lam, who's written a book on taxonomies and whatnot for the Enterprise Excellent Book, L-A-M-B-E is his last name. He did the same thing, but with more of a content slant. So these would help you with taxonomy projects, metadata projects, and that sort of thing. What are the techniques? So are any of you Six Sigma or certified or anything? So you know, in that sort of realm, they have all those tools that we're supposed to use, like Five Lies and Venn diagrams and all of those techniques. That's what these are, just in a design mentality, or in a knowledge management mentality. So they're very useful. I'm almost wishing I had brought them. So I've been doing unstructured data since 99. And these are the things that I use. Surveys, focus groups, observation analysis, watching somebody. It's very much what your interaction design and information architects will be using. SWOT analysis, capabilities analysis. User personas. So I referenced, you know, you go into a meeting and even though you may have a great big Microsoft project system, housing your project details and your Gantt chart and who's supposed to do what, what happens nine times out of 10? The PM and all the various people who need to report in could show up with an Excel sheet or a PowerPoint to report in at the weekly staff meeting. Anybody ever experienced that? So these people who need that data, shouldn't they get it from the source and shouldn't they present it from the source? Shouldn't there be a tool to let them interface directly with the source instead of having to do that copy paste? So create a user persona. How would they like to do that? Ask them. Give them a story. This is definitely a design technique, but it works for unstructured. Because there's no, there's no other database. There's not necessarily, you can't go to, I mean although, you can't go to a data model book. You know, you can't open a book and say, how do I model this? Because it's unique to your organization. You have to customize it. Those pre-existing models, those patterns are very useful, but nine times out of 10, you're gonna need to riff on it to use the music metaphor. You're going to need to modify it a little bit for your organization. You're not gonna just walk away usually. So you do a knowledge audit. For this particular project, let's say you're working with HR, what do we need to know? And where do we get that information? We need to know information about the employees. So we have the HR databases. We have the LDAP. We have project assignments. How do we, you know, figure that out? If we have a repository of white papers and other scholarly articles that our people may have written, include that as well. Dashboards. Who has what dashboard? What project managers are using what kind of dashboard? What are the executives using? What's the data that they need? Where are the gaps? Where is it when somebody says, I really need to report on this every month or every week? But I have to take this piece over here and this piece over there and I just a little bit of math and then I can present it. That's a gap, note it down. Also important is to notice your redundancies and your overlaps, the duplications. You can get rid of those. A business process map, has anybody ever done one of these? Okay, yeah. And stop me, again, really seriously, prefer interactive, stop me if you have questions. So you document the steps in the process for how the information is created. So why did this report show up in a PowerPoint? How, you mean literally what was the process? The person who wrote it didn't just fire up PowerPoint and say let me throw some stuff in. They had to go somewhere to get the data. Where did they get the data? Did they get it from a database? Did they get it from a team member? Did they get it from their finance counterpart? Did they get it from sales? Where did they get it? Document all of that. Figure out where your inputs and outputs are, yeah. It's the one that is. Exactly, because they're corner cases. And yet, Gartner and other organizations will estimate that 80% of the data in your organization is unstructured. And my analysis after over a decade of doing data audits is that that is an accurate number. We had 85% at Raytheon in the unclassified systems. 13% of that had metadata. And most of that metadata was crap. I had this one thing, I was chasing something down and Ethan Frome came up as a top term in a machine generated taxonomy. Now to my knowledge, and I did have security classification at that time, Ethan Frome was not the code name for a project. What happened was somebody took their laptop home and let their son use it to write his English report paper. And the kid was smart enough, the kid was smart enough to put in the metadata. And then his father just reused the thing, just control all, delete, start typing. How useful is that in your data discovery in managing your unstructured data? It's not, it's not at all. And believe me, I had lots of writers because I started at Raytheon as an actual librarian. So I had access to a lot of different things and everybody swore to me that it was not a code name for a project. But, and obviously I didn't get anywhere sensitive because I'm still here, so. So identifying the process, where it comes from is really important. When you do a structured database, you have a pretty good idea of where the data is coming from, don't you? You know you have to ingest information via whatever process, ETL, whatever process that you have, you know where it's coming from. You know what you do to modify it, to put it in the format that you need for your structured system that then your business analysts and the other folks use to extract it. When I was at Dow Jones, we were a publishing company. Everybody thinks, oh, Dow Jones Financial Services, yeah. Okay, we did a lot of financial data. We published financial data. We had some smart editors who wrote about financial data, but we were a publishing company and we had a lot of third-party content that we acquired, integrated with our own, and then published out. Thousands of third-party sources. Let me assure you, they do not all use the same format. It would take us a week to get their feed, however we were gonna get it, be it FTP drop, be it XML, transfer, and we would have to map it to our schema and then we could include it in our processing platform and republish it. Think about the one-to-many on that, or the many-to-many as it was. It was a mess. It was truly a mess. But we needed to keep that provenance information. We needed to know where it was coming from. Yeah, yeah. Yes, absolutely, yeah. And we'll talk a bit about that. The so much happens in the hallway or in cubicle meetings or at lunch, if you have a cafeteria, for example, in your organization, that a lot of this is absolutely missed. But those are usually, how do we generate conversations, right? They still have to go back to their desk and sit down and do something or get information from somewhere. So if you can get past that to the, okay, and what happened next? You talked about, you had this great idea at lunch, okay, fabulous. When you go back to your desk, what did you do? And that is probably, a lot of times, gonna be as good as you're gonna get. So when you add the metadata, and there's lots of studies, and I'm not gonna get too deep into metadata today. I'm happy to talk about it with anybody. Obviously, it's really my sweet spot. There are lots of studies that they say, just like with design, the five plus or minus two rule, how many metadata fields is the average person gonna tolerate entering information for? Is it three, is it five, is it seven? We can talk about that. The key things that you need, so a bunch of stuff should be done by the system. I should know who authored it because of their login credentials. There's no reason they should have to enter their name anywhere. You might have to change your systems to make that happen. It's a different project. But there's really, you can solve that problem. Date, date should be no-brainer. Modification date, though. You need to have the creation date and the modification date. You need to keep both. Another useful thing, say, from your LDAP or your HR systems is the context. So for a lot of the folks in your organization, you're going to be able to know that they're in HR, they're in finance, they're in engineering. You can add, there are problems when people get loaned out, yes. But if you have their project, the contextual information for their project or where they're assigned at that point in time, you should also be able to add that automatically. And then they should be able to talk, they should only have to add a few key fields. Does anybody use on a focus on the Mac by any chance, or any of you Mac people? Yeah. When I come here, it's like, yeah, I'm a Mac person. So anyway, so you have a project in the context. So you can say this, especially if you're in IT, you can say this is a project for HR and this specific project is X. That is important information to add, the project and the context. One of the things that I notice, especially with the quote schema list, design used in systems like Hadoop, reminds me, and again, like I mentioned earlier, I'm a librarian by training, of something called colon classification, because you have the way you set up your columnar, columnar, whatever, table system, is, colon classification was created by a mathematician and librarian in India in the 50s. And he believed that you could classify everything in the world, books, concepts, everything, by five key facets of information. And those were personality, matter, energy, space and time. And when I look at things like Hadoop, it's the same thing. It is the same thing. You have your personality, what's it about? You have your matter, what is it? So in a digital system though, what's the file type? What's the mime type? Energy is what's the action, what's the verb? What'd you do to it? Is it sales, for example? Space is where, so I'm in the eastern region. That could be an example. Time, time is pretty straightforward. The canonical example is 18th century wood furniture design in France. You get all that information out of those five facets of information. You can classify something. So you can get it down to something like just five. All right. I'm good at tangents. I'm very good in the abstract layer, so if I get totally off track, just wave at me. Social tagging. Do any of you have social tagging tools inside your organization? Do you let people add their own metadata? No. Do you guys not blog? Let people blog in your organizations? That's a shame, okay. If you had it available, you could use social tagging. So I won't spend too much time on this since none of you are using it. Okay. There is a hierarchy, a continuum if you will, of complexity that controlled vocabulary systems use. And again, remember I'm coming at this from an information science background, not a computer science background. To classify and organize information, you can start with the folksonomy. This term was coined by a guy named Tom Vanderwaal, also from a library science background. And that is the tags that people add to their content objects, the average person. So when you go to delicious and you tag something, that's a folksonomy tag. If you go to Flickr and upload a photo and give it tags, that, those are folksonomy terms. They are not controlled. They could have spelling errors, they could be completely meaningless to somebody other than the author, the person who added the tags and whatnot. Does that all make sense? Yep, yep. It is the name given to the phenomenon that is people tagging, classifying information that they've stumbled across. Yes. It's very organic. They are very useful to determining your controlled term list, your vocabulary, your taxonomy, whatever. So the folksonomies goes on up, synonym rings, glossaries, taxonomies, the psoriontologies, et cetera. That folksonomies is at the bare bones bottom level. And on the various kinds of ontologies sit at the top. No control. You cannot guarantee spelling. You cannot guarantee anything. I mean, have you ever gone to delicious and run across a tag to read? Well, if I tried to read everything that somebody has tagged to read, I mean, it's useless except for within the context, again, but going back to that silo, within my silo. It's only useful in within my silo. Delicious. So you can, it's a good way though, if you're stuck trying to determine a controlled term list or taxonomy for whatever your project is and you can use this in structured as well as unstructured systems, this is a good way to go. How are people tagging it? Even go out to the web and look. Use Google's, I don't know how to pronounce that word. Zeitgeist, is that Zeitgeist? Is that how you pronounce it? You know, that'll tell you. And then you figure out, okay, first is the spelled right. What's the right way to spell it? Are there other terms that mean something like it? Cause those terms might be useful in your system as well. And are there trends? You'll see news, you know, we've all seen it. What's trending today on Google Plus or on Twitter or whatever. Twitter hashtags are folks' onemies. There's nothing controlled about that. Okay, so ask the people who, so with an unstructured data project, it could be bottoms up, somebody in the organization gets really frustrated and says I wanna fix this and let me do it and make their case and they get okay, fine. You can spend a couple hours a week fixing it. Or it could be that you are somebody on high, got ticked off at something and once it fixed. You could have top down, right? So ask them, what's the problem? Surveying, I hesitate to make a generalization, but if you're in the kind of role that doesn't usually have to go talk to your customers, cut it out, go talk to them. Tell your project manager, sorry, I need to go talk to these people and facilitate it somehow for me and find out what's really bugging them. One of the things that were taught in library school that I really think is important in this area is how to do a reference interview. People will come into a library and they'll say, so I checked out this book a couple years ago and I need to find a reference to it. There's something in there that I needed for my current project and it was green. Can you find me that? This has actually happened to me when I was a real librarian. I'm like, what are you talking about? So I looked through his records. Well, you've checked out these books. Did you actually check it out? Or did you just take it back to your office? So finally I had to say, okay, this is not working. His description isn't working because people's memories are faulty. So what do you need it for? What problem is this going to help you solve so that I could at least narrow it down to books on a certain topic? Well, it turned out it was a white book that he checked out seven years before for four months and it was on radar or something like that. So I mean, people don't remember well. So you have to keep digging. So when you talk to your folks, more often not what they ask for explicitly is not what they actually want. Keep, be careful, not suggesting you step on toes or piss people off, but keep digging, keep probing. Look, observe, see if you can, what kinds of information about that person you can access. What are they trying to solve for? What are the benefits? You know, is it going to generate more sales? More importantly, are there any restrictions? Are you in an organization or an industry that has regulations? Either government regulations or internal regulations to figure out what those are because you definitely don't want to get yourself in trouble. So these projects really aren't that different from what you usually do. You have to figure out the, analyze what the needs and the wants are. You have to define the requirements. You have to commit to getting it done. That all-important, yes, somebody has signed off, you have the money to do it. You get your resources and I have a separate slide deck on the kinds of people who are good at this, which you can find on my slide share account if you're interested in that, in more depth than we're going to go into today and then you develop and deploy it and then you define and publish maintenance processes and governance rules. It's interesting to me that there's a lot of talks on data governance in this conference and related conferences of late. For years, I've been telling my internal clients and now that I'm a consultant, my consulting clients, if you're not going to plan for this now, I'm not doing this project. A very large bank, headquartered somewhere here in the south. In 1999, spent $3 million to develop a taxonomy to improve their search results for their unstructured data. They got it developed, they installed it, and then they never updated it. They never maintained it. And in 2004, they started the process all over again. What a waste. Because what they created in 2005 looked a lot like what they had done in 1999, just some new terms were added. These terminology changes, technology advances, new terms are created, the usage of terms change. This is the linguistics part of this, of metadata. You have to make sure it's up to date. Which is in an unstructured system or a semi-structured system. Now again, the other bias that I have is that I'm much bigger in the semantic community than I am in the structured community. I'm used to working in systems that I can change things whenever I feel like it and have much less agita about breaking things. I can do deprecation and subsumption, something that I can say, we're not using this anymore. I can subsume the information from this node, this entity, or entity instance somewhere else. So yes, there are concerns there. But in a lot of systems that you're gonna deploy to manage unstructured data, you have a bit more freedom. So if you're ever on one of these projects, talk to your vendor, obviously, about how they do it under the hood, to figure out what exactly your capabilities are. Okay, so I've referenced some of this data earlier, so we can go through some of these examples a little bit faster this time. So what are the inputs to your project to try to figure out what your project is? You've been told, go fix search for unstructured content. We need to find our information better, we need to know what's in there. Did anybody go to John, is it LADLY, his session on Enterprise Information Management, how you value it? Yeah, I did too, it was great. He actually had algorithms that he published on how you can value your information assets, which is really hard to do. His book is at the bookstore, if you're interested. But this is, if you're going to do an information asset valuation, this is some of the stuff that you're gonna have to do to really get a handle on what you have in your organization. So as I mentioned, so it's all right, it's no secret, I worked at Raytheon. So we had 80,000 people, we had 85% of unstructured data, 90%, did I say 90% earlier? Had no metadata. Most of it was bad. 13% of the documents were exact duplicates. Exact duplicates. We have no idea how many were near duplicates. You know, somebody changed one number on a spreadsheet. No idea. Because that's actually harder to find. I mean, pattern matching, exact duplicates are easy. The true age of the documents was harder to figure out because IT would randomly say, well, we're gonna decommission this server, so we're gonna programmatically move this from here over here. Oh, guess what, it got new creation dates. Really? So to my example about having the tiered storage system and not actually knowing the true provenance of the information and how old it really was so that we could put the older data maybe on less expensive tiered storage to reduce those costs, that became really difficult. So these are all good things to know as you're trying to figure out what metadata is important to capture. How do people search for information? Look at your search engine logs, look at your server logs. How do people traverse your document system? How do they search in your document system? Do they try and refine, filter their search by their department? If they do, then you're gonna wanna add metadata for that. These are the ones that we, traditionally in my experience, these are the ones that roll to the top, that float to the top, the function organization or business line, date, document type. People have it in their head that they know it was a PDF or a PowerPoint. Usually they're right. It's definitely easier to know if it's Excel, right, so. And if it has any tags at all, you need to provide additional sorting and filtering. If you have a faceted search capability where you can add, you can stack multiple metadata filters, that's great. This last bullet is definitely from the mouth of Tim Berners-Lee. Don't change the location. Give it a persistent location so that, say, I can write a macro in my presentation tool of choice that says, every week I gotta go to this meeting and tell people how we stand. Go get the data for me so I have to do it every week. Just go and fetch the data and bring it in. Just display it right here where I want it. That's definitely more of a synodic capability, but you can do it. And you can do it week after week after week if you know that the most recent data is always at the same place, instead of having to go find it. Oh man, what was his name? I'm blanking on the name of the analyst at Delphi, unfortunately at the moment, who did this research. It will come to me. Delphi Group did a survey about how much time was wasted. This goes to that soft dollar conversation that we were having. If people are wasting a couple hours a day just looking for stuff, that's a waste. Now, the argument, of course, is that, and this is the argument that I've heard 1,000 times, well you can't prove that if we gave them back that time they'd actually do something productive with it. Okay, that's how you feel about your employees? That's, I can't change that, so, okay. But if you, but you can measure that. It's a performance thing. If you know that you've improved their capability to find stuff and get more done, and they don't start getting more done, then that's a conversation for the manager to have with his employee. At least half of that time is wasted because they just don't find it. So after spending the time looking for something, and a good chance of not having found it, then they recreate it, which goes back to having, you know, hundreds of near duplicates in your systems. And everybody puts their own spin on things, so you don't know how accurate it is at the end of the day. Are you making major product decisions based on questionable data? Possibly. And the factors that contributed the most were data changes of location, things moved, and just bad tools. So this goes, this is where we really start to talk about more about the output, what comes out. So you figure it out, you've got all your inputs, you've done all your surveys, and you've talked to people, and you've analyzed their processes. Sorry, I just need to catch up on my slides, make sure I'm getting everything in here. Speaker notes on a iPad are kind of small. I lost myself, that's okay. Where to store your metadata is one of the big requirements that you're going to have to determine. Do you embed it in the object itself, or do you attach it to the object in your storage system of choice? So, we've all got document management systems, right, that we've had to deal with. We all upload a file of some type. What's usually there, you click upload, what usually shows up after the window that lets you select the file from your desktop? A what? Description, check-in form. Little fields that you have to fill out, right? That's metadata. What is important for you to have on that page? What's really important, as opposed to what you think is important? Delimiters, sorry, you just made me go back to my library mark, you know, delimiter kind of hatred. Yes. In theory, based on your login, as we said, a lot of that stuff should be pre-populated. You should not have to worry about it. So that you have to figure out, and this is where you do focus scripts and user studies and you actually do prototypes. It's just like doing an alpha if you're doing a software release. What's the tolerance? How many fields are people gonna fill out? I wish I could tell you that the research was conclusive on this. There have been studies that have shown that authors are the best person to do their metadata. They know exactly what they've written, they know what's important, they know what they're talking about. Then there are studies that show that the authors are the worst people to do their metadata. It's all due to perspective. Yes, the author knows what they were talking about, but they know what was important to them when they were writing it. What was important to them when they were writing it may not be what was important to the person trying to find it. So this is one of those things where you have to make a judgment call. You have to do your research and make a judgment call for your organization. The jury is out. I would say have some fields that the author absolutely has to fill out, yes. But then I would say let people tag it with a folksonomy. Provide a mechanism for the users to also provide tags for it. Yeah, you do, you have to federate. You have to federate your metadata. So the problem with embedding the metadata with the object is that it makes it a little bit larger and there's more things to break as you transfer an object from system to system. However, if you don't embed it in the object, if as you're transferring it from system to system, something breaks and you lose it, that's a problem as well. So that's another choice that needs to be made. But you need to be able to federate the searching of the metadata and it has to maintain its link from the metadata record to the object in question. So if you've gone into a library and gone to the catalog and looked something up using the metadata title, author, subject in the computer and then go to find it on the shelf because it tells you it should be at this location, right? And it's not there when you get there. Now it could be simply that it's been checked out but it should have told you on the computer that it was checked out and you wouldn't have bothered to go find it. But you get there and it's not there. It's because somebody put it back in the wrong place and librarians haven't gotten around to finding that it's in the wrong place and putting it back. With that sort of system, there are tools that are metadata registries basically and all of those tools can link into that. And the authoritativeness with which you ingest and transfer metadata is something that you need to create. It's like creating an algorithm. So in your SharePoint systems, you can have controlled term lists, right? As they're uploading something, they can pull down a list and have something to choose, right? You don't have to have these huge taxonomies. If you have to that conversation about colon classification, the personality matter energy space time, it is more effective to have say four facets of information with 10 terms each that you can then combine to be 10,000 different things than to have a single taxonomy of 10,000 terms. Right? But where that is managed is in your central store and there are lots of tools, taxonomy and metadata plugin tools for SharePoint. SharePoints, does anybody hear from Microsoft? This is recorded. SharePoint, even though I know the most recent release is much improved over earlier editions, SharePoint is still not a good metadata system. It stinks at taxonomies. And we encounter it all the time. So what you do though is you have your SharePoint system look to a centralized metadata system, okay? Schemalogic, not schemalogic, they got bought by SmartLogic, Synaptica, Data Harmony, those kinds of tools. And then it gets all the terms from there, but then you also have your social tools. So for example, if you're using like an open source tool like Drupal, you can have it look to that same metadata registry to get the same terms to select. And you can have the system as people start to type, have it pop up with a suggestion from your controlled vocabulary. If they need to add a term that is not already in there, there's a way to do that. You let them add the term, but then as it gets pushed to your central metadata system, it gets flagged along the way so that your editor of your taxonomies has to review it and either it gets added, append it as something more formal, or it goes away. Does that make sense? Did I answer your question a little bit? It's a really complicated topic, so I know it's not a full answer, but does that give you an idea? Okay, great. So apparently we're running out of time, but that's okay, because I told you I'd be happy to make it more conversational. Let's see where else we get to go. So you can improve efficiencies. You can reduce storage costs. Told you the slide was in here somewhere, or I thought it was. So you have to identify the opportunities. If the inputs say, I want you to curate the content for me, that's an opportunity. If you want to reuse it, if you want to annotate your content, how many people are starting to see requests or annotation of content? Anybody? This could be more of my semantic realm. Refine the content. I want to take something that already exists, refine it just for my needs. It may not be that I want to change the data, but it may mean that I just want to use part of what's available, not the whole thing. So content reuse and purposing. The slides are on the CD, are on the server if you want them. So these are just some examples of how you can reuse and repurpose for different skill levels. So let's say you're a student in an organization or a university and the introductory level learners only get maybe this much of the content, the little bit more experience, get this much and the experts get all of it. That's one example. So the outputs to the example I was just giving you, your CRM content has to be indexing categories. And it's a hub and spoke model. The model I described to you, you have your central hub and then you have your spokes with your additional metadata, which is your control vocabularies and your business rules. You've got to come up, just like an infrastructure system, you have to have rules that you enforce for naming conventions and whatnot. Let's see what else. These are nothing new. You still have functional user, administrative and technical requirements for your systems. That's not changing. Especially if you're looking for a metadata system. You still have to authenticate, authorize and have security. The examples we'd be given is, if you can give a lift up to your metadata tagging using the LDAP, for example, you should do that. If you need to restrict access based on security classification for government or industry, you can do that. You need to figure out how much you need to scale. How much unstructured data do you have? How big is your company? These are typical. What do your users need to do and what kinds of users? Most metadata management systems have dozens of user levels, of user permission profiles. Not just two or three. Does it have to be interoperable? Most of these things can work in the cloud. Is it gonna be an XML or something else? And where does it need to go? Document management systems, digital asset management systems, social media systems. Again, we've been talking about this. It's just a rehashing another way. You build versus buy, you build or buy it. Taxonomies are becoming kind of a commodity. More often, more likely than not, you could buy it. You can go to taxonomy warehouse. It's a site that I used to own when I was at Dungeons. It's now got sold with Synaptica to the current owners of Synaptica. And it is a site from which you can find taxonomies. And in fact, now you can download them in a variety of formats, including not just XML and Excel and Word, but also RDF, Alaska, that sort of thing. So you can actually buy them and not have to build them yourself. Okay, and then as I was saying, governance. And then you can scale it. Okay. Sorry, of course we got to so much talking. Side talking, we didn't quite get there. And I didn't, like I said, I was gonna, I think I might do another presentation on how you map it to technologies. Natural Language Processing Tools, entity recognition, rules-based classification. Have any of you worked with any of these? Yeah. Those are the kinds of tools that you use to give yourself a leg up. There's an open source tool called GATE. If you are interested in playing with it, I suggest you check it out. The Apache Foundation has several useful tools, obviously, as well, for doing this. But GATE will identify the entities for you. And it is open source. Okay, so sorry I got rushed there at the end. Anybody have any questions about what we actually got to talk about? Yeah. Ranganathan. Ranganathan, R-A-N-G-A-N-A-T-H-A-N. He's an Indian mathematician who studied at Oxford. When he got home to India, he ended up becoming the national librarian. And in fact, colon classification, which is what it's called, is still used in India today, in the Indian libraries today. Colon classification. And I see it's used in these kinds of realms a lot. In fact, a lot of people have recognized that. It's very useful. All right, so sorry, again, any questions? Great, thank you very much for coming. Really appreciate it.