 Hi, everyone. Welcome. Thanks for coming. My name is Kenning Arledge. I'm the Dean of the Library at Montana State University. This is Martha Carilladou, and I'm from ARL. Good to see you. How are you? Wow, that could be a problem. This is Martha Carilladou from ARL, and Patrick O'Brien also from Montana State University. Our fourth team member is stuck in Wisconsin for a year away, so I'm going to stumble through the first part, which is her presentation, and I'll ask my colleagues to jump in and help me where I fail. So this is a presentation about web analytics, and it stems from research that Patrick and I started about three and a half, four years ago at the University of Utah. So some of what you're going to be seeing today is data from the University of Utah, even though we're not there anymore. We started this research trying to figure out why our digital collections were not showing up in internet search engines the way we expected them to, and it has mushroomed. It has gone in a number of different directions, and the web analytics piece is just one of those branches. Here's an interesting take on an aspect of web analytics that we're not going to talk about very much today, but there is a privacy issue, and it's a policy issue, and depending on where you stand on this side of this very fine line, you can either be a hero or a pariah, and we hope that libraries can find some good middle ground there. But just know that as we measure and track user behavior, there are potential privacy issues. Did I capture that? So privacy, there are a number of other considerations in this. One of the big things that Patrick's going to talk about later is the measurement of non-HTML web assets, because that's one of the major findings here, that if you have an institutional repository and users use a search engine like Google Scholar and click on that link that comes up on the right side of the search results that says ADF from such and such repository, if a user clicks on that link and goes there, they're probably bypassing any kind of counting mechanism that you have in place. So we think there's a large number of downloads and visits to institutional repositories that are happening that are not being counted. Blinded by the numbers, it is important to look at numbers, but numbers go up and down, they spike, they tank. It's more important to look at trends over time than it is to look at any particular specific period of time or numbers of downloads. Spiders, robots, proxy servers and caching, these are all things that can throw off log file analysis. There are basically two sets of methods of capturing data. One is to analyze log files in their software that does that, and the other is to use page tagging software like Google Analytics. They both have their pluses and minuses. This issue of spiders, robots, proxy servers and caching can really obfuscate log file analysis. And then there's the issue of quantitative versus qualitative data. Martha will talk a little bit more about qualitative data that comes about from surveys. What we're mostly talking about when we look at Google Analytics or any kind of analytic software is quantitative data. So first the big question, before going too far, it's important to know what you want to learn. What is your objective or desired outcome and what goals will get you there? Then you can consider some of the more specific aspects. Often the reason for collecting these kinds of analytics is simply to report to a parent institution. Many of you are ARL institutions and the Association for Research Libraries is interested in gathering data as you know and producing reports on an annual basis. And so some of the impetus for this is driven by an organization like ARL, but also by administrations that we all work for in our home institutions as well as our own kinds of annual reports. So a little bit about the what, some of the kinds of things that you might want to know, what are users searching for, which features are used, where are they coming from, which documents are being downloaded, how long do users stay on the site, what are their navigation paths, and this is a really crucial one. What we've found in our research is that more often than not Google Analytics is configured to measure siloed repositories rather than being configured to measure across repositories. So what happens is if a user comes into your website and goes and looks at a digital object here and then click on a link and gets sent to another server, the analytics software tends to lose them because it hasn't been configured properly to track them across domains and across properties. Then there's the how. What kinds of metrics do we use to measure? Hits, as many of you probably know by now, is one of the worst kind of measures that you can collect because a hit is a simply of a measure of a file download. So if your page has five image objects on it and a user visits that page once, it will count as six hits, five for the images, one for the HTML page itself. So it's a completely inaccurate measurement. Page views is a much more accurate view because it tells you that the user visited this page once. It doesn't care how many different files are on that page that were downloaded. Time spent on page can be important whether the visitor is a unique visitor or a new visitor or a returning visitor and then bounce rate can be an interesting measure as well. You might see a high bounce rate, something like 60 or 70% bounce rate on your analytics. That means that a visitor went to the page and then left immediately. And you think most of the time that that's not a good thing, but it can be a good thing. It can mean that they found exactly what they were looking for and the story they're done. So these are some of the tools that we use. Google Analytics up at the top tends to be the most popular analytics page tagging software. It's free. It's very powerful. It can be configured in a multitude of ways. But there are other systems as well. WebTrends is one that we use at Utah. And then there are also log file analysis software. Heatmaps can also tell you what users are doing once they get to your page. What are they clicking on? Where are they spending the most time? And so forth. So there's an array of measurement methods here. It's just the point that we want to get across. Thank you, Kenning. As Kenning indicated, ARL has an interest in collecting data and describing what research libraries are doing. And over the years we have built a history of experience in trying to capture some of that information on data about how users interact with the electronic environment. And some of that need is driven by the need to demonstrate our worth by the transformation that's happening in libraries, by the changes in user behavior and increasing demands. These are a number of our external drivers. And where we are at this point in time, having the history of the last 10 years behind us and the more recent history of the lead value IMLS grant behind us, is at a point where we have a number of different tested methodologies that can serve as an incubator for really scaling up the kind of data collection we need to be doing that includes web analytics work. So historically the Association of Research Libraries has been driven by this objective to describe and measure the performance of research libraries and their contribution to research teaching, learning and community service. And our lead value work will have the toolkit available in January and an opportunity for call for participation in different pilot activities. Now to give you a sense of how we've gotten there, things have century old history going back to 1908 when the director of the University of Minnesota, James Gerald, started collecting data across research libraries and that's a legacy that's been with us for more than a century. The more recent experience that has been scaled across many different institutions and across hundreds of thousands of users is the experience we have with live call. It's more than 1,300 libraries across 20 different languages that have implemented this protocol and the database has now millions of perceptions of users about the library surveys. And over the years we have experimented with protocols that capture the behavior of the users when they download electronic resources, the Mines for Libraries protocol for example, captures the purpose of use for the users of electronic resources at the point of use when they download the PDF or the other electronic resources. We have gotten an NSF grant that looked more closely into the digital library experience. I see some of you here that have been involved in the formative work of that digital and climate call is another protocol that looks into organizational climate and diversity issues. Now this is sort of the background history. I will look a little bit more specifically into a couple of these tools and some of the evidence that we have at hand as it relates to the web analytics work. And when it comes to the ARL statistics, the legacy, the century old legacy there, we did try over the years to refresh it, to modernize it. It is really great to see Sherry Schmid here because she was the leader behind the e-metrics and movement pilot activity that dated back in 2001 when we first tried to capture the usage of electronic resources. And some of that work led to ARLB, one of the founding members of Counter and also testing data elements. Some data elements have been now carried over into the annual ARL statistics and some have been let go. One of the data elements we let go is the number of hits on the library's web pages, partly because of the variation we were seeing there across the different institutions and really the lack of meaning in those data. But some of the data elements we have kept and moved now into the annual ARL statistics include the number of downloads, the number of searches and the number of federated searches. Now these are not without challenges of course because the latest challenge is our discovery systems and how discovery systems capture usage of electronic resources. We have two interesting quotes here, gripes, that have appeared December 5th on the collections assess, call assess, list serve. Anybody that's following that list here? So it's actually a recently formed list. And I'm going to read those quotes. It says, unfortunately because of the way discovery systems work they never interact directly with the source database or platform. Instead all searches are conducted entirely within the discovery systems platform. Thus there is no search to record at the source database. So it is not reported in the DB1 reports. This is counter compliant reports. So these things are not recorded as federated searches or sessions. And this is a big problem for us. And the second quotation there from the same thread, the second quotation is actually from Dana Thomas. We presented here at CNI last year. She was saying clients need to complain about the lack of good stats available to us and to demand something better. I know of some institutions that have implemented Google Analytics tracking for this discovery system so that they can collect information about the content that users click on from within the discovery systems index. We really shouldn't have to be doing this. These systems should do a better job. And we need to be demanding that as clients. Now why are these things important? What evidence do we have at hand that tells us that we need to do a better job on these things? I'm going to show you three slides from the life hold data that captured trends over the last 10 years on three key questions about faculty perceptions. And these are data from ARL libraries. It's about 15 to 25 ARL libraries on an annual basis. And they are about 5 to 8,000 faculty members on an annual basis tracked here. The bottom line of this bottom blue dotted line is the minimum perceptions that faculty have. And the top dotted blue line is the desired expectations faculty have. The red dotted line there is how faculty perceive the libraries performing. And the three items I chose are telling a story. The first item, the printed library materials I need for my work. When we look at faculty perceptions of how libraries are doing over the last 10 years, we see that we have finally started meeting their minimum expectations when it comes to the print material right around 2007. And by the time we meet their desired expectations, print will be obsolete probably. But the second item that I chose for you here is the print and or electronic journal collections I require for my work. Interestingly enough, we are seeing that faculty are increasing their ratings of how libraries are doing in this area. At the same time, their minimum expectations are also increasing. So even though we are trying to catch up, we are swimming upstream and we are at the same spot where or less. The third item from the likeable data is about a library website enabling me to locate information on my own. And this is an item where we are losing ground. And we see that faculty perceptions over the last 10 years about how the library is doing. We are not meeting their minimum expectations and their minimum expectations are increasing faster whereas our rating about how we are doing has stayed the same. We need to be doing much more work in getting those websites better and Web Analytics is one tool that can help us get there. Our users are increasingly coming to us from off campus using our resources remotely. We have data on that from the Mines for Libraries protocol. We have done studies with a scholars portal at the Ontario Council of University Libraries at Oakville in Canada. And this is the pie chart on where the users are coming from from the 2010 data collection and about 70% of people who use electronic resources are coming from off campus. This was about 45% in 2005 and we are now 2013. It will be interesting to see if that will stabilize and at what point it will stabilize but clearly the increase in usage from off campus is happening. We have data that captures that. It is increasingly important to have good data on Web Analytics. The work where we have worked directly with Google Analytics is one of the pilot projects in the lead value IMLS grant. One of the strands of research there looked at special collections and specifically looked at how Google Analytics can identify the value of a specific special collection. Gail Baker and Ken Wise from the University of Tennessee did most of that work. We have actually a lead value webcast available that captures how they did what they did and we hope that through the toolkit being available the webcast being available, future work we are doing with OCLC and Montana State that we can launch more activities on getting more institutions to come together and do a concerted effort of articulating the value of their digitized special collections and other digitized resources, institutional repositories and all sorts of forms digital collections come from and we hope that through a more concerted effort we will start having better and more comparable data on these resources. So on that note, Kenny will continue. So when Patrick and I started working together in Utah in 2010 I knew we had a problem with our digital collections. I knew that the numbers were low, I knew that people weren't getting to those collections. I didn't know how bad the problem was. So we started to put some benchmarks in place. This is the number of search queries that are submitted to Google each month by Americans, 12 billion. When you start to think of these bigger numbers you realize how small a slice of the pie libraries are actually getting. This is the percentage of our digital collections that were bindable that were indexed by Google. And this is even worse, the percentage of the scholarly papers at Utah's institutional repository that were indexed in Google Scholar. Very, very low numbers. It was no wonder that people weren't getting to us. We find this with OCLC's well-known surveys in 2005 and 2010 showing that very few people start their research at library websites and it starts to make sense why they're not getting to our digital collections because most of them are going into search engines like Google and Google Scholar. So over a period of time we managed to have quite an effect on these numbers simply by doing what we call traditional SEO. Making sure servers are configured correctly. Making sure there's decent metadata in there. There are appropriate messages to search engines, redirect messages and so forth when collections are changed. Making sure that there is actually a relationship established with the search engine using Webmaster Tools and Google Analytics. And we brought those numbers way up. And that resulted in a very significant change in the number of page views every day. Which of course resulted in more referrals and more visitors. So we took the number of visitors up by 136% to the digital collections and the number of referrals from Google domain search engines up by 500%. Meaning Google was telling more and more of its users to go to our site and it was having an effect. So all this research resulted in some general themes. One is that it's pretty clear that SEO has been an afterthought when libraries have created digital collections. We talk so much about scanning, image quality, metadata, building the website. We don't think about this until we realize that nobody's coming to see those great collections that we created. Librarians think very small about potential traffic. They start thinking about those 12 billion searches a month that are happening at Google search engines. And then you look at the numbers, you know, the tens or the hundreds or maybe the thousands that are coming to your collections and you realize there's a big gap there. Organizational communication is often poor. SEO tends to get left to a few people in IT organizations. But it really should be the domain of anyone who has an interest of making information available via the web. We had the classic example that I like to give is that we had a systems administrator and a programmer sitting about 20 feet away from each other and Carol and the programmer had submitted site maps to Google. And the systems administrator was responsible for the robots.txt file on the server and so the site maps were saying to Google, come and index us and the robots.txt file was saying, no, no, you can't come in here, stay out. So bad communication, data in repositories are often messy. When I say data, I mean metadata. We re-key too much. We make mistakes when we re-key. Analytics are usually poorly implemented and that's really the focus of this talk. Vendors tend to be slow to catch on to SEO problems. Vendors for commercial products like content DM, but also for open source products like d-space and e-prints and so forth. And then the semantic web issue is the second part of SEO that we have been focusing more on for this last year. But you can't do semantic web SEO. It's the advantage of the benefits of semantic web SEO if you don't do the traditional SEO first. You have to be in the search engine indexes in order to get the click through rates that semantic web SEO can provide. So in general, what we have been recommending is that SEO be institutionalized, that it become part of the library's strategic plan and that accurate measurement tools are put into place and managed from the top, not from pockets of IT. And then making sure that the traditional SEO stuff is in place, that you are getting your digital objects into the search engine indexes. And then the next step is taking advantage of the growing semantic web SEO possibilities, taking advantage of linked data sources, taking advantage of semantic web techniques like applying schema.org to metadata to take that metadata from a very flat metadata, one-dimensional flat one-term metadata to contextual metadata that establishes authoritatively what it is that you're describing. So this has resulted in a book that we wrote, a leader guide, and two, I think, significant articles. And the one we're going to focus on for now is that one on the right, invisible institutional repositories which appeared in library high-tech in 2012. So we got Google to index nearly 100% of the University of Utah's institutional repository. Over a period of time, the index and ratios went up, the next step was showing up in Google. Great, we were golden. We were not showing up in Google Scholar. Google and Google Scholar are separate organizations. Google Scholar has its own rules, its own recommendations for how your metadata should be structured, for how pathways to the digital objects should be set up. So all that work that we did still got us nowhere in Google Scholar. And Google Scholar is really where an IR wants to be. That's where the academic research is happening. So we figured out and published about it in this paper, Invisible Institutional Repositories, that the real challenge is in how citation metadata are structured in most institutional repositories. This is how we typically put a citation into a double core field in an institutional repository. It may be a certain style, maybe Chicago or APA or some other style like that. But it's all the pieces of the citation in one field. Great for humans. A machine has no idea what to do with that. This is what Google Scholar wants. Google Scholar recommends against using double core. It says in its documentation, use double core only as a last resort. It wants you to use one of four metadata schema. High wire press, which is what this is. Prism, E-prints, or B-Press. And it wants every part of the citation in a separate field. It can do something with that. It understands what to do with that. Not only does it understand how to pull that into its index, but that helps it push the citations back out to the user in any citation format that they want, any style. Great. It's going to tell you how we figured this out and the rest of what we figured out. Thanks, Kenny. You asked me, was there anything of a mess? I do have to point out that first slide with Julian Assange and Mark Zuckerberg is one we put up there. Ricky. So if there's any controversy, let's say Ricky did it. And if everybody, there's praise, we did it. And it's available on Facebook. So we kept asking questions as we went through these steps. And the thing we realized was that less than 1%, we're throwing numbers around here, 1% being generous, the 8,000 scholarly papers are indexed by global scholars. Kenny said that is where scholars are going to find research. And that's, we think, why maybe, again, this theory, why expectations are increasing? Mendeley, Google Scholar, Microsoft Academic Insight are becoming easier and easier to use and the content in them is getting larger and larger. So we conducted three pilots, the first pilot. And again, this is a summary of the paper that Kenny mentioned. The first pilot, we followed what we thought were the published Google Scholar standards at the time. We took, and got nothing. So through OCLC, we're able to make contact with Google Scholar. Got some direction. We did another pilot. We got a 62% for, let's just say our laboratory wasn't very robust at the time. And our server went down. In fact, there was a lot of power outages at the university. And it was just basically the old desktop machine plugged into the wall outlet that we were running our services on here. But we got a 62%. We did it again, expanded the number of digitized items. We got to 90%. We found the magic mix. And then we took that magic mix. We cleaned up three of the IR collections and put it into production around July 2012. And we went from close to zero to over 4,000 items indexed by Google Scholar. Those are verified items. So the question is, to me, when I look at this, this is great. Why is there a big gap? Don't know. Don't know if it's our methods. Because Google Scholar is not real public about how you find information and verify it through APIs or anything like that that I'm aware of. But that's an interesting question we want to look at. But at the end of the day, we didn't see our Google Scholar visitors' page views increase. And that didn't make sense. So in doing that, we went ahead and dug deeper. And we discovered that most analytics have potential accuracy issues. And basically, there's two methods. The first method I'm going to talk about is log file analysis. And this is typically, and it's not to say that it's these space knee prints aren't aware of the issue. And they have, they're aware of it. They're implementing things to deal with it. It's not a complex issue. But structurally, there are things that cannot be addressed by using log files alone. And that is basically you have people coming from commercial search engines. And they go ahead and come to your IR through a search. And that's going to end up in the log file, no matter what it is. A new user coming through. However, we know from the private sector, they've stopped using log file analysis. Because it overcounts visits and downloads due to these spiders. There's over 240 unknown spiders. And there are just new spiders being created every minute because there's hijacking. There's malware. There's all sorts of stuff. People, businesses want access to content. That is how they generate revenue. So that isn't necessarily your target audience. That's not who the stakeholders of who funds the library, who your audience is. So it may not be a good idea to count that. And then the internet is constrained by bandwidth. And they've done a lot of work to keep what's called caching. So that someone can make a request to the item. And they're not going to come to the website. They're going to get it from somewhere else. And it will not show up in your law. So as much as you do to improve the products, those are two fundamental issues that you have to deal with. And there has to be a hybrid. Private sectors move to what's called analytic services. And they do not track non-HTML files or downloads out of the box. And this is what I mean by that. So the typical analytic services are like Google Analytics, Web Transit, and Omniture. They are making a fortune. They're making a lot of money because of the data. So what happens is users make an HTML request, which is a standard web page. And that hits your IR. And they use a technique called page tagging, which is JavaScript. Now, JavaScript won't be executed unless it goes through this page tagging technique. However, they want to download files. And files like the PDFs, docs, tower points, any of those things aren't going to get captured without some special type of configuration. Google has come out with 3.0 version that's improving this and making it easier. But you're going to have to do something. And not everybody is going beyond the basic implementation. So analytic services tend to undercount non-HTML PDF like, for example, PDF downloads. And if you're an IR, this can be, we think, a pretty big issue. So this is an example of someone going to Google Scholar. And you'll notice there's two choices here. I want to talk about the first one, the HTML request. And what happens is they click on that. They get taken to this page. So if the person scrolls down and downloads the file, this page is going to capture that information, but Google Analytics will not, unless there's a special configuration. So now you have this problem of you have data from two places. And that never works out well. So the next scenario is this is a likely scenario. And it's the only problem if you get indexed by Google Scholar. So if you're not indexed by Google Scholar, you're probably not going to notice this. So they don't track direct non-HTML files. So for example, a Scholar, somebody using a scholarly search engine and requesting information and accessing that information, that's real high value, right? So in seeking information that's probably a Scholar, it's not a kid on Facebook coming over looking for stuff. Which may be interesting, but it's probably not with faculty, department heads, and college deans are interested in. You know, e-mails. If the link's embedded in the e-mail, it's going straight to the non-HTML file, and it will not get captured. Google is moving to a better solution, but it's really, really sophisticated, difficult to implement. And again, consultants are making a fortune off of implementing Omniture, Google Analytics, and web trends. So the PDF request. So let's say the Scholar does the search, and they read that description and go, you know what? I don't need to go to web page. I want to read that paper, and they click on that. They go directly to the paper, which means they probably want to read it, they probably want to view it, they don't want to read the metadata that they're familiar with. That's what they want. So going back to, we went to all this indexing, and we increased the content considerably, but we'd not seen the traffic. There's no referrals. Why is that? Well, we believe there's a large number of Google scholars, at least through one data point, that users are being undercounted. So we used pretty sophisticated software, Splunk, to come through this data to separate the week from the chat. And we identified a minimum of 125 unique users making requests for scholarly words, clicking on PDFs from Google scholars that we did not see in the analytics. Those 125 probably scholars downloaded 200 papers, at least, okay? Splunk is pretty sophisticated, it's very expensive, and they throttle what you can analyze. We looked at five days of data. Those 200 authors probably would be interested that people are searching Google scholar. And again, we have one data point, that's part of the issue, okay? We don't know how prevalent it is. We don't know how well the software, DSpacy, for instance, stuff we're dealing with this. But we'd also want to know, gee, maybe what are the URLs that are coming from? Do those 125 people have edu.gov URLs? Because those are probably going to be real researchers that may cite, and that demonstrates value of the IR. And if you can produce numbers like that around a department or a college, there's probably going to be increased participation. Again, this is theory, but it makes sense. So this is where Kenny comes in. So we hope this is a pretty compelling case. We think it is. This is about establishing the value of institutional repositories and doing it with numbers. Being able to report those numbers to funding organizations, to parent organizations like our universities, and even beyond to organizations like ARL. Our next steps in this partnership of Montana State University, ARL, and OCLC Research are basically to start gathering more data. As Patrick said, we have one data point. Over five days, you can see how many PDF downloads were not being counted. You extrapolate that over a year. Those are significant numbers, but we need to make sure that those numbers are true across numerous repositories. So the biggest problem that we have is getting our hands on more data sets. So if there's anybody that wants to participate with us, we would love to talk to you. Time and resources. Yes. So what Patrick is saying is that we're probably going to be looking for a funding source to support this research. But we hope to develop solutions, including policies, training perhaps to institutions who don't currently have the expertise on how to appropriately configure web analytics. And in doing so, we hope to improve the situation for all of us. So we'd be happy to answer any questions. I hope all that was clear, but if it wasn't, please ask.