 Hello. Let's start again. So welcome to the second session. I hope you could have a bit of a breathe outside while we still have the sun here. Let's dive into what we already did actually before the break but dive further into scientific workflows and hear what scientists have to tell us and how they can provide us with some more help on deciding on on seeing what is expected of us as a community. We have three researchers here that I'd like to introduce to you. I'll introduce them they'll will each give us a quick overview of what they are doing and how they are dealing with data and then the floor is all open to questions I have already here and of course all the questions that you have so you have the scientists here fire up I would say and figure of speech of course. So we have there on the right for you, Joris van Zendert. He is a researcher and developer in computational and digital humanities at Huggins. He's experienced in computational asset and empirical literary and linguistics research applying digital and statistical technology. He's a profound advocate of empirical humanities research evolutionary modeling iterative development agile and lean project management. Next to him is Ike de Weaver he's a science officer for the biofresh data portal at the Royal Belgian Institute of Natural Sciences here in Brussels. As a taxonomic backbone for the project they are maintaining and extending the freshwater animal diversity assessment database. He will tell us more about it. Professor Dick van der Poel is a professor of marketing analytics big data analytical customer relationship management at Gantt University. He specializes in customer intelligence and analytical customer relationship management including customer acquisition customer retention and cross-up selling and data mining for him is an important thing. He's also the program director of the master of marketing analysis well since 2000. I give the floor to you. I start with Joris please tell us what you're about and what you're doing and maybe you have already some questions for us. Sure thank you for having me over first of all it's very nice to be here. I wrote that myself actually that introduction that you had there and I'm quite surprised to hear that back after I think a few years now because nowadays I might describe myself a little different. That doesn't matter it's a fine introduction it's just maybe someone on the strong on the computational side maybe. So what I'm doing I'm from the Huygens Institute actually somebody made a nice little mistake of putting Belgium as my nationality there which I'm very happy that it happened because I think my roots are finally here coming from here but I really couldn't aspire to take the nationality of such a nice people. So I'm actually Dutch I'm working at the Huygens Institute for the history of the Netherlands and that is an interesting flock of researchers together because we have them in all sorts and varieties. We have researchers that are completely I would say at the hermeneutic side of things so people humanities researchers that are completely focused on interpreting data interpreting text and creating new insights based on the basis of solid and subtle reasoning. I'm on the other hand could be seen as somewhat of a new fangle sort of type of research in humanities which is all about trying to formalize the data that we're working on and I guess that might be interesting for the people here in the audience today as well. So we have been looking into ways of for example formalizing the stylistics of literature and literary resources sorry. So just to give you one example of what I've been working on is trying to tell apart two different authors in a 14th century manuscript Middle Dutch manuscript of which it says the text itself claims that it was written by two authors but nobody really knows where one author took over from another and indeed using statistical means one could say simple word counting it's slightly more difficult than that but we can go into that if you're interested. We were indeed able to tell to pinpoint the transfer from one author to the other author's text and what is interesting about that is not so much the fact that we could pinpoint it because that was somewhat known although it was unsure but what we also found which was rather new to the field is that the latest author took over from the first author and changed his text and that we could only certainly tell from taking a more statistical sort of data model view on the text so that's that's the kind of research that I'm interested in and that gets you also into the data modeling aspect within humanities which is a I would say intricate and complex thing that we're still figuring out how to do that and yes certainly at some point we will need all the data centers and libraries to help us out with how to store and how to manage and how to exchange that data because that is still a very much a a search and a quest for us actually so that's hopefully enough for me yeah thank you very much so I'm I could whoever I'm coming from a completely different background I'm a biologist and why I studied here in Ghent University and I did a PhD on aquatic microbial ecology and this brought me into contact with with several types of data one of them are our genetic sequence data for which there's a very nice repository which is called gene bank which well every scientist who's worked with sequences has known and knows about on the other hand I've worked a lot with experimental data where where it's very much each researcher on its own except well if you collaborate on this particular experiment but it's typically a raw rather small group of people working on this data set and then I've worked after my PhD a little bit with remote sensing data with remote sensing imagery where I have both worked with satellite data where you have you're just relying on on the big providers and where it's more a matter of them describing what you actually did with this data to have some metadata on this and then on the other hand you're working with space born imagery where where you were actually the team of people we worked with they paid for a plane to go out to take pictures of a certain area and and then it's it's also up to the group to to take care of the the storage of such data which haven't well they're very expensive to to make these pictures and and they they could be reused for for different purposes so this there it becomes really important to get a decent archiving facility and now of about three years ago I started my job at Royal Benz Belgian Institute of Natural Sciences and for those living in Belgium it's known as the Dinosaur Museum and it's it's quite a big institution it's has a it's among the top 10 natural history museums in terms of the size of its collections in the world and I'm working on an EU project called Biofresh which is focusing on freshwater biodiversity and we're trying to to get data together on on which species occurs where on the planet and and for instance if if people catch a fish in a stream and and they they would report it then Biofresh is interested in this kind of data to to get a global overview of of where species occur and the big problem there is and and this in this job I come into contact with data of every day and and the big problem there is that there's like this long tail of data where where individual researchers they they go out to to a pond and to to to see what's what's living there and and they do they if if you're lucky they make a publication out of it this but sometimes they just don't care and this data gets lost in in the best case it's in a paper but it's not the data is not deposited in the worst case it's just going to be lost it's on their hard drive what they found there but it's never going to to be used and freshwater are are really one of the is is the environment that's under the heaviest threats the biodiversity is decreasing in freshwater at at rates which are much higher than in the marine and terrestrial environment and if we want to to get a good idea of what is actually going on why this is the case what we can do to stop this what will happen under climate scenarios for different species we need to have a better view on on on where species live under which which conditions and by bringing all this data together and especially those data in the long tail and this would help a lot but there's still a big effort that we need to go through because well scientists are just reluctant to make this kind of data available so that's it for me for now thank you hi I'm Derek from the professor of marketing analytics here at the university I got started with my PhD about 20 years ago sounds scary most of the data we had available was company internal data so I'm in marketing so what we do is we analyze marketing data consumer behavior data and so we try to figure out how the consumer behaves under certain conditions and we try from the point of view of the company to optimize how things work and at that time we only had internal data now even just internal data is a lot of data in the case of marketing because as you know you're all consumers you go to the supermarket you scan your loyalty cards to get your points and what we get out of it out of the process is lots of data so what you purchase and more importantly what you purchase together in one basket and we as researchers we want to understand what's going on in this purchasing process so even these let's say purchasing data is just recording what you purchase is in fact the end result of a whole process that has been going on in your mind in your social environment and so when I started 20 years ago we only had the end result the purchase in the meantime we of course we had the development of the internet and all kinds of let's say social media in the meantime and what made this possible is in fact understanding or get giving you additional data on what's going on between let's say you viewing an ad on TV a commercial on TV you thinking about a product should I buy it shouldn't I buy it and so on so the whole process is much more documented right now because we also see your information searches on the internet so as opposed to just having the end result the purchase we now have the whole process information available so in the end let's say 20 years ago we only had hundreds of gigabytes of data which is still a lot of data but limited and now we have terabytes of data and even some companies go up to petabytes of data because of let's say the intermediate processes that go on these companies they can observe every tweet you send out about their company and so this sending out this tweet may reveal something of your thought processes that brought you to either buy or discard a certain product now this is let's say just observing what we what we do as consumers but there is more to that of course we are not always rational consumers we you also want to tap into let's say the attitude a consumer has towards a brand and so in addition to the existing data we also collect additional survey data and we try to augment the existing data with what people think about products and so nowadays there's a let's say a big stream of research that combines the behavioral side of things with the attitudinal side of things and the combination of both should give us let's say more insight into why people purchase certain products well now we know what you're all world as what your research is it's quite a diverse diverse field I would say at this moment what do you do with the data is it in your institution that this data is archived and curated do you do it yourself or do you use a general disciplinary bank for my part the humanities I should distinguish I guess history and literature research from several other types of humanities research speaking from for that part of humanity for my institution the situation is that we don't don't publish our core data really we do have that for example if I run analysis I get statistical data I do keep that and sure I'm prepared to deposit that with the Dutch national data archive but the fact is that we're not very much educating and instilling this idea of data curation data stewardship into humanities researchers I think we could do better at that Daniel actually there is however a complicated complication it's that for the humanities for literary and history scholars it's still quite unclear what their data actually is of course we all use text we all use documents but what is text what is a document if you start asking humanities researcher what he defines as data you get a different distinctly different answer from for every humanities research and the trouble with humanities data is that it's highly complex and it's high highly highly varied highly heterogeneous so we feel like people in the linguistics side of research have a distinct advantage over us because they tend to use spoken word and written language and they tend to use to look at the string at the characters or the phonology which and I don't want to downplay their research at all here but from humanities from a literary or historians perspective that is a very baseline kind of data which is very very much possible it's very much storageable it's very clear it's very very common now now take the research of a plain vanilla average humanities researcher and he has tens maybe hundreds different documents different images different interviews various languages various compositions that are meaningful to him or her and there is not really a way yet to express that richness and complexity sure there's a cemented web but that is still not really sort of close to what we as humanities researchers need from data modeling and data representation is that a question of data definition what is data or is it a question of a good ways to describe data I would tend to a lot or actually but also we're only grappling at the bottom I think of formalizing the way we do research in humanities and that word alone triggers mighty negative responses actually if you talk to a humanities researcher about formalization so you just ask questions about how do you do your researcher you tend to run in this wall of well my research is so specific and so much title interpretation that it's not formalized I think this is sort of a somewhat easy fallacy that humanities researchers use to sort of not enter into real trying really trying to establish a formalization about research is possible but at the same time we need to the computer science side of things needs to and is in fact trying to come up with ways of supporting this fake and incomplete and subtle reasoning that humanities researchers are using because we're not we're in fact not about mathematics we're not about counting about reasoning and there's ways to support that but that's also in development so on the one hand we're figuring out what our data is and how to represent it on the other hand we're trying to come up with the ways you can computational approach this data so it's research on both sides so we're not way near I think generalized infrastructure for these kinds of research but we're hopefully getting there but this moment this is stored all on your own servers in your own institute currently yes I'm afraid I can okay well in biology and then well the kind of data we're focusing on it's really quite different it's really straightforward what we're looking for in terms of data it's primary biodiversity data is also termed as the what's where and by whom has a certain species been observed and it's their standards for this so this is not really the problem now I'll go a little bit into detail in in what kind of of different types of data or sources of data we're targeting and one is for instance museum collections and another one is like monitoring data like in the frame of the water framework directive where where member states of the EU have to report the water quality based on which organisms occur and the other one is more the experimental the ad hoc sampling done by researchers and each of these three have their own peculiarities and the for museum collections it's clear museums have an obligation they're starting to digitize their collections and they have an obligation to make the public they have a public role and so well most of those institutions actually make their data available and this is kind of done through a network was what we call an network of interoperable databases and they make their data available so actually the data physically is on a server from the museum but it's it's made available through open standards it can be harvested through a central initiative and in my case this is the called the global biodiversity information facility and for monitoring data what there's there's more let's say we have in the biofresh project tried to to mobilize this kind of data and there it's it's a problem of of political will willingness where we really see okay those data are out there but people are seem reluctant to share this data because okay if you could based on this original raw data people could re-calculate the quality index and and criticize this this kind of thing and then it becomes more a political thing and certain member states don't don't have any trouble of sharing these data others are like okay but the qualities may be not good enough they don't they distrust that is what you see in the biofresh pro projects you stumble upon that obstacle not everyone is willing to share and that data quality question is yeah and and then well you have have the the researchers themselves and then you you you run into different stories again like okay I don't get enough credit for for this is the presentation by Sarah Callahan it summarized the situation there very much I think so we have the whole range of everything in place do you think it's necessary how do the peers react in the project do they force each other or really demand to open up the data or to make it available somewhere for sharing so well the nice thing within our project is indeed that we have the researchers in the project they they have their own data and well in because we we're a project that's supposed to show off how you can make this data available they they are under a certain pressure but even there there's it's still now we're running in into the third year of 40 year the last year of our project and still a lot of these data sets that we still need to to really get it from the researchers within our own projects so there's a long way to go what's your experience with that yeah so given that we are working with highly sensitive data I think most of us wouldn't like our purchases being thrown on the internet so it is a very very sensitive topic in our field absolutely and so we have to sign lots and lots of non-disclosure agreements so NDAs are really a pain in our line of business but that's the way it is of course you could argue why don't you anonymize data sets and to some extent you can do so but even doing that is really tricky because there's always this one data source that is external to your data that could augment your data in such a way that you could actually identify who is behind the data so we have to be extremely careful with that but still it is a very important issue actually even that important that one of the leading journals in our field the marketing science journal actually they published in their editorial of January February 2013 so this year about a new policy they put in place that you have to share the data with them but even more so not just the data the raw data but also the programs you used to validate the results and so you're already there you have the question in front of you absolutely and so there's a push towards this openness and I'm very much in favor of that because we're also doing lots of open source software development so we actually support that but or even though our companies that fund us don't support that part but it is there is a big push towards this the the open question is will it happen because even this journal if you read the editorial so they put in a mandatory process but at the end they say you may apply for exemptions but when we talk about openness we can't avoid it there are some data that we cannot put out in the open but we have to make sure that it's somewhere safe and it can be repeated that it can be curated do you have help in your universities on that front from your university or do you say we need absolute or your institution or do you say we really need a more national approach or we need every library to step in and help us on this subject what is your feeling about the way I understand your question correct that you're asking what are we have sufficient help and support in opening our data or it's it's as well the archiving and curating so it doesn't have to be open then but in both in both sense because when you want to have data out in the open you also have to make sure that it stays open and that you create it that you archive it but some data will never well not quickly be open because it's too sensitive I can imagine yeah but it's both I yes and I can I there's lots to be said about that I guess but let me give you two things that I think are our pointers or could be important at least for the humanity side of things first there's again there's what is our data I mean if it's if it's journal articles I think the situation is sort of under control in the sense that we can publish these and libraries are doing their jobs and so these get stored and archived and preserved and opened and discoverable so that's that's that's something we we trust in as a word but then there's the sort of digital born data the data that gets created while doing your research and this is very much we're now experiencing sort of a not a full switch but at least the humanities research is augmented with this idea of open open science approaches indeed so where your data is being produced some sometimes by crowdsourcing and also the research process itself is much more open collaborative and it also means that for example in the case of digital editions these might not be closed down anytime soon so it might not be done at some point so how do we cope with this data that gets updated every time and again versioning and things like versioning etc etc so these are very much difficult problems we're still struggling with and yes we certainly could use to help a lot of reason archives there now there's the thing about open open data open access data apart from some things that you can't disclose for ethical reasons of course I think that the big trouble there is that everybody have no reason will deny the use of the usefulness of open data and usually within the humanities it's not very sensitive data in the sense that somebody will scoop my research out that's not a very big issue but what a big issue is is crediting the people creating this data and creating data sets and opening up data publishing data online digitally I usually never academically credited for doing so and this is so you say there there should be a policy initiative there because I'm not I'm not sure how we should implement that policy actually but I don't say top-down but I mean somewhere there must be yes there's definitely an urgent need for crediting just you know creating data creating data set publishing data if we if we don't sort of if we keep only crediting monographs and journal articles open access data will not very soon happen not even if you are increasingly this is just a demand from your funders there's no good stewardship in pressing people into lots of research is done now on that level how can other means of assessment also in opening up data and data publications and things like that I want to ask you that you have really big well rather big data sets I was thinking about the petabytes is something else but it really it's a big data set how many versions do you version the data how do you want to preserve them it's actually a very big issue in our field because the other thing is that we do modeling on the data so we do not just describe what's in the data we actually try to figure out causal relationships so we we hope in the data there are some natural experiments where for example prices were lower at certain point in time in the supermarket and at other points in time they were higher and we use we try to use these differences to to infer what price is doing and whether a lower price actually helps us in buying more so this is just one example we have this for many many for example on in how many supermarkets physically is my product available so what's the distribution coverage in my in my distribution channel so there are many many aspects to this and moreover we need to version the data so that we can do out of period predictions because we are not just in business of describing data but what we really want to do is we want to predict your behavior and we want to use these predictions to optimize our company policies and so that's a third layer so we really need different versions of our data so different histories so that we can go back how did we build models on last year's data and then we use them to predict next year and then we verify did it happen what we predicted so and do you feel that you can cope with that now or is it this is really a struggle at the moment what is the help that you expect good question currently we cope with it ourselves but it's a it puts a big burden on the PhD students and on our department to deal with these elements and you almost need an IT background to be able to cope with these so let's say help from libraries or IT departments would very much be appreciated to do that because of course massaging your data takes a lot of time people forget that it's almost 80 to 90% of the time of a PhD researcher preparing a data set so absolutely and so nobody credits that but in our field it's a little different because it's usually the PhD student who actually publishes it that actually also prepares the data so to some extent they get credited for their work but we forget that it is most of the most of the time involved of the whole process that's absolutely so that you need help from the libraries but we have I think we have to work together to get the right frame which data do we do we need to collect how is it described I don't know I really want to know and what do you think in the audience what what would you like to ask the the scientists what information do you need to help them what pops up in your mind anyone having a remark or a question on that yes I just wonder what do you wish your libraries and more generally your universities would doing for you that they aren't doing at the moment that would make I guess would allow you to concentrate on doing the research and not doing what you see as the things that distract you from the research well I understand that libraries are very much connected to this institutional archiving efforts and I think there there's various needs for for encouraging scientists to really deposit their data as well not only the paper but also the data and and there's various initiatives there's initiatives at the level of the funding agency where they they require depositing data there's a need from the requirement from the journals like for instance working with dry it on which we will have a presentation this afternoon I guess and I think that the libraries could play a role in on one hand together with this institutional archives to push forward for for this archiving of the data itself and in as in as a link in the whole chain and I'm lost I would like to answer two things to that at least that's lots more probably but first of all if you ask any humanities researcher what is a data you want was the information that you want simply get me all text world round in the same format exchangeable sounds simple actually it is pretty simple but it's a hard job doing that but what specifically legs but although it's coming more and more becoming more and more fashionable I guess is API access to their textual data so a means for my service for my algorithm to travel to your servers and parse or analyze the data that you have available from your collections of text now of course I'm text oriented so there will be similar questions from these perspective of get me all the pictures you have etc etc so that's very machine machine access to the data is very important and it's still very very obscure actually at the moment the other thing what I would like to do and would I would I be very much interested in is small scope small scale projects together with libraries librarians to figure out what actually because it sounds simple you know machine access what does that mean but if you go into really doing research you discover that there's all kinds of workflows and description problematics that you want to get solved and we need far and far more information on actually doing that and I think this is not by the way we gain this information this knowledge not by doing some grand broad European wide one-off project but by a multitude of small projects with specific libraries and specific research questions so that we can accumulate that knowledge on how humanity researchers actually access that information so go into your institution talk to the scientists or with a group of scientists and determine what the specific needs are and how they can be how we picked up in a workflow with standards and protocols that are more general can add to that I would definitely encourage all of you to not just think about the data also think about how people process the data and so the workflow is even more important maybe than the data because maybe the data you can recollect if something is lost but imagine for example the the case that we saw a few months ago or a few weeks ago of these two Harvard professors that did a very famous study on austerity and they they're in conclusions have been used by our political leaders but it turned out that they made a small mistake in their Excel analysis they excluded six data points and these were six outlying data points that made sure that the conclusions were some would argue wrong other would say yeah would yeah so the importance of the process and the programs and the code is as important as the original data just just a small addition so linked to the the role of libraries linked to the institutional archives I believe there's there's an important role in terms of guiding researchers towards certain repositories let's say they have this type of data okay this this would be the best fit for your data to deposit and I think there's a could be an important role for the library and scientists would accept that from us librarians well it's it's a valid question sometimes they don't sometimes they will you would accept that it's like okay it's this institutional archives itself they're not too popular I would think in general but it's something they have to do with and if if it's if at that stage you can offer them some help some guidance I think it's welcome a welcome addition thank you I have to end here I'm afraid let's talk more at lunchtime but I'm afraid I have to break it off here thank you very much