 As you can hear, this meeting is being recorded. A couple of things before we start, as you could hear in the intro, we're raising funds to support students in need. We have already helped really quite a number of international students and we're hoping to help many more. So please consider our request. But now we'll go on to today and we'll have a few house rules as they say. First of all, please mute your settings. That would be greatly appreciated. Also, if you can just show us your face and put on your video, because it's nice to see people looking at you and not see these black screens. We appreciate it too. Please switch off your mobile phone because we find that sometimes mobile phones interfere with the webinar, the broadcasting. I think that's it for house rules and then I'll go over to the speaker of today. And we're very happy and delighted to announce the speaker of today. It's Hannes Dutta. He works as a associate professor of marketing at our marketing department. And maybe you know, maybe you don't know, but this marketing department that he works in is one of the world leading marketing departments ranking very highly in the research world. So that's great that he is willing to give this webinar today. Hannes works on research in the area of digital media. So things like digital TV or streaming. And today he's going to tell us more about his hub. So I'm not going to waste any more time. Hannes, the floor is yours. Thank you very much. And thanks for the invite. It's really an honor to be speaking to Alanna at Tilburg University. It's actually, I think the first time speaking to this group. So I think you can all see my screen. Is that correct? You can use the chat or you cannot. We'll make this a very interactive session. So actually I monitor the chat all the time. So today I talk about what I learned about... Could you start the screen share please? Of course I can. There you go. Thank you. So today I'm going to talk to you about what I learned in terms of managing data and computation intensive research projects. And my slides are available at our site. So I invite you all to join me on that deck. And you can click on the links if you find certain things that are interesting. And you can actually explore more than what I'm telling about right now. This is the agenda for today. So first of all, I'm going to introduce myself a little bit to you in case you don't know me yet. But I also would like to know who you are and what you're currently working on in order to be able to... Let's say kind of personalize what I'm talking to you about today. So to make sure that it's very relevant to you. And then I spent most of the time about a few key building blocks of efficient workflows. And what workflow really means, I'm going to get at that. So bear with me for a moment. I'm going to pick up and I have some tips on how I can help you with the implementation of what I'm telling you today. So first a little bit about my background. I did my PhD at Maastricht University back seven years ago in quantitative marketing. And I joined Tilburg seven years ago and I'm an associate professor now. And as Frederick said, I'm interested in the digitization broadly and subscription based business models. So I have a bunch of projects on online music streaming. For instance, about how music consumption changes when users adopt Spotify. But I'm also broadly interested in marketing mix modeling and branding. You know, some of those topics that you would generally associate with marketers. But I must warn you, this session is not about marketing today. It's about something else. But before I really get into the topic, I actually act to know what's your link to Tilburg. And so I'll issue a poll for you guys. So it would be awesome if you could let me know your graduation year at Tilburg. So a couple of more recent graduates, which probably already brings me, you know, to my, actually, that's my third question. But actually, I'm very curious on whether they're like any former students of mine in this thing. So if you are, you can put your name in the chat and I'd be happy actually to see some of my prior students. All right, so we have a cohort of very recent graduates. So that's cool, because probably you're in a, you know, maybe still somewhat junior function compared to other people that may be here, which gives you the ability to use many of the things that I'm talking to, that I will talk about, because you will probably work still operationally. So that's actually a very cool thing. And I would also like to know to give me an idea in which programs you've graduated from. And please put that in the chat because I need to know, are you like data scientists? Are you marketeers? Are you maybe not even from the business and economics world like I am? So, oh, marketing analytics, Mitch, that's great. So you're from my program economics. Okay, good. Cool. Oh, Ingo, that's great that you're here. Ingo is a former student of mine. Cool. So we got a bunch of marketeers. Yeah, people from the business. Okay, that's a good, that's a good thing. So, and you can also maybe put in the chat like where you're working right now and why you think today this will be relevant for you or why did you actually join the session that'll help me to kind of move on. So what is the motivation for this class and I will not monitor the chat for like maybe like three minutes, but please put your stuff there and I'm reading it in a bit. So I'm doing quantitative work. That means like I work a lot with like big data sets. And when I did my PhD, I was coding a lot, I prepared large data sets and I estimated econometric models, but I really didn't learn about how to structure my work. So I just did everything by intuition. And that actually created a complete chaos in this first project. And today I will show you that chaos and you know if you have my slides. Once again, that's available at Tilburg Science Hub.com slash slides. I think it's the first thing in the chat but I'm putting it on there again. You can actually check out how my project directory looked like at this particular point in time, which you can see on the screen. So this is a paper which got published in one of the, you know, most reputable journals in our field. So I did very good work here. But how I got at this work was, you know, maybe not the most efficient way. So for instance, if you look at my folder structure, I have a folder called code my property that stores the code for my project. But then there is another subfolder. It says new data preparation. Hmm. So actually I don't know what is the code that I ran for this project? Is it the code in code or is it the code in new data prep? But then there is also something called Sarah content. Sarah used to be a big computer that I used when I estimated these models. And I have different model runs here and different versions of the same code files and I kind of like version them with version numbers but I don't actually know what that means anymore. So actually I'm unable to find back the code that I used to estimate my model and I'm not even able to really get at the data set that I at that time used to run my stuff. And that arguably is not a very efficient way to run your project. So this is kind of where I'm starting from it, which is my motivation. I need to go back to my slides. There we go. So what was so bad about this? Well, I can reproduce the results whenever I wanted to even while I was working on this project. And then second, it was completely inefficient. So when making changes to my data or the model, I had to go back to the very beginning repeating all of those steps. So actually I'll show you real quick. So let me just open this in a new tab so I don't lose it. So at this data preparation, I have like certain files, right? So files to prep my advertising data. One of those files actually prep my customer data. Then I have some marketing data like direct marketing data that I was reading in. But actually I don't know. I mean, even if I know which file I used, I don't know in what order I actually executed these files. So that will take me like a very long time to like, you know, understand how that was done. And then a colleague at Tilburg asked me maybe like two, three years ago to share this data because she wanted to use it for an ongoing project, which by the way, recently got published. And it took me a very long time to dig out actually the data set that she needed to have because nothing was like properly documented. And I didn't learn about this. Actually, you know, I thought I was doing a good job because I did everything by intuition, but you know, as I know now I didn't do a good job. So why should you care? You will also change your code continuously while working on the project. And maybe even if you're like honest with yourself, many of the things that you do may be like not automated. You still do things manually like, I don't know, opening Excel and saving an Excel file like, you know, to CSV and then opening up that in your statistical software. So that is not good practice. And also colleagues will probably have a look at your code too, because maybe they want to help you. Or if you leave the company, they need to continue with your code base. And by the way that recently happened to somebody new, she was taking over a job of somebody else and you know now you have to kind of, you know, understand that code base that was written. And I can promise you that even the small efficiency gains of what I'm going to talk about today will pay off very soon. And now we'll spend like a few seconds reading about your current functions. Another former student of mine. So hi, that's cool that you're here. I got, I got corn. So financial risk one and concern that's cool process building. Yeah, good. So, yeah, we got a couple of people but it's really sense hard to sense from the chat how advanced this stuff that you do is already so I'll just continue. The next class is, and that's my learning objective. I want to show you a few, and the emphasis is on a few ways to increase your efficiency when working on data and computation intensive projects and I'm going to define all of those terms on the next few slides. So what do I mean with efficiency and I have like a slide on what's efficient or what's not efficient in my view. What efficient is if you don't make mistakes, but frankly, everybody does mistakes. It's also efficient to catch mistakes early. So, I mean, in my projects, it once happened to me that I that I caught that I caught something very late and that you know costs a lot of time down the road so I mean you probably all agree with me that catching mistakes early is this important or catching mistakes at all. You know, probably in this, you know, paper that I showed you before probably there's still a mistake somewhere in that code just because I wasn't coding very good at that point in time. What's also very efficient is if you can bring down the setup cost of restarting a project so as an academic we work on a book on projects and then we write about paper and send it out for review. We have to wait and depending on the journal that can take about like three, three months. I think that's that's that's a good estimate, and then you get the work back and you need to address the comments of the review team. And what happens many times in between is that you actually forget how you did your project right you know I mean it would order files had to be executed. What is the most recent variable that you use. The tips that that we are sharing here in this presentation but also until book science up.com help you to bring down these costs they're not zero, but they, they are much lower than if you didn't do what we what we advocate here. And another really cool thing of what efficiency is if you can prototype the final product and refine it later. So I want to give you an example. So I work on a project right now, in which I try to assess what I call platform power, where's is the power of digital content owners. So if you think about the music industry. That's an industry where I'm like doing a lot of research and you have Spotify which is a platform and you have the content providers which are labels music labels that put stuff on Spotify like you know universal Warner Sony these are like the major labels. And in this project, I want to estimate who's more powerful on on the digital platform is it Spotify the owner of the platform. You may think they are right because they own the store, or is it actually the suppliers of that content, and we quantify that power. You know, that's interesting, but what's even more interesting is to see in which categories of music consumption. The power is larger or smaller. So, for example, the most traditional music categories that that probably you can also think of are like genres right you have pop you have rock you have hip hop EDM. This is a very established. But what do you see on streaming platforms and so that they also have new categories like workout music, or sleep music or music that you can use to concentrate while working. You may be familiar with these kind of categories. So you may wonder, are there differences in terms of this power in these categories. All right, so that's like, essentially the setup here. But in my data I don't observe categories. And I use a machine learning algorithm to learn these categories from data. But I just didn't have that much time yet to invest a lot of work into this algorithm. So what I did is I built a prototype algorithm. It's simple k meets clustering it's like off the shelf stuff. And what I get out is okay, but I want to work on it later on, but I have a co author who also wants to work on my project. And, you know, to make like progress, my co author needs to already have some kind of classification in there. So what we did is we built a very simplistic clustering algorithm, but we built the entire pipeline. So like our whole data preparation is done and also like this clustering algorithm is run in between. And the output. It can be like pass to the next stage and my co author can work with this data. That's very efficient because my co author can already make progress on the project while I'm still like refining and fine tuning the algorithm. Will it change much of the results. I don't think so if I like it better categories it will not change the results dramatically, but at least we can work in parallel that way. So what I mean to say here is it's efficient if you can build an entire pipeline which builds your entire project already, and then you can refine the steps in between to make it better. Rotating on task is also extremely efficient. So with me it's like switching between projects, you know if I get kind of brain dead on one project I can like start my work on another project and because I have zero setup costs to switch. That's very good and sharing and reusing code is also like a good idea. So if your code is written in a way that other people can look at it and use it actually. That's great because they make find bucks and tell you about it. And that is maybe the last point I have others audit your code. So I started to release my code on on github either code snippets or or entire projects. And sometimes I get comments from people saying like look you overlooked something I think this should be the right thing. So one example from my work is I use an algorithm to classify label names music label names into their parent labels. So I can learn that particular label is part of Sony Warner or Universal I need that for my research project right and I just put up that algorithm. And it turns out a firm from Silicon Valley who is into data analytics started using my algorithm to benchmark their own data on whether their own data is correct. But they also shared their own list with me and now we can both improve that algorithm and it's just because we started being a little more open in sharing algorithms and data. So that's a win win. It's a win win for them because they can better you know identify these labels and it's a great win for me because that's what my research project is about. Okay, I'm spending a lot of time on those slides but I think I think it's important to give you a few examples here what's not efficient waiting. I hate waiting. I'm very impatient. And I've seen some colleagues here on this team who will probably agree. I'm very impatient first and so waiting for results or waiting for estimations and the bad thing about waiting is that I personally get distracted. So when I wait more than 10 seconds with something to show up on my screen I'll start looking on my smartphone I start looking at my email and then actually I lose my phone. What's also inefficient is if you forget how things are done or we're done, you know, that means like historically in a project or in the current project, losing data, I think you all agree that's something very inefficient. And using code which isn't properly documented, you know, you don't really know how to use it once comes out of it and and. Yeah, and when you feel you lost kind of the oversight in a project. That's also not a good thing. So this all is important in any kind of project I think, but it aggravates in data and computation intensive projects. So when I talk about data intensive projects I mean, I think, at least one of those four v's that defines big data right. So if it's large data. So, for example, when you have to prototype your project on a small data set. And because like building your entire project on the large data set may take too long of time and the variety of data that you may use so I've talked to recent graduates of mine that joined companies and they're just blown away by the amount of data sources that they will have to integrate. And sometimes in these companies how it happens is that, you know, kind of manually take an extract here manually do this manually do that and and that's also taking a lot of time and is potentially error prone. Many companies are also in the stage of investing a lot in data cleaning. So, some of those companies that I'm talking to their, their like in the stage of integrating different data sources from different departments and then making sure that these data past the audit checks. And, and when a particular team has already invested a lot in cleaning up some particular data source, then that is also very efficient to share that with others in the organization. And the last thing is, sometimes you're building projects, like predicting customer churn, for example, that you will redo over and over again. And, or like generating an annual report on something, whatever it may be, you can fill in the gaps before, but it's, it's very inefficient doing that every year. And I know, I know, many people that do that manually and it takes them a lot of time every year. There's a lot of churn predictions just as another example right and people think like, yeah, let's do that like once a year right but I mean you can do that every week, in essence to update your stuff. And our workflows that I will present a bit they enable you to do that. And computation intensive that's just like it takes a lot of time to compute. So, I want to know in your work right now, what do you feel has been like very efficient, and what has been very inefficient do you like kind of recognize a few of the things that I that I that I pointed at. You may be also way more advanced than I am that that's very well possible so maybe, maybe you discovered the holy ground you need to teach me but like what's the current state so give me a couple of examples, whether you recognize some inefficiencies or some some some some efficiencies that you've already generated in your projects. And the fit also the description are you working on big data projects. And you can also unmute your mic. There's not so much. I mean, and tell me about it. That's also fine. Yes. So, Friedrich is saying she's being delivered data over time that does not carry the same attributes. Yes. Yes, that's very inefficient. So probably also on the other, you know, side somebody is doing things manually. And yes, version control helps. That's a good argument. I have a couple of things on version control I think on this deck to, which is part of this. And structuring data. Yes. So what firm. I don't know your first name. I just see a Melchosh. What firm are you working on or like where you at I want to place you that helps me but you could give me some background here. What my learning objective was right. I want to share a couple of tips and and how to like kind of achieve my learning objectives is that I want to familiarize you with the, let's call it the typical science of way of setting up projects. Tell a little bit about Tilburg Science Hub. So, my school, the Tilburg School of Economics and Management seeks to professionalize the way research is done and also generate efficiencies there. And together with Professor Toby Klein, I think he's also here in the stream. We received funding to make a live Tilburg Science Hub.com as kind of a manual for for students and researchers and entire teams of data scientists to increase their efficiency in collaborating with each other. I have used the site in my teaching this year and that's been that's been a very good experience, but this site is out there for everybody to use and currently we're like in this stage where, you know, we build a couple of things. And we're in the phase of we'd like to receive feedback and we want to make it relevant for other people to use. I mean, it works very well for for our teams of students and other researchers that we're working on but we see there is more value to it. And that's why we kind of give this seminar and we hope to receive some feedback from your site. But we of course also hope that you can use it. And I think there's a batch of 100 students from the marketing analytics program coming that knows this stuff. So, so I think that's also a cool thing. So, I have a disclaimer before I show you a couple of things. This is a very short class. I took like half an hour to set up my, you know, introduction here. And it's, I can only deliver you a rough overview. Tilburg Science Hub seeks to provide you the deep dive, but you know, we're still working on the side and so you know a few things may still be rough here and there. So, you know, if you have feedback, we're happy to to hear that. I want you to think of Tilburg Science Hub more as scrum like a software product or a platform. So, to those that don't know scrum scrums like a way in which like it teams manage their projects, which I really like. By the way, so in this discussions about Tilburg Science Hub, many times I received the question, yeah, but do you need to walk in and you know what is this a platform? Is this a software product? No, well it's just like a collection of templates and workflows that we're sharing that have proven to be very efficient in our work. And probably it will be efficient in your work too. A couple of cool examples also about inefficiencies like typing manual queries and manual conversions to other programs. So I totally agree with those and we think we solve those. I want to tell you, because most of you work in businesses and are not researchers like I am, I don't use tools that are popular in business. They do certain things that we do, Tableau may do certain things that we do, but we need a more flexible tool stack than these commercial software packages. In order, for instance, to be able to transport our computations to the cloud where we don't have a Power BI license or we don't have a Tableau license or anything like that. So we use mostly open source software because that allows us to port something that we prototype on our machine to a supercomputer and then run it in the same configuration. And then also I'm not a computer scientist. I'm not even a data scientist. I'm a marketer that learned these things from co-authors. I did a lot of reading myself. I developed many things myself, but you know, if you talk to somebody who studied computer science and works in a software firm, they're probably going to say like, yeah, we do this all the time. But the point is, we in business and economics research don't do it like this yet. At least, you know, some people do it, but not a broad group of people. And at least when I talk to my, you know, recent graduates, nobody in these firms where they join does it that way. So, you know, maybe there's something to win here, but I don't really claim I've discovered the Holy Grail here. I think other people have done that. Some things, according to me, may be very simple, but you know, some things may be very advanced for others. So depending on your level. Just like we prototype these things on the configuration I display on the screen, I think it also works for smaller projects anyways, it will also work a little bit for larger projects. And we didn't build like pipelines that could, I don't know, replace the infrastructure of a Netflix or something. No, we're like interested. We are researchers. Our experience is from doing this for research and that is where this has been proven very stable. So I don't know where you are at and what the scale of things is that you're building but maybe you need to be a little cautious also here. I hope this is not a traditional lecture so you can continue using the chat. I hope I make this an interactive live stream. So the agenda of what I want to spend my last half an hour with is I want to share with you a couple of these workflows. And I will show you one of the templates that we have until the science app.com. And, you know, my wish is that you actually start using our site. And this is how our site looks like. And you'll reach it at the science app.com. The site is going to be overhauled and we're going to change it probably in a bit, but the major sections are kind of, you know, probably staying the same. The first section explains you about all of our principles about how to set up workflows that are efficient and reproducible. And I'll give you a few highlights of this in a bit. The second section explains you about common setup procedures for software programs. And the important thing is that this software is set up in a way so that it can be automatically called from scripts on your system. When you open, or I'll do this right now, if I open a terminal on my Mac and I'll type R, actually R is being opened. If you do that and you haven't followed the instructions like we've given, that doesn't work. So the cool thing is if you do it and set up your computer that way, you can, let's say kind of remote control your software programs. And that is at the core of like automation, which you will learn all about in just a bit. And we put, let's say our software stack there, but yeah, so that's that. And then we have developed project templates and a tutorial to teach you about these workflows. The project templates are like templates that you can just like download and put all your files in and adjust the files a little bit. And then you should be good to go because they adhere to a kind of, you know, the workflow things that we that we explain over here in this day. And for those who need a little more guidance, we recorded a tutorial. So, and I will show you actually a few steps of this tutorial also. And like for each of those steps, we have like YouTube videos that explain you those things. And, you know, video production from my Zolder camera. You know, that's, that's kind of how I use the corona time. And we got a bunch of tips and manuals on how to use particular software tools. I will quickly spend like a few seconds reading the chat because a few things happening and I don't want to miss out on them. Okay. I hope I can comment on this later. So, let me go back to my deck. Zoom in on what are these workflows that I'm talking about and maybe that's a little dry, but I hope you see the value of this. Okay, so what we are building in our research project is, is pipelines. And I want to explain that concept. So a pipeline refers to the usually out of my steps necessary to build a project. And I sometimes use the work word workflow interchangeably. So, for example, the steps necessary to prepare a data set, you have seen that in one of my old projects before run a model will produce our result tables and figures. So for me as a researcher, this is how prototypical pipeline looks like so I have stage one that's preparing my data set for analysis. I download the raw data, I clean it, I aggregate it. And then stage two, you know, maybe somebody else works on this stage in my project is run a model on this data set, have various variable configurations select the best fitting model. And then stage three for me is like produce tables and figures for paper. That may look very different for you guys if you're working like I've seen like we got some financial model so probably you got like, you know, stage one prepping the data set maybe you have like an auditing stage to check whether your data is actually accurate, not maybe like something cool to do. You probably also have like a model that you do a prototype model, but then stage three maybe something like deployment in life in like life trading or something like that right so your model kind of gets packaged. You save like all parameter estimates, and then it's being shipped to other units of the firm that actually use it for decision making in the first. So the pipeline is like a variable concept. So whatever you're building, put it in like steps, and then that's that that's your pipeline. And the thing is you want to automate it. And it's a good idea also thinking about your pipeline in a project so for instance this is a prototypical prototypical pipeline that you know for an academic paper so maybe you have like broad data sources and Excel. You have like scripts, and that convert those to CSV files, because then you can load them nicely into our and bind them together. It's like, I don't know stacking them together. That's another rule. Then you have your final data set. And then you have two rules. One of those rules or scripts, you know, produces your regression results your model results, and another few rules run, run some graphs on this. And the last step of this process of this pipeline maybe like, I don't know to rep these pictures and tables together into a PDF. But again, I'm an academic so I'm building papers, but you may build very different things but probably this this concept may may apply to YouTube. The white boxes your inputs and outputs can be both right so the output of one stage can be also the input of the next stage. Like the transformation that happens so so what is actually being done. And you can think of those as like script files right and the script files they can also be you know used in different projects. So maybe I'm for sharing a bit so I stole my code on GitHub, which is great because I have like a common platform where I can search all of the code I've written in my career. And I frequently go back to to like previous stuff that I've written and I can just search for these things very, very, very efficiently and plug them into my new project so many of these rules that I've written I can just like reuse over time. So that's some of the efficiency gains that I have. So the benefits if you think about pipelines of your work is you write clearer source code because you can, you know, kind of chunk your projects into smaller parts. So back in the days in my first project I had like a giant script, or like maybe 10 giant scripts, but you can break it down in like very very small units, which then makes debugging very fast and also rebuilding your project very fast. So the beauty of this is that the tools the automation tools that we're using enable you to automatically recognize which part of your process of your pipeline has already been built or not. So suppose the first, let's say part of your pipeline takes two days to prep, and that's not unrealistic in like big data application so one project that I've been working on. We work with, yeah, I mean, yeah, very raw. Yeah, I mean, we work with like a data set with, and then that's just 5000 customers, and we observe them over like a period of two years. But we have like second level estimates, sorry, second level information on what songs these consumers listen to. And now we're trying to compare the music consumption between all of our consumers with all of the other consumers so that's like 5000 squared. And, you know, you could say like 5000 that's really not a large thing but like, you know, 5000 squared that is many computations to make, and that takes, you know, many hours to do in our application. So it's very inefficient that if you just like want to reproduce some some regression outputs to run your whole workflow from beginning to end. So the tools that we're using they kind of recognize what is the code that has changed, what are the inputs that have changed, and then it only builds the part of the project that you that needs to be built. And that's this part like obtain results faster redo only the things that were changed. It increases transparency and collaboration you can work in different parts of the pipeline at the same time. And you can actually use a fully flexible software stack, because to this tool that binds everything together it doesn't matter whether something is in Python, as long as it can be optimized for the command line, you can use anything. So, you know, you can use whatever is best at its job. And for me as a researcher that's a very nice thing, and you know probably for you guys to. So for me it's data prep model estimation writing a paper. Give me a couple of a couple of thoughts here on how you can break down your project, and maybe you can put some stuff in the chat campaigns for donations. Okay, so what are the pipeline steps. That's your whole project. Yeah, so how could you break it. So what we try to do is that you have these big pool of people that you want to approach, and then based off a couple attributes that you identify and that might be different every campaign to try to get a different segments and every segment has a different action. Yeah, so the segmentation I think that's something that is one pipeline thing because the input changes every few months with these donation campaigns, but the algorithm that does your your segmentation is probably the same right. So, yes. Maybe there's also data cleaning step involved right you're getting like raw data maybe from your firm from a telephone campaigns that the alumnus I think you're doing that right, like quite sure, but yeah. So it's an overseable project in a way, but it certainly can be thought of as you know, being break down into into pipelines. And the results can change all the time maybe you can give me a few a few extra insights here. What do you mean by that. Oh yeah sure sure. I was just saying like the results of campaigns with nations can change all the time yes but like this general segmentation step of course you can improve it every time right but like that's like a, like a building block in your project, which you can make explicitly one one module in your in your pipeline. And then also you need to think about the components of the pipeline so these refer to a project's building blocks consisting of source code data generated files and notes. So, you probably all have written source code before, but what really is source code what does it do. Actually, take some inputs can be like either data as an input or arguments as an input, then apply some transformation that you do in this code, and then it outputs something could be data sets could be lock files could be images that's all what source code is about. Maybe like one thing I want to highlight here. So, data as input is clear right so Frederick with her donation campaigns she has like a new data set every I don't know six months or something that she wants to put through a segmentation stage right. But what about those arguments. This is a research research is speaking here once again so in research sometimes you want to do robustness checks. So that means like do your results hold if you exclude certain observations if you add an increase your sample size if you use a different model, things like that right. So it's very inefficient of like building copies of your analysis script just with like different data sets. So I use arguments sometimes to specify. And what data things to be run, but maybe also on what algorithm stuff. Things should be run. So all of my important algorithms are in one script, and I pass an argument to that strip, telling the algorithm like okay telling the script like what part of that script to run. And in that way, you can very, you know, easily check certain things in your data, like setting these flags. So but you know the general thing is the same thing so you have inputs transformation outputs. Examples are load data from the web and save locally as a CSV. Clustering could be like, you know, there's an argument. And by the way, later that's one of the things in in K means clustering I don't know whether you know that but like you have to pre specify the number of clusters that you want to So I can call my script like, you know, Python cluster dot pi 10, and then I'm going to get 10 clusters right, and then I can loop over the script to, you know, have like, I don't know data sets with like five to like 15 clusters and I can estimate my model see where the fit is best and then pick that model or something like that right, but you don't have to build like five scripts for that. You can just like pass an argument. I hope that answers your question. And so, what about raw data we store that remotely, and we use a variety of systems, which you know maybe parallel, maybe like mirror the world that you're in in firms where you have data sitting on different servers or systems. So data can be stored in file based systems like Amazon S3 or FTP servers. Maybe it just sits on Dropbox and drive. You may use databases like MongoDB or MySQL. And at TobicSense.com we have like a little code snippet. I think we call it the get data tool. Maybe I still have to post a link. I need to check but that allows you to flexibly integrate like various kind of data sources in your project. The key really is that the scripts pull data locally and then run stuff. So it's not like that they kind of live in a cloud and do things but like our scripts, they're made in a way that that everything will be built locally on your computer or you know you may rent a computer in the cloud and put our workflows on there. That's fine. But like everything happens on that computer. And something that may be of interest to you guys too is to check out our REITME template. So if you click on this in the slides but you can also browse the slides. We have like a few templates that you can use to describe the data that you're generating in your project. And I'm very enthusiastic about those because like even just like two months after I generated files. For example like like a year ago I generated a data set for a PhD student of mine. She's working on this data for now and at that time I didn't know yet about this cool REITME. I put it up recently actually. So you know and now we're like how is this data prepped again? And these questions in this template they are very, very good and they help you to have like a very good explanation of what that data is. Filling it in may take some time. But it forces you really to document all the important things very well. So for example a few things that I want to highlight which I think are nice are like very good details about the collection process. I mean you know sometimes you're collecting data and there are hiccups in the data collection or you use the specific seed to start to collect your data. So you did some sampling before you actually started to collect the data. And it's important that you tell future users of your data set what that is. And by the way that future user may be you. You know it's not that you have to document it for anybody else but maybe you just have to do it for your future self. And another good section is like was there any pre-cleaning of this data done in any way? What is it that you think the data should be used? That's important for future users of the data. You know maybe there are certain conditions that make this data set unusable for others. So I actually ran student projects with this where students created their own data sets and documented their data and I find this extremely useful. And I do that with my own data sets too. And then you have generated files which are output files of your code. And I want you to distinguish between a couple of those like you know final files of your scripts like the final product like a final data set or a final table. We'll put them in output. If they're temporary we'll put them in temporary folders. And if they're useful auditing that is like kind of checking on whether your code you know did things good, we'll put them in an audit folder. And last notes we like actually haven't found a good system so if you have a better one let me know. We just keep a Dropbox folder or Drive folder or Teams whatever and have our documentation there. So code is the most essential building block in our projects and this is like a nice framework that we've drawn up which kind of puts these pipelines and components together. I know it's theoretical but I'm going to code in a bit so I hope I can get others excited about that too. So this is your pipeline. This is like the building blocks of your project like prep data, read final data set and prototype your model, estimate final model and whatever do something in a paper. And they kind of cross with our project components like raw data, source code, generated files. And like each of those steps in your project are independent of each other. So for instance you can, I don't know, do some prototypical things on your laptop, push your code base to GitHub and then the computer in your cloud pulls the most recent version of your GitHub repository let's say every night or every two nights. And estimates using the most up to date algorithm that you've prototyped on a smaller data set on your laptop on a big cluster computer. And I use that for research so I'm in this phase where I have to do a revision for like an important journal and we're still you know working on the data but we're also working on the analysis and I'm running nightly builds of my projects. So whatever I did during the day, it's pulling that code at night and rebuilds the entire project. And then in the morning the only thing I need to do is check my results because they're there. And where I can check errors if there were any problems in building my project. So in essence, it's not only that you can work in a team with this but you can also just work with yourself like this, because in today's world, we're working on different computing systems at least I don't know I have like 34 computers open at the same time, virtually. So this brings me to the next important thing is like how do you need to structure your, your, your directory structure. And you know there are many ways in which you can do this we think we found one that works very well for us but you know you're free to modify this. There's some guiding principles here. And those principles are others should be able to understand your pipeline merely by looking at the file and directory structure. And that should be up in the pipeline that self contains should be portable, because we want to be able to, you know, port one stage of our pipeline to a cluster computer and that project should just like build itself on that cluster computer we don't want to spend like time configuring this thing this should be done like automatically. And yeah project should be versioned and backed up using him get up. So, this is kind of the directory structure that we advocate, which is inspired also by software engineers that that do that to build software projects. So we source code folder, and for you source code folder we're having a sub folder with the pipeline name. So for instance our pipeline stages data prep analysis paper. And then we're also having a folder with generated files, these are the output files of our source code. We never make source code with output files because then you get to the mess that I should you at the beginning of this lecture. And then we have, I don't know, like a mirror of some service structure with our raw data. It's important that each of those generated files that you bucket them, according to these four categories, which have worked input files like any stuff that your source code needs to build stuff. So temporary files that are, you know, built in between but that are not really important. By the way, you may love, but it's like especially important to like, I don't know, separate things that are unimportant from things that are important and temporary files are clearly unimportant but in the project structure I've shown you before I'm unable to see what is important and what is not important. Think about, you know, long term projects and audit files that allow you to assess whether your code ran correctly. And we do this for every pipeline stage. You know, I told you at the beginning, this may sound like very simple, but actually when you start working with this, it helps you tremendously. And I know we just have like eight minutes left, so I'm not going to go into all of these examples but if you click on these links, you can see the directory structure of a more recent project I've been working on where we kind of took these, what we actually learned a lot about these things from colleagues with whom I was working at that time. And I think looking back at this directory structure. It looks way cleaner and yeah it's much easier to share it also and people will be able to replicate it much easier. You know, you look at these at these folders they're like very clean and you know no duplicate files and all of that so I think that should convince you already to look at this. So if you want to kickstart your project we got a link in our slides where you can just like download a zip file and zip and then you have all the directories that we think are important in a project. And then probably the most important thing today is about automation. And I'm showing you a bit about what that means with an example. But before that just like a definition. So we use build tools that are used by software engineers that incorporate like have recipes on how to make stuff. And they work as follows so they have like a target of what needs to be built. They have a bunch of source files that they need to know what is actually required to make that built. And last they define on how that build should actually be executed. That sounds very theoretical. I will show you how that works and then you can look at TilburgSense.com to actually teach yourself these skills because I mean, yeah, I want to show you a bit of this fun. So what you can do with this is like, you build only what needs to be built you can quickly move stuff to a new computer you can loop through like different model specifications. You can execute nightly builds. And I need to watch the chat because I got a couple of questions by Hans. And Hans is asking like, do you make actually an archive of your old work? Oh, yes, I do. So we have like systems until work. Well, like, like, it's called Dataverse where we can store data or code. And certainly I've like full backups of these things for archival purposes, but also our journals increasingly offer a storage capacity for hosting of replication files. So, and something actually the other question is like, can you reproduce results from like six to seven months ago. Yes, I can go back in my code base, like automatically and then I can just click on run and that would replicate my results. That's like one thing, but also for historical projects we're storing the history of these things. So yes, that's that's fully possible. And I will not show you this. I will just tell you we're using a tool called make this is one bill tool, but there are many other bill tools available. The sweet thing thing is it's already installed on Mac and Linux systems, but and you can easily install it for free and windows. And I will show you this in practice in the remaining last, you know, remaining five minutes. And I want to encourage you actually to do the things I do and turns out what I'm showing you are the first few steps of the tutorial, but we have recorded the tutorial so you can follow it at your own pace after this session. So I'm going to switch to which sense of the camera right now thinking of project templates and tutorial. This is a tutorial I actually built for my class on digital and social media analytics and students. I wanted to familiarize students with these workflows. But I also wanted to familiarize students with these with text analysis actually that's one part of this this course that I'm teaching. Okay, and in essence this template is a reproducible GitHub repository for text mining. So we share this repository in the folder structure that you would expect on this GitHub page and you can find it easily by following these links. And this is an example of a reproducible research workflow that does a whole bunch of stuff so it makes us chase it makes Python with our but it also applies certain concepts like downloading stuff from the internet. It unzipping data automatically parsing Jason data to CSV files, loading CSV files and enriching it with text mining metrics and then analyzing it and making like tables and figures. It does the thing, but it's a prototype right is like not for a final paper, and you can actually clone this thing on your systems and you know that's what what our, that's what our tutorial is about so and maybe that becomes a little geeky right now because I'm using the command line interface and see whether I get my terminal up here, but where is my terminal. So, but you know I just want to get you excited about it. So what I'm doing is I'm cloning this, this workflow to my system, which is actually takes a copy of this entire workflow to my computer right it parallels the directory structure of what you've seen before. And now I can CD into this directories access this directory and by only typing make the whole project is being executed and I want you to watch it right so here's the status message of what's being done so now it's downloading the data from a server that I pre specified. And that takes some time, but also not too long so tested it before. And in here you see the different software programs now communicating with each other so this is Python downloading. Now, it has downloaded the raw data into this data folder. And, oh it's generating a bunch of things it's already in the last pipeline stage where it execute some our code to like get tables and figures. But now we've built this entire generated things that I talked to you about this analysis thing with output and temp files data preparation with some temp files, and it did like everything automatically. If you go to Jen analysis output, you'll also get the final results of your analysis. And I need to put it on the right screen. Which is here. And it's a rudimentary analysis, I admit, but it does something in our, you know, some histograms and stuff. And now you can start modifying this template to your needs by editing the source code. So there's is like just like, let me go to sorry, let me go to the source code and just do like one thing here because I'm running out of time. This is a file, which parses my data. I don't know. It's, it's technical it's like kind of transforms adjacent file unstructured data to structured data. I don't know what about this in in this tutorial, but I built in a prototype condition here because I want this workflow to run very fast. That's, by the way, a very good thing that you should be using prototyping like conditions, which like make your project a little smaller while you're working on it so you can like iterate faster when you're editing code. Let's suppose I have like finalized prototyping my project. You can just like blank out this prototype condition and go back to your text mining workflow and again type make and make us recognize that you've just edited this code file. And not we download your raw data because you didn't change anything with your raw data. And now it's parsing the entire code bay. Let's say the entire data that was collected in this prototypical project was like I don't know 14,000 tweets about something. And you can now look at the analysis check those numbers. So I don't know the mean was 2251. And if you just refresh, you already see things have changed. It's, you know, now done on this big thing. So I went from prototyping to like, let's say my productive thing with like everything in like just seconds and that you can bring to scale with like many, many things. And we teach you about this in in this tutorial. So I really encourage everybody to follow it. Also, think about how you could change this template to make a template that works for you in the particularly industry setting that that you're in because probably you're probably not doing text mining analytics and I don't know downloading data from like fortnight, which is a game, which is popular among younger people than we are. So, you know, but take it as a framework to think about your projects and we have a bunch of other templates coming to this page soon. I meant to tell you about versioning, but that's on the page anyways and let me wrap up now with a little summary about what we have here. So, actually, these are takeaways. So manage your project as a pipeline have logical pipeline stages that allows multiple people to work on the same project at the same time. And also allows you to port one stage of your pipeline to a different computer and have self running there while you're still prototyping in another stage, store data code and generated file separately for each of those stages, automate your entire pipeline and you know, I've just like spend five minutes and make what I maybe should have made this like, I don't know 50 minutes, because it's such a powerful tool. So I hope that you, you take my word for it, it will boost your productivity and use our templates to kickstart your projects or make your own templates and you can actually share those with us later on. How to get started. Don't be over ambitious start gradually. If you are working on a joint Dropbox folder with like no good directory structure, maybe you just like unzip our directory structure template right now and start from that that will may already help. And then use our project templates to learn from and build your skills here. Again, we have a bunch of tutorials coming there's one tutorial you fully recorded. And last there's a housekeeping checklist. Actually, this is a checklist I developed together with some some some students also colleagues, which is like, which helps you kind of, you know, there's a bunch of things that you can do to make you efficient but you can use this checklist to like I don't know go through your different folder structures and kind of see whether to what extent you have implemented our our guide guidelines here. Let me go back to my deck to wrap up. It's an open science project. So you can contribute to our site. If you click this link or just, you know, browse our page you'll find how you can do that. We appreciate any contributions. And we appreciate if you share the site with others. We build it predominantly for, you know, our teams, because we found ourselves, you know, explaining these things over and over again to like every new generation of students that we that that we were teaching to or that we were working with and and now we have this and we're happy to throw it out there for other people to use. So I hope you find it useful. Given that not many people defected from our stream. I think you did find it useful. So, Suja asked a question is prototyping the same as upsetting a sample data set. Yeah, I mean, the way I've shown it. Yes, that was prototyping, but you could also think about different algorithms. And that's also prototyping. So, yeah, there are many ways in which you can prototype your stuff and you can use make for all of this. That's it. I'm sticking around for some questions I can understand if some people are leaving. I have a reputation for going over time sorry for that. But yeah. That was great. As a chaotic person I really love the structure that you're giving us. For all of you who are listening this is recorded so it will be on the knowledge kit for the online to university magazine will share it there. And hopefully more people are going to make use of your great project. I really like it. Any questions. Let's hang around a bit and people can open up their mic and see if you want to talk to us. It's not my initiative so we're building this with many people. And actually it's also not nothing that we like invented from scratch we learned from many people and you know there were some PDFs with some thoughts but there was never like this interactive website so that's kind of what we built. That's where we take credit but it's built on the shoulders of many other people so be humble here. By the way guys it's good to it's good to it's good to see that that you learn things it's good to see your reactions in the chat. And if you have any comments that you want to like address to me. This is my contact details you can find like a contact form there or just my email address whatever. So, yeah, there's a question on this. Are you also making use of m lobs tooling to try and reproduce your experiments. Actually, I need to say I don't know what that is. So that's good because I can learn from you what is that. Yeah, you saw maybe recently appearing tools like ML flow where you can for every experiment run that you're doing just record all your results accuracy but also your hyper parameters which data set you like that just to keep track of all the things that you've been doing and that you can get back to. So I was wondering sweet sweet sweet. I love it. I don't and I, I, I, I kind of made a sketch of this tool in my head that I wanted to build it because I didn't spend time actually checking out whether this is there. So awesome that you put this in the chest know I don't use this tool but yes I want to use this tool. That's, that's amazing. Any other tips I want to learn from you guys. So let me see. Yeah, I mean, and so the question is in my area do reviewers actually try to reproduce your results. So, in my experience the journal editors do or let's say the managing managing staff at the journal so when I submitted my replication files they're running it and they check whether the numbers match up. I as a reviewer do things at times so sometimes actually when I see things that don't make sense I write like a little R script to make like a little simulation and I share the code with the office to point out some mistakes. So, yeah. I think short question short answer to this journal editors do and in the review process it's it's, you know, becoming more but yeah. And yes I've used get LFS to versioning large files. I don't think large files in the order of magnitude that I use should be versioned. I'm generating temp files in the magnitude of I don't know terabytes and you need these files to run your workflows, but it is extremely costly in terms of data storage to keep copies of each and every temp file that I built on the way. I need to say a strictly opposed versioning these temporary files and output files all those generated files they should just conceptually not be versions, because if you did a good job your code plus your raw data reproduces each and every single version of these files. If I use get LFS in general, yes I do. In smaller projects I version, let's say I store my raw data. But like for my bigger empirical projects for research I always have external servers where have like, like, yeah, I hope that's a good answer to you to your question. Tim, if not, let me know. If you also don't have control and then suddenly you hit like a paying threshold it's very interesting you can then can configure your own server but like, okay conceptually generated file should not be versioned in my humble opinion. Good. Okay. Good. Don't forget that next week we have a Professor Kuno Hausmann on decision making under uncertainty and for now I really thank you very much for joining us and giving us these insight on the Tilburg Science Hub. Thank you. Yes. All right. Thank you guys. Thanks for joining.