 So our next speaker is Shagun Sodhani, who will be talking about how to log in ML experiments. What do you Shagun? Everyone, thank you for coming to this talk on logging machine learning experiments. I am Shagun. I am a research engineer at Facebook AI. And as a research engineer, I have to run a lot of experiments. So this talk is a condensed version of my experience and learnings with logging ML experiments. In this talk, we will be talking about many things. We will talk about why we should log experiments, who are the potential target users of our logs, what should we be logging, how we should be logging. I'll spend some time talking about the logging practices that I use, where we should be logging things and when we should start worrying about logging. And I'll leave some time in the end for questions and answers. The key message of this talk is that logging ML experiments is a holistic exercise. It is not limited to tracking just the performance metrics like accuracy or loss. It goes beyond that. We can think of experimentation as a way to answer questions or to test ideas. For example, a question could be, how do I improve the performance of my NLP model? And the different experiments I run or different ideas that I try out, they are a way to answer this question. So from this perspective, logging is a way to capture the lifecycle of these ideas or the lifecycle of these different hypotheses that we are testing. So let's say we try out 10 different things to improve our model. Logging is the process of cracking these efforts and improving over it. There are several other essential ideas and items when it comes to applied ML. For example, ML ops, managing the lifecycle of our model, things like feature engineering, model design and so on. These are all important areas to be aware of. But for the purpose of this talk, we would not be focusing on these areas. Just to reiterate the disclaimers once again, I mentioned that the talk is a condensed version of my experience with logging ML experiments. So the presentation is also going to reflect my experiences. I prefer certain workflows over other workflows, but that by no means mean that my workflow is the best possible workflow. There's no reason to believe that my workflow would work the best for everyone. So please pick up and use whatever things make sense to you and discard the other things. With that disclaimer, let's jump into the talk. The first item is why should we log ML experiments? And the short answer to the question is, well, we should log because logs are an interface to the experiment. If we look at the standard ML workflow, we run an experiment. The experiment gets over and then we have questions about the experiment. For example, we may want to know how good the train model is or how much is the validation accuracy or how much time the model takes to train and so on. So to answer any of these questions, we need logs. And apart from the questions about the performance alone, we may have questions like, why did we run this experiment? Or even questions like, did we ever run this experiment? And some of these questions on the slide, they may sound a bit unlikely. I mean, who forgets if they have run an experiment or not? But it does happen and it happens way more often than you would expect. And the reasoning goes like this. When we are doing ML experiments, we have some idea or hypothesis or question in our mind. So this question or this hypothesis could be as simple as, you know, does decreasing the learning rate help with the convergence of the model? So we run an experiment. We observe the results and based on the results, we come up with a new hypothesis. So maybe we need to use a lower gain ratio or probably we need to use a wider network or a deeper network. So essentially what we are doing is we are iterating over these hypotheses. And when we iterate over the hypothesis, two things happen. One is that our original hypothesis has changed now. So maybe the new hypothesis is that lowering the learning rate with a wider network would improve the performance. So we need to explicitly track this. We need to explicitly track and remind ourselves that the hypothesis that we are testing now is the new hypothesis and it's not the old hypothesis. And this is important because if the model does not work, then we need to remember that the hypothesis that we are refuting is that a lower learning rate with a wider network is better. That's the hypothesis that is refuted. Maybe the original hypothesis that a lower learning rate is good, that might still be a good hypothesis. We don't know. But that is not the hypothesis that we are refuting. So it is important to track why we are running an experiment. It's important to track the hypothesis or the question that we are trying to answer with an experiment because it helps to cut down on the search space. The second thing to take into account here is all these new hypotheses that we come up with when we run these experiments and all the follow-up experiments that we come up with, generally it's not possible to run all of them. We cannot run each and every experiment or we cannot try each and every idea that we have in mind. We have to do some kind of prioritization and we have to keep track of the ideas or the hypothesis that we are going to test and the ones that we are not going to test. And these choices, the number of these ideas or these hypotheses, they increase exponentially. So logging helps us in two ways. One, it helps us to keep track of choices. It helps us in two ways. One, it helps us to keep track of choices. It helps us to remember which experiments or which ideas we have tried and which we haven't. And secondly, it helps us to narrow down between the choices. So maybe we have consistently seen in our previous experiments that wider network is a better architecture when it comes to convergence. So this observation might nudge us towards trying a wider network with a lower learning rate. But for this, we need to know if wider networks have indeed been better in the past and for that, we need logging so that we can answer the question of what other experiments we have run in the past, how many of these experiments have used wider networks and which of or how many of these experiments actually work better in the past. So logging helps us to track the hypothesis and to choose between different competing hypotheses. The second item on the agenda is who are the users of our logs or who needs these logs. And this is an important question because it helps us to choose what all information or what all metrics or what all metadata do we want to track. So there are different personas and the first persona that uses the logs is the persona of people designing the experiments. The people who are coming up with the hypothesis, this could be domain experts, this could be data scientists, people who have these questions which need to be answered. So they need the logs to verify or refute the hypothesis and based on the observations come up with new hypothesis. The second persona that cares about logging is the persona or the people implementing the experiments. This could include the engineers or the people who are coding these experiments. They need logs to verify that their implementation is correct or their code works. And a common example of that is we take the model and we overfit it to a single batch of data and we look at the accuracy or the other metrics and see whether those are converging or not. So that's the example of second persona that cares about logging. The third persona that cares about logging is the people who are actually running the experiment. So a way of thinking about it is let's say you start with an open source code and you start using it for running your own experiments. So in that case you are not the one who is implementing the experiment but you are the one who is running the experiment and who cares about that. So this persona uses the logs to verify that the experiment actually ran. It did not crash or it did not terminate in between. The fourth persona is the people who are interpreting the experiments and who are drawing some conclusions from them. This could include analysts and domain experts and people who are trying to make sense of models or trying to decide whether the model should be used or not. A fifth persona is people who are making some business decisions based on these experiments. So this generally includes people in the executive order or people who are going to make business decisions. And a sixth persona is the person who will be running the experiments either to reproduce your results or to build on top of your work. And interestingly this persona could also include your own future self. So let's say you worked on this project for six months and then you take a break and you work on something else and then you come back. So when you come back to this model or this problem you do not want to start all the exploration and all the work from scratch. You want to resume from where you left and then the logs that you have that's a useful point. There are two things to note here. First this list of personas is obviously not an exhaustive list. There are different other use cases of personas who may be interested in your logs. And second there is a good deal of overlap between these personas. The important thing is not that they are the six personas that have listed. The important thing or the key takeaway here is you are not the only one who needs the logs from your experiments. So when you are thinking about logging, think of logging as a more holistic exercise and not something that you do to cater your own needs. This naturally leads to the next question of what we should log. And the answer to this question is we should log anything and everything that we think that we or any of the personas that we talked about earlier. Anything and everything that we or any of these personas may need to answer questions about the experiments. Often we narrowly define everything to mean every possible metric. But as we saw earlier, that is not enough. Metrics are necessary but they are not sufficient. So another way of motivating the question of what to log is with a hypothetical scenario. Let's say you are running an experiment, you are training a model and the only thing that you care about in this experiment is the validation accuracy. So that's the only metric that you track. Now you run your experiment and you train your model and when you look at the validation accuracy, the validation accuracy turns out to be quite low. There could be many reasons for that. Maybe the learning rate is too low or maybe the gradients are too small or maybe the gradients are too high. So there are many possibilities. So how do you test this or how do you test these possibilities or debug the model? Well, you can start logging additional metrics and you can read on this experiment. But relearning the experiment is going to take time and it's going to use more compute resources. An alternative could be that you plan well ahead of time and track all the additional information that you think you may need, even if that is not strictly the information that you care about. So for example, while learning the model for the first time, you could have tracked the norm of the gradient parameters, norm of the gradients. Now there is a tradeoff to be made. You can obviously log as much information as you want, but that information could be totally irrelevant for the experiment that you are trying to run or the questions that you might potentially want to answer, in which case it would become more of noise and less of information. There is a balance that is needed in terms of logging more information versus logging all possible things. So alongside tracking the metrics, also keep a track of why did you run an experiment. So log the reason of running an experiment. This is an example entry. Actually, this is an experiment config which I generally write as part of the logging process. Here I am tracking a couple of things. So there's an experiment that I am running. This experiment has got a short description. I have a tag in the config and then I have an issue which is basically a GitHub issue. So I prefer logging the description or sort of the reason for running an experiment as GitHub issues. And then I put a short description into the config. Using GitHub issues for logging this, you know, why you run this experiment part, it has certain benefits like the experiment or the reason of running the experiment lives near the code. But it also has downsides like you have to switch between tooling and your collaborators or your team. Others may just prefer docs or Wiki or some other medium. It doesn't really matter how you do it. You can do it in config or metadata as well. It doesn't really matter. What is important is that you do this, that you track the reason for running this experiment. Another important thing to log and to keep track of is how to run an experiment. Think of this, how to run an experiment as all the information that a stranger would need to read on your experiment and reproduce your results without talking to you. So things that come under this category include things like metadata, config, the code that was used to run this experiment. And this code can generally be tracked by the Git commits, the version of the datasets or the features, the version of different softwares that you are using. Even things like environment flags, the commands, the documentation, all these things which a person would need to reproduce your results without talking to you, you should consider tracking that. Another important thing to track is the metadata for the experiment. And this includes things like the config of the device on which you are running this experiment or the name of the cluster or the config of the cluster where you are running this experiment. Tracking things like GPU CPU usage, tracking when the experiment started, when was it scheduled, when did it end and so on. And this piece of information is very useful when you are updating your framework. So let's say you are using PyTorch and you are going from 1.8 to 1.9 or something. You can use this information to do basic regression testing and make sure that the updated version does not introduce any performance related issues. As we discussed earlier, it is also important to track what experiments you have run so far or what experiments are you running right now. It's important because when you are iterating over these hypotheses, it is very possible that you would come back to the same question multiple times. And unless it's an easy way for you to search what things you have tried or what ideas you have experimented with, you would probably end up re-learning those experiments which takes time, resources and so on. And this becomes even more useful and important when you have multiple people running experiments for the same project because now there needs to be a way for people to be able to communicate what experiments they are running and how does their results look like. The next question is what tools we could use for logging. And interestingly, there are a lot of useful tools available now and they come in different kind of flavors. So some of the tools are cloud hosted, some of the tools are cloud hosted but have on-prem versions. Some of these tools are framework agnostic, some are specific to the frameworks and so on. This is sort of a non-small, this is a small non-exhaustive list. I would not go into the details of these tools. These are all excellent tools. You can find their documentation and tutorials online. I do not have a strong preference for one versus the other. The important thing is you do the logging and it's less important as to how do you do the logging. The choice of the tool that you would end up using would probably also depend on the kind of setup that you are working in, the kind of questions you care about your collaborators and so on. Now, I want to spend some time on how I do logging to illustrate some of the things that we have been talking about so far and to show how I apply those things in practice. I will be focusing more on the details about logging metrics because that's something that we have to do a lot of times within an experiment and so it has some more challenges and nuances. Things like logging metadata, config, these are obviously part of the logging process but since they are done once per experiment, I am not going to talk about that aspect in this subsection. In terms of logging and parsing, my stack is basically Python. So I use file system for tracking the logs and for saving the logs and Jupyter notebooks with usual tools and libraries like pandas, numpy, matplotlib, highplot. I use these tools for analysis. A couple of things that I keep in mind when I'm writing logs are I write logs as I generate them. It is okay to buffer logs for some time to do a batch write on the file system but I do not carry around a list of logs throughout the experiments. Part of the reason is if for whatever reason my experiment terminates in between because of a machine failure or something, I should not lose the logs that I have generated so far. So generate the log and write it to the file system so that it's persisted there. Secondly, I prefer to make each log entry a stateful or a standalone entry. So if you give me the i-th row of your logs, I should be able to interpret the different numbers or the different metrics that you are tracking as part of that particular log. This is a concrete example of what I mean. I have a dictionary based log which has got keys for the name of the metrics and values as the metric values. So for example, in this case, I have a key called as episode or a key called as batch reward or critic loss. And corresponding to these keys, I have the metric values. The good thing is this log entry tells me everything about all the metric values. I do not need a header or some other file or some other piece of information to interpret the values of different metrics. And this is very different than the case of the CSV logs where you have a top-level CSV headers and then you have to refer back to that top-level row to interpret the metrics. This style obviously leads to redundancy because you are writing the same set of keys in all the different logs. So there's a lot of repetition and there's a lot of redundancy. The benefit is when I'm writing the metric, I can log whatever metric I want, wherever I want, whenever I want. So I prefer logging dictionaries because of the flexibility it provides. As I said, the downside is there is a lot of redundancy, but I think this redundancy is useful and we will talk about the benefits in a minute. The logging format is also quite verbose, so it can take up a lot of space. So my solution generally is to compress the logs after the experiment. So when I'm analyzing the results, I'll learn a job in the background that would compress the logs. Another thing I do is to track the logs at different levels. So there would be the debug logs or the full experiment logs, which can be pretty large. And they might be tracking a lot of metrics, say at the batch level. And then there are the general logs or the base logs or the essential logs, which are probably the key logs that I care about. So the idea with the debug logs is I would need them only for debugging and probably 90 to 95% of the time I would not even care about them. So as long as my experiment behaves the way I expect it to behave, I would not probably care about the debug logs. Whereas if I want to debug something, I want to answer these additional questions like how does the gradient of the model look like? Or how do other metrics look like, which I generally do not care about? I would be putting them in the debug logs. So I will be tracking the logs at these two levels. And debug logs are generally, if the experiment goes well and it turns out that I do not need the debug logs, I can just go ahead and delete them. To give you an idea, the debug logs I'm tracking are three orders of magnitude bigger than the logs I strictly need. And this goes back to this earlier discussion we have where we said that if I run an experiment and I have a question about the experiment after the experiment has gotten over, unless there is a metric or a log that I tracked for that, there's no way for me to answer the experiment other than to rerun the experiment. So the tradeoff that I'm making is the high cost of this temporary storage versus the cost of rerunning the experiment and waiting for the experiment to run the time as well as the computer source cost. So if the experiment works well, I don't care about the debug logs, I can go ahead and delete them. If it doesn't, then I can deep dive into the detailed logs and use them. To save on the space, I generally convert my logs into pandas data frames and then I serialize those data frames on the file system. Another problem you may encounter when you are tracking these logs specifically for the debug logs is that the debug logs may take too much time to pass. And in that case, the standard approach is pass multiple log files in parallel. Now in our case, since the way the logs are tracked is every log entry is standalone or complete in its own sense, you can actually pass the different rows in a file in parallel. So the benefit of all that additional redundancy that we had earlier is that you have all the keys and values that you care about within the log. So you can write whatever arbitrary filters and transformations you want without having to take any sort of joins and you can process each line in parallel. So that redundancy becomes useful now. So the TLDR is I prefer logging debts. It has downsides like it's it has downsides like it is very verbose. But on the flip side, debts are a lot more easier to handle. Dicts are first class citizens in Python. There's a lot of support for JSON to debt conversion. And there are plenty of libraries which support very fast passing of JSON into Python dictionaries. For the last two questions, I want to reiterate that logging is holistic process. It is not limited to tracking the performance metrics like loss or accuracy. Sure, performance metrics are an important part, but that's not the complete picture. Think of logging as a way to capture the life cycle of a hypothesis or the life cycle of an idea. So we start with some idea or some hypothesis. We run experiments to verify it refuted and based on the experiments we change it. Keeping in mind that logging is this sort of a multifaceted process. We may want to use multiple tools for logging. Now, GitHub is one good place to put some of the wise and house questions when you are running experiments. Partly because this is where the code also lives. So you are already interacting with GitHub for saving your code. So it can also be used for keeping track of the experiments. Another place that you may use or consider using is documents or wikis. Especially when you want to have a snapshot view or a bird's eye view of the different things that you applied. Or to have a summarized view of the different experiments that you applied. So docs and wikis are very useful to keep track of the progress that you have made over different kind of experiments. When it comes to writing the metric logs, you can use file system. You can use some hosted tools or whatever logger you prefer. And any combination of these tools is good. So there is no perfect answer. Choose what works for you. Choose whatever combination of tools works for you. Just make sure that the tools you are using are reliable. They are easy to use and the entries or the logs that you are creating, they are easy to share across people. For example, do not use a random piece of paper you see lying nearby. Do not use that for logging. That's not easy to share. That's not reliable. The last question is when is it a good time to think about logging? And the answer is probably sooner the better. For example, you should know what metrics you care about and what metrics are going to log before you run an experiment. Or probably even before you write the code. And since logging is about the life cycle of the idea and not just some numbers, you should start logging or writing down the idea when you first think of an idea and you update this every time you discuss the idea with others. I picked up this habit quite late, but this has made a lot of difference in the quality of my work. The luxury of going back to the start of an idea and challenging the initial assumptions and seeing how your assumptions evolved over time and basically redeveloping the idea in a different way that's a very useful tool. But at the same time, it requires you to be way more diligent when you start working on an idea and it requires you to think about logging from day one. With that, I will wrap up my talk. To reiterate the key message once again, logging ML experiments is a holistic exercise. It's not limited to tracking the performance. Experiments are a way to test the hypothesis and logging is a way to capture the life cycle of these hypotheses. With that, I wrap up my talk. Thank you once again for coming to the talk.