 What is data leakage? So let's say that we have a data set here that is time dependent. That means every sample is collected at a specific instance in time. This is the earliest sample and this here is the latest sample. Let's say that we want to perform some cross validation where we train on these two chunks and we'll test on this latter chunk. This has no data leakage as this is exactly how we would use the data in the real world. Now in this case however for cross validation we are testing on some data but the model has actually been trained on data that would have happened in a future state. This is not representative of how we would train a model in the real world and so this is bad. As we might get a mean squared error that is low during cross validation but is actually going to be a much worse model in production. Hence data leakage needs to be combated.